1. 16 12月, 2021 2 次提交
    • F
      btrfs: fix warning when freeing leaf after subvolume creation failure · 212a58fd
      Filipe Manana 提交于
      When creating a subvolume, at ioctl.c:create_subvol(), if we fail to
      insert the root item for the new subvolume into the root tree, we can
      trigger the following warning:
      
      [78961.741046] WARNING: CPU: 0 PID: 4079814 at fs/btrfs/extent-tree.c:3357 btrfs_free_tree_block+0x2af/0x310 [btrfs]
      [78961.743344] Modules linked in:
      [78961.749440]  dm_snapshot dm_thin_pool (...)
      [78961.773648] CPU: 0 PID: 4079814 Comm: fsstress Not tainted 5.16.0-rc4-btrfs-next-108 #1
      [78961.775198] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
      [78961.777266] RIP: 0010:btrfs_free_tree_block+0x2af/0x310 [btrfs]
      [78961.778398] Code: 17 00 48 85 (...)
      [78961.781067] RSP: 0018:ffffaa4001657b28 EFLAGS: 00010202
      [78961.781877] RAX: 0000000000000213 RBX: ffff897f8a796910 RCX: 0000000000000000
      [78961.782780] RDX: 0000000000000000 RSI: 0000000011004000 RDI: 00000000ffffffff
      [78961.783764] RBP: ffff8981f490e800 R08: 0000000000000001 R09: 0000000000000000
      [78961.784740] R10: 0000000000000000 R11: 0000000000000001 R12: ffff897fc963fcc8
      [78961.785665] R13: 0000000000000001 R14: ffff898063548000 R15: ffff898063548000
      [78961.786620] FS:  00007f31283c6b80(0000) GS:ffff8982ace00000(0000) knlGS:0000000000000000
      [78961.787717] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [78961.788598] CR2: 00007f31285c3000 CR3: 000000023fcc8003 CR4: 0000000000370ef0
      [78961.789568] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [78961.790585] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [78961.791684] Call Trace:
      [78961.792082]  <TASK>
      [78961.792359]  create_subvol+0x5d1/0x9a0 [btrfs]
      [78961.793054]  btrfs_mksubvol+0x447/0x4c0 [btrfs]
      [78961.794009]  ? preempt_count_add+0x49/0xa0
      [78961.794705]  __btrfs_ioctl_snap_create+0x123/0x190 [btrfs]
      [78961.795712]  ? _copy_from_user+0x66/0xa0
      [78961.796382]  btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
      [78961.797392]  btrfs_ioctl+0xd1e/0x35c0 [btrfs]
      [78961.798172]  ? __slab_free+0x10a/0x360
      [78961.798820]  ? rcu_read_lock_sched_held+0x12/0x60
      [78961.799664]  ? lock_release+0x223/0x4a0
      [78961.800321]  ? lock_acquired+0x19f/0x420
      [78961.800992]  ? rcu_read_lock_sched_held+0x12/0x60
      [78961.801796]  ? trace_hardirqs_on+0x1b/0xe0
      [78961.802495]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
      [78961.803358]  ? kmem_cache_free+0x321/0x3c0
      [78961.804071]  ? __x64_sys_ioctl+0x83/0xb0
      [78961.804711]  __x64_sys_ioctl+0x83/0xb0
      [78961.805348]  do_syscall_64+0x3b/0xc0
      [78961.805969]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [78961.806830] RIP: 0033:0x7f31284bc957
      [78961.807517] Code: 3c 1c 48 f7 d8 (...)
      
      This is because we are calling btrfs_free_tree_block() on an extent
      buffer that is dirty. Fix that by cleaning the extent buffer, with
      btrfs_clean_tree_block(), before freeing it.
      
      This was triggered by test case generic/475 from fstests.
      
      Fixes: 67addf29 ("btrfs: fix metadata extent leak after failure to create subvolume")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      212a58fd
    • F
      btrfs: fix invalid delayed ref after subvolume creation failure · 7a163608
      Filipe Manana 提交于
      When creating a subvolume, at ioctl.c:create_subvol(), if we fail to
      insert the new root's root item into the root tree, we are freeing the
      metadata extent we reserved for the new root to prevent a metadata
      extent leak, as we don't abort the transaction at that point (since
      there is nothing at that point that is irreversible).
      
      However we allocated the metadata extent for the new root which we are
      creating for the new subvolume, so its delayed reference refers to the
      ID of this new root. But when we free the metadata extent we pass the
      root of the subvolume where the new subvolume is located to
      btrfs_free_tree_block() - this is incorrect because this will generate
      a delayed reference that refers to the ID of the parent subvolume's root,
      and not to ID of the new root.
      
      This results in a failure when running delayed references that leads to
      a transaction abort and a trace like the following:
      
      [3868.738042] RIP: 0010:__btrfs_free_extent+0x709/0x950 [btrfs]
      [3868.739857] Code: 68 0f 85 e6 fb ff (...)
      [3868.742963] RSP: 0018:ffffb0e9045cf910 EFLAGS: 00010246
      [3868.743908] RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000002
      [3868.745312] RDX: 00000000fffffffe RSI: 0000000000000002 RDI: ffff90b0cd793b88
      [3868.746643] RBP: 000000000e5d8000 R08: 0000000000000000 R09: ffff90b0cd793b88
      [3868.747979] R10: 0000000000000002 R11: 00014ded97944d68 R12: 0000000000000000
      [3868.749373] R13: ffff90b09afe4a28 R14: 0000000000000000 R15: ffff90b0cd793b88
      [3868.750725] FS:  00007f281c4a8b80(0000) GS:ffff90b3ada00000(0000) knlGS:0000000000000000
      [3868.752275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [3868.753515] CR2: 00007f281c6a5000 CR3: 0000000108a42006 CR4: 0000000000370ee0
      [3868.754869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [3868.756228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [3868.757803] Call Trace:
      [3868.758281]  <TASK>
      [3868.758655]  ? btrfs_merge_delayed_refs+0x178/0x1c0 [btrfs]
      [3868.759827]  __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs]
      [3868.761047]  btrfs_run_delayed_refs+0x86/0x210 [btrfs]
      [3868.762069]  ? lock_acquired+0x19f/0x420
      [3868.762829]  btrfs_commit_transaction+0x69/0xb20 [btrfs]
      [3868.763860]  ? _raw_spin_unlock+0x29/0x40
      [3868.764614]  ? btrfs_block_rsv_release+0x1c2/0x1e0 [btrfs]
      [3868.765870]  create_subvol+0x1d8/0x9a0 [btrfs]
      [3868.766766]  btrfs_mksubvol+0x447/0x4c0 [btrfs]
      [3868.767669]  ? preempt_count_add+0x49/0xa0
      [3868.768444]  __btrfs_ioctl_snap_create+0x123/0x190 [btrfs]
      [3868.769639]  ? _copy_from_user+0x66/0xa0
      [3868.770391]  btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs]
      [3868.771495]  btrfs_ioctl+0xd1e/0x35c0 [btrfs]
      [3868.772364]  ? __slab_free+0x10a/0x360
      [3868.773198]  ? rcu_read_lock_sched_held+0x12/0x60
      [3868.774121]  ? lock_release+0x223/0x4a0
      [3868.774863]  ? lock_acquired+0x19f/0x420
      [3868.775634]  ? rcu_read_lock_sched_held+0x12/0x60
      [3868.776530]  ? trace_hardirqs_on+0x1b/0xe0
      [3868.777373]  ? _raw_spin_unlock_irqrestore+0x3e/0x60
      [3868.778280]  ? kmem_cache_free+0x321/0x3c0
      [3868.779011]  ? __x64_sys_ioctl+0x83/0xb0
      [3868.779718]  __x64_sys_ioctl+0x83/0xb0
      [3868.780387]  do_syscall_64+0x3b/0xc0
      [3868.781059]  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [3868.781953] RIP: 0033:0x7f281c59e957
      [3868.782585] Code: 3c 1c 48 f7 d8 4c (...)
      [3868.785867] RSP: 002b:00007ffe1f83e2b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
      [3868.787198] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281c59e957
      [3868.788450] RDX: 00007ffe1f83e2c0 RSI: 0000000050009418 RDI: 0000000000000003
      [3868.789748] RBP: 00007ffe1f83f300 R08: 0000000000000000 R09: 00007ffe1f83fe36
      [3868.791214] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003
      [3868.792468] R13: 0000000000000003 R14: 00007ffe1f83e2c0 R15: 00000000000003cc
      [3868.793765]  </TASK>
      [3868.794037] irq event stamp: 0
      [3868.794548] hardirqs last  enabled at (0): [<0000000000000000>] 0x0
      [3868.795670] hardirqs last disabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
      [3868.797086] softirqs last  enabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040
      [3868.798309] softirqs last disabled at (0): [<0000000000000000>] 0x0
      [3868.799284] ---[ end trace be24c7002fe27747 ]---
      [3868.799928] BTRFS info (device dm-0): leaf 241188864 gen 1268 total ptrs 214 free space 469 owner 2
      [3868.801133] BTRFS info (device dm-0): refs 2 lock_owner 225627 current 225627
      [3868.802056]  item 0 key (237436928 169 0) itemoff 16250 itemsize 33
      [3868.802863]          extent refs 1 gen 1265 flags 2
      [3868.803447]          ref#0: tree block backref root 1610
      (...)
      [3869.064354]  item 114 key (241008640 169 0) itemoff 12488 itemsize 33
      [3869.065421]          extent refs 1 gen 1268 flags 2
      [3869.066115]          ref#0: tree block backref root 1689
      (...)
      [3869.403834] BTRFS error (device dm-0): unable to find ref byte nr 241008640 parent 0 root 1622  owner 0 offset 0
      [3869.405641] BTRFS: error (device dm-0) in __btrfs_free_extent:3076: errno=-2 No such entry
      [3869.407138] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2159: errno=-2 No such entry
      
      Fix this by passing the new subvolume's root ID to btrfs_free_tree_block().
      This requires changing the root argument of btrfs_free_tree_block() from
      struct btrfs_root * to a u64, since at this point during the subvolume
      creation we have not yet created the struct btrfs_root for the new
      subvolume, and btrfs_free_tree_block() only needs a root ID and nothing
      else from a struct btrfs_root.
      
      This was triggered by test case generic/475 from fstests.
      
      Fixes: 67addf29 ("btrfs: fix metadata extent leak after failure to create subvolume")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7a163608
  2. 08 12月, 2021 1 次提交
  3. 16 11月, 2021 1 次提交
  4. 29 10月, 2021 1 次提交
    • D
      btrfs: send: prepare for v2 protocol · e77fbf99
      David Sterba 提交于
      This is preparatory work for send protocol update to version 2 and
      higher.
      
      We have many pending protocol update requests but still don't have the
      basic protocol rev in place, the first thing that must happen is to do
      the actual versioning support.
      
      The protocol version is u32 and is a new member in the send ioctl
      struct. Validity of the version field is backed by a new flag bit. Old
      kernels would fail when a higher version is requested. Version protocol
      0 will pick the highest supported version, BTRFS_SEND_STREAM_VERSION,
        that's also exported in sysfs.
      
      The version is still unchanged and will be increased once we have new
      incompatible commands or stream updates.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e77fbf99
  5. 27 10月, 2021 14 次提交
    • O
      btrfs: fix deadlock when defragging transparent huge pages · 24bcb454
      Omar Sandoval 提交于
      Attempting to defragment a Btrfs file containing a transparent huge page
      immediately deadlocks with the following stack trace:
      
        #0  context_switch (kernel/sched/core.c:4940:2)
        #1  __schedule (kernel/sched/core.c:6287:8)
        #2  schedule (kernel/sched/core.c:6366:3)
        #3  io_schedule (kernel/sched/core.c:8389:2)
        #4  wait_on_page_bit_common (mm/filemap.c:1356:4)
        #5  __lock_page (mm/filemap.c:1648:2)
        #6  lock_page (./include/linux/pagemap.h:625:3)
        #7  pagecache_get_page (mm/filemap.c:1910:4)
        #8  find_or_create_page (./include/linux/pagemap.h:420:9)
        #9  defrag_prepare_one_page (fs/btrfs/ioctl.c:1068:9)
        #10 defrag_one_range (fs/btrfs/ioctl.c:1326:14)
        #11 defrag_one_cluster (fs/btrfs/ioctl.c:1421:9)
        #12 btrfs_defrag_file (fs/btrfs/ioctl.c:1523:9)
        #13 btrfs_ioctl_defrag (fs/btrfs/ioctl.c:3117:9)
        #14 btrfs_ioctl (fs/btrfs/ioctl.c:4872:10)
        #15 vfs_ioctl (fs/ioctl.c:51:10)
        #16 __do_sys_ioctl (fs/ioctl.c:874:11)
        #17 __se_sys_ioctl (fs/ioctl.c:860:1)
        #18 __x64_sys_ioctl (fs/ioctl.c:860:1)
        #19 do_syscall_x64 (arch/x86/entry/common.c:50:14)
        #20 do_syscall_64 (arch/x86/entry/common.c:80:7)
        #21 entry_SYSCALL_64+0x7c/0x15b (arch/x86/entry/entry_64.S:113)
      
      A huge page is represented by a compound page, which consists of a
      struct page for each PAGE_SIZE page within the huge page. The first
      struct page is the "head page", and the remaining are "tail pages".
      
      Defragmentation attempts to lock each page in the range. However,
      lock_page() on a tail page actually locks the corresponding head page.
      So, if defragmentation tries to lock more than one struct page in a
      compound page, it tries to lock the same head page twice and deadlocks
      with itself.
      
      Ideally, we should be able to defragment transparent huge pages.
      However, THP for filesystems is currently read-only, so a lot of code is
      not ready to use huge pages for I/O. For now, let's just return
      ETXTBUSY.
      
      This can be reproduced with the following on a kernel with
      CONFIG_READ_ONLY_THP_FOR_FS=y:
      
        $ cat create_thp_file.c
        #include <fcntl.h>
        #include <stdbool.h>
        #include <stdio.h>
        #include <stdint.h>
        #include <stdlib.h>
        #include <unistd.h>
        #include <sys/mman.h>
      
        static const char zeroes[1024 * 1024];
        static const size_t FILE_SIZE = 2 * 1024 * 1024;
      
        int main(int argc, char **argv)
        {
                if (argc != 2) {
                        fprintf(stderr, "usage: %s PATH\n", argv[0]);
                        return EXIT_FAILURE;
                }
                int fd = creat(argv[1], 0777);
                if (fd == -1) {
                        perror("creat");
                        return EXIT_FAILURE;
                }
                size_t written = 0;
                while (written < FILE_SIZE) {
                        ssize_t ret = write(fd, zeroes,
                                            sizeof(zeroes) < FILE_SIZE - written ?
                                            sizeof(zeroes) : FILE_SIZE - written);
                        if (ret < 0) {
                                perror("write");
                                return EXIT_FAILURE;
                        }
                        written += ret;
                }
                close(fd);
                fd = open(argv[1], O_RDONLY);
                if (fd == -1) {
                        perror("open");
                        return EXIT_FAILURE;
                }
      
                /*
                 * Reserve some address space so that we can align the file mapping to
                 * the huge page size.
                 */
                void *placeholder_map = mmap(NULL, FILE_SIZE * 2, PROT_NONE,
                                             MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
                if (placeholder_map == MAP_FAILED) {
                        perror("mmap (placeholder)");
                        return EXIT_FAILURE;
                }
      
                void *aligned_address =
                        (void *)(((uintptr_t)placeholder_map + FILE_SIZE - 1) & ~(FILE_SIZE - 1));
      
                void *map = mmap(aligned_address, FILE_SIZE, PROT_READ | PROT_EXEC,
                                 MAP_SHARED | MAP_FIXED, fd, 0);
                if (map == MAP_FAILED) {
                        perror("mmap");
                        return EXIT_FAILURE;
                }
                if (madvise(map, FILE_SIZE, MADV_HUGEPAGE) < 0) {
                        perror("madvise");
                        return EXIT_FAILURE;
                }
      
                char *line = NULL;
                size_t line_capacity = 0;
                FILE *smaps_file = fopen("/proc/self/smaps", "r");
                if (!smaps_file) {
                        perror("fopen");
                        return EXIT_FAILURE;
                }
                for (;;) {
                        for (size_t off = 0; off < FILE_SIZE; off += 4096)
                                ((volatile char *)map)[off];
      
                        ssize_t ret;
                        bool this_mapping = false;
                        while ((ret = getline(&line, &line_capacity, smaps_file)) > 0) {
                                unsigned long start, end, huge;
                                if (sscanf(line, "%lx-%lx", &start, &end) == 2) {
                                        this_mapping = (start <= (uintptr_t)map &&
                                                        (uintptr_t)map < end);
                                } else if (this_mapping &&
                                           sscanf(line, "FilePmdMapped: %ld", &huge) == 1 &&
                                           huge > 0) {
                                        return EXIT_SUCCESS;
                                }
                        }
      
                        sleep(6);
                        rewind(smaps_file);
                        fflush(smaps_file);
                }
        }
        $ ./create_thp_file huge
        $ btrfs fi defrag -czstd ./huge
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      24bcb454
    • J
      btrfs: use btrfs_get_dev_args_from_path in dev removal ioctls · 1a15eb72
      Josef Bacik 提交于
      For device removal and replace we call btrfs_find_device_by_devspec,
      which if we give it a device path and nothing else will call
      btrfs_get_dev_args_from_path, which opens the block device and reads the
      super block and then looks up our device based on that.
      
      However at this point we're holding the sb write "lock", so reading the
      block device pulls in the dependency of ->open_mutex, which produces the
      following lockdep splat
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.14.0-rc2+ #405 Not tainted
      ------------------------------------------------------
      losetup/11576 is trying to acquire lock:
      ffff9bbe8cded938 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
      
      but task is already holding lock:
      ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #4 (&lo->lo_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             lo_open+0x28/0x60 [loop]
             blkdev_get_whole+0x25/0xf0
             blkdev_get_by_dev.part.0+0x168/0x3c0
             blkdev_open+0xd2/0xe0
             do_dentry_open+0x161/0x390
             path_openat+0x3cc/0xa20
             do_filp_open+0x96/0x120
             do_sys_openat2+0x7b/0x130
             __x64_sys_openat+0x46/0x70
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #3 (&disk->open_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             blkdev_get_by_dev.part.0+0x56/0x3c0
             blkdev_get_by_path+0x98/0xa0
             btrfs_get_bdev_and_sb+0x1b/0xb0
             btrfs_find_device_by_devspec+0x12b/0x1c0
             btrfs_rm_device+0x127/0x610
             btrfs_ioctl+0x2a31/0x2e70
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #2 (sb_writers#12){.+.+}-{0:0}:
             lo_write_bvec+0xc2/0x240 [loop]
             loop_process_work+0x238/0xd00 [loop]
             process_one_work+0x26b/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
             process_one_work+0x245/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
             __lock_acquire+0x10ea/0x1d90
             lock_acquire+0xb5/0x2b0
             flush_workqueue+0x91/0x5e0
             drain_workqueue+0xa0/0x110
             destroy_workqueue+0x36/0x250
             __loop_clr_fd+0x9a/0x660 [loop]
             block_ioctl+0x3f/0x50
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      other info that might help us debug this:
      
      Chain exists of:
        (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&lo->lo_mutex);
                                     lock(&disk->open_mutex);
                                     lock(&lo->lo_mutex);
        lock((wq_completion)loop0);
      
       *** DEADLOCK ***
      
      1 lock held by losetup/11576:
       #0: ffff9bbe88e4fc68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      stack backtrace:
      CPU: 0 PID: 11576 Comm: losetup Not tainted 5.14.0-rc2+ #405
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
      Call Trace:
       dump_stack_lvl+0x57/0x72
       check_noncircular+0xcf/0xf0
       ? stack_trace_save+0x3b/0x50
       __lock_acquire+0x10ea/0x1d90
       lock_acquire+0xb5/0x2b0
       ? flush_workqueue+0x67/0x5e0
       ? lockdep_init_map_type+0x47/0x220
       flush_workqueue+0x91/0x5e0
       ? flush_workqueue+0x67/0x5e0
       ? verify_cpu+0xf0/0x100
       drain_workqueue+0xa0/0x110
       destroy_workqueue+0x36/0x250
       __loop_clr_fd+0x9a/0x660 [loop]
       ? blkdev_ioctl+0x8d/0x2a0
       block_ioctl+0x3f/0x50
       __x64_sys_ioctl+0x80/0xb0
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7f31b02404cb
      
      Instead what we want to do is populate our device lookup args before we
      grab any locks, and then pass these args into btrfs_rm_device().  From
      there we can find the device and do the appropriate removal.
      Suggested-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a15eb72
    • J
      btrfs: handle device lookup with btrfs_dev_lookup_args · 562d7b15
      Josef Bacik 提交于
      We have a lot of device lookup functions that all do something slightly
      different.  Clean this up by adding a struct to hold the different
      lookup criteria, and then pass this around to btrfs_find_device() so it
      can do the proper matching based on the lookup criteria.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      562d7b15
    • Q
      btrfs: defrag: enable defrag for subpage case · c22a3572
      Qu Wenruo 提交于
      With the new infrastructure which has taken subpage into consideration,
      now we should be safe to allow defrag to work for subpage case.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c22a3572
    • Q
      btrfs: defrag: remove the old infrastructure · c6357573
      Qu Wenruo 提交于
      Now the old infrastructure can all be removed, defrag
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c6357573
    • Q
      btrfs: defrag: use defrag_one_cluster() to implement btrfs_defrag_file() · 7b508037
      Qu Wenruo 提交于
      The function defrag_one_cluster() is able to defrag one range well
      enough, we only need to do preparation for it, including:
      
      - Clamp and align the defrag range
      - Exclude invalid cases
      - Proper inode locking
      
      The old infrastructures will not be removed in this patch, as it would
      be too noisy to review.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7b508037
    • Q
      btrfs: defrag: introduce helper to defrag one cluster · b18c3ab2
      Qu Wenruo 提交于
      This new helper, defrag_one_cluster(), will defrag one cluster (at most
      256K):
      
      - Collect all initial targets
      
      - Kick in readahead when possible
      
      - Call defrag_one_range() on each initial target
        With some extra range clamping.
      
      - Update @sectors_defragged parameter
      
      This involves one behavior change, the defragged sectors accounting is
      no longer as accurate as old behavior, as the initial targets are not
      consistent.
      
      We can have new holes punched inside the initial target, and we will
      skip such holes later.
      But the defragged sectors accounting doesn't need to be that accurate
      anyway, thus I don't want to pass those extra accounting burden into
      defrag_one_range().
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b18c3ab2
    • Q
      btrfs: defrag: introduce helper to defrag a range · e9eec721
      Qu Wenruo 提交于
      A new helper, defrag_one_range(), is introduced to defrag one range.
      
      This function will mostly prepare the needed pages and extent status for
      defrag_one_locked_target().
      
      As we can only have a consistent view of extent map with page and extent
      bits locked, we need to re-check the range passed in to get a real
      target list for defrag_one_locked_target().
      
      Since defrag_collect_targets() will call defrag_lookup_extent() and lock
      extent range, we also need to teach those two functions to skip extent
      lock.  Thus new parameter, @locked, is introduced to skip extent lock if
      the caller has already locked the range.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e9eec721
    • Q
      btrfs: defrag: introduce helper to defrag a contiguous prepared range · 22b398ee
      Qu Wenruo 提交于
      A new helper, defrag_one_locked_target(), introduced to do the real part
      of defrag.
      
      The caller needs to ensure both page and extents bits are locked, and no
      ordered extent exists for the range, and all writeback is finished.
      
      The core defrag part is pretty straight-forward:
      
      - Reserve space
      - Set extent bits to defrag
      - Update involved pages to be dirty
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      22b398ee
    • Q
      btrfs: defrag: introduce helper to collect target file extents · eb793cf8
      Qu Wenruo 提交于
      Introduce a helper, defrag_collect_targets(), to collect all possible
      targets to be defragged.
      
      This function will not consider things like max_sectors_to_defrag, thus
      caller should be responsible to ensure we don't exceed the limit.
      
      This function will be the first stage of later defrag rework.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eb793cf8
    • Q
      btrfs: defrag: factor out page preparation into a helper · 5767b50c
      Qu Wenruo 提交于
      In cluster_pages_for_defrag(), we have complex code block inside one
      for() loop.
      
      The code block is to prepare one page for defrag, this will ensure:
      
      - The page is locked and set up properly.
      - No ordered extent exists in the page range.
      - The page is uptodate.
      
      This behavior is pretty common and will be reused by later defrag
      rework.
      
      So factor out the code into its own helper, defrag_prepare_one_page(),
      for later usage, and cleanup the code by a little.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5767b50c
    • Q
      btrfs: defrag: replace hard coded PAGE_SIZE with sectorsize · 76068cae
      Qu Wenruo 提交于
      When testing subpage defrag support, I always find some strange inode
      nbytes error, after a lot of debugging, it turns out that
      defrag_lookup_extent() is using PAGE_SIZE as size for
      lookup_extent_mapping().
      
      Since lookup_extent_mapping() is calling __lookup_extent_mapping() with
      @strict == 1, this means any extent map smaller than one page will be
      ignored, prevent subpage defrag to grab a correct extent map.
      
      There are quite some PAGE_SIZE usage in ioctl.c, but most of them are
      correct usages, and can be one of the following cases:
      
      - ioctl structure size check
        We want ioctl structure to be contained inside one page.
      
      - real page operations
      
      The remaining cases in defrag_lookup_extent() and
      check_defrag_in_cache() will be addressed in this patch.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      76068cae
    • Q
      btrfs: defrag: also check PagePrivate for subpage cases in cluster_pages_for_defrag() · cae79686
      Qu Wenruo 提交于
      In function cluster_pages_for_defrag() we have a window where we unlock
      page, either start the ordered range or read the content from disk.
      
      When we re-lock the page, we need to make sure it still has the correct
      page->private for subpage.
      
      Thus add the extra PagePrivate check here to handle subpage cases
      properly.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cae79686
    • Q
      btrfs: defrag: pass file_ra_state instead of file to btrfs_defrag_file() · 1ccc2e8a
      Qu Wenruo 提交于
      Currently btrfs_defrag_file() accepts both "struct inode" and "struct
      file" as parameter.  We can easily grab "struct inode" from "struct
      file" using file_inode() helper.
      
      The reason why we need "struct file" is just to re-use its f_ra.
      
      Change this to pass "struct file_ra_state" parameter, so that it's more
      clear what we really want.  Since we're here, also add some comments on
      the function btrfs_defrag_file().
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1ccc2e8a
  6. 26 10月, 2021 1 次提交
  7. 19 10月, 2021 1 次提交
  8. 18 10月, 2021 1 次提交
    • A
      gup: Turn fault_in_pages_{readable,writeable} into fault_in_{readable,writeable} · bb523b40
      Andreas Gruenbacher 提交于
      Turn fault_in_pages_{readable,writeable} into versions that return the
      number of bytes not faulted in, similar to copy_to_user, instead of
      returning a non-zero value when any of the requested pages couldn't be
      faulted in.  This supports the existing users that require all pages to
      be faulted in as well as new users that are happy if any pages can be
      faulted in.
      
      Rename the functions to fault_in_{readable,writeable} to make sure
      this change doesn't silently break things.
      
      Neither of these functions is entirely trivial and it doesn't seem
      useful to inline them, so move them to mm/gup.c.
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      bb523b40
  9. 07 9月, 2021 1 次提交
    • J
      btrfs: delay blkdev_put until after the device remove · 3fa421de
      Josef Bacik 提交于
      When removing the device we call blkdev_put() on the device once we've
      removed it, and because we have an EXCL open we need to take the
      ->open_mutex on the block device to clean it up.  Unfortunately during
      device remove we are holding the sb writers lock, which results in the
      following lockdep splat:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.14.0-rc2+ #407 Not tainted
      ------------------------------------------------------
      losetup/11595 is trying to acquire lock:
      ffff973ac35dd138 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x67/0x5e0
      
      but task is already holding lock:
      ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #4 (&lo->lo_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             lo_open+0x28/0x60 [loop]
             blkdev_get_whole+0x25/0xf0
             blkdev_get_by_dev.part.0+0x168/0x3c0
             blkdev_open+0xd2/0xe0
             do_dentry_open+0x161/0x390
             path_openat+0x3cc/0xa20
             do_filp_open+0x96/0x120
             do_sys_openat2+0x7b/0x130
             __x64_sys_openat+0x46/0x70
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #3 (&disk->open_mutex){+.+.}-{3:3}:
             __mutex_lock+0x7d/0x750
             blkdev_put+0x3a/0x220
             btrfs_rm_device.cold+0x62/0xe5
             btrfs_ioctl+0x2a31/0x2e70
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      -> #2 (sb_writers#12){.+.+}-{0:0}:
             lo_write_bvec+0xc2/0x240 [loop]
             loop_process_work+0x238/0xd00 [loop]
             process_one_work+0x26b/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
             process_one_work+0x245/0x560
             worker_thread+0x55/0x3c0
             kthread+0x140/0x160
             ret_from_fork+0x1f/0x30
      
      -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
             __lock_acquire+0x10ea/0x1d90
             lock_acquire+0xb5/0x2b0
             flush_workqueue+0x91/0x5e0
             drain_workqueue+0xa0/0x110
             destroy_workqueue+0x36/0x250
             __loop_clr_fd+0x9a/0x660 [loop]
             block_ioctl+0x3f/0x50
             __x64_sys_ioctl+0x80/0xb0
             do_syscall_64+0x38/0x90
             entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      other info that might help us debug this:
      
      Chain exists of:
        (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(&lo->lo_mutex);
                                     lock(&disk->open_mutex);
                                     lock(&lo->lo_mutex);
        lock((wq_completion)loop0);
      
       *** DEADLOCK ***
      
      1 lock held by losetup/11595:
       #0: ffff973ac9812c68 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x660 [loop]
      
      stack backtrace:
      CPU: 0 PID: 11595 Comm: losetup Not tainted 5.14.0-rc2+ #407
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
      Call Trace:
       dump_stack_lvl+0x57/0x72
       check_noncircular+0xcf/0xf0
       ? stack_trace_save+0x3b/0x50
       __lock_acquire+0x10ea/0x1d90
       lock_acquire+0xb5/0x2b0
       ? flush_workqueue+0x67/0x5e0
       ? lockdep_init_map_type+0x47/0x220
       flush_workqueue+0x91/0x5e0
       ? flush_workqueue+0x67/0x5e0
       ? verify_cpu+0xf0/0x100
       drain_workqueue+0xa0/0x110
       destroy_workqueue+0x36/0x250
       __loop_clr_fd+0x9a/0x660 [loop]
       ? blkdev_ioctl+0x8d/0x2a0
       block_ioctl+0x3f/0x50
       __x64_sys_ioctl+0x80/0xb0
       do_syscall_64+0x38/0x90
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      RIP: 0033:0x7fc21255d4cb
      
      So instead save the bdev and do the put once we've dropped the sb
      writers lock in order to avoid the lockdep recursion.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3fa421de
  10. 23 8月, 2021 13 次提交
    • C
      btrfs: allow idmapped INO_LOOKUP_USER ioctl · 6623d9a0
      Christian Brauner 提交于
      The INO_LOOKUP_USER is an unprivileged version of the INO_LOOKUP ioctl
      and has the following restrictions. The main difference between the two
      is that INO_LOOKUP is filesystem wide operation wheres INO_LOOKUP_USER
      is scoped beneath the file descriptor passed with the ioctl.
      Specifically, INO_LOOKUP_USER must adhere to the following restrictions:
      
      - The caller must be privileged over each inode of each path component
        for the path they are trying to lookup.
      
      - The path for the subvolume the caller is trying to lookup must be reachable
        from the inode associated with the file descriptor passed with the ioctl.
      
      The second condition makes it possible to scope the lookup of the path
      to the mount identified by the file descriptor passed with the ioctl.
      This allows us to enable this ioctl on idmapped mounts.
      
      Specifically, this is possible because all child subvolumes of a parent
      subvolume are reachable when the parent subvolume is mounted. So if the
      user had access to open the parent subvolume or has been given the fd
      then they can lookup the path if they had access to it provided they
      were privileged over each path component.
      
      Note, the INO_LOOKUP_USER ioctl allows a user to learn the path and name
      of a subvolume even though they would otherwise be restricted from doing
      so via regular VFS-based lookup.
      
      So think about a parent subvolume with multiple child subvolumes.
      Someone could mount he parent subvolume and restrict access to the child
      subvolumes by overmounting them with empty directories. At this point
      the user can't traverse the child subvolumes and they can't open files
      in the child subvolumes.  However, they can still learn the path of
      child subvolumes as long as they have access to the parent subvolume by
      using the INO_LOOKUP_USER ioctl.
      
      The underlying assumption here is that it's ok that the lookup ioctls
      can't really take mounts into account other than the original mount the
      fd belongs to during lookup. Since this assumption is baked into the
      original INO_LOOKUP_USER ioctl we can extend it to idmapped mounts.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6623d9a0
    • C
      btrfs: allow idmapped SUBVOL_SETFLAGS ioctl · 39e1674f
      Christian Brauner 提交于
      Setting flags on subvolumes or snapshots are core features of btrfs. The
      SUBVOL_SETFLAGS ioctl is especially important as it allows to make
      subvolumes and snapshots read-only or read-write. Allow setting flags on
      btrfs subvolumes and snapshots on idmapped mounts. This is a fairly
      straightforward operation since all the permission checking helpers are
      already capable of handling idmapped mounts. So we just need to pass
      down the mount's userns.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      39e1674f
    • C
      btrfs: allow idmapped SET_RECEIVED_SUBVOL ioctls · e4fed17a
      Christian Brauner 提交于
      The SET_RECEIVED_SUBVOL ioctls are used to set information about
      a received subvolume. Make it possible to set information about a
      received subvolume on idmapped mounts. This is a fairly straightforward
      operation since all the permission checking helpers are already capable
      of handling idmapped mounts. So we just need to pass down the mount's
      userns.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e4fed17a
    • C
      btrfs: relax restrictions for SNAP_DESTROY_V2 with subvolids · aabb34e7
      Christian Brauner 提交于
      So far we prevented the deletion of subvolumes and snapshots using
      subvolume ids possible with the BTRFS_SUBVOL_SPEC_BY_ID flag.
      
      This restriction is necessary on idmapped mounts as this allows
      filesystem wide subvolume and snapshot deletions and thus can escape the
      scope of what's exposed under the mount identified by the fd passed with
      the ioctl.
      
      Deletion by subvolume id works by looking for an alias of the parent of
      the subvolume or snapshot to be deleted. The parent alias can be
      anywhere in the filesystem. However, as long as the alias of the parent
      that is found is the same as the one identified by the file descriptor
      passed through the ioctl we can allow the deletion.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      aabb34e7
    • C
      btrfs: allow idmapped SNAP_DESTROY ioctls · c4ed533b
      Christian Brauner 提交于
      Destroying subvolumes and snapshots are important features of btrfs.
      Both operations are available to unprivileged users if the filesystem
      has been mounted with the "user_subvol_rm_allowed" mount option. Allow
      subvolume and snapshot deletion on idmapped mounts. This is a fairly
      straightforward operation since all the permission checking helpers are
      already capable of handling idmapped mounts. So we just need to pass
      down the mount's userns.
      
      Subvolumes and snapshots can either be deleted by specifying their name
      or - if BTRFS_IOC_SNAP_DESTROY_V2 is used - by their subvolume or
      snapshot id if the BTRFS_SUBVOL_SPEC_BY_ID is set.
      
      This feature is blocked on idmapped mounts as this allows filesystem
      wide subvolume deletions and thus can escape the scope of what's exposed
      under the mount identified by the fd passed with the ioctl.
      
      This means that even the root or CAP_SYS_ADMIN capable user can't delete
      a subvolume via BTRFS_SUBVOL_SPEC_BY_ID. This is intentional.
      
      The root user is currently already subject to permission checks in
      btrfs_may_delete() including whether the inode's i_uid/i_gid of the
      directory the subvolume is located in have a mapping in the caller's
      idmapping. For this to fail isn't currently possible since a btrfs
      filesystem can't be mounted with a non-initial idmapping but it shows
      that even the root user would fail to delete a subvolume if the relevant
      inode isn't mapped in their idmapping. The idmapped mount case is the
      same in principle.
      
      This isn't a huge problem a root user wanting to delete arbitrary
      subvolumes can just always create another (even detached) mount without
      an idmapping attached.
      
      In addition, we will allow BTRFS_SUBVOL_SPEC_BY_ID for cases where the
      subvolume to delete is directly located under inode referenced by the fd
      passed for the ioctl() in a follow-up commit.
      
      Here is an example where a btrfs subvolume is deleted through a
      subvolume mount that does not expose the subvolume to be delete but it
      can still be deleted by using the subvolume id:
      
        /* Compile the following program as "delete_by_spec". */
      
        #define _GNU_SOURCE
        #include <fcntl.h>
        #include <inttypes.h>
        #include <linux/btrfs.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <sys/ioctl.h>
        #include <sys/stat.h>
        #include <sys/types.h>
        #include <unistd.h>
      
        static int rm_subvolume_by_id(int fd, uint64_t subvolid)
        {
      	 struct btrfs_ioctl_vol_args_v2 args = {};
      	 int ret;
      
      	 args.flags = BTRFS_SUBVOL_SPEC_BY_ID;
      	 args.subvolid = subvolid;
      
      	 ret = ioctl(fd, BTRFS_IOC_SNAP_DESTROY_V2, &args);
      	 if (ret < 0)
      		 return -1;
      
      	 return 0;
        }
      
        int main(int argc, char *argv[])
        {
      	 int subvolid = 0;
      
      	 if (argc < 3)
      		 exit(1);
      
      	 fprintf(stderr, "Opening %s\n", argv[1]);
      	 int fd = open(argv[1], O_CLOEXEC | O_DIRECTORY);
      	 if (fd < 0)
      		 exit(2);
      
      	 subvolid = atoi(argv[2]);
      
      	 fprintf(stderr, "Deleting subvolume with subvolid %d\n", subvolid);
      	 int ret = rm_subvolume_by_id(fd, subvolid);
      	 if (ret < 0)
      		 exit(3);
      
      	 exit(0);
        }
        #include <stdio.h>"
        #include <stdlib.h>"
        #include <linux/btrfs.h"
      
        truncate -s 10G btrfs.img
        mkfs.btrfs btrfs.img
        export LOOPDEV=$(sudo losetup -f --show btrfs.img)
        mount ${LOOPDEV} /mnt
        sudo chown $(id -u):$(id -g) /mnt
        btrfs subvolume create /mnt/A
        btrfs subvolume create /mnt/B/C
        # Get subvolume id via:
        sudo btrfs subvolume show /mnt/A
        # Save subvolid
        SUBVOLID=<nr>
        sudo umount /mnt
        sudo mount ${LOOPDEV} -o subvol=B/C,user_subvol_rm_allowed /mnt
        ./delete_by_spec /mnt ${SUBVOLID}
      
      With idmapped mounts this can potentially be used by users to delete
      subvolumes/snapshots they would otherwise not have access to as the
      idmapping would be applied to an inode that is not exposed in the mount
      of the subvolume.
      
      The fact that this is a filesystem wide operation suggests it might be a
      good idea to expose this under a separate ioctl that clearly indicates
      this. In essence, the file descriptor passed with the ioctl is merely
      used to identify the filesystem on which to operate when
      BTRFS_SUBVOL_SPEC_BY_ID is used.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c4ed533b
    • C
      btrfs: allow idmapped SNAP_CREATE/SUBVOL_CREATE ioctls · 4d4340c9
      Christian Brauner 提交于
      Creating subvolumes and snapshots is one of the core features of btrfs
      and is even available to unprivileged users. Make it possible to use
      subvolume and snapshot creation on idmapped mounts. This is a fairly
      straightforward operation since all the permission checking helpers are
      already capable of handling idmapped mounts. So we just need to pass
      down the mount's userns.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4d4340c9
    • C
      btrfs: check whether fsgid/fsuid are mapped during subvolume creation · 5474bf40
      Christian Brauner 提交于
      When a new subvolume is created btrfs currently doesn't check whether
      the fsgid/fsuid of the caller actually have a mapping in the user
      namespace attached to the filesystem. The VFS always checks this to make
      sure that the caller's fsgid/fsuid can be represented on-disk. This is
      most relevant for filesystems that can be mounted inside user namespaces
      but it is in general a good hardening measure to prevent unrepresentable
      gid/uid from being written to disk.
      
      Since we want to support idmapped mounts for btrfs ioctls to create
      subvolumes in follow-up patches this becomes important since we want to
      make sure the fsgid/fsuid of the caller as mapped according to the
      idmapped mount can be represented on-disk. Simply add the missing
      fsuidgid_has_mapping() line from the VFS may_create() version to
      btrfs_may_create().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5474bf40
    • G
      btrfs: allocate btrfs_ioctl_defrag_range_args on stack · c853a578
      Goldwyn Rodrigues 提交于
      Instead of using kmalloc() to allocate btrfs_ioctl_defrag_range_args,
      allocate btrfs_ioctl_defrag_range_args on stack, the size is reasonably
      small and ioctls are called in process context.
      
      sizeof(btrfs_ioctl_defrag_range_args) = 48
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c853a578
    • G
      btrfs: allocate btrfs_ioctl_quota_rescan_args on stack · 0afb603a
      Goldwyn Rodrigues 提交于
      Instead of using kmalloc() to allocate btrfs_ioctl_quota_rescan_args,
      allocate btrfs_ioctl_quota_rescan_args on stack, the size is reasonably
      small and ioctls are called in process context.
      
      sizeof(btrfs_ioctl_quota_rescan_args) = 64
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0afb603a
    • M
      btrfs: introduce btrfs_search_backwards function · 0ff40a91
      Marcos Paulo de Souza 提交于
      It's a common practice to start a search using offset (u64)-1, which is
      the u64 maximum value, meaning that we want the search_slot function to
      be set in the last item with the same objectid and type.
      
      Once we are in this position, it's a matter to start a search backwards
      by calling btrfs_previous_item, which will check if we'll need to go to
      a previous leaf and other necessary checks, only to be sure that we are
      in last offset of the same object and type.
      
      The new btrfs_search_backwards function does the all these steps when
      necessary, and can be used to avoid code duplication.
      Signed-off-by: NMarcos Paulo de Souza <mpdesouza@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0ff40a91
    • B
      btrfs: initial fsverity support · 14605409
      Boris Burkov 提交于
      Add support for fsverity in btrfs. To support the generic interface in
      fs/verity, we add two new item types in the fs tree for inodes with
      verity enabled. One stores the per-file verity descriptor and btrfs
      verity item and the other stores the Merkle tree data itself.
      
      Verity checking is done in end_page_read just before a page is marked
      uptodate. This naturally handles a variety of edge cases like holes,
      preallocated extents, and inline extents. Some care needs to be taken to
      not try to verity pages past the end of the file, which are accessed by
      the generic buffered file reading code under some circumstances like
      reading to the end of the last page and trying to read again. Direct IO
      on a verity file falls back to buffered reads.
      
      Verity relies on PageChecked for the Merkle tree data itself to avoid
      re-walking up shared paths in the tree. For this reason, we need to
      cache the Merkle tree data. Since the file is immutable after verity is
      turned on, we can cache it at an index past EOF.
      
      Use the new inode ro_flags to store verity on the inode item, so that we
      can enable verity on a file, then rollback to an older kernel and still
      mount the file system and read the file. Since we can't safely write the
      file anymore without ruining the invariants of the Merkle tree, we mark
      a ro_compat flag on the file system when a file has verity enabled.
      Acked-by: NEric Biggers <ebiggers@google.com>
      Co-developed-by: NChris Mason <clm@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      14605409
    • B
      btrfs: add ro compat flags to inodes · 77eea05e
      Boris Burkov 提交于
      Currently, inode flags are fully backwards incompatible in btrfs. If we
      introduce a new inode flag, then tree-checker will detect it and fail.
      This can even cause us to fail to mount entirely. To make it possible to
      introduce new flags which can be read-only compatible, like VERITY, we
      add new ro flags to btrfs without treating them quite so harshly in
      tree-checker. A read-only file system can survive an unexpected flag,
      and can be mounted.
      
      As for the implementation, it unfortunately gets a little complicated.
      
      The on-disk representation of the inode, btrfs_inode_item, has an __le64
      for flags but the in-memory representation, btrfs_inode, uses a u32.
      David Sterba had the nice idea that we could reclaim those wasted 32 bits
      on disk and use them for the new ro_compat flags.
      
      It turns out that the tree-checker code which checks for unknown flags
      is broken, and ignores the upper 32 bits we are hoping to use. The issue
      is that the flags use the literal 1 rather than 1ULL, so the flags are
      signed ints, and one of them is specifically (1 << 31). As a result, the
      mask which ORs the flags is a negative integer on machines where int is
      32 bit twos complement. When tree-checker evaluates the expression:
      
        btrfs_inode_flags(leaf, iitem) & ~BTRFS_INODE_FLAG_MASK)
      
      The mask is something like 0x80000abc, which gets promoted to u64 with
      sign extension to 0xffffffff80000abc. Negating that 64 bit mask leaves
      all the upper bits zeroed, and we can't detect unexpected flags.
      
      This suggests that we can't use those bits after all. Luckily, we have
      good reason to believe that they are zero anyway. Inode flags are
      metadata, which is always checksummed, so any bit flips that would
      introduce 1s would cause a checksum failure anyway (excluding the
      improbable case of the checksum getting corrupted exactly badly).
      
      Further, unless the 1 << 31 flag is used, the cast to u64 of the 32 bit
      inode flag should preserve its value and not add leading zeroes
      (at least for twos complement). The only place that flag
      (BTRFS_INODE_ROOT_ITEM_INIT) is used is in a special inode embedded in
      the root item, and indeed for that inode we see 0xffffffff80000000 as
      the flags on disk. However, that inode is never seen by tree checker,
      nor is it used in a context where verity might be meaningful.
      Theoretically, a future ro flag might cause trouble on that inode, so we
      should proactively clean up that mess before it does.
      
      With the introduction of the new ro flags, keep two separate unsigned
      masks and check them against the appropriate u32. Since we no longer run
      afoul of sign extension, this also stops writing out 0xffffffff80000000
      in root_item inodes going forward.
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      77eea05e
    • Q
      btrfs: allow read-write for 4K sectorsize on 64K page size systems · 95ea0486
      Qu Wenruo 提交于
      Since now we support data and metadata read-write for subpage, remove
      the RO requirement for subpage mount.
      
      There are some extra limitations though:
      
      - For now, subpage RW mount is still considered experimental
        Thus that mount warning will still be there.
      
      - No compression support
        There are still quite some PAGE_SIZE hard coded and quite some call
        sites use extent_clear_unlock_delalloc() to unlock locked_page.
        This will screw up subpage helpers.
      
        Now for subpage RW mount, no matter what mount option or inode attr is
        set, all writes will not be compressed.  Although reading compressed
        data has no problem.
      
      - No defrag for subpage case
        The defrag support for subpage case will come in later patches, which
        will also rework the defrag workflow.
      
      - No inline extent will be created
        This is mostly due to the fact that filemap_fdatawrite_range() will
        trigger more write than the range specified.
        In fallocate calls, this behavior can make us to writeback which can
        be inlined, before we enlarge the i_size.
      
        This is a very special corner case, and even current btrfs check won't
        report error on such inline extent + regular extent.
        But considering how much effort has been put to prevent such inline +
        regular, I'd prefer to cut off inline extent completely until we have
        a good solution.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      95ea0486
  11. 22 6月, 2021 1 次提交
  12. 21 6月, 2021 3 次提交