1. 27 10月, 2021 4 次提交
  2. 26 10月, 2021 1 次提交
  3. 07 9月, 2021 1 次提交
  4. 23 8月, 2021 2 次提交
    • Q
      btrfs: allow read-write for 4K sectorsize on 64K page size systems · 95ea0486
      Qu Wenruo 提交于
      Since now we support data and metadata read-write for subpage, remove
      the RO requirement for subpage mount.
      
      There are some extra limitations though:
      
      - For now, subpage RW mount is still considered experimental
        Thus that mount warning will still be there.
      
      - No compression support
        There are still quite some PAGE_SIZE hard coded and quite some call
        sites use extent_clear_unlock_delalloc() to unlock locked_page.
        This will screw up subpage helpers.
      
        Now for subpage RW mount, no matter what mount option or inode attr is
        set, all writes will not be compressed.  Although reading compressed
        data has no problem.
      
      - No defrag for subpage case
        The defrag support for subpage case will come in later patches, which
        will also rework the defrag workflow.
      
      - No inline extent will be created
        This is mostly due to the fact that filemap_fdatawrite_range() will
        trigger more write than the range specified.
        In fallocate calls, this behavior can make us to writeback which can
        be inlined, before we enlarge the i_size.
      
        This is a very special corner case, and even current btrfs check won't
        report error on such inline extent + regular extent.
        But considering how much effort has been put to prevent such inline +
        regular, I'd prefer to cut off inline extent completely until we have
        a good solution.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      95ea0486
    • Q
      btrfs: subpage: reject raid56 filesystem and profile conversion · c8050b3b
      Qu Wenruo 提交于
      RAID56 is not only unsafe due to its write-hole problem, but also has
      tons of hardcoded PAGE_SIZE.
      
      Disable it for subpage support for now.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c8050b3b
  5. 29 7月, 2021 1 次提交
    • D
      btrfs: calculate number of eb pages properly in csum_tree_block · 7280305e
      David Sterba 提交于
      Building with -Warray-bounds on systems with 64K pages there's a
      warning:
      
        fs/btrfs/disk-io.c: In function ‘csum_tree_block’:
        fs/btrfs/disk-io.c:226:34: warning: array subscript 1 is above array bounds of ‘struct page *[1]’ [-Warray-bounds]
          226 |   kaddr = page_address(buf->pages[i]);
              |                        ~~~~~~~~~~^~~
        ./include/linux/mm.h:1630:48: note: in definition of macro ‘page_address’
         1630 | #define page_address(page) lowmem_page_address(page)
              |                                                ^~~~
        In file included from fs/btrfs/ctree.h:32,
                         from fs/btrfs/disk-io.c:23:
        fs/btrfs/extent_io.h:98:15: note: while referencing ‘pages’
           98 |  struct page *pages[1];
              |               ^~~~~
      
      The compiler has no way to know that in that case the nodesize is exactly
      PAGE_SIZE, so the resulting number of pages will be correct (1).
      
      Let's use num_extent_pages that makes the case nodesize == PAGE_SIZE
      explicitly 1.
      Reported-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7280305e
  6. 22 6月, 2021 5 次提交
    • J
      btrfs: rip out btrfs_space_info::total_bytes_pinned · 138a12d8
      Josef Bacik 提交于
      We used this in may_commit_transaction() in order to determine if we
      needed to commit the transaction.  However we no longer have that logic
      and thus have no use of this counter anymore, so delete it.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      138a12d8
    • F
      btrfs: send: fix crash when memory allocations trigger reclaim · 35b22c19
      Filipe Manana 提交于
      When doing a send we don't expect the task to ever start a transaction
      after the initial check that verifies if commit roots match the regular
      roots. This is because after that we set current->journal_info with a
      stub (special value) that signals we are in send context, so that we take
      a read lock on an extent buffer when reading it from disk and verifying
      it is valid (its generation matches the generation stored in the parent).
      This stub was introduced in 2014 by commit a26e8c9f ("Btrfs: don't
      clear uptodate if the eb is under IO") in order to fix a concurrency issue
      between send and balance.
      
      However there is one particular exception where we end up needing to start
      a transaction and when this happens it results in a crash with a stack
      trace like the following:
      
      [60015.902283] kernel: WARNING: CPU: 3 PID: 58159 at arch/x86/include/asm/kfence.h:44 kfence_protect_page+0x21/0x80
      [60015.902292] kernel: Modules linked in: uinput rfcomm snd_seq_dummy (...)
      [60015.902384] kernel: CPU: 3 PID: 58159 Comm: btrfs Not tainted 5.12.9-300.fc34.x86_64 #1
      [60015.902387] kernel: Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A88XN-WIFI, BIOS F6 12/24/2015
      [60015.902389] kernel: RIP: 0010:kfence_protect_page+0x21/0x80
      [60015.902393] kernel: Code: ff 0f 1f 84 00 00 00 00 00 55 48 89 fd (...)
      [60015.902396] kernel: RSP: 0018:ffff9fb583453220 EFLAGS: 00010246
      [60015.902399] kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9fb583453224
      [60015.902401] kernel: RDX: ffff9fb583453224 RSI: 0000000000000000 RDI: 0000000000000000
      [60015.902402] kernel: RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
      [60015.902404] kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000002
      [60015.902406] kernel: R13: ffff9fb583453348 R14: 0000000000000000 R15: 0000000000000001
      [60015.902408] kernel: FS:  00007f158e62d8c0(0000) GS:ffff93bd37580000(0000) knlGS:0000000000000000
      [60015.902410] kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [60015.902412] kernel: CR2: 0000000000000039 CR3: 00000001256d2000 CR4: 00000000000506e0
      [60015.902414] kernel: Call Trace:
      [60015.902419] kernel:  kfence_unprotect+0x13/0x30
      [60015.902423] kernel:  page_fault_oops+0x89/0x270
      [60015.902427] kernel:  ? search_module_extables+0xf/0x40
      [60015.902431] kernel:  ? search_bpf_extables+0x57/0x70
      [60015.902435] kernel:  kernelmode_fixup_or_oops+0xd6/0xf0
      [60015.902437] kernel:  __bad_area_nosemaphore+0x142/0x180
      [60015.902440] kernel:  exc_page_fault+0x67/0x150
      [60015.902445] kernel:  asm_exc_page_fault+0x1e/0x30
      [60015.902450] kernel: RIP: 0010:start_transaction+0x71/0x580
      [60015.902454] kernel: Code: d3 0f 84 92 00 00 00 80 e7 06 0f 85 63 (...)
      [60015.902456] kernel: RSP: 0018:ffff9fb5834533f8 EFLAGS: 00010246
      [60015.902458] kernel: RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
      [60015.902460] kernel: RDX: 0000000000000801 RSI: 0000000000000000 RDI: 0000000000000039
      [60015.902462] kernel: RBP: ffff93bc0a7eb800 R08: 0000000000000001 R09: 0000000000000000
      [60015.902463] kernel: R10: 0000000000098a00 R11: 0000000000000001 R12: 0000000000000001
      [60015.902464] kernel: R13: 0000000000000000 R14: ffff93bc0c92b000 R15: ffff93bc0c92b000
      [60015.902468] kernel:  btrfs_commit_inode_delayed_inode+0x5d/0x120
      [60015.902473] kernel:  btrfs_evict_inode+0x2c5/0x3f0
      [60015.902476] kernel:  evict+0xd1/0x180
      [60015.902480] kernel:  inode_lru_isolate+0xe7/0x180
      [60015.902483] kernel:  __list_lru_walk_one+0x77/0x150
      [60015.902487] kernel:  ? iput+0x1a0/0x1a0
      [60015.902489] kernel:  ? iput+0x1a0/0x1a0
      [60015.902491] kernel:  list_lru_walk_one+0x47/0x70
      [60015.902495] kernel:  prune_icache_sb+0x39/0x50
      [60015.902497] kernel:  super_cache_scan+0x161/0x1f0
      [60015.902501] kernel:  do_shrink_slab+0x142/0x240
      [60015.902505] kernel:  shrink_slab+0x164/0x280
      [60015.902509] kernel:  shrink_node+0x2c8/0x6e0
      [60015.902512] kernel:  do_try_to_free_pages+0xcb/0x4b0
      [60015.902514] kernel:  try_to_free_pages+0xda/0x190
      [60015.902516] kernel:  __alloc_pages_slowpath.constprop.0+0x373/0xcc0
      [60015.902521] kernel:  ? __memcg_kmem_charge_page+0xc2/0x1e0
      [60015.902525] kernel:  __alloc_pages_nodemask+0x30a/0x340
      [60015.902528] kernel:  pipe_write+0x30b/0x5c0
      [60015.902531] kernel:  ? set_next_entity+0xad/0x1e0
      [60015.902534] kernel:  ? switch_mm_irqs_off+0x58/0x440
      [60015.902538] kernel:  __kernel_write+0x13a/0x2b0
      [60015.902541] kernel:  kernel_write+0x73/0x150
      [60015.902543] kernel:  send_cmd+0x7b/0xd0
      [60015.902545] kernel:  send_extent_data+0x5a3/0x6b0
      [60015.902549] kernel:  process_extent+0x19b/0xed0
      [60015.902551] kernel:  btrfs_ioctl_send+0x1434/0x17e0
      [60015.902554] kernel:  ? _btrfs_ioctl_send+0xe1/0x100
      [60015.902557] kernel:  _btrfs_ioctl_send+0xbf/0x100
      [60015.902559] kernel:  ? enqueue_entity+0x18c/0x7b0
      [60015.902562] kernel:  btrfs_ioctl+0x185f/0x2f80
      [60015.902564] kernel:  ? psi_task_change+0x84/0xc0
      [60015.902569] kernel:  ? _flat_send_IPI_mask+0x21/0x40
      [60015.902572] kernel:  ? check_preempt_curr+0x2f/0x70
      [60015.902576] kernel:  ? selinux_file_ioctl+0x137/0x1e0
      [60015.902579] kernel:  ? expand_files+0x1cb/0x1d0
      [60015.902582] kernel:  ? __x64_sys_ioctl+0x82/0xb0
      [60015.902585] kernel:  __x64_sys_ioctl+0x82/0xb0
      [60015.902588] kernel:  do_syscall_64+0x33/0x40
      [60015.902591] kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
      [60015.902595] kernel: RIP: 0033:0x7f158e38f0ab
      [60015.902599] kernel: Code: ff ff ff 85 c0 79 9b (...)
      [60015.902602] kernel: RSP: 002b:00007ffcb2519bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
      [60015.902605] kernel: RAX: ffffffffffffffda RBX: 00007ffcb251ae00 RCX: 00007f158e38f0ab
      [60015.902607] kernel: RDX: 00007ffcb2519cf0 RSI: 0000000040489426 RDI: 0000000000000004
      [60015.902608] kernel: RBP: 0000000000000004 R08: 00007f158e297640 R09: 00007f158e297640
      [60015.902610] kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 0000000000000000
      [60015.902612] kernel: R13: 0000000000000002 R14: 00007ffcb251aee0 R15: 0000558c1a83e2a0
      [60015.902615] kernel: ---[ end trace 7bbc33e23bb887ae ]---
      
      This happens because when writing to the pipe, by calling kernel_write(),
      we end up doing page allocations using GFP_HIGHUSER | __GFP_ACCOUNT as the
      gfp flags, which allow reclaim to happen if there is memory pressure. This
      allocation happens at fs/pipe.c:pipe_write().
      
      If the reclaim is triggered, inode eviction can be triggered and that in
      turn can result in starting a transaction if the inode has a link count
      of 0. The transaction start happens early on during eviction, when we call
      btrfs_commit_inode_delayed_inode() at btrfs_evict_inode(). This happens if
      there is currently an open file descriptor for an inode with a link count
      of 0 and the reclaim task gets a reference on the inode before that
      descriptor is closed, in which case the reclaim task ends up doing the
      final iput that triggers the inode eviction.
      
      When we have assertions enabled (CONFIG_BTRFS_ASSERT=y), this triggers
      the following assertion at transaction.c:start_transaction():
      
          /* Send isn't supposed to start transactions. */
          ASSERT(current->journal_info != BTRFS_SEND_TRANS_STUB);
      
      And when assertions are not enabled, it triggers a crash since after that
      assertion we cast current->journal_info into a transaction handle pointer
      and then dereference it:
      
         if (current->journal_info) {
             WARN_ON(type & TRANS_EXTWRITERS);
             h = current->journal_info;
             refcount_inc(&h->use_count);
             (...)
      
      Which obviously results in a crash due to an invalid memory access.
      
      The same type of issue can happen during other memory allocations we
      do directly in the send code with kmalloc (and friends) as they use
      GFP_KERNEL and therefore may trigger reclaim too, which started to
      happen since 2016 after commit e780b0d1 ("btrfs: send: use
      GFP_KERNEL everywhere").
      
      The issue could be solved by setting up a NOFS context for the entire
      send operation so that reclaim could not be triggered when allocating
      memory or pages through kernel_write(). However that is not very friendly
      and we can in fact get rid of the send stub because:
      
      1) The stub was introduced way back in 2014 by commit a26e8c9f
         ("Btrfs: don't clear uptodate if the eb is under IO") to solve an
         issue exclusive to when send and balance are running in parallel,
         however there were other problems between balance and send and we do
         not allow anymore to have balance and send run concurrently since
         commit 9e967495 ("Btrfs: prevent send failures and crashes due
         to concurrent relocation"). More generically the issues are between
         send and relocation, and that last commit eliminated only the
         possibility of having send and balance run concurrently, but shrinking
         a device also can trigger relocation, and on zoned filesystems we have
         relocation of partially used block groups triggered automatically as
         well. The previous patch that has a subject of:
      
         "btrfs: ensure relocation never runs while we have send operations running"
      
         Addresses all the remaining cases that can trigger relocation.
      
      2) We can actually allow starting and even committing transactions while
         in a send context if needed because send is not holding any locks that
         would block the start or the commit of a transaction.
      
      So get rid of all the logic added by commit a26e8c9f ("Btrfs: don't
      clear uptodate if the eb is under IO"). We can now always call
      clear_extent_buffer_uptodate() at verify_parent_transid() since send is
      the only case that uses commit roots without having a transaction open or
      without holding the commit_root_sem.
      Reported-by: NChris Murphy <lists@colorremedies.com>
      Link: https://lore.kernel.org/linux-btrfs/CAJCQCtRQ57=qXo3kygwpwEBOU_CA_eKvdmjP52sU=eFvuVOEGw@mail.gmail.com/Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      35b22c19
    • F
      btrfs: ensure relocation never runs while we have send operations running · 1cea5cf0
      Filipe Manana 提交于
      Relocation and send do not play well together because while send is
      running a block group can be relocated, a transaction committed and
      the respective disk extents get re-allocated and written to or discarded
      while send is about to do something with the extents.
      
      This was explained in commit 9e967495 ("Btrfs: prevent send failures
      and crashes due to concurrent relocation"), which prevented balance and
      send from running in parallel but it did not address one remaining case
      where chunk relocation can happen: shrinking a device (and device deletion
      which shrinks a device's size to 0 before deleting the device).
      
      We also have now one more case where relocation is triggered: on zoned
      filesystems partially used block groups get relocated by a background
      thread, introduced in commit 18bb8bbf ("btrfs: zoned: automatically
      reclaim zones").
      
      So make sure that instead of preventing balance from running when there
      are ongoing send operations, we prevent relocation from happening.
      This uses the infrastructure recently added by a patch that has the
      subject: "btrfs: add cancellable chunk relocation support".
      
      Also it adds a spinlock used exclusively for the exclusivity between
      send and relocation, as before fs_info->balance_mutex was used, which
      would make an attempt to run send to block waiting for balance to
      finish, which can take a lot of time on large filesystems.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1cea5cf0
    • D
      btrfs: shorten integrity checker extent data mount option · cbeaae4f
      David Sterba 提交于
      Subjectively, CHECK_INTEGRITY_INCLUDING_EXTENT_DATA is quite long and
      calling it CHECK_INTEGRITY_DATA still keeps the meaning and matches the
      mount option name.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cbeaae4f
    • D
      btrfs: fix typos in comments · 1a9fd417
      David Sterba 提交于
      Fix typos that have snuck in since the last round. Found by codespell.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a9fd417
  7. 21 6月, 2021 4 次提交
  8. 04 6月, 2021 1 次提交
  9. 21 4月, 2021 2 次提交
  10. 19 4月, 2021 2 次提交
  11. 18 3月, 2021 2 次提交
    • F
      btrfs: fix subvolume/snapshot deletion not triggered on mount · 8d488a8c
      Filipe Manana 提交于
      During the mount procedure we are calling btrfs_orphan_cleanup() against
      the root tree, which will find all orphans items in this tree. When an
      orphan item corresponds to a deleted subvolume/snapshot (instead of an
      inode space cache), it must not delete the orphan item, because that will
      cause btrfs_find_orphan_roots() to not find the orphan item and therefore
      not add the corresponding subvolume root to the list of dead roots, which
      results in the subvolume's tree never being deleted by the cleanup thread.
      
      The same applies to the remount from RO to RW path.
      
      Fix this by making btrfs_find_orphan_roots() run before calling
      btrfs_orphan_cleanup() against the root tree.
      
      A test case for fstests will follow soon.
      Reported-by: NRobbie Ko <robbieko@synology.com>
      Link: https://lore.kernel.org/linux-btrfs/b19f4310-35e0-606e-1eea-2dd84d28c5da@synology.com/
      Fixes: 638331fa ("btrfs: fix transaction leak and crash after cleaning up orphans on RO mount")
      CC: stable@vger.kernel.org # 5.11+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8d488a8c
    • J
      btrfs: initialize device::fs_info always · 820a49da
      Josef Bacik 提交于
      Neal reported a panic trying to use -o rescue=all
      
        BUG: kernel NULL pointer dereference, address: 0000000000000030
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP NOPTI
        CPU: 0 PID: 696 Comm: mount Tainted: G        W         5.12.0-rc2+ #296
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
        RIP: 0010:btrfs_device_init_dev_stats+0x1d/0x200
        RSP: 0018:ffffafaec1483bb8 EFLAGS: 00010286
        RAX: 0000000000000000 RBX: ffff9a5715bcb298 RCX: 0000000000000070
        RDX: ffff9a5703248000 RSI: ffff9a57052ea150 RDI: ffff9a5715bca400
        RBP: ffff9a57052ea150 R08: 0000000000000070 R09: ffff9a57052ea150
        R10: 000130faf0741c10 R11: 0000000000000000 R12: ffff9a5703700000
        R13: 0000000000000000 R14: ffff9a5715bcb278 R15: ffff9a57052ea150
        FS:  00007f600d122c40(0000) GS:ffff9a577bc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000030 CR3: 0000000112a46005 CR4: 0000000000370ef0
        Call Trace:
         ? btrfs_init_dev_stats+0x1f/0xf0
         ? kmem_cache_alloc+0xef/0x1f0
         btrfs_init_dev_stats+0x5f/0xf0
         open_ctree+0x10cb/0x1720
         btrfs_mount_root.cold+0x12/0xea
         legacy_get_tree+0x27/0x40
         vfs_get_tree+0x25/0xb0
         vfs_kern_mount.part.0+0x71/0xb0
         btrfs_mount+0x10d/0x380
         legacy_get_tree+0x27/0x40
         vfs_get_tree+0x25/0xb0
         path_mount+0x433/0xa00
         __x64_sys_mount+0xe3/0x120
         do_syscall_64+0x33/0x40
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      This happens because when we call btrfs_init_dev_stats we do
      device->fs_info->dev_root.  However device->fs_info isn't initialized
      because we were only calling btrfs_init_devices_late() if we properly
      read the device root.  However we don't actually need the device root to
      init the devices, this function simply assigns the devices their
      ->fs_info pointer properly, so this needs to be done unconditionally
      always so that we can properly dereference device->fs_info in rescue
      cases.
      Reported-by: NNeal Gompa <ngompa13@gmail.com>
      CC: stable@vger.kernel.org # 5.11+
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      820a49da
  12. 12 2月, 2021 1 次提交
    • S
      btrfs: initialize fs_info::csum_size earlier in open_ctree · 83c68bbc
      Su Yue 提交于
      User reported that btrfs-progs misc-tests/028-superblock-recover fails:
      
            [TEST/misc]   028-superblock-recover
        unexpected success: mounted fs with corrupted superblock
        test failed for case 028-superblock-recover
      
      The test case expects that a broken image with bad superblock will be
      rejected to be mounted. However, the test image just passed csum check
      of superblock and was successfully mounted.
      
      Commit 55fc29be ("btrfs: use cached value of fs_info::csum_size
      everywhere") replaces all calls to btrfs_super_csum_size by
      fs_info::csum_size. The calls include the place where fs_info->csum_size
      is not initialized. So btrfs_check_super_csum() passes because memcmp()
      with len 0 always returns 0.
      
      Fix it by caching csum size in btrfs_fs_info::csum_size once we know the
      csum type in superblock is valid in open_ctree().
      
      Link: https://github.com/kdave/btrfs-progs/issues/250
      Fixes: 55fc29be ("btrfs: use cached value of fs_info::csum_size everywhere")
      Signed-off-by: NSu Yue <l@damenly.su>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      83c68bbc
  13. 09 2月, 2021 14 次提交
    • N
      btrfs: zoned: reorder log node allocation on zoned filesystem · 3ddebf27
      Naohiro Aota 提交于
      This is the 3/3 patch to enable tree-log on zoned filesystems.
      
      The allocation order of nodes of "fs_info->log_root_tree" and nodes of
      "root->log_root" is not the same as the writing order of them. So, the
      writing causes unaligned write errors.
      
      Reorder the allocation of them by delaying allocation of the root node of
      "fs_info->log_root_tree," so that the node buffers can go out sequentially
      to devices.
      
      Cc: Filipe Manana <fdmanana@gmail.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3ddebf27
    • N
      btrfs: zoned: extend zoned allocator to use dedicated tree-log block group · 40ab3be1
      Naohiro Aota 提交于
      This is the 1/3 patch to enable tree log on zoned filesystems.
      
      The tree-log feature does not work on a zoned filesystem as is. Blocks for
      a tree-log tree are allocated mixed with other metadata blocks and btrfs
      writes and syncs the tree-log blocks to devices at the time of fsync(),
      which has a different timing than a global transaction commit. As a
      result, both writing tree-log blocks and writing other metadata blocks
      become non-sequential writes that zoned filesystems must avoid.
      
      Introduce a dedicated block group for tree-log blocks, so that tree-log
      blocks and other metadata blocks can be separate write streams.  As a
      result, each write stream can now be written to devices separately.
      "fs_info->treelog_bg" tracks the dedicated block group and assigns
      "treelog_bg" on-demand on tree-log block allocation time.
      
      This commit extends the zoned block allocator to use the block group.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40ab3be1
    • N
      btrfs: split alloc_log_tree() · 6ab6ebb7
      Naohiro Aota 提交于
      This is a preparation patch for the next patch. Split alloc_log_tree()
      into two parts. The first one allocating the tree structure, remains in
      alloc_log_tree() and the second part allocating the tree node, which is
      moved into btrfs_alloc_log_tree_node().
      
      Also export the latter part is to be used in the next patch.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6ab6ebb7
    • N
      btrfs: zoned: do not use async metadata checksum on zoned filesystems · 4eef29ef
      Naohiro Aota 提交于
      On zoned filesystems, btrfs uses per-fs zoned_meta_io_lock to serialize
      the metadata write IOs.
      
      Even with this serialization, write bios sent from btree_write_cache_pages
      can be reordered by async checksum workers as these workers are per CPU
      and not per zone.
      
      To preserve write bio ordering, we disable async metadata checksum on a
      zoned filesystem. This does not result in lower performance with HDDs as
      a single CPU core is fast enough to do checksum for a single zone write
      stream with the maximum possible bandwidth of the device. If multiple
      zones are being written simultaneously, HDD seek overhead lowers the
      achievable maximum bandwidth, resulting again in a per zone checksum
      serialization not affecting the performance.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4eef29ef
    • N
      btrfs: zoned: serialize metadata IO · 0bc09ca1
      Naohiro Aota 提交于
      We cannot use zone append for writing metadata, because the B-tree nodes
      have references to each other using logical address. Without knowing
      the address in advance, we cannot construct the tree in the first place.
      So we need to serialize write IOs for metadata.
      
      We cannot add a mutex around allocation and submission because metadata
      blocks are allocated in an earlier stage to build up B-trees.
      
      Add a zoned_meta_io_lock and hold it during metadata IO submission in
      btree_write_cache_pages() to serialize IOs.
      
      Furthermore, this adds a per-block group metadata IO submission pointer
      "meta_write_pointer" to ensure sequential writing, which can break when
      attempting to write back blocks in an unfinished transaction. If the
      writing out failed because of a hole and the write out is for data
      integrity (WB_SYNC_ALL), it returns EAGAIN.
      
      A caller like fsync() code should handle this properly e.g. by falling
      back to a full transaction commit.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0bc09ca1
    • N
      btrfs: zoned: handle REQ_OP_ZONE_APPEND as writing · cfe94440
      Naohiro Aota 提交于
      Zoned filesystems use REQ_OP_ZONE_APPEND bios for writing to actual
      devices.
      
      Let btrfs_end_bio() and btrfs_op be aware of it, by mapping
      REQ_OP_ZONE_APPEND to BTRFS_MAP_WRITE and using btrfs_op() instead of
      bio_op().
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cfe94440
    • N
      btrfs: zoned: redirty released extent buffers · d3575156
      Naohiro Aota 提交于
      Tree manipulating operations like merging nodes often release
      once-allocated tree nodes. Such nodes are cleaned so that pages in the
      node are not uselessly written out. On zoned volumes, however, such
      optimization blocks the following IOs as the cancellation of the write
      out of the freed blocks breaks the sequential write sequence expected by
      the device.
      
      Introduce a list of clean and unwritten extent buffers that have been
      released in a transaction. Redirty the buffers so that
      btree_write_cache_pages() can send proper bios to the devices.
      
      Besides it clears the entire content of the extent buffer not to confuse
      raw block scanners e.g. 'btrfs check'. By clearing the content,
      csum_dirty_buffer() complains about bytenr mismatch, so avoid the
      checking and checksum using newly introduced buffer flag
      EXTENT_BUFFER_NO_CHECK.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d3575156
    • J
      btrfs: zoned: do not load fs_info::zoned from incompat flag · b53429ba
      Johannes Thumshirn 提交于
      Don't set the zoned flag in fs_info as soon as we're encountering the
      incompat filesystem flag for a zoned filesystem on mount. The zoned flag
      in fs_info is in a union together with the zone_size, so setting it too
      early will result in setting an incorrect zone_size as well.
      
      Once the correct zone_size is read from the device, we can rely on the
      zoned flag in fs_info as well to determine if the filesystem is zoned.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b53429ba
    • N
      btrfs: zoned: defer loading zone info after opening trees · 73651042
      Naohiro Aota 提交于
      This is a preparation patch to implement zone emulation on a regular
      device.
      
      To emulate a zoned filesystem on a regular (non-zoned) device, we need to
      decide an emulated zone size. Instead of making it a compile-time static
      value, we'll make it configurable at mkfs time. Since we have one zone ==
      one device extent restriction, we can determine the emulated zone size
      from the size of a device extent. We can extend btrfs_get_dev_zone_info()
      to show a regular device filled with conventional zones once the zone size
      is decided.
      
      The current call site of btrfs_get_dev_zone_info() during the mount process
      is earlier than loading the file system trees so that we don't know the
      size of a device extent at this point. Thus we can't slice a regular device
      to conventional zones.
      
      This patch introduces btrfs_get_dev_zone_info_all_devices to load the zone
      info for all the devices. And, it places this function in open_ctree()
      after loading the trees.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      73651042
    • Q
      btrfs: allow read-only mount of 4K sector size fs on 64K page system · 0bb3eb3e
      Qu Wenruo 提交于
      This adds the basic RO mount ability for 4K sector size on 64K page
      system.
      
      Currently we only plan to support 4K and 64K page system.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0bb3eb3e
    • Q
      btrfs: introduce subpage metadata validation check · 371cdc07
      Qu Wenruo 提交于
      For subpage metadata validation check, there are some differences:
      
      - Read must finish in one bvec
        Since we're just reading one subpage range in one page, it should
        never be split into two bios nor two bvecs.
      
      - How to grab the existing eb
        Instead of grabbing eb using page->private, we have to go search radix
        tree as we don't have any direct pointer at hand.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      371cdc07
    • J
      btrfs: improve preemptive background space flushing · 576fa348
      Josef Bacik 提交于
      Currently if we ever have to flush space because we do not have enough
      we allocate a ticket and attach it to the space_info, and then
      systematically flush things in the filesystem that hold space
      reservations until our space is reclaimed.
      
      However this has a latency cost, we must go to sleep and wait for the
      flushing to make progress before we are woken up and allowed to continue
      doing our work.
      
      In order to address that we used to kick off the async worker to flush
      space preemptively, so that we could be reclaiming space hopefully
      before any tasks needed to stop and wait for space to reclaim.
      
      When I introduced the ticketed ENOSPC stuff this broke slightly in the
      fact that we were using tickets to indicate if we were done flushing.
      No tickets, no more flushing.  However this meant that we essentially
      never preemptively flushed.  This caused a write performance regression
      that Nikolay noticed in an unrelated patch that removed the committing
      of the transaction during btrfs_end_transaction.
      
      The behavior that happened pre that patch was btrfs_end_transaction()
      would see that we were low on space, and it would commit the
      transaction.  This was bad because in this particular case you could end
      up with thousands and thousands of transactions being committed during
      the 5 minute reproducer.  With the patch to remove this behavior we got
      much more sane transaction commits, but we ended up slower because we
      would write for a while, flush, write for a while, flush again.
      
      To address this we need to reinstate a preemptive flushing mechanism.
      However it is distinctly different from our ticketing flushing in that
      it doesn't have tickets to base it's decisions on.  Instead of bolting
      this logic into our existing flushing work, add another worker to handle
      this preemptive flushing.  Here we will attempt to be slightly
      intelligent about the things that we flushing, attempting to balance
      between whichever pool is taking up the most space.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      576fa348
    • J
      btrfs: track ordered bytes instead of just dio ordered bytes · 5deb17e1
      Josef Bacik 提交于
      We track dio_bytes because the shrink delalloc code needs to know if we
      have more DIO in flight than we have normal buffered IO.  The reason for
      this is because we can't "flush" DIO, we have to just wait on the
      ordered extents to finish.
      
      However this is true of all ordered extents.  If we have more ordered
      space outstanding than dirty pages we should be waiting on ordered
      extents.  We already are ok on this front technically, because we always
      do a FLUSH_DELALLOC_WAIT loop, but I want to use the ordered counter in
      the preemptive flushing code as well, so change this to count all
      ordered bytes instead of just DIO ordered bytes.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5deb17e1
    • N
      btrfs: make btrfs_root::free_objectid hold the next available objectid · 23125104
      Nikolay Borisov 提交于
      Adjust the way free_objectid is being initialized, it now stores
      BTRFS_FIRST_FREE_OBJECTID rather than the, somewhat arbitrary,
      BTRFS_FIRST_FREE_OBJECTID - 1. This change also has the added benefit
      that now it becomes unnecessary to explicitly initialize free_objectid
      for a newly create fs root.
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      23125104