1. 23 1月, 2019 1 次提交
    • J
      btrfs: wait on ordered extents on abort cleanup · 01634ac5
      Josef Bacik 提交于
      commit 74d5d229b1bf60f93bff244b2dfc0eb21ec32a07 upstream.
      
      If we flip read-only before we initiate writeback on all dirty pages for
      ordered extents we've created then we'll have ordered extents left over
      on umount, which results in all sorts of bad things happening.  Fix this
      by making sure we wait on ordered extents if we have to do the aborted
      transaction cleanup stuff.
      
      generic/475 can produce this warning:
      
       [ 8531.177332] WARNING: CPU: 2 PID: 11997 at fs/btrfs/disk-io.c:3856 btrfs_free_fs_root+0x95/0xa0 [btrfs]
       [ 8531.183282] CPU: 2 PID: 11997 Comm: umount Tainted: G        W 5.0.0-rc1-default+ #394
       [ 8531.185164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
       [ 8531.187851] RIP: 0010:btrfs_free_fs_root+0x95/0xa0 [btrfs]
       [ 8531.193082] RSP: 0018:ffffb1ab86163d98 EFLAGS: 00010286
       [ 8531.194198] RAX: ffff9f3449494d18 RBX: ffff9f34a2695000 RCX:0000000000000000
       [ 8531.195629] RDX: 0000000000000002 RSI: 0000000000000001 RDI:0000000000000000
       [ 8531.197315] RBP: ffff9f344e930000 R08: 0000000000000001 R09:0000000000000000
       [ 8531.199095] R10: 0000000000000000 R11: ffff9f34494d4ff8 R12:ffffb1ab86163dc0
       [ 8531.200870] R13: ffff9f344e9300b0 R14: ffffb1ab86163db8 R15:0000000000000000
       [ 8531.202707] FS:  00007fc68e949fc0(0000) GS:ffff9f34bd800000(0000)knlGS:0000000000000000
       [ 8531.204851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [ 8531.205942] CR2: 00007ffde8114dd8 CR3: 000000002dfbd000 CR4:00000000000006e0
       [ 8531.207516] Call Trace:
       [ 8531.208175]  btrfs_free_fs_roots+0xdb/0x170 [btrfs]
       [ 8531.210209]  ? wait_for_completion+0x5b/0x190
       [ 8531.211303]  close_ctree+0x157/0x350 [btrfs]
       [ 8531.212412]  generic_shutdown_super+0x64/0x100
       [ 8531.213485]  kill_anon_super+0x14/0x30
       [ 8531.214430]  btrfs_kill_super+0x12/0xa0 [btrfs]
       [ 8531.215539]  deactivate_locked_super+0x29/0x60
       [ 8531.216633]  cleanup_mnt+0x3b/0x70
       [ 8531.217497]  task_work_run+0x98/0xc0
       [ 8531.218397]  exit_to_usermode_loop+0x83/0x90
       [ 8531.219324]  do_syscall_64+0x15b/0x180
       [ 8531.220192]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [ 8531.221286] RIP: 0033:0x7fc68e5e4d07
       [ 8531.225621] RSP: 002b:00007ffde8116608 EFLAGS: 00000246 ORIG_RAX:00000000000000a6
       [ 8531.227512] RAX: 0000000000000000 RBX: 00005580c2175970 RCX:00007fc68e5e4d07
       [ 8531.229098] RDX: 0000000000000001 RSI: 0000000000000000 RDI:00005580c2175b80
       [ 8531.230730] RBP: 0000000000000000 R08: 00005580c2175ba0 R09:00007ffde8114e80
       [ 8531.232269] R10: 0000000000000000 R11: 0000000000000246 R12:00005580c2175b80
       [ 8531.233839] R13: 00007fc68eac61c4 R14: 00005580c2175a68 R15:0000000000000000
      
      Leaving a tree in the rb-tree:
      
      3853 void btrfs_free_fs_root(struct btrfs_root *root)
      3854 {
      3855         iput(root->ino_cache_inode);
      3856         WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
      
      CC: stable@vger.kernel.org
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add stacktrace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      01634ac5
  2. 21 12月, 2018 1 次提交
    • O
      Btrfs: fix missing delayed iputs on unmount · b4c7c826
      Omar Sandoval 提交于
      [ Upstream commit d6fd0ae2 ]
      
      There's a race between close_ctree() and cleaner_kthread().
      close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
      sees it set, but this is racy; the cleaner might have already checked
      the bit and could be cleaning stuff. In particular, if it deletes unused
      block groups, it will create delayed iputs for the free space cache
      inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
      longer running delayed iputs after a commit. Therefore, if the cleaner
      creates more delayed iputs after delayed iputs are run in
      btrfs_commit_super(), we will leak inodes on unmount and get a busy
      inode crash from the VFS.
      
      Fix it by parking the cleaner before we actually close anything. Then,
      any remaining delayed iputs will always be handled in
      btrfs_commit_super(). This also ensures that the commit in close_ctree()
      is really the last commit, so we can get rid of the commit in
      cleaner_kthread().
      
      The fstest/generic/475 followed by 476 can trigger a crash that
      manifests as a slab corruption caused by accessing the freed kthread
      structure by a wake up function. Sample trace:
      
      [ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
      [ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
      [ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
      [ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G        W         4.19.0-rc8-default+ #323
      [ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
      [ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
      [ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
      [ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
      [ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
      [ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
      [ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
      [ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
      [ 5657.099689] FS:  00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
      [ 5657.101032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
      [ 5657.103159] Call Trace:
      [ 5657.103776]  shrink_inactive_list+0x194/0x410
      [ 5657.104671]  shrink_node_memcg.constprop.84+0x39a/0x6a0
      [ 5657.105750]  shrink_node+0x62/0x1c0
      [ 5657.106529]  try_to_free_pages+0x1a4/0x500
      [ 5657.107408]  __alloc_pages_slowpath+0x2c9/0xb20
      [ 5657.108418]  __alloc_pages_nodemask+0x268/0x2b0
      [ 5657.109348]  kmalloc_large_node+0x37/0x90
      [ 5657.110205]  __kmalloc_node+0x236/0x310
      [ 5657.111014]  kvmalloc_node+0x3e/0x70
      
      Fixes: 30928e9b ("btrfs: don't run delayed_iputs in commit")
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add trace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b4c7c826
  3. 06 12月, 2018 1 次提交
    • N
      btrfs: Always try all copies when reading extent buffers · aaf249e3
      Nikolay Borisov 提交于
      commit f8397d69daef06d358430d3054662fb597e37c00 upstream.
      
      When a metadata read is served the endio routine btree_readpage_end_io_hook
      is called which eventually runs the tree-checker. If tree-checker fails
      to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
      leads to btree_read_extent_buffer_pages wrongly assuming that all
      available copies of this extent buffer are wrong and failing prematurely.
      Fix this modify btree_read_extent_buffer_pages to read all copies of
      the data.
      
      This failure was exhibitted in xfstests btrfs/124 which would
      spuriously fail its balance operations. The reason was that when balance
      was run following re-introduction of the missing raid1 disk
      __btrfs_map_block would map the read request to stripe 0, which
      corresponded to devid 2 (the disk which is being removed in the test):
      
          item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
      	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
      	io_align 65536 io_width 65536 sector_size 4096
      	num_stripes 2 sub_stripes 1
      		stripe 0 devid 2 offset 2156920832
      		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
      		stripe 1 devid 1 offset 3553624064
      		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
      
      This caused read requests for a checksum item that to be routed to the
      stale disk which triggered the aforementioned logic involving
      EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
      the balance operation.
      
      Fixes: a826d6dc ("Btrfs: check items for correctness as we search")
      CC: stable@vger.kernel.org # 4.4+
      Suggested-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aaf249e3
  4. 21 11月, 2018 1 次提交
    • L
      btrfs: fix pinned underflow after transaction aborted · ec26ad25
      Lu Fengqi 提交于
      commit fcd5e74288f7d36991b1f0fb96b8c57079645e38 upstream.
      
      When running generic/475, we may get the following warning in dmesg:
      
      [ 6902.102154] WARNING: CPU: 3 PID: 18013 at fs/btrfs/extent-tree.c:9776 btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
      [ 6902.109160] CPU: 3 PID: 18013 Comm: umount Tainted: G        W  O      4.19.0-rc8+ #8
      [ 6902.110971] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      [ 6902.112857] RIP: 0010:btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
      [ 6902.118921] RSP: 0018:ffffc9000459bdb0 EFLAGS: 00010286
      [ 6902.120315] RAX: ffff880175050bb0 RBX: ffff8801124a8000 RCX: 0000000000170007
      [ 6902.121969] RDX: 0000000000000002 RSI: 0000000000170007 RDI: ffffffff8125fb74
      [ 6902.123716] RBP: ffff880175055d10 R08: 0000000000000000 R09: 0000000000000000
      [ 6902.125417] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880175055d88
      [ 6902.127129] R13: ffff880175050bb0 R14: 0000000000000000 R15: dead000000000100
      [ 6902.129060] FS:  00007f4507223780(0000) GS:ffff88017ba00000(0000) knlGS:0000000000000000
      [ 6902.130996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6902.132558] CR2: 00005623599cac78 CR3: 000000014b700001 CR4: 00000000003606e0
      [ 6902.134270] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 6902.135981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 6902.137836] Call Trace:
      [ 6902.138939]  close_ctree+0x171/0x330 [btrfs]
      [ 6902.140181]  ? kthread_stop+0x146/0x1f0
      [ 6902.141277]  generic_shutdown_super+0x6c/0x100
      [ 6902.142517]  kill_anon_super+0x14/0x30
      [ 6902.143554]  btrfs_kill_super+0x13/0x100 [btrfs]
      [ 6902.144790]  deactivate_locked_super+0x2f/0x70
      [ 6902.146014]  cleanup_mnt+0x3b/0x70
      [ 6902.147020]  task_work_run+0x9e/0xd0
      [ 6902.148036]  do_syscall_64+0x470/0x600
      [ 6902.149142]  ? trace_hardirqs_off_thunk+0x1a/0x1c
      [ 6902.150375]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 6902.151640] RIP: 0033:0x7f45077a6a7b
      [ 6902.157324] RSP: 002b:00007ffd589f3e68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 6902.159187] RAX: 0000000000000000 RBX: 000055e8eec732b0 RCX: 00007f45077a6a7b
      [ 6902.160834] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055e8eec73490
      [ 6902.162526] RBP: 0000000000000000 R08: 000055e8eec734b0 R09: 00007ffd589f26c0
      [ 6902.164141] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e8eec73490
      [ 6902.165815] R13: 00007f4507ac61a4 R14: 0000000000000000 R15: 00007ffd589f40d8
      [ 6902.167553] irq event stamp: 0
      [ 6902.168998] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
      [ 6902.170731] hardirqs last disabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
      [ 6902.172773] softirqs last  enabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
      [ 6902.174671] softirqs last disabled at (0): [<0000000000000000>]           (null)
      [ 6902.176407] ---[ end trace 463138c2986b275c ]---
      [ 6902.177636] BTRFS info (device dm-3): space_info 4 has 273465344 free, is not full
      [ 6902.179453] BTRFS info (device dm-3): space_info total=276824064, used=4685824, pinned=18446744073708158976, reserved=0, may_use=0, readonly=65536
      
      In the above line there's "pinned=18446744073708158976" which is an
      unsigned u64 value of -1392640, an obvious underflow.
      
      When transaction_kthread is running cleanup_transaction(), another
      fsstress is running btrfs_commit_transaction(). The
      btrfs_finish_extent_commit() may get the same range as
      btrfs_destroy_pinned_extent() got, which causes the pinned underflow.
      
      Fixes: d4b450cd ("Btrfs: fix race between transaction commit and empty block group removal")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ec26ad25
  5. 18 8月, 2018 1 次提交
    • R
      Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space · 8ecebf4d
      Robbie Ko 提交于
      Commit e9894fd3 ("Btrfs: fix snapshot vs nocow writting") forced
      nocow writes to fallback to COW, during writeback, when a snapshot is
      created. This resulted in writes made before creating the snapshot to
      unexpectedly fail with ENOSPC during writeback when success (0) was
      returned to user space through the write system call.
      
      The steps leading to this problem are:
      
      1. When it's not possible to allocate data space for a write, the
         buffered write path checks if a NOCOW write is possible.  If it is,
         it will not reserve space and success (0) is returned to user space.
      
      2. Then when a snapshot is created, the root's will_be_snapshotted
         atomic is incremented and writeback is triggered for all inode's that
         belong to the root being snapshotted. Incrementing that atomic forces
         all previous writes to fallback to COW during writeback (running
         delalloc).
      
      3. This results in the writeback for the inodes to fail and therefore
         setting the ENOSPC error in their mappings, so that a subsequent
         fsync on them will report the error to user space. So it's not a
         completely silent data loss (since fsync will report ENOSPC) but it's
         a very unexpected and undesirable behaviour, because if a clean
         shutdown/unmount of the filesystem happens without previous calls to
         fsync, it is expected to have the data present in the files after
         mounting the filesystem again.
      
      So fix this by adding a new atomic named snapshot_force_cow to the
      root structure which prevents this behaviour and works the following way:
      
      1. It is incremented when we start to create a snapshot after triggering
         writeback and before waiting for writeback to finish.
      
      2. This new atomic is now what is used by writeback (running delalloc)
         to decide whether we need to fallback to COW or not. Because we
         incremented this new atomic after triggering writeback in the
         snapshot creation ioctl, we ensure that all buffered writes that
         happened before snapshot creation will succeed and not fallback to
         COW (which would make them fail with ENOSPC).
      
      3. The existing atomic, will_be_snapshotted, is kept because it is used
         to force new buffered writes, that start after we started
         snapshotting, to reserve data space even when NOCOW is possible.
         This makes these writes fail early with ENOSPC when there's no
         available space to allocate, preventing the unexpected behaviour of
         writeback later failing with ENOSPC due to a fallback to COW mode.
      
      Fixes: e9894fd3 ("Btrfs: fix snapshot vs nocow writting")
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8ecebf4d
  6. 06 8月, 2018 18 次提交
  7. 30 5月, 2018 1 次提交
    • L
      Btrfs: add parent_transid parameter to veirfy_level_key · ff76a864
      Liu Bo 提交于
      As verify_level_key() is checked after verify_parent_transid(), i.e.
      
      if (verify_parent_transid())
         ret = -EIO;
      else if (verify_level_key())
         ret = -EUCLEAN;
      
      if parent_transid is 0, verify_parent_transid() skips verifying
      parent_transid and considers eb as valid, and if verify_level_key()
      reports something wrong, we're not going to know if it's caused by
      corrupted metadata or non-checkecd eb (e.g. stale eb).
      
      The stale eb can be from an outdated raid1 mirror after a degraded
      mount, see eg "btrfs: fix reading stale metadata blocks after degraded
      raid1 mounts" (02a3307a) for more details.
      
      @parent_transid is able to tell whether the eb's generation has been
      verified by the caller.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ff76a864
  8. 29 5月, 2018 7 次提交
    • O
      Btrfs: get rid of unused orphan infrastructure · a575ceeb
      Omar Sandoval 提交于
      Now that we don't keep long-standing reservations for orphan items,
      root->orphan_block_rsv isn't used. We can git rid of it, along with:
      
      - root->orphan_lock, which was used to protect root->orphan_block_rsv
      - root->orphan_inodes, which was used as a refcount for root->orphan_block_rsv
      - BTRFS_INODE_ORPHAN_META_RESERVED, which was used to track reservations
        in root->orphan_block_rsv
      - btrfs_orphan_commit_root(), which was the last user of any of these
        and does nothing else
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a575ceeb
    • Q
      btrfs: Do super block verification before writing it to disk · 75cb857d
      Qu Wenruo 提交于
      There are already 2 reports about strangely corrupted super blocks,
      where csum still matches but extra garbage gets slipped into super block.
      
      The corruption would looks like:
      ------
      superblock: bytenr=65536, device=/dev/sdc1
      ---------------------------------------------------------
      csum_type               41700 (INVALID)
      csum                    0x3b252d3a [match]
      bytenr                  65536
      flags                   0x1
                              ( WRITTEN )
      magic                   _BHRfS_M [match]
      ...
      incompat_flags          0x5b22400000000169
                              ( MIXED_BACKREF |
                                COMPRESS_LZO |
                                BIG_METADATA |
                                EXTENDED_IREF |
                                SKINNY_METADATA |
                                unknown flag: 0x5b22400000000000 )
      ...
      ------
      Or
      ------
      superblock: bytenr=65536, device=/dev/mapper/x
      ---------------------------------------------------------
      csum_type              35355 (INVALID)
      csum_size              32
      csum                   0xf0dbeddd [match]
      bytenr                 65536
      flags                  0x1
                             ( WRITTEN )
      magic                  _BHRfS_M [match]
      ...
      incompat_flags         0x176d200000000169
                             ( MIXED_BACKREF |
                               COMPRESS_LZO |
                               BIG_METADATA |
                               EXTENDED_IREF |
                               SKINNY_METADATA |
                               unknown flag: 0x176d200000000000 )
      ------
      
      Obviously, csum_type and incompat_flags get some garbage, but its csum
      still matches, which means kernel calculates the csum based on corrupted
      super block memory.
      And after manually fixing these values, the filesystem is completely
      healthy without any problem exposed by btrfs check.
      
      Although the cause is still unknown, at least detect it and prevent further
      corruption.
      
      Both reports have same symptoms, there's an overwrite on offset 192 of
      the superblock, by 4 bytes. The superblock structure is not allocated or
      freed and stays in the memory for the whole filesystem lifetime, so it's
      not a use-after-free kind of error on someone else's leaked page.
      
      As a vague point for the problable cause is mentioning of other system
      freezing related to graphic card drivers.
      Reported-by: NKen Swenson <flat@imo.uto.moe>
      Reported-by: NBen Parsons <9parsonsb@gmail.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add brief analysis of the reports ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      75cb857d
    • Q
      btrfs: Refactor btrfs_check_super_valid · 069ec957
      Qu Wenruo 提交于
      Refactor btrfs_check_super_valid:
      
      1) Rename it to btrfs_validate_mount_super()
         Now it's more obvious when the function should be called.
      
      2) Extract core check routine into validate_super()
         Later write time check can reuse it, and if needed, we could also
         use validate_super() to check each super block.
      
      3) Add more comments about btrfs_validate_mount_super()
         Mostly about what it doesn't check and when it should be called.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ rename to validate_super ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      069ec957
    • Q
      btrfs: Move btrfs_check_super_valid() to avoid forward declaration · 21a852b0
      Qu Wenruo 提交于
      Move btrfs_check_super_valid() before its single caller to avoid forward
      declaration.
      
      Though such code motion is not recommended as it pollutes git history,
      in this case the following patches would need to add new forward
      declarations for static functions that we want to avoid.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      21a852b0
    • A
      btrfs: move btrfs_raid_group values to btrfs_raid_attr table · 41a6e891
      Anand Jain 提交于
      Add a new member struct btrfs_raid_attr::bg_flag so that
      btrfs_raid_array can maintain the bit map flag of the raid type, and
      so we can drop btrfs_raid_group.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      41a6e891
    • D
      btrfs: track running balance in a simpler way · 3009a62f
      David Sterba 提交于
      Currently fs_info::balance_running is 0 or 1 and does not use the
      semantics of atomics. The pause and cancel check for 0, that can happen
      only after __btrfs_balance exits for whatever reason.
      
      Parallel calls to balance ioctl may enter btrfs_ioctl_balance multiple
      times but will block on the balance_mutex that protects the
      fs_info::flags bit.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3009a62f
    • D
      btrfs: kill btrfs_fs_info::volume_mutex · dccdb07b
      David Sterba 提交于
      Mutual exclusion of device add/rm and balance was done by the volume
      mutex up to version 3.7. The commit 5ac00add ("Btrfs: disallow
      mutually exclusive admin operations from user mode") added a bit that
      essentially tracked the same information.
      
      The status bit has an advantage over a mutex that it can be set without
      restrictions of function context, so it started to be used in the
      mount-time resuming of balance or device replace.
      
      But we don't really need to track the same information in two ways.
      
      1) After the previous cleanups, the main ioctl handlers for
         add/del/resize copy the EXCL_OP bit next to the volume mutex, here
         it's clearly safe.
      
      2) Resuming balance during mount or after rw remount will set only the
         EXCL_OP bit and the volume_mutex is held in the kernel thread that
         calls btrfs_balance.
      
      3) Resuming device replace during mount or after rw remount is done
         after balance and is excluded by the EXCL_OP bit. It does not take
         the volume_mutex at all and completely relies on the EXCL_OP bit.
      
      4) The resuming of balance and dev-replace cannot hapen at the same time
         as the ioctls cannot be started in parallel. Nevertheless, a crafted
         image could trigger that and a warning is printed.
      
      5) Balance is normally excluded by EXCL_OP and also uses own mutex to
         protect against concurrent access to its status data. There's some
         trickery to maintain the right lock nesting in case we need to
         reexamine the status in btrfs_ioctl_balance. The volume_mutex is
         removed and the unlock/lock sequence is left in place as we might
         expect other waiters to proceed.
      
      6) Similar to 5, the unlock/lock sequence is kept in
         btrfs_cancel_balance to allow waiters to continue.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dccdb07b
  9. 17 5月, 2018 1 次提交
    • N
      btrfs: Fix delalloc inodes invalidation during transaction abort · fe816d0f
      Nikolay Borisov 提交于
      When a transaction is aborted btrfs_cleanup_transaction is called to
      cleanup all the various in-flight bits and pieces which migth be
      active. One of those is delalloc inodes - inodes which have dirty
      pages which haven't been persisted yet. Currently the process of
      freeing such delalloc inodes in exceptional circumstances such as
      transaction abort boiled down to calling btrfs_invalidate_inodes whose
      sole job is to invalidate the dentries for all inodes related to a
      root. This is in fact wrong and insufficient since such delalloc inodes
      will likely have pending pages or ordered-extents and will be linked to
      the sb->s_inode_list. This means that unmounting a btrfs instance with
      an aborted transaction could potentially lead inodes/their pages
      visible to the system long after their superblock has been freed. This
      in turn leads to a "use-after-free" situation once page shrink is
      triggered. This situation could be simulated by running generic/019
      which would cause such inodes to be left hanging, followed by
      generic/176 which causes memory pressure and page eviction which lead
      to touching the freed super block instance. This situation is
      additionally detected by the unmount code of VFS with the following
      message:
      
      "VFS: Busy inodes after unmount of Self-destruct in 5 seconds.  Have a nice day..."
      
      Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
      in free_fs_root for the same reason.
      
      This patch aims to rectify the sitaution by doing the following:
      
      1. Change btrfs_destroy_delalloc_inodes so that it calls
      invalidate_inode_pages2 for every inode on the delalloc list, this
      ensures that all the pages of the inode are released. This function
      boils down to calling btrfs_releasepage. During test I observed cases
      where inodes on the delalloc list were having an i_count of 0, so this
      necessitates using igrab to be sure we are working on a non-freed inode.
      
      2. Since calling btrfs_releasepage might queue delayed iputs move the
      call out to btrfs_cleanup_transaction in btrfs_error_commit_super before
      calling run_delayed_iputs for the last time. This is necessary to ensure
      that delayed iputs are run.
      
      Note: this patch is tagged for 4.14 stable but the fix applies to older
      versions too but needs to be backported manually due to conflicts.
      
      CC: stable@vger.kernel.org # 4.14.x: 2b877331: btrfs: Split btrfs_del_delalloc_inode into 2 functions
      CC: stable@vger.kernel.org # 4.14.x
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add comment to igrab ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fe816d0f
  10. 18 4月, 2018 1 次提交
    • Q
      btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT · a514d638
      Qu Wenruo 提交于
      Unlike previous method that tries to commit transaction inside
      qgroup_reserve(), this time we will try to commit transaction using
      fs_info->transaction_kthread to avoid nested transaction and no need to
      worry about locking context.
      
      Since it's an asynchronous function call and we won't wait for
      transaction commit, unlike previous method, we must call it before we
      hit the qgroup limit.
      
      So this patch will use the ratio and size of qgroup meta_pertrans
      reservation as indicator to check if we should trigger a transaction
      commit.  (meta_prealloc won't be cleaned in transaction committ, it's
      useless anyway)
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a514d638
  11. 13 4月, 2018 1 次提交
    • Q
      btrfs: Only check first key for committed tree blocks · 5d41be6f
      Qu Wenruo 提交于
      When looping btrfs/074 with many cpus (>= 8), it's possible to trigger
      kernel warning due to first key verification:
      
      [ 4239.523446] WARNING: CPU: 5 PID: 2381 at fs/btrfs/disk-io.c:460 btree_read_extent_buffer_pages+0x1ad/0x210
      [ 4239.523830] Modules linked in:
      [ 4239.524630] RIP: 0010:btree_read_extent_buffer_pages+0x1ad/0x210
      [ 4239.527101] Call Trace:
      [ 4239.527251]  read_tree_block+0x42/0x70
      [ 4239.527434]  read_node_slot+0xd2/0x110
      [ 4239.527632]  push_leaf_right+0xad/0x1b0
      [ 4239.527809]  split_leaf+0x4ea/0x700
      [ 4239.527988]  ? leaf_space_used+0xbc/0xe0
      [ 4239.528192]  ? btrfs_set_lock_blocking_rw+0x99/0xb0
      [ 4239.528416]  btrfs_search_slot+0x8cc/0xa40
      [ 4239.528605]  btrfs_insert_empty_items+0x71/0xc0
      [ 4239.528798]  __btrfs_run_delayed_refs+0xa98/0x1680
      [ 4239.529013]  btrfs_run_delayed_refs+0x10b/0x1b0
      [ 4239.529205]  btrfs_commit_transaction+0x33/0xaf0
      [ 4239.529445]  ? start_transaction+0xa8/0x4f0
      [ 4239.529630]  btrfs_alloc_data_chunk_ondemand+0x1b0/0x4e0
      [ 4239.529833]  btrfs_check_data_free_space+0x54/0xa0
      [ 4239.530045]  btrfs_delalloc_reserve_space+0x25/0x70
      [ 4239.531907]  btrfs_direct_IO+0x233/0x3d0
      [ 4239.532098]  generic_file_direct_write+0xcb/0x170
      [ 4239.532296]  btrfs_file_write_iter+0x2bb/0x5f4
      [ 4239.532491]  aio_write+0xe2/0x180
      [ 4239.532669]  ? lock_acquire+0xac/0x1e0
      [ 4239.532839]  ? __might_fault+0x3e/0x90
      [ 4239.533032]  do_io_submit+0x594/0x860
      [ 4239.533223]  ? do_io_submit+0x594/0x860
      [ 4239.533398]  SyS_io_submit+0x10/0x20
      [ 4239.533560]  ? SyS_io_submit+0x10/0x20
      [ 4239.533729]  do_syscall_64+0x75/0x1d0
      [ 4239.533979]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
      [ 4239.534182] RIP: 0033:0x7f8519741697
      
      The problem here is, at btree_read_extent_buffer_pages() we don't have
      acquired read/write lock on that extent buffer, only basic info like
      level/bytenr is reliable.
      
      So race condition leads to such false alert.
      
      However in current call site, it's impossible to acquire proper lock
      without race window.
      To fix the problem, we only verify first key for committed tree blocks
      (whose generation is no larger than fs_info->last_trans_committed), so
      the content of such tree blocks will not change and there is no need to
      get read/write lock.
      Reported-by: NNikolay Borisov <nborisov@suse.com>
      Fixes: 581c1760 ("btrfs: Validate child tree block's level and first key")
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5d41be6f
  12. 12 4月, 2018 2 次提交
    • D
      btrfs: replace GPL boilerplate by SPDX -- sources · c1d7c514
      David Sterba 提交于
      Remove GPL boilerplate text (long, short, one-line) and keep the rest,
      ie. personal, company or original source copyright statements. Add the
      SPDX header.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c1d7c514
    • L
      Btrfs: clean up resources during umount after trans is aborted · af722733
      Liu Bo 提交于
      Currently if some fatal errors occur, like all IO get -EIO, resources
      would be cleaned up when
      a) transaction is being committed or
      b) BTRFS_FS_STATE_ERROR is set
      
      However, in some rare cases, resources may be left alone after transaction
      gets aborted and umount may run into some ASSERT(), e.g.
      ASSERT(list_empty(&block_group->dirty_list));
      
      For case a), in btrfs_commit_transaciton(), there're several places at the
      beginning where we just call btrfs_end_transaction() without cleaning up
      resources.  For case b), it is possible that the trans handle doesn't have
      any dirty stuff, then only trans hanlde is marked as aborted while
      BTRFS_FS_STATE_ERROR is not set, so resources remain in memory.
      
      This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that
      all resources won't stay in memory after umount.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      af722733
  13. 31 3月, 2018 4 次提交
    • L
      Btrfs: print error messages when failing to read trees · f50f4353
      Liu Bo 提交于
      When mount fails to read trees like fs tree, checksum tree, extent
      tree, etc, there is not enough information about where went wrong.
      
      With this, messages like
      
      "BTRFS warning (device sdf): failed to read root (objectid=7): -5"
      
      would help us a bit.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f50f4353
    • Q
      btrfs: Validate child tree block's level and first key · 581c1760
      Qu Wenruo 提交于
      We have several reports about node pointer points to incorrect child
      tree blocks, which could have even wrong owner and level but still with
      valid generation and checksum.
      
      Although btrfs check could handle it and print error message like:
      leaf parent key incorrect 60670574592
      
      Kernel doesn't have enough check on this type of corruption correctly.
      At least add such check to read_tree_block() and btrfs_read_buffer(),
      where we need two new parameters @level and @first_key to verify the
      child tree block.
      
      The new @level check is mandatory and all call sites are already
      modified to extract expected level from its call chain.
      
      While @first_key is optional, the following call sites are skipping such
      check:
      1) Root node/leaf
         As ROOT_ITEM doesn't contain the first key, skip @first_key check.
      2) Direct backref
         Only parent bytenr and level is known and we need to resolve the key
         all by ourselves, skip @first_key check.
      
      Another note of this verification is, it needs extra info from nodeptr
      or ROOT_ITEM, so it can't fit into current tree-checker framework, which
      is limited to node/leaf boundary.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      581c1760
    • Q
      btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space · 8287475a
      Qu Wenruo 提交于
      For quota disabled->enable case, it's possible that at reservation time
      quota was not enabled so no bytes were really reserved, while at release
      time, quota was enabled so we will try to release some bytes we didn't
      really own.
      
      Such situation can cause metadata reserveation underflow, for both types,
      also less possible for per-trans type since quota enable will commit
      transaction.
      
      To address this, record qgroup meta reserved bytes into
      root::qgroup_meta_rsv_pertrans and ::prealloc.
      So at releasing time we won't free any bytes we didn't reserve.
      
      For DATA, it's already handled by io_tree, so nothing needs to be done
      there.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8287475a
    • Q
      btrfs: qgroup: Don't use root->qgroup_meta_rsv for qgroup · e1211d0e
      Qu Wenruo 提交于
      Since qgroup has seperate metadata reservation types now, we can
      completely get rid of the old root->qgroup_meta_rsv, which mostly acts
      as current META_PERTRANS reservation type.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e1211d0e