1. 08 11月, 2018 1 次提交
    • O
      Btrfs: fix missing delayed iputs on unmount · d6fd0ae2
      Omar Sandoval 提交于
      There's a race between close_ctree() and cleaner_kthread().
      close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
      sees it set, but this is racy; the cleaner might have already checked
      the bit and could be cleaning stuff. In particular, if it deletes unused
      block groups, it will create delayed iputs for the free space cache
      inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
      longer running delayed iputs after a commit. Therefore, if the cleaner
      creates more delayed iputs after delayed iputs are run in
      btrfs_commit_super(), we will leak inodes on unmount and get a busy
      inode crash from the VFS.
      
      Fix it by parking the cleaner before we actually close anything. Then,
      any remaining delayed iputs will always be handled in
      btrfs_commit_super(). This also ensures that the commit in close_ctree()
      is really the last commit, so we can get rid of the commit in
      cleaner_kthread().
      
      The fstest/generic/475 followed by 476 can trigger a crash that
      manifests as a slab corruption caused by accessing the freed kthread
      structure by a wake up function. Sample trace:
      
      [ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
      [ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
      [ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
      [ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G        W         4.19.0-rc8-default+ #323
      [ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
      [ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
      [ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
      [ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
      [ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
      [ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
      [ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
      [ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
      [ 5657.099689] FS:  00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
      [ 5657.101032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
      [ 5657.103159] Call Trace:
      [ 5657.103776]  shrink_inactive_list+0x194/0x410
      [ 5657.104671]  shrink_node_memcg.constprop.84+0x39a/0x6a0
      [ 5657.105750]  shrink_node+0x62/0x1c0
      [ 5657.106529]  try_to_free_pages+0x1a4/0x500
      [ 5657.107408]  __alloc_pages_slowpath+0x2c9/0xb20
      [ 5657.108418]  __alloc_pages_nodemask+0x268/0x2b0
      [ 5657.109348]  kmalloc_large_node+0x37/0x90
      [ 5657.110205]  __kmalloc_node+0x236/0x310
      [ 5657.111014]  kvmalloc_node+0x3e/0x70
      
      Fixes: 30928e9b ("btrfs: don't run delayed_iputs in commit")
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add trace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d6fd0ae2
  2. 06 11月, 2018 1 次提交
    • L
      btrfs: fix pinned underflow after transaction aborted · fcd5e742
      Lu Fengqi 提交于
      When running generic/475, we may get the following warning in dmesg:
      
      [ 6902.102154] WARNING: CPU: 3 PID: 18013 at fs/btrfs/extent-tree.c:9776 btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
      [ 6902.109160] CPU: 3 PID: 18013 Comm: umount Tainted: G        W  O      4.19.0-rc8+ #8
      [ 6902.110971] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      [ 6902.112857] RIP: 0010:btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
      [ 6902.118921] RSP: 0018:ffffc9000459bdb0 EFLAGS: 00010286
      [ 6902.120315] RAX: ffff880175050bb0 RBX: ffff8801124a8000 RCX: 0000000000170007
      [ 6902.121969] RDX: 0000000000000002 RSI: 0000000000170007 RDI: ffffffff8125fb74
      [ 6902.123716] RBP: ffff880175055d10 R08: 0000000000000000 R09: 0000000000000000
      [ 6902.125417] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880175055d88
      [ 6902.127129] R13: ffff880175050bb0 R14: 0000000000000000 R15: dead000000000100
      [ 6902.129060] FS:  00007f4507223780(0000) GS:ffff88017ba00000(0000) knlGS:0000000000000000
      [ 6902.130996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6902.132558] CR2: 00005623599cac78 CR3: 000000014b700001 CR4: 00000000003606e0
      [ 6902.134270] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 6902.135981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 6902.137836] Call Trace:
      [ 6902.138939]  close_ctree+0x171/0x330 [btrfs]
      [ 6902.140181]  ? kthread_stop+0x146/0x1f0
      [ 6902.141277]  generic_shutdown_super+0x6c/0x100
      [ 6902.142517]  kill_anon_super+0x14/0x30
      [ 6902.143554]  btrfs_kill_super+0x13/0x100 [btrfs]
      [ 6902.144790]  deactivate_locked_super+0x2f/0x70
      [ 6902.146014]  cleanup_mnt+0x3b/0x70
      [ 6902.147020]  task_work_run+0x9e/0xd0
      [ 6902.148036]  do_syscall_64+0x470/0x600
      [ 6902.149142]  ? trace_hardirqs_off_thunk+0x1a/0x1c
      [ 6902.150375]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 6902.151640] RIP: 0033:0x7f45077a6a7b
      [ 6902.157324] RSP: 002b:00007ffd589f3e68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 6902.159187] RAX: 0000000000000000 RBX: 000055e8eec732b0 RCX: 00007f45077a6a7b
      [ 6902.160834] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055e8eec73490
      [ 6902.162526] RBP: 0000000000000000 R08: 000055e8eec734b0 R09: 00007ffd589f26c0
      [ 6902.164141] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e8eec73490
      [ 6902.165815] R13: 00007f4507ac61a4 R14: 0000000000000000 R15: 00007ffd589f40d8
      [ 6902.167553] irq event stamp: 0
      [ 6902.168998] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
      [ 6902.170731] hardirqs last disabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
      [ 6902.172773] softirqs last  enabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
      [ 6902.174671] softirqs last disabled at (0): [<0000000000000000>]           (null)
      [ 6902.176407] ---[ end trace 463138c2986b275c ]---
      [ 6902.177636] BTRFS info (device dm-3): space_info 4 has 273465344 free, is not full
      [ 6902.179453] BTRFS info (device dm-3): space_info total=276824064, used=4685824, pinned=18446744073708158976, reserved=0, may_use=0, readonly=65536
      
      In the above line there's "pinned=18446744073708158976" which is an
      unsigned u64 value of -1392640, an obvious underflow.
      
      When transaction_kthread is running cleanup_transaction(), another
      fsstress is running btrfs_commit_transaction(). The
      btrfs_finish_extent_commit() may get the same range as
      btrfs_destroy_pinned_extent() got, which causes the pinned underflow.
      
      Fixes: d4b450cd ("Btrfs: fix race between transaction commit and empty block group removal")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fcd5e742
  3. 15 10月, 2018 6 次提交
    • J
      btrfs: assert on non-empty delayed iputs · e187831e
      Josef Bacik 提交于
      I ran into an issue where there was some reference being held on an
      inode that I couldn't track.  This assert wasn't triggered, but it at
      least rules out we're doing something stupid.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e187831e
    • D
      btrfs: dev-replace: move replace members out of fs_info · 7f8d236a
      David Sterba 提交于
      The replace_wait and bio_counter were mistakenly added to fs_info in
      commit c404e0dc ("Btrfs: fix use-after-free in the finishing
      procedure of the device replace"), but they logically belong to
      fs_info::dev_replace. Besides, bio_counter is a very generic name and is
      confusing in bare fs_info context.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f8d236a
    • D
      btrfs: remove btrfs_dev_replace::read_locks · 3280f874
      David Sterba 提交于
      This member seems to be copied from the extent_buffer locking scheme and
      is at least used to assert that the read lock/unlock is properly nested.
      In some way. While the _inc/_dec are called inside the read lock
      section, the asserts are both inside and outside, so the ordering is not
      guaranteed and we can see read/inc/dec ordered in any way
      (theoretically).
      
      A missing call of btrfs_dev_replace_clear_lock_blocking could cause
      unexpected read_locks count, so this at least looks like a valid
      assertion, but this will become unnecessary with later updates.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3280f874
    • L
      Btrfs: delayed-refs: use rb_first_cached for ref_tree · e3d03965
      Liu Bo 提交于
      rb_first_cached() trades an extra pointer "leftmost" for doing the same
      job as rb_first() but in O(1).
      
      Functions manipulating href->ref_tree need to get the first entry, this
      converts href->ref_tree to use rb_first_cached().
      
      For more details about the optimization see patch "Btrfs: delayed-refs:
      use rb_first_cached for href_root".
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e3d03965
    • L
      Btrfs: delayed-refs: use rb_first_cached for href_root · 5c9d028b
      Liu Bo 提交于
      rb_first_cached() trades an extra pointer "leftmost" for doing the same
      job as rb_first() but in O(1).
      
      Functions manipulating href_root need to get the first entry, this
      converts href_root to use rb_first_cached().
      
      This patch is first in the sequenct of similar updates to other rbtrees
      and this is analysis of the expected behaviour and improvements.
      
      There's a common pattern:
      
      while (node = rb_first) {
              entry = rb_entry(node)
              next = rb_next(node)
              rb_erase(node)
              cleanup(entry)
      }
      
      rb_first needs to traverse the tree up to logN depth, rb_erase can
      completely reshuffle the tree. With the caching we'll skip the traversal
      in rb_first.  That's a cached memory access vs looped pointer
      dereference trade-off that IMHO has a clear winner.
      
      Measurements show there's not much difference in a sample tree with
      10000 nodes: 4.5s / rb_first and 4.8s / rb_first_cached. Real effects of
      caching and pointer chasing are unpredictable though.
      
      Further optimzations can be done to avoid the expensive rb_erase step.
      In some cases it's ok to process the nodes in any order, so the tree can
      be traversed in post-order, not rebalancing the children nodes and just
      calling free. Care must be taken regarding the next node.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog from mail discussions ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5c9d028b
    • M
      btrfs: Remove 'objectid' member from struct btrfs_root · 4fd786e6
      Misono Tomohiro 提交于
      There are two members in struct btrfs_root which indicate root's
      objectid: objectid and root_key.objectid.
      
      They are both set to the same value in __setup_root():
      
        static void __setup_root(struct btrfs_root *root,
                                 struct btrfs_fs_info *fs_info,
                                 u64 objectid)
        {
          ...
          root->objectid = objectid;
          ...
          root->root_key.objectid = objecitd;
          ...
        }
      
      and not changed to other value after initialization.
      
      grep in btrfs directory shows both are used in many places:
        $ grep -rI "root->root_key.objectid" | wc -l
        133
        $ grep -rI "root->objectid" | wc -l
        55
       (4.17, inc. some noise)
      
      It is confusing to have two similar variable names and it seems
      that there is no rule about which should be used in a certain case.
      
      Since ->root_key itself is needed for tree reloc tree, let's remove
      'objecitd' member and unify code to use ->root_key.objectid in all places.
      Signed-off-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4fd786e6
  4. 18 8月, 2018 1 次提交
    • R
      Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space · 8ecebf4d
      Robbie Ko 提交于
      Commit e9894fd3 ("Btrfs: fix snapshot vs nocow writting") forced
      nocow writes to fallback to COW, during writeback, when a snapshot is
      created. This resulted in writes made before creating the snapshot to
      unexpectedly fail with ENOSPC during writeback when success (0) was
      returned to user space through the write system call.
      
      The steps leading to this problem are:
      
      1. When it's not possible to allocate data space for a write, the
         buffered write path checks if a NOCOW write is possible.  If it is,
         it will not reserve space and success (0) is returned to user space.
      
      2. Then when a snapshot is created, the root's will_be_snapshotted
         atomic is incremented and writeback is triggered for all inode's that
         belong to the root being snapshotted. Incrementing that atomic forces
         all previous writes to fallback to COW during writeback (running
         delalloc).
      
      3. This results in the writeback for the inodes to fail and therefore
         setting the ENOSPC error in their mappings, so that a subsequent
         fsync on them will report the error to user space. So it's not a
         completely silent data loss (since fsync will report ENOSPC) but it's
         a very unexpected and undesirable behaviour, because if a clean
         shutdown/unmount of the filesystem happens without previous calls to
         fsync, it is expected to have the data present in the files after
         mounting the filesystem again.
      
      So fix this by adding a new atomic named snapshot_force_cow to the
      root structure which prevents this behaviour and works the following way:
      
      1. It is incremented when we start to create a snapshot after triggering
         writeback and before waiting for writeback to finish.
      
      2. This new atomic is now what is used by writeback (running delalloc)
         to decide whether we need to fallback to COW or not. Because we
         incremented this new atomic after triggering writeback in the
         snapshot creation ioctl, we ensure that all buffered writes that
         happened before snapshot creation will succeed and not fallback to
         COW (which would make them fail with ENOSPC).
      
      3. The existing atomic, will_be_snapshotted, is kept because it is used
         to force new buffered writes, that start after we started
         snapshotting, to reserve data space even when NOCOW is possible.
         This makes these writes fail early with ENOSPC when there's no
         available space to allocate, preventing the unexpected behaviour of
         writeback later failing with ENOSPC due to a fallback to COW mode.
      
      Fixes: e9894fd3 ("Btrfs: fix snapshot vs nocow writting")
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8ecebf4d
  5. 06 8月, 2018 18 次提交
  6. 30 5月, 2018 1 次提交
    • L
      Btrfs: add parent_transid parameter to veirfy_level_key · ff76a864
      Liu Bo 提交于
      As verify_level_key() is checked after verify_parent_transid(), i.e.
      
      if (verify_parent_transid())
         ret = -EIO;
      else if (verify_level_key())
         ret = -EUCLEAN;
      
      if parent_transid is 0, verify_parent_transid() skips verifying
      parent_transid and considers eb as valid, and if verify_level_key()
      reports something wrong, we're not going to know if it's caused by
      corrupted metadata or non-checkecd eb (e.g. stale eb).
      
      The stale eb can be from an outdated raid1 mirror after a degraded
      mount, see eg "btrfs: fix reading stale metadata blocks after degraded
      raid1 mounts" (02a3307a) for more details.
      
      @parent_transid is able to tell whether the eb's generation has been
      verified by the caller.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ff76a864
  7. 29 5月, 2018 7 次提交
    • O
      Btrfs: get rid of unused orphan infrastructure · a575ceeb
      Omar Sandoval 提交于
      Now that we don't keep long-standing reservations for orphan items,
      root->orphan_block_rsv isn't used. We can git rid of it, along with:
      
      - root->orphan_lock, which was used to protect root->orphan_block_rsv
      - root->orphan_inodes, which was used as a refcount for root->orphan_block_rsv
      - BTRFS_INODE_ORPHAN_META_RESERVED, which was used to track reservations
        in root->orphan_block_rsv
      - btrfs_orphan_commit_root(), which was the last user of any of these
        and does nothing else
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a575ceeb
    • Q
      btrfs: Do super block verification before writing it to disk · 75cb857d
      Qu Wenruo 提交于
      There are already 2 reports about strangely corrupted super blocks,
      where csum still matches but extra garbage gets slipped into super block.
      
      The corruption would looks like:
      ------
      superblock: bytenr=65536, device=/dev/sdc1
      ---------------------------------------------------------
      csum_type               41700 (INVALID)
      csum                    0x3b252d3a [match]
      bytenr                  65536
      flags                   0x1
                              ( WRITTEN )
      magic                   _BHRfS_M [match]
      ...
      incompat_flags          0x5b22400000000169
                              ( MIXED_BACKREF |
                                COMPRESS_LZO |
                                BIG_METADATA |
                                EXTENDED_IREF |
                                SKINNY_METADATA |
                                unknown flag: 0x5b22400000000000 )
      ...
      ------
      Or
      ------
      superblock: bytenr=65536, device=/dev/mapper/x
      ---------------------------------------------------------
      csum_type              35355 (INVALID)
      csum_size              32
      csum                   0xf0dbeddd [match]
      bytenr                 65536
      flags                  0x1
                             ( WRITTEN )
      magic                  _BHRfS_M [match]
      ...
      incompat_flags         0x176d200000000169
                             ( MIXED_BACKREF |
                               COMPRESS_LZO |
                               BIG_METADATA |
                               EXTENDED_IREF |
                               SKINNY_METADATA |
                               unknown flag: 0x176d200000000000 )
      ------
      
      Obviously, csum_type and incompat_flags get some garbage, but its csum
      still matches, which means kernel calculates the csum based on corrupted
      super block memory.
      And after manually fixing these values, the filesystem is completely
      healthy without any problem exposed by btrfs check.
      
      Although the cause is still unknown, at least detect it and prevent further
      corruption.
      
      Both reports have same symptoms, there's an overwrite on offset 192 of
      the superblock, by 4 bytes. The superblock structure is not allocated or
      freed and stays in the memory for the whole filesystem lifetime, so it's
      not a use-after-free kind of error on someone else's leaked page.
      
      As a vague point for the problable cause is mentioning of other system
      freezing related to graphic card drivers.
      Reported-by: NKen Swenson <flat@imo.uto.moe>
      Reported-by: NBen Parsons <9parsonsb@gmail.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add brief analysis of the reports ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      75cb857d
    • Q
      btrfs: Refactor btrfs_check_super_valid · 069ec957
      Qu Wenruo 提交于
      Refactor btrfs_check_super_valid:
      
      1) Rename it to btrfs_validate_mount_super()
         Now it's more obvious when the function should be called.
      
      2) Extract core check routine into validate_super()
         Later write time check can reuse it, and if needed, we could also
         use validate_super() to check each super block.
      
      3) Add more comments about btrfs_validate_mount_super()
         Mostly about what it doesn't check and when it should be called.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ rename to validate_super ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      069ec957
    • Q
      btrfs: Move btrfs_check_super_valid() to avoid forward declaration · 21a852b0
      Qu Wenruo 提交于
      Move btrfs_check_super_valid() before its single caller to avoid forward
      declaration.
      
      Though such code motion is not recommended as it pollutes git history,
      in this case the following patches would need to add new forward
      declarations for static functions that we want to avoid.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      21a852b0
    • A
      btrfs: move btrfs_raid_group values to btrfs_raid_attr table · 41a6e891
      Anand Jain 提交于
      Add a new member struct btrfs_raid_attr::bg_flag so that
      btrfs_raid_array can maintain the bit map flag of the raid type, and
      so we can drop btrfs_raid_group.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      41a6e891
    • D
      btrfs: track running balance in a simpler way · 3009a62f
      David Sterba 提交于
      Currently fs_info::balance_running is 0 or 1 and does not use the
      semantics of atomics. The pause and cancel check for 0, that can happen
      only after __btrfs_balance exits for whatever reason.
      
      Parallel calls to balance ioctl may enter btrfs_ioctl_balance multiple
      times but will block on the balance_mutex that protects the
      fs_info::flags bit.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3009a62f
    • D
      btrfs: kill btrfs_fs_info::volume_mutex · dccdb07b
      David Sterba 提交于
      Mutual exclusion of device add/rm and balance was done by the volume
      mutex up to version 3.7. The commit 5ac00add ("Btrfs: disallow
      mutually exclusive admin operations from user mode") added a bit that
      essentially tracked the same information.
      
      The status bit has an advantage over a mutex that it can be set without
      restrictions of function context, so it started to be used in the
      mount-time resuming of balance or device replace.
      
      But we don't really need to track the same information in two ways.
      
      1) After the previous cleanups, the main ioctl handlers for
         add/del/resize copy the EXCL_OP bit next to the volume mutex, here
         it's clearly safe.
      
      2) Resuming balance during mount or after rw remount will set only the
         EXCL_OP bit and the volume_mutex is held in the kernel thread that
         calls btrfs_balance.
      
      3) Resuming device replace during mount or after rw remount is done
         after balance and is excluded by the EXCL_OP bit. It does not take
         the volume_mutex at all and completely relies on the EXCL_OP bit.
      
      4) The resuming of balance and dev-replace cannot hapen at the same time
         as the ioctls cannot be started in parallel. Nevertheless, a crafted
         image could trigger that and a warning is printed.
      
      5) Balance is normally excluded by EXCL_OP and also uses own mutex to
         protect against concurrent access to its status data. There's some
         trickery to maintain the right lock nesting in case we need to
         reexamine the status in btrfs_ioctl_balance. The volume_mutex is
         removed and the unlock/lock sequence is left in place as we might
         expect other waiters to proceed.
      
      6) Similar to 5, the unlock/lock sequence is kept in
         btrfs_cancel_balance to allow waiters to continue.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dccdb07b
  8. 17 5月, 2018 1 次提交
    • N
      btrfs: Fix delalloc inodes invalidation during transaction abort · fe816d0f
      Nikolay Borisov 提交于
      When a transaction is aborted btrfs_cleanup_transaction is called to
      cleanup all the various in-flight bits and pieces which migth be
      active. One of those is delalloc inodes - inodes which have dirty
      pages which haven't been persisted yet. Currently the process of
      freeing such delalloc inodes in exceptional circumstances such as
      transaction abort boiled down to calling btrfs_invalidate_inodes whose
      sole job is to invalidate the dentries for all inodes related to a
      root. This is in fact wrong and insufficient since such delalloc inodes
      will likely have pending pages or ordered-extents and will be linked to
      the sb->s_inode_list. This means that unmounting a btrfs instance with
      an aborted transaction could potentially lead inodes/their pages
      visible to the system long after their superblock has been freed. This
      in turn leads to a "use-after-free" situation once page shrink is
      triggered. This situation could be simulated by running generic/019
      which would cause such inodes to be left hanging, followed by
      generic/176 which causes memory pressure and page eviction which lead
      to touching the freed super block instance. This situation is
      additionally detected by the unmount code of VFS with the following
      message:
      
      "VFS: Busy inodes after unmount of Self-destruct in 5 seconds.  Have a nice day..."
      
      Additionally btrfs hits WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
      in free_fs_root for the same reason.
      
      This patch aims to rectify the sitaution by doing the following:
      
      1. Change btrfs_destroy_delalloc_inodes so that it calls
      invalidate_inode_pages2 for every inode on the delalloc list, this
      ensures that all the pages of the inode are released. This function
      boils down to calling btrfs_releasepage. During test I observed cases
      where inodes on the delalloc list were having an i_count of 0, so this
      necessitates using igrab to be sure we are working on a non-freed inode.
      
      2. Since calling btrfs_releasepage might queue delayed iputs move the
      call out to btrfs_cleanup_transaction in btrfs_error_commit_super before
      calling run_delayed_iputs for the last time. This is necessary to ensure
      that delayed iputs are run.
      
      Note: this patch is tagged for 4.14 stable but the fix applies to older
      versions too but needs to be backported manually due to conflicts.
      
      CC: stable@vger.kernel.org # 4.14.x: 2b877331: btrfs: Split btrfs_del_delalloc_inode into 2 functions
      CC: stable@vger.kernel.org # 4.14.x
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add comment to igrab ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fe816d0f
  9. 18 4月, 2018 1 次提交
    • Q
      btrfs: qgroup: Commit transaction in advance to reduce early EDQUOT · a514d638
      Qu Wenruo 提交于
      Unlike previous method that tries to commit transaction inside
      qgroup_reserve(), this time we will try to commit transaction using
      fs_info->transaction_kthread to avoid nested transaction and no need to
      worry about locking context.
      
      Since it's an asynchronous function call and we won't wait for
      transaction commit, unlike previous method, we must call it before we
      hit the qgroup limit.
      
      So this patch will use the ratio and size of qgroup meta_pertrans
      reservation as indicator to check if we should trigger a transaction
      commit.  (meta_prealloc won't be cleaned in transaction committ, it's
      useless anyway)
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a514d638
  10. 13 4月, 2018 1 次提交
    • Q
      btrfs: Only check first key for committed tree blocks · 5d41be6f
      Qu Wenruo 提交于
      When looping btrfs/074 with many cpus (>= 8), it's possible to trigger
      kernel warning due to first key verification:
      
      [ 4239.523446] WARNING: CPU: 5 PID: 2381 at fs/btrfs/disk-io.c:460 btree_read_extent_buffer_pages+0x1ad/0x210
      [ 4239.523830] Modules linked in:
      [ 4239.524630] RIP: 0010:btree_read_extent_buffer_pages+0x1ad/0x210
      [ 4239.527101] Call Trace:
      [ 4239.527251]  read_tree_block+0x42/0x70
      [ 4239.527434]  read_node_slot+0xd2/0x110
      [ 4239.527632]  push_leaf_right+0xad/0x1b0
      [ 4239.527809]  split_leaf+0x4ea/0x700
      [ 4239.527988]  ? leaf_space_used+0xbc/0xe0
      [ 4239.528192]  ? btrfs_set_lock_blocking_rw+0x99/0xb0
      [ 4239.528416]  btrfs_search_slot+0x8cc/0xa40
      [ 4239.528605]  btrfs_insert_empty_items+0x71/0xc0
      [ 4239.528798]  __btrfs_run_delayed_refs+0xa98/0x1680
      [ 4239.529013]  btrfs_run_delayed_refs+0x10b/0x1b0
      [ 4239.529205]  btrfs_commit_transaction+0x33/0xaf0
      [ 4239.529445]  ? start_transaction+0xa8/0x4f0
      [ 4239.529630]  btrfs_alloc_data_chunk_ondemand+0x1b0/0x4e0
      [ 4239.529833]  btrfs_check_data_free_space+0x54/0xa0
      [ 4239.530045]  btrfs_delalloc_reserve_space+0x25/0x70
      [ 4239.531907]  btrfs_direct_IO+0x233/0x3d0
      [ 4239.532098]  generic_file_direct_write+0xcb/0x170
      [ 4239.532296]  btrfs_file_write_iter+0x2bb/0x5f4
      [ 4239.532491]  aio_write+0xe2/0x180
      [ 4239.532669]  ? lock_acquire+0xac/0x1e0
      [ 4239.532839]  ? __might_fault+0x3e/0x90
      [ 4239.533032]  do_io_submit+0x594/0x860
      [ 4239.533223]  ? do_io_submit+0x594/0x860
      [ 4239.533398]  SyS_io_submit+0x10/0x20
      [ 4239.533560]  ? SyS_io_submit+0x10/0x20
      [ 4239.533729]  do_syscall_64+0x75/0x1d0
      [ 4239.533979]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
      [ 4239.534182] RIP: 0033:0x7f8519741697
      
      The problem here is, at btree_read_extent_buffer_pages() we don't have
      acquired read/write lock on that extent buffer, only basic info like
      level/bytenr is reliable.
      
      So race condition leads to such false alert.
      
      However in current call site, it's impossible to acquire proper lock
      without race window.
      To fix the problem, we only verify first key for committed tree blocks
      (whose generation is no larger than fs_info->last_trans_committed), so
      the content of such tree blocks will not change and there is no need to
      get read/write lock.
      Reported-by: NNikolay Borisov <nborisov@suse.com>
      Fixes: 581c1760 ("btrfs: Validate child tree block's level and first key")
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5d41be6f
  11. 12 4月, 2018 2 次提交
    • D
      btrfs: replace GPL boilerplate by SPDX -- sources · c1d7c514
      David Sterba 提交于
      Remove GPL boilerplate text (long, short, one-line) and keep the rest,
      ie. personal, company or original source copyright statements. Add the
      SPDX header.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c1d7c514
    • L
      Btrfs: clean up resources during umount after trans is aborted · af722733
      Liu Bo 提交于
      Currently if some fatal errors occur, like all IO get -EIO, resources
      would be cleaned up when
      a) transaction is being committed or
      b) BTRFS_FS_STATE_ERROR is set
      
      However, in some rare cases, resources may be left alone after transaction
      gets aborted and umount may run into some ASSERT(), e.g.
      ASSERT(list_empty(&block_group->dirty_list));
      
      For case a), in btrfs_commit_transaciton(), there're several places at the
      beginning where we just call btrfs_end_transaction() without cleaning up
      resources.  For case b), it is possible that the trans handle doesn't have
      any dirty stuff, then only trans hanlde is marked as aborted while
      BTRFS_FS_STATE_ERROR is not set, so resources remain in memory.
      
      This makes btrfs also check BTRFS_FS_STATE_TRANS_ABORTED to make sure that
      all resources won't stay in memory after umount.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      af722733