1. 29 5月, 2018 8 次提交
  2. 28 5月, 2018 2 次提交
  3. 02 5月, 2018 1 次提交
    • E
      btrfs: Take trans lock before access running trans in check_delayed_ref · 998ac6d2
      ethanwu 提交于
      In preivous patch:
      Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist
      We avoid starting btrfs transaction and get this information from
      fs_info->running_transaction directly.
      
      When accessing running_transaction in check_delayed_ref, there's a
      chance that current transaction will be freed by commit transaction
      after the NULL pointer check of running_transaction is passed.
      
      After looking all the other places using fs_info->running_transaction,
      they are either protected by trans_lock or holding the transactions.
      
      Fix this by using trans_lock and increasing the use_count.
      
      Fixes: e4c3b2dc ("Btrfs: kill trans in run_delalloc_nocow and btrfs_cross_ref_exist")
      CC: stable@vger.kernel.org # 4.14+
      Signed-off-by: Nethanwu <ethanwu@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      998ac6d2
  4. 21 4月, 2018 1 次提交
    • N
      btrfs: Fix race condition between delayed refs and blockgroup removal · 5e388e95
      Nikolay Borisov 提交于
      When the delayed refs for a head are all run, eventually
      cleanup_ref_head is called which (in case of deletion) obtains a
      reference for the relevant btrfs_space_info struct by querying the bg
      for the range. This is problematic because when the last extent of a
      bg is deleted a race window emerges between removal of that bg and the
      subsequent invocation of cleanup_ref_head. This can result in cache being null
      and either a null pointer dereference or assertion failure.
      
      	task: ffff8d04d31ed080 task.stack: ffff9e5dc10cc000
      	RIP: 0010:assfail.constprop.78+0x18/0x1a [btrfs]
      	RSP: 0018:ffff9e5dc10cfbe8 EFLAGS: 00010292
      	RAX: 0000000000000044 RBX: 0000000000000000 RCX: 0000000000000000
      	RDX: ffff8d04ffc1f868 RSI: ffff8d04ffc178c8 RDI: ffff8d04ffc178c8
      	RBP: ffff8d04d29e5ea0 R08: 00000000000001f0 R09: 0000000000000001
      	R10: ffff9e5dc0507d58 R11: 0000000000000001 R12: ffff8d04d29e5ea0
      	R13: ffff8d04d29e5f08 R14: ffff8d04efe29b40 R15: ffff8d04efe203e0
      	FS:  00007fbf58ead500(0000) GS:ffff8d04ffc00000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	CR2: 00007fe6c6975648 CR3: 0000000013b2a000 CR4: 00000000000006f0
      	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      	Call Trace:
      	 __btrfs_run_delayed_refs+0x10e7/0x12c0 [btrfs]
      	 btrfs_run_delayed_refs+0x68/0x250 [btrfs]
      	 btrfs_should_end_transaction+0x42/0x60 [btrfs]
      	 btrfs_truncate_inode_items+0xaac/0xfc0 [btrfs]
      	 btrfs_evict_inode+0x4c6/0x5c0 [btrfs]
      	 evict+0xc6/0x190
      	 do_unlinkat+0x19c/0x300
      	 do_syscall_64+0x74/0x140
      	 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      	RIP: 0033:0x7fbf589c57a7
      
      To fix this, introduce a new flag "is_system" to head_ref structs,
      which is populated at insertion time. This allows to decouple the
      querying for the spaceinfo from querying the possibly deleted bg.
      
      Fixes: d7eae340 ("Btrfs: rework delayed ref total_bytes_pinned accounting")
      CC: stable@vger.kernel.org # 4.14+
      Suggested-by: NOmar Sandoval <osandov@osandov.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5e388e95
  5. 18 4月, 2018 1 次提交
    • Q
      btrfs: qgroup: Use independent and accurate per inode qgroup rsv · ff6bc37e
      Qu Wenruo 提交于
      Unlike reservation calculation used in inode rsv for metadata, qgroup
      doesn't really need to care about things like csum size or extent usage
      for the whole tree COW.
      
      Qgroups care more about net change of the extent usage.
      That's to say, if we're going to insert one file extent, it will mostly
      find its place in COWed tree block, leaving no change in extent usage.
      Or causing a leaf split, resulting in one new net extent and increasing
      qgroup number by nodesize.
      Or in an even more rare case, increase the tree level, increasing qgroup
      number by 2 * nodesize.
      
      So here instead of using the complicated calculation for extent
      allocator, which cares more about accuracy and no error, qgroup doesn't
      need that over-estimated reservation.
      
      This patch will maintain 2 new members in btrfs_block_rsv structure for
      qgroup, using much smaller calculation for qgroup rsv, reducing false
      EDQUOT.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      ff6bc37e
  6. 12 4月, 2018 1 次提交
  7. 06 4月, 2018 1 次提交
  8. 31 3月, 2018 13 次提交
    • D
      btrfs: use lockdep_assert_held for mutexes · a32bf9a3
      David Sterba 提交于
      Using lockdep_assert_held is preferred, replace mutex_is_locked.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a32bf9a3
    • Q
      btrfs: Validate child tree block's level and first key · 581c1760
      Qu Wenruo 提交于
      We have several reports about node pointer points to incorrect child
      tree blocks, which could have even wrong owner and level but still with
      valid generation and checksum.
      
      Although btrfs check could handle it and print error message like:
      leaf parent key incorrect 60670574592
      
      Kernel doesn't have enough check on this type of corruption correctly.
      At least add such check to read_tree_block() and btrfs_read_buffer(),
      where we need two new parameters @level and @first_key to verify the
      child tree block.
      
      The new @level check is mandatory and all call sites are already
      modified to extract expected level from its call chain.
      
      While @first_key is optional, the following call sites are skipping such
      check:
      1) Root node/leaf
         As ROOT_ITEM doesn't contain the first key, skip @first_key check.
      2) Direct backref
         Only parent bytenr and level is known and we need to resolve the key
         all by ourselves, skip @first_key check.
      
      Another note of this verification is, it needs extra info from nodeptr
      or ROOT_ITEM, so it can't fit into current tree-checker framework, which
      is limited to node/leaf boundary.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      581c1760
    • Q
      btrfs: qgroup: Use separate meta reservation type for delalloc · 43b18595
      Qu Wenruo 提交于
      Before this patch, btrfs qgroup is mixing per-transcation meta rsv with
      preallocated meta rsv, making it quite easy to underflow qgroup meta
      reservation.
      
      Since we have the new qgroup meta rsv types, apply it to delalloc
      reservation.
      
      Now for delalloc, most of its reserved space will use META_PREALLOC qgroup
      rsv type.
      
      And for callers reducing outstanding extent like btrfs_finish_ordered_io(),
      they will convert corresponding META_PREALLOC reservation to
      META_PERTRANS.
      
      This is mainly due to the fact that current qgroup numbers will only be
      updated in btrfs_commit_transaction(), that's to say if we don't keep
      such placeholder reservation, we can exceed qgroup limitation.
      
      And for callers freeing outstanding extent in error handler, we will
      just free META_PREALLOC bytes.
      
      This behavior makes callers of btrfs_qgroup_release_meta() or
      btrfs_qgroup_convert_meta() to be aware of which type they are.
      So in this patch, btrfs_delalloc_release_metadata() and its callers get
      an extra parameter to info qgroup to do correct meta convert/release.
      
      The good news is, even we use the wrong type (convert or free), it won't
      cause obvious bug, as prealloc type is always in good shape, and the
      type only affects how per-trans meta is increased or not.
      
      So the worst case will be at most metadata limitation can be sometimes
      exceeded (no convert at all) or metadata limitation is reached too soon
      (no free at all).
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      43b18595
    • Q
      btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans · 733e03a0
      Qu Wenruo 提交于
      Btrfs uses 2 different methods to reseve metadata qgroup space.
      
      1) Reserve at btrfs_start_transaction() time
         This is quite straightforward, caller will use the trans handler
         allocated to modify b-trees.
      
         In this case, reserved metadata should be kept until qgroup numbers
         are updated.
      
      2) Reserve by using block_rsv first, and later btrfs_join_transaction()
         This is more complicated, caller will reserve space using block_rsv
         first, and then later call btrfs_join_transaction() to get a trans
         handle.
      
         In this case, before we modify trees, the reserved space can be
         modified on demand, and after btrfs_join_transaction(), such reserved
         space should also be kept until qgroup numbers are updated.
      
      Since these two types behave differently, split the original "META"
      reservation type into 2 sub-types:
      
        META_PERTRANS:
          For above case 1)
      
        META_PREALLOC:
          For reservations that happened before btrfs_join_transaction() of
          case 2)
      
      NOTE: This patch will only convert existing qgroup meta reservation
      callers according to its situation, not ensuring all callers are at
      correct timing.
      Such fix will be added in later patches.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [ update comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      733e03a0
    • J
      btrfs: defer adding raid type kobject until after chunk relocation · 75cb379d
      Jeff Mahoney 提交于
      Any time the first block group of a new type is created, we add a new
      kobject to sysfs to hold the attributes for that type.  Kobject-internal
      allocations always use GFP_KERNEL, making them prone to fs-reclaim races.
      While it appears as if this can occur any time a block group is created,
      the only times the first block group of a new type can be created in
      memory is at mount and when we create the first new block group during
      raid conversion.
      
      This patch adds a new list to track pending kobject additions and then
      handles them after we do chunk relocation.  Between relocating the
      target chunk (or forcing allocation of a new chunk in the case of data)
      and removing the old chunk, we're in a safe place for fs-reclaim to
      occur.  We're holding the volume mutex, which is already held across
      page faults, and the delete_unused_bgs_mutex, which will only stall
      the cleaner thread.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      75cb379d
    • J
      btrfs: remove dead create_space_info calls · dc2d3005
      Jeff Mahoney 提交于
      Since commit 2be12ef7 (btrfs: Separate space_info create/update), we've
      separated out the creation and updating of the space info structures.
      That commit was a straightforward refactoring of the two parts of
      update_space_info, but we can go a step further.  Since commits
      c59021f8 (Btrfs: fix OOPS of empty filesystem after balance) and
      b742bb82 (Btrfs: Link block groups of different raid types), we know
      that the space_info structures will be created at mount and there will
      only ever be, at most, three of them.
      
      This patch cleans out the create_space_info calls after __find_space_info
      returns NULL since __find_space_info *can't* return NULL.
      
      The initial cause for reviewing this was the kobject_add calls from
      create_space_info occuring in sites where fs-reclaim wasn't allowed.  Now
      we are certain they occur only early in the mount process and are safe.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dc2d3005
    • N
      btrfs: Drop fs_info parameter from __btrfs_run_delayed_refs · 0a1e458a
      Nikolay Borisov 提交于
      It's provided by transaction handle.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0a1e458a
    • N
      btrfs: Drop fs_info parameter from btrfs_finish_extent_commit · 5ead2dd0
      Nikolay Borisov 提交于
      It's provided by the transaction handle.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5ead2dd0
    • N
      btrfs: drop fs_info parameter from btrfs_run_delayed_refs · c79a70b1
      Nikolay Borisov 提交于
      It's provided by the transaction handle.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c79a70b1
    • N
      btrfs: Remove unused flush var in shrink_delalloc · 39d7d09d
      Nikolay Borisov 提交于
      Added by 08e007d2 ("Btrfs: improve the noflush reservation") and
      made redundant by 17024ad0 ("Btrfs: fix early ENOSPC due to
      delalloc").
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      39d7d09d
    • N
      btrfs: Remove unused extent_root var from caching_thread · 101d2dc0
      Nikolay Borisov 提交于
      Added by b4570aa9 ("btrfs: fix compiling with CONFIG_BTRFS_DEBUG
      enabled.") and obsoleted by 2ff7e61e ("btrfs: take an fs_info
      directly when the root is not used otherwise").
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      101d2dc0
    • N
      btrfs: Document parameters of btrfs_reserve_extent · 6f47c706
      Nikolay Borisov 提交于
      This function is the entry to the extent allocator and as such has
      quite a number of parameters. Some of those have subtle effects on the
      allocation algorithm. Document the parameters.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6f47c706
    • N
      btrfs: Remove btrfs_fs_info::open_ioctl_trans · 92e2f7e3
      Nikolay Borisov 提交于
      Since userspace transaction have been removed we no longer have use
      for this field so delete it.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92e2f7e3
  9. 26 3月, 2018 9 次提交
  10. 20 3月, 2018 1 次提交
  11. 02 2月, 2018 1 次提交
    • F
      Btrfs: fix null pointer dereference when replacing missing device · 627e0873
      Filipe Manana 提交于
      When we are replacing a missing device we mount the filesystem with the
      degraded mode option in which case we are allowed to have a btrfs device
      structure without a backing device member (its bdev member is NULL) and
      therefore we can't dereference that member. Commit 38b5f68e
      ("btrfs: drop btrfs_device::can_discard to query directly") started to
      dereference that member when discarding extents, resulting in a null
      pointer dereference:
      
       [ 3145.322257] BTRFS warning (device sdf): devid 2 uuid 4d922414-58eb-4880-8fed-9c3840f6c5d5 is missing
       [ 3145.364116] BTRFS info (device sdf): dev_replace from <missing disk> (devid 2) to /dev/sdg started
       [ 3145.413489] BUG: unable to handle kernel NULL pointer dereference at 00000000000000e0
       [ 3145.415085] IP: btrfs_discard_extent+0x6a/0xf8 [btrfs]
       [ 3145.415085] PGD 0 P4D 0
       [ 3145.415085] Oops: 0000 [#1] PREEMPT SMP PTI
       [ 3145.415085] Modules linked in: ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse parport_pc serio_raw i2c_piix4 i2
       [ 3145.415085] CPU: 0 PID: 11989 Comm: btrfs Tainted: G        W        4.15.0-rc9-btrfs-next-55+ #1
       [ 3145.415085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
       [ 3145.415085] RIP: 0010:btrfs_discard_extent+0x6a/0xf8 [btrfs]
       [ 3145.415085] RSP: 0018:ffffc90004813c60 EFLAGS: 00010293
       [ 3145.415085] RAX: ffff88020d39cc00 RBX: ffff88020c4ea2a0 RCX: 0000000000000002
       [ 3145.415085] RDX: 0000000000000000 RSI: ffff88020c4ea240 RDI: 0000000000000000
       [ 3145.415085] RBP: 0000000000000000 R08: 0000000000004000 R09: 0000000000000000
       [ 3145.415085] R10: ffffc90004813ae8 R11: 0000000000000000 R12: 0000000000000000
       [ 3145.415085] R13: ffff88020c418000 R14: 0000000000000000 R15: 0000000000000000
       [ 3145.415085] FS:  00007f565681f8c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
       [ 3145.415085] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [ 3145.415085] CR2: 00000000000000e0 CR3: 000000020d208006 CR4: 00000000001606f0
       [ 3145.415085] Call Trace:
       [ 3145.415085]  btrfs_finish_extent_commit+0x9a/0x1be [btrfs]
       [ 3145.415085]  btrfs_commit_transaction+0x649/0x7a0 [btrfs]
       [ 3145.415085]  ? start_transaction+0x2b0/0x3b3 [btrfs]
       [ 3145.415085]  btrfs_dev_replace_start+0x274/0x30c [btrfs]
       [ 3145.415085]  btrfs_dev_replace_by_ioctl+0x45/0x59 [btrfs]
       [ 3145.415085]  btrfs_ioctl+0x1a91/0x1d62 [btrfs]
       [ 3145.415085]  ? lock_acquire+0x16a/0x1af
       [ 3145.415085]  ? vfs_ioctl+0x1b/0x28
       [ 3145.415085]  ? trace_hardirqs_on_caller+0x14c/0x1a6
       [ 3145.415085]  vfs_ioctl+0x1b/0x28
       [ 3145.415085]  do_vfs_ioctl+0x5a9/0x5e0
       [ 3145.415085]  ? _raw_spin_unlock_irq+0x34/0x46
       [ 3145.415085]  ? entry_SYSCALL_64_fastpath+0x5/0x8b
       [ 3145.415085]  ? trace_hardirqs_on_caller+0x14c/0x1a6
       [ 3145.415085]  SyS_ioctl+0x52/0x76
       [ 3145.415085]  entry_SYSCALL_64_fastpath+0x1e/0x8b
       [ 3145.415085] RIP: 0033:0x7f56558b3c47
       [ 3145.415085] RSP: 002b:00007ffdcfac4c58 EFLAGS: 00000202
       [ 3145.415085] Code: be 02 00 00 00 4c 89 ef e8 b9 e7 03 00 85 c0 89 c5 75 75 48 8b 44 24 08 45 31 f6 48 8d 58 60 eb 52 48 8b 03 48 8b b8 a0 00 00 00 <48> 8b 87 e0 00
       [ 3145.415085] RIP: btrfs_discard_extent+0x6a/0xf8 [btrfs] RSP: ffffc90004813c60
       [ 3145.415085] CR2: 00000000000000e0
       [ 3145.458185] ---[ end trace 06302e7ac31902bf ]---
      
      This is trivially reproduced by running the test btrfs/027 from fstests
      like this:
      
        $ MOUNT_OPTIONS="-o discard" ./check btrfs/027
      
      Fix this by skipping devices without a backing device before attempting
      to discard.
      
      Fixes: 38b5f68e ("btrfs: drop btrfs_device::can_discard to query directly")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      627e0873
  12. 22 1月, 2018 1 次提交