1. 18 4月, 2018 1 次提交
    • Q
      btrfs: qgroup: Use independent and accurate per inode qgroup rsv · ff6bc37e
      Qu Wenruo 提交于
      Unlike reservation calculation used in inode rsv for metadata, qgroup
      doesn't really need to care about things like csum size or extent usage
      for the whole tree COW.
      
      Qgroups care more about net change of the extent usage.
      That's to say, if we're going to insert one file extent, it will mostly
      find its place in COWed tree block, leaving no change in extent usage.
      Or causing a leaf split, resulting in one new net extent and increasing
      qgroup number by nodesize.
      Or in an even more rare case, increase the tree level, increasing qgroup
      number by 2 * nodesize.
      
      So here instead of using the complicated calculation for extent
      allocator, which cares more about accuracy and no error, qgroup doesn't
      need that over-estimated reservation.
      
      This patch will maintain 2 new members in btrfs_block_rsv structure for
      qgroup, using much smaller calculation for qgroup rsv, reducing false
      EDQUOT.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      ff6bc37e
  2. 12 4月, 2018 1 次提交
  3. 06 4月, 2018 1 次提交
  4. 31 3月, 2018 13 次提交
    • D
      btrfs: use lockdep_assert_held for mutexes · a32bf9a3
      David Sterba 提交于
      Using lockdep_assert_held is preferred, replace mutex_is_locked.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a32bf9a3
    • Q
      btrfs: Validate child tree block's level and first key · 581c1760
      Qu Wenruo 提交于
      We have several reports about node pointer points to incorrect child
      tree blocks, which could have even wrong owner and level but still with
      valid generation and checksum.
      
      Although btrfs check could handle it and print error message like:
      leaf parent key incorrect 60670574592
      
      Kernel doesn't have enough check on this type of corruption correctly.
      At least add such check to read_tree_block() and btrfs_read_buffer(),
      where we need two new parameters @level and @first_key to verify the
      child tree block.
      
      The new @level check is mandatory and all call sites are already
      modified to extract expected level from its call chain.
      
      While @first_key is optional, the following call sites are skipping such
      check:
      1) Root node/leaf
         As ROOT_ITEM doesn't contain the first key, skip @first_key check.
      2) Direct backref
         Only parent bytenr and level is known and we need to resolve the key
         all by ourselves, skip @first_key check.
      
      Another note of this verification is, it needs extra info from nodeptr
      or ROOT_ITEM, so it can't fit into current tree-checker framework, which
      is limited to node/leaf boundary.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      581c1760
    • Q
      btrfs: qgroup: Use separate meta reservation type for delalloc · 43b18595
      Qu Wenruo 提交于
      Before this patch, btrfs qgroup is mixing per-transcation meta rsv with
      preallocated meta rsv, making it quite easy to underflow qgroup meta
      reservation.
      
      Since we have the new qgroup meta rsv types, apply it to delalloc
      reservation.
      
      Now for delalloc, most of its reserved space will use META_PREALLOC qgroup
      rsv type.
      
      And for callers reducing outstanding extent like btrfs_finish_ordered_io(),
      they will convert corresponding META_PREALLOC reservation to
      META_PERTRANS.
      
      This is mainly due to the fact that current qgroup numbers will only be
      updated in btrfs_commit_transaction(), that's to say if we don't keep
      such placeholder reservation, we can exceed qgroup limitation.
      
      And for callers freeing outstanding extent in error handler, we will
      just free META_PREALLOC bytes.
      
      This behavior makes callers of btrfs_qgroup_release_meta() or
      btrfs_qgroup_convert_meta() to be aware of which type they are.
      So in this patch, btrfs_delalloc_release_metadata() and its callers get
      an extra parameter to info qgroup to do correct meta convert/release.
      
      The good news is, even we use the wrong type (convert or free), it won't
      cause obvious bug, as prealloc type is always in good shape, and the
      type only affects how per-trans meta is increased or not.
      
      So the worst case will be at most metadata limitation can be sometimes
      exceeded (no convert at all) or metadata limitation is reached too soon
      (no free at all).
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      43b18595
    • Q
      btrfs: qgroup: Split meta rsv type into meta_prealloc and meta_pertrans · 733e03a0
      Qu Wenruo 提交于
      Btrfs uses 2 different methods to reseve metadata qgroup space.
      
      1) Reserve at btrfs_start_transaction() time
         This is quite straightforward, caller will use the trans handler
         allocated to modify b-trees.
      
         In this case, reserved metadata should be kept until qgroup numbers
         are updated.
      
      2) Reserve by using block_rsv first, and later btrfs_join_transaction()
         This is more complicated, caller will reserve space using block_rsv
         first, and then later call btrfs_join_transaction() to get a trans
         handle.
      
         In this case, before we modify trees, the reserved space can be
         modified on demand, and after btrfs_join_transaction(), such reserved
         space should also be kept until qgroup numbers are updated.
      
      Since these two types behave differently, split the original "META"
      reservation type into 2 sub-types:
      
        META_PERTRANS:
          For above case 1)
      
        META_PREALLOC:
          For reservations that happened before btrfs_join_transaction() of
          case 2)
      
      NOTE: This patch will only convert existing qgroup meta reservation
      callers according to its situation, not ensuring all callers are at
      correct timing.
      Such fix will be added in later patches.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [ update comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      733e03a0
    • J
      btrfs: defer adding raid type kobject until after chunk relocation · 75cb379d
      Jeff Mahoney 提交于
      Any time the first block group of a new type is created, we add a new
      kobject to sysfs to hold the attributes for that type.  Kobject-internal
      allocations always use GFP_KERNEL, making them prone to fs-reclaim races.
      While it appears as if this can occur any time a block group is created,
      the only times the first block group of a new type can be created in
      memory is at mount and when we create the first new block group during
      raid conversion.
      
      This patch adds a new list to track pending kobject additions and then
      handles them after we do chunk relocation.  Between relocating the
      target chunk (or forcing allocation of a new chunk in the case of data)
      and removing the old chunk, we're in a safe place for fs-reclaim to
      occur.  We're holding the volume mutex, which is already held across
      page faults, and the delete_unused_bgs_mutex, which will only stall
      the cleaner thread.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      75cb379d
    • J
      btrfs: remove dead create_space_info calls · dc2d3005
      Jeff Mahoney 提交于
      Since commit 2be12ef7 (btrfs: Separate space_info create/update), we've
      separated out the creation and updating of the space info structures.
      That commit was a straightforward refactoring of the two parts of
      update_space_info, but we can go a step further.  Since commits
      c59021f8 (Btrfs: fix OOPS of empty filesystem after balance) and
      b742bb82 (Btrfs: Link block groups of different raid types), we know
      that the space_info structures will be created at mount and there will
      only ever be, at most, three of them.
      
      This patch cleans out the create_space_info calls after __find_space_info
      returns NULL since __find_space_info *can't* return NULL.
      
      The initial cause for reviewing this was the kobject_add calls from
      create_space_info occuring in sites where fs-reclaim wasn't allowed.  Now
      we are certain they occur only early in the mount process and are safe.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dc2d3005
    • N
      btrfs: Drop fs_info parameter from __btrfs_run_delayed_refs · 0a1e458a
      Nikolay Borisov 提交于
      It's provided by transaction handle.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0a1e458a
    • N
      btrfs: Drop fs_info parameter from btrfs_finish_extent_commit · 5ead2dd0
      Nikolay Borisov 提交于
      It's provided by the transaction handle.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5ead2dd0
    • N
      btrfs: drop fs_info parameter from btrfs_run_delayed_refs · c79a70b1
      Nikolay Borisov 提交于
      It's provided by the transaction handle.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c79a70b1
    • N
      btrfs: Remove unused flush var in shrink_delalloc · 39d7d09d
      Nikolay Borisov 提交于
      Added by 08e007d2 ("Btrfs: improve the noflush reservation") and
      made redundant by 17024ad0 ("Btrfs: fix early ENOSPC due to
      delalloc").
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      39d7d09d
    • N
      btrfs: Remove unused extent_root var from caching_thread · 101d2dc0
      Nikolay Borisov 提交于
      Added by b4570aa9 ("btrfs: fix compiling with CONFIG_BTRFS_DEBUG
      enabled.") and obsoleted by 2ff7e61e ("btrfs: take an fs_info
      directly when the root is not used otherwise").
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      101d2dc0
    • N
      btrfs: Document parameters of btrfs_reserve_extent · 6f47c706
      Nikolay Borisov 提交于
      This function is the entry to the extent allocator and as such has
      quite a number of parameters. Some of those have subtle effects on the
      allocation algorithm. Document the parameters.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6f47c706
    • N
      btrfs: Remove btrfs_fs_info::open_ioctl_trans · 92e2f7e3
      Nikolay Borisov 提交于
      Since userspace transaction have been removed we no longer have use
      for this field so delete it.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      92e2f7e3
  5. 26 3月, 2018 9 次提交
  6. 02 2月, 2018 1 次提交
    • F
      Btrfs: fix null pointer dereference when replacing missing device · 627e0873
      Filipe Manana 提交于
      When we are replacing a missing device we mount the filesystem with the
      degraded mode option in which case we are allowed to have a btrfs device
      structure without a backing device member (its bdev member is NULL) and
      therefore we can't dereference that member. Commit 38b5f68e
      ("btrfs: drop btrfs_device::can_discard to query directly") started to
      dereference that member when discarding extents, resulting in a null
      pointer dereference:
      
       [ 3145.322257] BTRFS warning (device sdf): devid 2 uuid 4d922414-58eb-4880-8fed-9c3840f6c5d5 is missing
       [ 3145.364116] BTRFS info (device sdf): dev_replace from <missing disk> (devid 2) to /dev/sdg started
       [ 3145.413489] BUG: unable to handle kernel NULL pointer dereference at 00000000000000e0
       [ 3145.415085] IP: btrfs_discard_extent+0x6a/0xf8 [btrfs]
       [ 3145.415085] PGD 0 P4D 0
       [ 3145.415085] Oops: 0000 [#1] PREEMPT SMP PTI
       [ 3145.415085] Modules linked in: ppdev ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper evdev psmouse parport_pc serio_raw i2c_piix4 i2
       [ 3145.415085] CPU: 0 PID: 11989 Comm: btrfs Tainted: G        W        4.15.0-rc9-btrfs-next-55+ #1
       [ 3145.415085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
       [ 3145.415085] RIP: 0010:btrfs_discard_extent+0x6a/0xf8 [btrfs]
       [ 3145.415085] RSP: 0018:ffffc90004813c60 EFLAGS: 00010293
       [ 3145.415085] RAX: ffff88020d39cc00 RBX: ffff88020c4ea2a0 RCX: 0000000000000002
       [ 3145.415085] RDX: 0000000000000000 RSI: ffff88020c4ea240 RDI: 0000000000000000
       [ 3145.415085] RBP: 0000000000000000 R08: 0000000000004000 R09: 0000000000000000
       [ 3145.415085] R10: ffffc90004813ae8 R11: 0000000000000000 R12: 0000000000000000
       [ 3145.415085] R13: ffff88020c418000 R14: 0000000000000000 R15: 0000000000000000
       [ 3145.415085] FS:  00007f565681f8c0(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
       [ 3145.415085] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [ 3145.415085] CR2: 00000000000000e0 CR3: 000000020d208006 CR4: 00000000001606f0
       [ 3145.415085] Call Trace:
       [ 3145.415085]  btrfs_finish_extent_commit+0x9a/0x1be [btrfs]
       [ 3145.415085]  btrfs_commit_transaction+0x649/0x7a0 [btrfs]
       [ 3145.415085]  ? start_transaction+0x2b0/0x3b3 [btrfs]
       [ 3145.415085]  btrfs_dev_replace_start+0x274/0x30c [btrfs]
       [ 3145.415085]  btrfs_dev_replace_by_ioctl+0x45/0x59 [btrfs]
       [ 3145.415085]  btrfs_ioctl+0x1a91/0x1d62 [btrfs]
       [ 3145.415085]  ? lock_acquire+0x16a/0x1af
       [ 3145.415085]  ? vfs_ioctl+0x1b/0x28
       [ 3145.415085]  ? trace_hardirqs_on_caller+0x14c/0x1a6
       [ 3145.415085]  vfs_ioctl+0x1b/0x28
       [ 3145.415085]  do_vfs_ioctl+0x5a9/0x5e0
       [ 3145.415085]  ? _raw_spin_unlock_irq+0x34/0x46
       [ 3145.415085]  ? entry_SYSCALL_64_fastpath+0x5/0x8b
       [ 3145.415085]  ? trace_hardirqs_on_caller+0x14c/0x1a6
       [ 3145.415085]  SyS_ioctl+0x52/0x76
       [ 3145.415085]  entry_SYSCALL_64_fastpath+0x1e/0x8b
       [ 3145.415085] RIP: 0033:0x7f56558b3c47
       [ 3145.415085] RSP: 002b:00007ffdcfac4c58 EFLAGS: 00000202
       [ 3145.415085] Code: be 02 00 00 00 4c 89 ef e8 b9 e7 03 00 85 c0 89 c5 75 75 48 8b 44 24 08 45 31 f6 48 8d 58 60 eb 52 48 8b 03 48 8b b8 a0 00 00 00 <48> 8b 87 e0 00
       [ 3145.415085] RIP: btrfs_discard_extent+0x6a/0xf8 [btrfs] RSP: ffffc90004813c60
       [ 3145.415085] CR2: 00000000000000e0
       [ 3145.458185] ---[ end trace 06302e7ac31902bf ]---
      
      This is trivially reproduced by running the test btrfs/027 from fstests
      like this:
      
        $ MOUNT_OPTIONS="-o discard" ./check btrfs/027
      
      Fix this by skipping devices without a backing device before attempting
      to discard.
      
      Fixes: 38b5f68e ("btrfs: drop btrfs_device::can_discard to query directly")
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      627e0873
  7. 22 1月, 2018 7 次提交
  8. 07 12月, 2017 1 次提交
  9. 21 11月, 2017 1 次提交
    • J
      btrfs: clear space cache inode generation always · 8e138e0d
      Josef Bacik 提交于
      We discovered a box that had double allocations, and suspected the space
      cache may be to blame.  While auditing the write out path I noticed that
      if we've already setup the space cache we will just carry on.  This
      means that any error we hit after cache_save_setup before we go to
      actually write the cache out we won't reset the inode generation, so
      whatever was already written will be considered correct, except it'll be
      stale.  Fix this by _always_ resetting the generation on the block group
      inode, this way we only ever have valid or invalid cache.
      
      With this patch I was no longer able to reproduce cache corruption with
      dm-log-writes and my bpf error injection tool.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8e138e0d
  10. 13 11月, 2017 1 次提交
    • D
      Pass mode to wait_on_atomic_t() action funcs and provide default actions · 5e4def20
      David Howells 提交于
      Make wait_on_atomic_t() pass the TASK_* mode onto its action function as an
      extra argument and make it 'unsigned int throughout.
      
      Also, consolidate a bunch of identical action functions into a default
      function that can do the appropriate thing for the mode.
      
      Also, change the argument name in the bit_wait*() function declarations to
      reflect the fact that it's the mode and not the bit number.
      
      [Peter Z gives this a grudging ACK, but thinks that the whole atomic_t wait
      should be done differently, though he's not immediately sure as to how]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      cc: Ingo Molnar <mingo@kernel.org>
      5e4def20
  11. 02 11月, 2017 3 次提交
    • J
      btrfs: track refs in a rb_tree instead of a list · 0e0adbcf
      Josef Bacik 提交于
      If we get a significant amount of delayed refs for a single block (think
      modifying multiple snapshots) we can end up spending an ungodly amount
      of time looping through all of the entries trying to see if they can be
      merged.  This is because we only add them to a list, so we have O(2n)
      for every ref head.  This doesn't make any sense as we likely have refs
      for different roots, and so they cannot be merged.  Tracking in a tree
      will allow us to break as soon as we hit an entry that doesn't match,
      making our worst case O(n).
      
      With this we can also merge entries more easily.  Before we had to hope
      that matching refs were on the ends of our list, but with the tree we
      can search down to exact matches and merge them at insert time.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0e0adbcf
    • J
      btrfs: make the delalloc block rsv per inode · 69fe2d75
      Josef Bacik 提交于
      The way we handle delalloc metadata reservations has gotten
      progressively more complicated over the years.  There is so much cruft
      and weirdness around keeping the reserved count and outstanding counters
      consistent and handling the error cases that it's impossible to
      understand.
      
      Fix this by making the delalloc block rsv per-inode.  This way we can
      calculate the actual size of the outstanding metadata reservations every
      time we make a change, and then reserve the delta based on that amount.
      This greatly simplifies the code everywhere, and makes the error
      handling in btrfs_delalloc_reserve_metadata far less terrifying.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      69fe2d75
    • J
      Btrfs: rework outstanding_extents · 8b62f87b
      Josef Bacik 提交于
      Right now we do a lot of weird hoops around outstanding_extents in order
      to keep the extent count consistent.  This is because we logically
      transfer the outstanding_extent count from the initial reservation
      through the set_delalloc_bits.  This makes it pretty difficult to get a
      handle on how and when we need to mess with outstanding_extents.
      
      Fix this by revamping the rules of how we deal with outstanding_extents.
      Now instead everybody that is holding on to a delalloc extent is
      required to increase the outstanding extents count for itself.  This
      means we'll have something like this
      
      btrfs_delalloc_reserve_metadata	- outstanding_extents = 1
       btrfs_set_extent_delalloc	- outstanding_extents = 2
      btrfs_release_delalloc_extents	- outstanding_extents = 1
      
      for an initial file write.  Now take the append write where we extend an
      existing delalloc range but still under the maximum extent size
      
      btrfs_delalloc_reserve_metadata - outstanding_extents = 2
        btrfs_set_extent_delalloc
          btrfs_set_bit_hook		- outstanding_extents = 3
          btrfs_merge_extent_hook	- outstanding_extents = 2
      btrfs_delalloc_release_extents	- outstanding_extnets = 1
      
      In order to make the ordered extent transition we of course must now
      make ordered extents carry their own outstanding_extent reservation, so
      for cow_file_range we end up with
      
      btrfs_add_ordered_extent	- outstanding_extents = 2
      clear_extent_bit		- outstanding_extents = 1
      btrfs_remove_ordered_extent	- outstanding_extents = 0
      
      This makes all manipulations of outstanding_extents much more explicit.
      Every successful call to btrfs_delalloc_reserve_metadata _must_ now be
      combined with btrfs_release_delalloc_extents, even in the error case, as
      that is the only function that actually modifies the
      outstanding_extents counter.
      
      The drawback to this is now we are much more likely to have transient
      cases where outstanding_extents is much larger than it actually should
      be.  This could happen before as we manipulated the delalloc bits, but
      now it happens basically at every write.  This may put more pressure on
      the ENOSPC flushing code, but I think making this code simpler is worth
      the cost.  I have another change coming to mitigate this side-effect
      somewhat.
      
      I also added trace points for the counter manipulation.  These were used
      by a bpf script I wrote to help track down leak issues.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8b62f87b
  12. 30 10月, 2017 1 次提交