1. 26 7月, 2016 10 次提交
    • L
      Btrfs: cleanup BUG_ON in merge_bio · 6f034ece
      Liu Bo 提交于
      One can use btrfs-corrupt-block to hit BUG_ON() in merge_bio(),
      thus this aims to stop anyone to panic the whole system by using
       their btrfs.
      
      Since the error in merge_bio can only come from __btrfs_map_block()
      when chunk tree mapping has something insane and __btrfs_map_block()
      has already had printed the reason, we can just return errors in
      merge_bio.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6f034ece
    • N
      btrfs: Fix slab accounting flags · fba4b697
      Nikolay Borisov 提交于
      BTRFS is using a variety of slab caches to satisfy internal needs.
      Those slab caches are always allocated with the SLAB_RECLAIM_ACCOUNT,
      meaning allocations from the caches are going to be accounted as
      SReclaimable. At the same time btrfs is not registering any shrinkers
      whatsoever, thus preventing memory from the slabs to be shrunk. This
      means those caches are not in fact reclaimable.
      
      To fix this remove the SLAB_RECLAIM_ACCOUNT on all caches apart from the
      inode cache, since this one is being freed by the generic VFS super_block
      shrinker. Also set the transaction related caches as SLAB_TEMPORARY,
      to better document the lifetime of the objects (it just translates
      to SLAB_RECLAIM_ACCOUNT).
      Signed-off-by: NNikolay Borisov <n.borisov.lkml@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fba4b697
    • H
      Btrfs: use the correct struct for BTRFS_IOC_LOGICAL_INO · 7af7c616
      Hans van Kranenburg 提交于
      BTRFS_IOC_LOGICAL_INO takes a btrfs_ioctl_logical_ino_args as argument,
      not a btrfs_ioctl_ino_path_args. The lines were probably copy/pasted
      when the code was written.
      
      Since btrfs_ioctl_logical_ino_args and btrfs_ioctl_ino_path_args have
      the same size, the actual IOCTL definition here does not change.
      
      But, it makes the code less confusing for the reader.
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7af7c616
    • S
      btrfs: Replace -ENOENT by -ERANGE in btrfs_get_acl() · a60617d0
      Salah Triki 提交于
      size contains the value returned by posix_acl_from_xattr(), which
      returns -ERANGE, -ENODATA, zero, or an integer greater than zero. So
      replace -ENOENT by -ERANGE.
      Signed-off-by: NSalah Triki <salah.triki@gmail.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a60617d0
    • N
      btrfs: Handle uninitialised inode eviction · 3d48d981
      Nikolay Borisov 提交于
      The code flow in btrfs_new_inode allows for btrfs_evict_inode to be
      called with not fully initialised inode (e.g. ->root member not
      being set). This can happen when btrfs_set_inode_index in
      btrfs_new_inode fails, which in turn would call iput for the newly
      allocated inode. This in turn leads to vfs calling into btrfs_evict_inode.
      This leads to null pointer dereference. To handle this situation check whether
      the passed inode has root set and just free it in case it doesn't.
      Signed-off-by: NNikolay Borisov <kernel@kyup.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d48d981
    • L
      Btrfs: fix read_node_slot to return errors · fb770ae4
      Liu Bo 提交于
      We use read_node_slot() to read btree node and it has two cases,
      a) slot is out of range, which means 'no such entry'
      b) we fail to read the block, due to checksum fails or corrupted
         content or not with uptodate flag.
      But we're returning NULL in both cases, this makes it return -ENOENT
      in case a) and return -EIO in case b), and this fixes its callers
      as well as btrfs_search_forward() 's caller to catch the new errors.
      
      The problem is reported by Peter Becker, and I can manage to
      hit the same BUG_ON by mounting my fuzz image.
      Reported-by: NPeter Becker <floyd.net@gmail.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fb770ae4
    • L
      Btrfs: fix double free of fs root · 876d2cf1
      Liu Bo 提交于
      I got this warning while mounting a btrfs image,
      
      [ 3020.509606] ------------[ cut here ]------------
      [ 3020.510107] WARNING: CPU: 3 PID: 5581 at lib/idr.c:1051 ida_remove+0xca/0x190
      [ 3020.510853] ida_remove called for id=42 which is not allocated.
      [ 3020.511466] Modules linked in:
      [ 3020.511802] CPU: 3 PID: 5581 Comm: mount Not tainted 4.7.0-rc5+ #274
      [ 3020.512438] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
      [ 3020.513385]  0000000000000286 0000000021295d86 ffff88006c66b8f0 ffffffff8182ba5a
      [ 3020.514153]  0000000000000000 0000000000000009 ffff88006c66b930 ffffffff810e0ed7
      [ 3020.514928]  0000041b00000000 ffffffff8289a8c0 ffff88007f437880 0000000000000000
      [ 3020.515717] Call Trace:
      [ 3020.515965]  [<ffffffff8182ba5a>] dump_stack+0xc9/0x13f
      [ 3020.516487]  [<ffffffff810e0ed7>] __warn+0x147/0x160
      [ 3020.517005]  [<ffffffff810e0f4f>] warn_slowpath_fmt+0x5f/0x80
      [ 3020.517572]  [<ffffffff8182e6ca>] ida_remove+0xca/0x190
      [ 3020.518075]  [<ffffffff813a2bcc>] free_anon_bdev+0x2c/0x60
      [ 3020.518609]  [<ffffffff81657a9f>] free_fs_root+0x13f/0x160
      [ 3020.519138]  [<ffffffff8165c679>] btrfs_get_fs_root+0x379/0x3d0
      [ 3020.519710]  [<ffffffff81e6e975>] ? __mutex_unlock_slowpath+0x155/0x2c0
      [ 3020.520366]  [<ffffffff816615b1>] open_ctree+0x2e91/0x3200
      [ 3020.520965]  [<ffffffff8161ede2>] btrfs_mount+0x1322/0x15b0
      [ 3020.521536]  [<ffffffff81e60e74>] ? kmemleak_alloc_percpu+0x44/0x170
      [ 3020.522167]  [<ffffffff8115f5e1>] ? lockdep_init_map+0x61/0x210
      [ 3020.522780]  [<ffffffff813a4f59>] mount_fs+0x49/0x2c0
      [ 3020.523305]  [<ffffffff813d840c>] vfs_kern_mount+0xac/0x1b0
      [ 3020.523872]  [<ffffffff8161dee1>] btrfs_mount+0x421/0x15b0
      [ 3020.524402]  [<ffffffff81e60e74>] ? kmemleak_alloc_percpu+0x44/0x170
      [ 3020.525045]  [<ffffffff8115f5e1>] ? lockdep_init_map+0x61/0x210
      [ 3020.525657]  [<ffffffff8115f5e1>] ? lockdep_init_map+0x61/0x210
      [ 3020.526289]  [<ffffffff813a4f59>] mount_fs+0x49/0x2c0
      [ 3020.526803]  [<ffffffff813d840c>] vfs_kern_mount+0xac/0x1b0
      [ 3020.527365]  [<ffffffff813dc27a>] do_mount+0x41a/0x1770
      [ 3020.527899]  [<ffffffff812e800d>] ? strndup_user+0x6d/0xc0
      [ 3020.528447]  [<ffffffff812e7f68>] ? memdup_user+0x78/0xb0
      [ 3020.528987]  [<ffffffff813ddad0>] SyS_mount+0x150/0x160
      [ 3020.529493]  [<ffffffff81e72b7c>] entry_SYSCALL_64_fastpath+0x1f/0xbd
      
      It turns out that we free fs root twice, btrfs_init_fs_root() calls
      free_anon_bdev(root->anon_dev) and later then btrfs_get_fs_root() cals
      free_fs_root which does another free_anon_bdev() and it ends up with the
      above warning.
      
      Instead of reset root->anon_dev to 0 after free_anon_bdev(), we can let
      btrfs_init_fs_root() return directly since its callers have already done
      the free job by calling free_fs_root().
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      876d2cf1
    • L
      Btrfs: error out if generic_bin_search get invalid arguments · 5e24e9af
      Liu Bo 提交于
      With btrfs-corrupt-block, one can set btree node/leaf's field, if
      we assign a negative value to node/leaf, we can get various hangs,
      eg. if extent_root's nritems is -2ULL, then we get stuck in
       btrfs_read_block_groups() because it has a while loop and
      btrfs_search_slot() on extent_root will always return the first
       child.
      
      This lets us know what's happening and returns a EINVAL to callers
      instead of returning the first item.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5e24e9af
    • L
      Btrfs: check inconsistence between chunk and block group · 6fb37b75
      Liu Bo 提交于
      With btrfs-corrupt-block, one can drop one chunk item and mounting
      will end up with a panic in btrfs_full_stripe_len().
      
      This doesn't not remove the BUG_ON, but instead checks it a bit
      earlier when we find the block group item.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6fb37b75
    • W
  2. 21 7月, 2016 3 次提交
  3. 08 7月, 2016 18 次提交
    • J
      Btrfs: use FLUSH_LIMIT for relocation in reserve_metadata_bytes · 8ca17f0f
      Josef Bacik 提交于
      We used to allow you to set FLUSH_ALL and then just wouldn't do things like
      commit transactions or wait on ordered extents if we noticed you were in a
      transaction.  However now that all the flushing for FLUSH_ALL is asynchronous
      we've lost the ability to tell, and we could end up deadlocking.  So instead use
      FLUSH_LIMIT in reserve_metadata_bytes in relocation and then return -EAGAIN if
      we error out to preserve the previous behavior.  I've also added an ASSERT() to
      catch anybody else who tries to do this.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8ca17f0f
    • J
      Btrfs: fill relocation block rsv after allocation · ac2fabac
      Josef Bacik 提交于
      Since we set the reloc control before we've reserved our space for relocation we
      could race with a root being dirtied and not actually have space to do our init
      reloc root.  So once we've allocated it and set it up go ahead and make our
      reservation before setting the relocate control, that way anybody who tries to
      do the reloc root init has space to use.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac2fabac
    • J
      Btrfs: always use trans->block_rsv for orphans · 40acc3ee
      Josef Bacik 提交于
      This is the case all the time anyway except for relocation which could be doing
      a reloc root for a non ref counted root, in which case we'd end up with some
      random block rsv rather than the one we have our reservation in.  If there isn't
      enough space in the block rsv we are trying to steal from we'll BUG() because we
      expect there to be space for the orphan to make its reservation.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      40acc3ee
    • J
      Btrfs: change how we calculate the global block rsv · ae2e4728
      Josef Bacik 提交于
      Traditionally we've calculated the global block rsv by guessing how much of the
      metadata used amount was the extent tree, and then taking the data size and
      figuring out how large the csum tree would have to be to hold that much data.
      
      This is imprecise and falls down on MIXED file systems as we can't trust the
      data used amount.  This resulted in failures for xfstests generic/333 because it
      creates lots of clones, which explodes out the extent tree.  Our global reserve
      calculations were woefully inaccurate in this case which meant we got into a
      situation where we did not have enough reserved to do our work.
      
      We know we only use the global block rsv for the extent, csum, and root trees,
      so just get the bytes used for these trees and use that as the basis of our
      global reserve.  Since these are not reference counted trees the bytes_used
      value will be accurate.  This fixed the transaction aborts seen with
      generic/333.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ae2e4728
    • J
      Btrfs: use root when checking need_async_flush · 87241c2e
      Josef Bacik 提交于
      Instead of doing fs_info->fs_root in need_async_flush, which may not be set
      during recovery when mounting, just pass the root itself in, which makes more
      sense as thats what btrfs_calc_reclaim_metadata_size takes.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reported-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      87241c2e
    • J
      Btrfs: don't bother kicking async if there's nothing to reclaim · d38b349c
      Josef Bacik 提交于
      We do this check when we start the async reclaimer thread, might as well check
      before we kick it off to save us some cycles.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d38b349c
    • J
      Btrfs: fix release reserved extents trace points · 31bada7c
      Josef Bacik 提交于
      We were doing trace_btrfs_release_reserved_extent() in pin_down_extent which
      isn't quite right because we will go through and free that extent later when we
      unpin, so it messes up apps that are accounting for the reservation space.  We
      were also unconditionally doing it in __btrfs_free_reserved_extent(), when we
      only actually free the reservation instead of pinning the extent.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31bada7c
    • J
      Btrfs: add fsid to some tracepoints · dce3afa5
      Josef Bacik 提交于
      When tracing enospc problems on a box with multiple file systems mounted I need
      to be able to differentiate between the two file systems.  Most of the important
      trace points I'm looking at already have an fsid, but the reserved extent trace
      points do not, so add that to make it possible to figure out which trace point
      belongs to which file system.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      dce3afa5
    • J
      Btrfs: add tracepoints for flush events · f376df2b
      Josef Bacik 提交于
      We want to track when we're triggering flushing from our reservation code and
      what flushing is being done when we start flushing.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f376df2b
    • J
      Btrfs: fix delalloc reservation amount tracepoint · f485c9ee
      Josef Bacik 提交于
      We can sometimes drop the reservation we had for our inode, so we need to remove
      that amount from to_reserve so that our tracepoint reports a valid amount of
      space.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f485c9ee
    • J
      Btrfs: trace pinned extents · c51e7bb1
      Josef Bacik 提交于
      Pinned extents are an important metric to keep track of for enospc.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c51e7bb1
    • J
      Btrfs: introduce ticketed enospc infrastructure · 957780eb
      Josef Bacik 提交于
      Our enospc flushing sucks.  It is born from a time where we were early
      enospc'ing constantly because multiple threads would race in for the same
      reservation and randomly starve other ones out.  So I came up with this solution
      to block any other reservations from happening while one guy tried to flush
      stuff to satisfy his reservation.  This gives us pretty good correctness, but
      completely crap latency.
      
      The solution I've come up with is ticketed reservations.  Basically we try to
      make our reservation, and if we can't we put a ticket on a list in order and
      kick off an async flusher thread.  This async flusher thread does the same old
      flushing we always did, just asynchronously.  As space is freed and added back
      to the space_info it checks and sees if we have any tickets that need
      satisfying, and adds space to the tickets and wakes up anything we've satisfied.
      
      Once the flusher thread stops making progress it wakes up all the current
      tickets and tells them to take a hike.
      
      There is a priority list for things that can't flush, since the async flusher
      could do anything we need to avoid deadlocks.  These guys get priority for
      having their reservation made, and will still do manual flushing themselves in
      case the async flusher isn't running.
      
      This patch gives us significantly better latencies.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      957780eb
    • J
      Btrfs: add tracepoint for adding block groups · c83f8eff
      Josef Bacik 提交于
      I'm writing a tool to visualize the enospc system inside btrfs, I need this
      tracepoint in order to keep track of the block groups in the system.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c83f8eff
    • J
      Btrfs: warn_on for unaccounted spaces · d555b6c3
      Josef Bacik 提交于
      These were hidden behind enospc_debug, which isn't helpful as they indicate
      actual bugs, unlike the rest of the enospc_debug stuff which is really debug
      information.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d555b6c3
    • J
      Btrfs: change delayed reservation fallback behavior · c48f49d6
      Josef Bacik 提交于
      We reserve space for the inode update when we first reserve space for writing to
      a file.  However there are lots of ways that we can use this reservation and not
      have it for subsequent ordered extents.  Previously we'd fall through and try to
      reserve metadata bytes for this, then we'd just steal the full reservation from
      the delalloc_block_rsv, and if that didn't have enough space we'd steal the full
      reservation from the global reserve.  The problem with this is we can easily
      just return ENOSPC and fallback to updating the inode item directly.  In the
      worst case (assuming 4k nodesize) we'd steal 64kib from the global reserve if we
      fall all the way through, however if we just fallback and update the inode
      directly we'd only steal 4k * BTRFS_PATH_MAX in the worst case which is 32kib.
      
      We would have also just added the extent item for the inode so we likely will
      have already cow'ed down most of the way to the leaf containing the inode item,
      so we are more often than not only need one or two nodesize's worth of
      reservations.  Given the reservation for the extent itself is also a worst case
      we will likely already have space to cover the inode update.
      
      This change will make us behave better in the theoretical worst case, and much
      better in the case that we don't have our reservation and cannot reserve more
      metadata.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c48f49d6
    • J
      Btrfs: always reserve metadata for delalloc extents · 48c3d480
      Josef Bacik 提交于
      There are a few races in the metadata reservation stuff.  First we add the bytes
      to the block_rsv well after we've set the bit on the inode saying that we have
      space for it and after we've reserved the bytes.  So use the normal
      btrfs_block_rsv_add helper for this case.  Secondly we can flush delalloc
      extents when we try to reserve space for our write, which means that we could
      have used up the space for the inode and we wouldn't know because we only check
      before the reservation.  So instead make sure we are always reserving space for
      the inode update, and then if we don't need it release those bytes afterward.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      48c3d480
    • J
      Btrfs: fix callers of btrfs_block_rsv_migrate · 25d609f8
      Josef Bacik 提交于
      So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
      Not only this but it unconditionally changes the size of the block_rsv.  This
      isn't a bug strictly speaking, but it makes truncate block rsv's look funny
      because every time we migrate bytes over its size grows, even though we only
      want it to be a specific size.  So collapse this into one function that takes an
      update_size argument and make truncate and evict not update the size for
      consistency sake.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      25d609f8
    • J
      Btrfs: add bytes_readonly to the spaceinfo at once · e40edf2d
      Josef Bacik 提交于
      For some reason we're adding bytes_readonly to the space info after we update
      the space info with the block group info.  This creates a tiny race where we
      could over-reserve space because we haven't yet taken out the bytes_readonly
      bit.  Since we already know this information at the time we call
      update_space_info, just pass it along so it can be updated all at once.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e40edf2d
  4. 04 7月, 2016 3 次提交
  5. 03 7月, 2016 6 次提交