1. 21 8月, 2017 5 次提交
    • L
      Btrfs: add one more sanity check for shared ref type · 64ecdb64
      Liu Bo 提交于
      Every shared ref has a parent tree block, which can be get from
      btrfs_extent_inline_ref_offset().  And the tree block must be aligned
      to the nodesize, so we'd know this inline ref is not valid if this
      block's bytenr is not aligned to the nodesize, in which case, most
      likely the ref type has been misused.
      
      This adds the above mentioned check and also updates
      print_extent_item() called by btrfs_print_leaf() to point out the
      invalid ref while printing the tree structure.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      64ecdb64
    • L
      Btrfs: convert to use btrfs_get_extent_inline_ref_type · 3de28d57
      Liu Bo 提交于
      Since we have a helper which can do sanity check, this converts all
      btrfs_extent_inline_ref_type to it.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3de28d57
    • L
      Btrfs: add a helper to retrive extent inline ref type · 167ce953
      Liu Bo 提交于
      An invalid value of extent inline ref type may be read from a
      malicious image which may force btrfs to crash.
      
      This adds a helper which does sanity check for the ref type, so we can
      know if it's sane, return he type, otherwise return an error.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minimal tweak const types, causing warnings due to other cleanup patches ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      167ce953
    • N
      btrfs: Remove chunk_objectid argument from btrfs_make_block_group · 0174484d
      Nikolay Borisov 提交于
      btrfs_make_block_group is always called with chunk_objectid set to
      BTRFS_FIRST_CHUNK_TREE_OBJECTID. There's no reason why this behavior will
      change anytime soon, so let's remove the argument and decrease the cognitive
      load when reading the code path. No functional change
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0174484d
    • H
      btrfs: Do not use data_alloc_cluster in ssd mode · 583b7231
      Hans van Kranenburg 提交于
          This patch provides a band aid to improve the 'out of the box'
      behaviour of btrfs for disks that are detected as being an ssd.  In a
      general purpose mixed workload scenario, the current ssd mode causes
      overallocation of available raw disk space for data, while leaving
      behind increasing amounts of unused fragmented free space. This
      situation leads to early ENOSPC problems which are harming user
      experience and adoption of btrfs as a general purpose filesystem.
      
      This patch modifies the data extent allocation behaviour of the ssd mode
      to make it behave identical to nossd mode.  The metadata behaviour and
      additional ssd_spread option stay untouched so far.
      
      Recommendations for future development are to reconsider the current
      oversimplified nossd / ssd distinction and the broken detection
      mechanism based on the rotational attribute in sysfs and provide
      experienced users with a more flexible way to choose allocator behaviour
      for data and metadata, optimized for certain use cases, while keeping
      sane 'out of the box' default settings.  The internals of the current
      btrfs code have more potential than what currently gets exposed to the
      user to choose from.
      
          The SSD story...
      
          In the first year of btrfs development, around early 2008, btrfs
      gained a mount option which enables specific functionality for
      filesystems on solid state devices. The first occurance of this
      functionality is in commit e18e4809, labeled "Add mount -o ssd, which
      includes optimizations for seek free storage".
      
      The effect on allocating free space for doing (data) writes is to
      'cluster' writes together, writing them out in contiguous space, as
      opposed to a 'tetris' way of putting all separate writes into any free
      space fragment that fits (which is what the -o nossd behaviour does).
      
      A somewhat simplified explanation of what happens is that, when for
      example, the 'cluster' size is set to 2MiB, when we do some writes, the
      data allocator will search for a free space block that is 2MiB big, and
      put the writes in there. The ssd mode itself might allow a 2MiB cluster
      to be composed of multiple free space extents with some existing data in
      between, while the additional ssd_spread mount option kills off this
      option and requires fully free space.
      
      The idea behind this is (commit 536ac8ae): "The [...] clusters make it
      more likely a given IO will completely overwrite the ssd block, so it
      doesn't have to do an internal rwm cycle."; ssd block meaning nand erase
      block. So, effectively this means applying a "locality based algorithm"
      and trying to outsmart the actual ssd.
      
      Since then, various changes have been made to the involved code, but the
      basic idea is still present, and gets activated whenever the ssd mount
      option is active. This also happens by default, when the rotational flag
      as seen at /sys/block/<device>/queue/rotational is set to 0.
      
          However, there's a number of problems with this approach.
      
          First, what the optimization is trying to do is outsmart the ssd by
      assuming there is a relation between the physical address space of the
      block device as seen by btrfs and the actual physical storage of the
      ssd, and then adjusting data placement. However, since the introduction
      of the Flash Translation Layer (FTL) which is a part of the internal
      controller of an ssd, these attempts are futile. The use of good quality
      FTL in consumer ssd products might have been limited in 2008, but this
      situation has changed drastically soon after that time. Today, even the
      flash memory in your automatic cat feeding machine or your grandma's
      wheelchair has a full featured one.
      
      Second, the behaviour as described above results in the filesystem being
      filled up with badly fragmented free space extents because of relatively
      small pieces of space that are freed up by deletes, but not selected
      again as part of a 'cluster'. Since the algorithm prefers allocating a
      new chunk over going back to tetris mode, the end result is a filesystem
      in which all raw space is allocated, but which is composed of
      underutilized chunks with a 'shotgun blast' pattern of fragmented free
      space. Usually, the next problematic thing that happens is the
      filesystem wanting to allocate new space for metadata, which causes the
      filesystem to fail in spectacular ways.
      
      Third, the default mount options you get for an ssd ('ssd' mode enabled,
      'discard' not enabled), in combination with spreading out writes over
      the full address space and ignoring freed up space leads to worst case
      behaviour in providing information to the ssd itself, since it will
      never learn that all the free space left behind is actually free.  There
      are two ways to let an ssd know previously written data does not have to
      be preserved, which are sending explicit signals using discard or
      fstrim, or by simply overwriting the space with new data.  The worst
      case behaviour is the btrfs ssd_spread mount option in combination with
      not having discard enabled. It has a side effect of minimizing the reuse
      of free space previously written in.
      
      Fourth, the rotational flag in /sys/ does not reliably indicate if the
      device is a locally attached ssd. For example, iSCSI or NBD displays as
      non-rotational, while a loop device on an ssd shows up as rotational.
      
      The combination of the second and third problem effectively means that
      despite all the good intentions, the btrfs ssd mode reliably causes the
      ssd hardware and the filesystem structures and performance to be choked
      to death. The clickbait version of the title of this story would have
      been "Btrfs ssd optimizations considered harmful for ssds".
      
      The current nossd 'tetris' mode (even still without discard) allows a
      pattern of overwriting much more previously used space, causing many
      more implicit discards to happen because of the overwrite information
      the ssd gets. The actual location in the physical address space, as seen
      from the point of view of btrfs is irrelevant, because the actual writes
      to the low level flash are reordered anyway thanks to the FTL.
      
          Changes made in the code
      
      1. Make ssd mode data allocation identical to tetris mode, like nossd.
      2. Adjust and clean up filesystem mount messages so that we can easily
      identify if a kernel has this patch applied or not, when providing
      support to end users. Also, make better use of the *_and_info helpers to
      only trigger messages on actual state changes.
      
          Backporting notes
      
      Notes for whoever wants to backport this patch to their 4.9 LTS kernel:
      * First apply commit 951e7966 "btrfs: drop the nossd flag when
        remounting with -o ssd", or fixup the differences manually.
      * The rest of the conflicts are because of the fs_info refactoring. So,
        for example, instead of using fs_info, it's root->fs_info in
        extent-tree.c
      Signed-off-by: NHans van Kranenburg <hans.van.kranenburg@mendix.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      583b7231
  2. 18 8月, 2017 1 次提交
  3. 16 8月, 2017 10 次提交
  4. 24 7月, 2017 2 次提交
    • O
      Btrfs: fix early ENOSPC due to delalloc · 17024ad0
      Omar Sandoval 提交于
      If a lot of metadata is reserved for outstanding delayed allocations, we
      rely on shrink_delalloc() to reclaim metadata space in order to fulfill
      reservation tickets. However, shrink_delalloc() has a shortcut where if
      it determines that space can be overcommitted, it will stop early. This
      made sense before the ticketed enospc system, but now it means that
      shrink_delalloc() will often not reclaim enough space to fulfill any
      tickets, leading to an early ENOSPC. (Reservation tickets don't care
      about being able to overcommit, they need every byte accounted for.)
      
      Fix it by getting rid of the shortcut so that shrink_delalloc() reclaims
      all of the metadata it is supposed to. This fixes early ENOSPCs we were
      seeing when doing a btrfs receive to populate a new filesystem, as well
      as early ENOSPCs Christoph saw when doing a big cp -r onto Btrfs.
      
      Fixes: 957780eb ("Btrfs: introduce ticketed enospc infrastructure")
      Tested-by: NChristoph Anton Mitterer <mail@christoph.anton.mitterer.name>
      Cc: stable@vger.kernel.org
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      17024ad0
    • J
      btrfs: fix lockup in find_free_extent with read-only block groups · 14443937
      Jeff Mahoney 提交于
      If we have a block group that is all of the following:
      1) uncached in memory
      2) is read-only
      3) has a disk cache state that indicates we need to recreate the cache
      
      AND the file system has enough free space fragmentation such that the
      request for an extent of a given size can't be honored;
      
      AND have a single CPU core;
      
      AND it's the block group with the highest starting offset such that
      there are no opportunities (like reading from disk) for the loop to
      yield the CPU;
      
      We can end up with a lockup.
      
      The root cause is simple.  Once we're in the position that we've read in
      all of the other block groups directly and none of those block groups
      can honor the request, there are no more opportunities to sleep.  We end
      up trying to start a caching thread which never gets run if we only have
      one core.  This *should* present as a hung task waiting on the caching
      thread to make some progress, but it doesn't.  Instead, it degrades into
      a busy loop because of the placement of the read-only check.
      
      During the first pass through the loop, block_group->cached will be set
      to BTRFS_CACHE_STARTED and have_caching_bg will be set.  Then we hit the
      read-only check and short circuit the loop.  We're not yet in
      LOOP_CACHING_WAIT, so we skip that loop back before going through the
      loop again for other raid groups.
      
      Then we move to LOOP_CACHING_WAIT state.
      
      During the this pass through the loop, ->cached will still be
      BTRFS_CACHE_STARTED, which means it's not cached, so we'll enter
      cache_block_group, do a lot of nothing, and return, and also set
      have_caching_bg again.  Then we hit the read-only check and short circuit
      the loop.  The same thing happens as before except now we DO trigger
      the LOOP_CACHING_WAIT && have_caching_bg check and loop back up to the
      top.  We do this forever.
      
      There are two fixes in this patch since they address the same underlying
      bug.
      
      The first is to add a cond_resched to the end of the loop to ensure
      that the caching thread always has an opportunity to run.  This will
      fix the soft lockup issue, but find_free_extent will still loop doing
      nothing until the thread has completed.
      
      The second is to move the read-only check to the top of the loop.  We're
      never going to return an allocation within a read-only block group so
      we may as well skip it early.  The check for ->cached == BTRFS_CACHE_ERROR
      would cause the same problem except that BTRFS_CACHE_ERROR is considered
      a "done" state and we won't re-set have_caching_bg again.
      
      Many thanks to Stephan Kulow <coolo@suse.de> for his excellent help in
      the testing process.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      14443937
  5. 30 6月, 2017 10 次提交
  6. 20 6月, 2017 7 次提交
  7. 01 6月, 2017 2 次提交
    • J
      btrfs: fix race with relocation recovery and fs_root setup · a9b3311e
      Jeff Mahoney 提交于
      If we have to recover relocation during mount, we'll ultimately have to
      evict the orphan inode.  That goes through the reservation dance, where
      priority_reclaim_metadata_space and flush_space expect fs_info->fs_root
      to be valid.  That's the next thing to be set up during mount, so we
      crash, almost always in flush_space trying to join the transaction
      but priority_reclaim_metadata_space is possible as well.  This call
      path has been problematic in the past WRT whether ->fs_root is valid
      yet.  Commit 957780eb (Btrfs: introduce ticketed enospc
      infrastructure) added new users that are called in the direct path
      instead of the async path that had already been worked around.
      
      The thing is that we don't actually need the fs_root, specifically, for
      anything.  We either use it to determine whether the root is the
      chunk_root for use in choosing an allocation profile or as a root to pass
      btrfs_join_transaction before immediately committing it.  Anything that
      isn't the chunk root works in the former case and any root works in
      the latter.
      
      A simple fix is to use a root we know will always be there: the
      extent_root.
      
      Cc: <stable@vger.kernel.org> # v4.8+
      Fixes: 957780eb (Btrfs: introduce ticketed enospc infrastructure)
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a9b3311e
    • J
      btrfs: fix memory leak in update_space_info failure path · 896533a7
      Jeff Mahoney 提交于
      If we fail to add the space_info kobject, we'll leak the memory
      for the percpu counter.
      
      Fixes: 6ab0a202 (btrfs: publish allocation data in sysfs)
      Cc: <stable@vger.kernel.org> # v3.14+
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      896533a7
  8. 18 4月, 2017 3 次提交