1. 21 1月, 2015 3 次提交
  2. 11 12月, 2014 1 次提交
    • F
      Btrfs: fix fs corruption on transaction abort if device supports discard · 678886bd
      Filipe Manana 提交于
      When we abort a transaction we iterate over all the ranges marked as dirty
      in fs_info->freed_extents[0] and fs_info->freed_extents[1], clear them
      from those trees, add them back (unpin) to the free space caches and, if
      the fs was mounted with "-o discard", perform a discard on those regions.
      Also, after adding the regions to the free space caches, a fitrim ioctl call
      can see those ranges in a block group's free space cache and perform a discard
      on the ranges, so the same issue can happen without "-o discard" as well.
      
      This causes corruption, affecting one or multiple btree nodes (in the worst
      case leaving the fs unmountable) because some of those ranges (the ones in
      the fs_info->pinned_extents tree) correspond to btree nodes/leafs that are
      referred by the last committed super block - breaking the rule that anything
      that was committed by a transaction is untouched until the next transaction
      commits successfully.
      
      I ran into this while running in a loop (for several hours) the fstest that
      I recently submitted:
      
        [PATCH] fstests: add btrfs test to stress chunk allocation/removal and fstrim
      
      The corruption always happened when a transaction aborted and then fsck complained
      like this:
      
         _check_btrfs_filesystem: filesystem on /dev/sdc is inconsistent
         *** fsck.btrfs output ***
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         Check tree block failed, want=94945280, have=0
         read block failed check_tree_block
         Couldn't open file system
      
      In this case 94945280 corresponded to the root of a tree.
      Using frace what I observed was the following sequence of steps happened:
      
         1) transaction N started, fs_info->pinned_extents pointed to
            fs_info->freed_extents[0];
      
         2) node/eb 94945280 is created;
      
         3) eb is persisted to disk;
      
         4) transaction N commit starts, fs_info->pinned_extents now points to
            fs_info->freed_extents[1], and transaction N completes;
      
         5) transaction N + 1 starts;
      
         6) eb is COWed, and btrfs_free_tree_block() called for this eb;
      
         7) eb range (94945280 to 94945280 + 16Kb) is added to
            fs_info->pinned_extents (fs_info->freed_extents[1]);
      
         8) Something goes wrong in transaction N + 1, like hitting ENOSPC
            for example, and the transaction is aborted, turning the fs into
            readonly mode. The stack trace I got for example:
      
            [112065.253935]  [<ffffffff8140c7b6>] dump_stack+0x4d/0x66
            [112065.254271]  [<ffffffff81042984>] warn_slowpath_common+0x7f/0x98
            [112065.254567]  [<ffffffffa0325990>] ? __btrfs_abort_transaction+0x50/0x10b [btrfs]
            [112065.261674]  [<ffffffff810429e5>] warn_slowpath_fmt+0x48/0x50
            [112065.261922]  [<ffffffffa032949e>] ? btrfs_free_path+0x26/0x29 [btrfs]
            [112065.262211]  [<ffffffffa0325990>] __btrfs_abort_transaction+0x50/0x10b [btrfs]
            [112065.262545]  [<ffffffffa036b1d6>] btrfs_remove_chunk+0x537/0x58b [btrfs]
            [112065.262771]  [<ffffffffa033840f>] btrfs_delete_unused_bgs+0x1de/0x21b [btrfs]
            [112065.263105]  [<ffffffffa0343106>] cleaner_kthread+0x100/0x12f [btrfs]
            (...)
            [112065.264493] ---[ end trace dd7903a975a31a08 ]---
            [112065.264673] BTRFS: error (device sdc) in btrfs_remove_chunk:2625: errno=-28 No space left
            [112065.264997] BTRFS info (device sdc): forced readonly
      
         9) The clear kthread sees that the BTRFS_FS_STATE_ERROR bit is set in
            fs_info->fs_state and calls btrfs_cleanup_transaction(), which in
            turn calls btrfs_destroy_pinned_extent();
      
         10) Then btrfs_destroy_pinned_extent() iterates over all the ranges
             marked as dirty in fs_info->freed_extents[], and for each one
             it calls discard, if the fs was mounted with "-o discard", and
             adds the range to the free space cache of the respective block
             group;
      
         11) btrfs_trim_block_group(), invoked from the fitrim ioctl code path,
             sees the free space entries and performs a discard;
      
         12) After an umount and mount (or fsck), our eb's location on disk was full
             of zeroes, and it should have been untouched, because it was marked as
             dirty in the fs_info->pinned_extents tree, and therefore used by the
             trees that the last committed superblock points to.
      
      Fix this by not performing a discard and not adding the ranges to the free space
      caches - it's useless from this point since the fs is now in readonly mode and
      we won't write free space caches to disk anymore (otherwise we would leak space)
      nor any new superblock. By not adding the ranges to the free space caches, it
      prevents other code paths from allocating that space and write to it as well,
      therefore being safer and simpler.
      
      This isn't a new problem, as it's been present since 2011 (git commit
      acce952b).
      
      Cc: stable@vger.kernel.org  # any kernel released after 2011-01-06
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      678886bd
  3. 03 12月, 2014 1 次提交
    • F
      Btrfs: fix race between fs trimming and block group remove/allocation · 04216820
      Filipe Manana 提交于
      Our fs trim operation, which is completely transactionless (doesn't start
      or joins an existing transaction) consists of visiting all block groups
      and then for each one to iterate its free space entries and perform a
      discard operation against the space range represented by the free space
      entries. However before performing a discard, the corresponding free space
      entry is removed from the free space rbtree, and when the discard completes
      it is added back to the free space rbtree.
      
      If a block group remove operation happens while the discard is ongoing (or
      before it starts and after a free space entry is hidden), we end up not
      waiting for the discard to complete, remove the extent map that maps
      logical address to physical addresses and the corresponding chunk metadata
      from the the chunk and device trees. After that and before the discard
      completes, the current running transaction can finish and a new one start,
      allowing for new block groups that map to the same physical addresses to
      be allocated and written to.
      
      So fix this by keeping the extent map in memory until the discard completes
      so that the same physical addresses aren't reused before it completes.
      
      If the physical locations that are under a discard operation end up being
      used for a new metadata block group for example, and dirty metadata extents
      are written before the discard finishes (the VM might call writepages() of
      our btree inode's i_mapping for example, or an fsync log commit happens) we
      end up overwriting metadata with zeroes, which leads to errors from fsck
      like the following:
      
              checking extents
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              read block failed check_tree_block
              owner ref check failed [833912832 16384]
              Errors found in extent allocation tree or chunk allocation
              checking free space cache
              checking fs roots
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              Check tree block failed, want=833912832, have=0
              read block failed check_tree_block
              root 5 root dir 256 error
              root 5 inode 260 errors 2001, no inode item, link count wrong
                      unresolved ref dir 256 index 0 namelen 8 name foobar_3 filetype 1 errors 6, no dir index, no inode ref
              root 5 inode 262 errors 2001, no inode item, link count wrong
                      unresolved ref dir 256 index 0 namelen 8 name foobar_5 filetype 1 errors 6, no dir index, no inode ref
              root 5 inode 263 errors 2001, no inode item, link count wrong
              (...)
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      04216820
  4. 22 11月, 2014 1 次提交
    • J
      Btrfs: make sure logged extents complete in the current transaction V3 · 50d9aa99
      Josef Bacik 提交于
      Liu Bo pointed out that my previous fix would lose the generation update in the
      scenario I described.  It is actually much worse than that, we could lose the
      entire extent if we lose power right after the transaction commits.  Consider
      the following
      
      write extent 0-4k
      log extent in log tree
      commit transaction
      	< power fail happens here
      ordered extent completes
      
      We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
      the transaction commit will reset the log so it isn't replayed.  If we lose
      power before the transaction commit we are save, otherwise we are not.
      
      Fix this by keeping track of all extents we logged in this transaction.  Then
      when we go to commit the transaction make sure we wait for all of those ordered
      extents to complete before proceeding.  This will make sure that if we lose
      power after the transaction commit we still have our data.  This also fixes the
      problem of the improperly updated extent generation.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      50d9aa99
  5. 21 11月, 2014 1 次提交
  6. 12 11月, 2014 2 次提交
    • D
      btrfs: switch inode_cache option handling to pending changes · 7e1876ac
      David Sterba 提交于
      The pending mount option(s) now share namespace and bits with the normal
      options, and the existing one for (inode_cache) is unset unconditionally
      at each transaction commit.
      
      Introduce a separate namespace for pending changes and enhance the
      descriptions of the intended change to use separate bits for each
      action.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      7e1876ac
    • D
      btrfs: add support for processing pending changes · 572d9ab7
      David Sterba 提交于
      There are some actions that modify global filesystem state but cannot be
      performed at the time of request, but later at the transaction commit
      time when the filesystem is in a known state.
      
      For example enabling new incompat features on-the-fly or issuing
      transaction commit from unsafe contexts (sysfs handlers).
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      572d9ab7
  7. 28 10月, 2014 1 次提交
  8. 04 10月, 2014 2 次提交
    • D
      btrfs: add more superblock checks · c926093e
      David Sterba 提交于
      Populate btrfs_check_super_valid() with checks that try to verify
      consistency of superblock by additional conditions that may arise from
      corrupted devices or bitflips. Some of tests are only hints and issue
      warnings instead of failing the mount, basically when the checks are
      derived from the data found in the superblock.
      
      Tested on a broken image provided by Qu.
      Reported-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NChris Mason <clm@fb.com>
      c926093e
    • F
      Btrfs: be aware of btree inode write errors to avoid fs corruption · 656f30db
      Filipe Manana 提交于
      While we have a transaction ongoing, the VM might decide at any time
      to call btree_inode->i_mapping->a_ops->writepages(), which will start
      writeback of dirty pages belonging to btree nodes/leafs. This call
      might return an error or the writeback might finish with an error
      before we attempt to commit the running transaction. If this happens,
      we might have no way of knowing that such error happened when we are
      committing the transaction - because the pages might no longer be
      marked dirty nor tagged for writeback (if a subsequent modification
      to the extent buffer didn't happen before the transaction commit) which
      makes filemap_fdata[write|wait]_range unable to find such pages (even
      if they're marked with SetPageError).
      So if this happens we must abort the transaction, otherwise we commit
      a super block with btree roots that point to btree nodes/leafs whose
      content on disk is invalid - either garbage or the content of some
      node/leaf from a past generation that got cowed or deleted and is no
      longer valid (for this later case we end up getting error messages like
      "parent transid verify failed on 10826481664 wanted 25748 found 29562"
      when reading btree nodes/leafs from disk).
      
      Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
      i_mapping would not be enough because we need to distinguish between
      log tree extents (not fatal) vs non-log tree extents (fatal) and
      because the next call to filemap_fdatawait_range() will catch and clear
      such errors in the mapping - and that call might be from a log sync and
      not from a transaction commit, which means we would not know about the
      error at transaction commit time. Also, checking for the eb flag
      EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
      not be completely reliable, as the eb might be removed from memory and
      read back when trying to get it, which clears that flag right before
      reading the eb's pages from disk, making us not know about the previous
      write error.
      
      Using the new 3 flags for the btree inode also makes us achieve the
      goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
      writeback for all dirty pages and before filemap_fdatawait_range() is
      called, the writeback for all dirty pages had already finished with
      errors - because we were not using AS_EIO/AS_ENOSPC,
      filemap_fdatawait_range() would return success, as it could not know
      that writeback errors happened (the pages were no longer tagged for
      writeback).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      656f30db
  9. 02 10月, 2014 10 次提交
  10. 23 9月, 2014 1 次提交
    • J
      Btrfs: remove empty block groups automatically · 47ab2a6c
      Josef Bacik 提交于
      One problem that has plagued us is that a user will use up all of his space with
      data, remove a bunch of that data, and then try to create a bunch of small files
      and run out of space.  This happens because all the chunks were allocated for
      data since the metadata requirements were so low.  But now there's a bunch of
      empty data block groups and not enough metadata space to do anything.  This
      patch solves this problem by automatically deleting empty block groups.  If we
      notice the used count go down to 0 when deleting or on mount notice that a block
      group has a used count of 0 then we will queue it to be deleted.
      
      When the cleaner thread runs we will double check to make sure the block group
      is still empty and then we will delete it.  This patch has the side effect of no
      longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
      helpful for both this and relocate.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      47ab2a6c
  11. 18 9月, 2014 9 次提交
  12. 09 9月, 2014 1 次提交
    • T
      block, bdi: an active gendisk always has a request_queue associated with it · ff9ea323
      Tejun Heo 提交于
      bdev_get_queue() returns the request_queue associated with the
      specified block_device.  blk_get_backing_dev_info() makes use of
      bdev_get_queue() to determine the associated bdi given a block_device.
      
      All the callers of bdev_get_queue() including
      blk_get_backing_dev_info() assume that bdev_get_queue() may return
      NULL and implement NULL handling; however, bdev_get_queue() requires
      the passed in block_device is opened and attached to its gendisk.
      Because an active gendisk always has a valid request_queue associated
      with it, bdev_get_queue() can never return NULL and neither can
      blk_get_backing_dev_info().
      
      Make it clear that neither of the two functions can return NULL and
      remove NULL handling from all the callers.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Chris Mason <clm@fb.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ff9ea323
  13. 08 9月, 2014 1 次提交
    • T
      percpu_counter: add @gfp to percpu_counter_init() · 908c7f19
      Tejun Heo 提交于
      Percpu allocator now supports allocation mask.  Add @gfp to
      percpu_counter_init() so that !GFP_KERNEL allocation masks can be used
      with percpu_counters too.
      
      We could have left percpu_counter_init() alone and added
      percpu_counter_init_gfp(); however, the number of users isn't that
      high and introducing _gfp variants to all percpu data structures would
      be quite ugly, so let's just do the conversion.  This is the one with
      the most users.  Other percpu data structures are a lot easier to
      convert.
      
      This patch doesn't make any functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: N"David S. Miller" <davem@davemloft.net>
      Cc: x86@kernel.org
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      908c7f19
  14. 24 8月, 2014 1 次提交
    • L
      Btrfs: fix task hang under heavy compressed write · 9e0af237
      Liu Bo 提交于
      This has been reported and discussed for a long time, and this hang occurs in
      both 3.15 and 3.16.
      
      Btrfs now migrates to use kernel workqueue, but it introduces this hang problem.
      
      Btrfs has a kind of work queued as an ordered way, which means that its
      ordered_func() must be processed in the way of FIFO, so it usually looks like --
      
      normal_work_helper(arg)
          work = container_of(arg, struct btrfs_work, normal_work);
      
          work->func() <---- (we name it work X)
          for ordered_work in wq->ordered_list
                  ordered_work->ordered_func()
                  ordered_work->ordered_free()
      
      The hang is a rare case, first when we find free space, we get an uncached block
      group, then we go to read its free space cache inode for free space information,
      so it will
      
      file a readahead request
          btrfs_readpages()
               for page that is not in page cache
                      __do_readpage()
                           submit_extent_page()
                                 btrfs_submit_bio_hook()
                                       btrfs_bio_wq_end_io()
                                       submit_bio()
                                       end_workqueue_bio() <--(ret by the 1st endio)
                                            queue a work(named work Y) for the 2nd
                                            also the real endio()
      
      So the hang occurs when work Y's work_struct and work X's work_struct happens
      to share the same address.
      
      A bit more explanation,
      
      A,B,C -- struct btrfs_work
      arg   -- struct work_struct
      
      kthread:
      worker_thread()
          pick up a work_struct from @worklist
          process_one_work(arg)
      	worker->current_work = arg;  <-- arg is A->normal_work
      	worker->current_func(arg)
      		normal_work_helper(arg)
      		     A = container_of(arg, struct btrfs_work, normal_work);
      
      		     A->func()
      		     A->ordered_func()
      		     A->ordered_free()  <-- A gets freed
      
      		     B->ordered_func()
      			  submit_compressed_extents()
      			      find_free_extent()
      				  load_free_space_inode()
      				      ...   <-- (the above readhead stack)
      				      end_workqueue_bio()
      					   btrfs_queue_work(work C)
      		     B->ordered_free()
      
      As if work A has a high priority in wq->ordered_list and there are more ordered
      works queued after it, such as B->ordered_func(), its memory could have been
      freed before normal_work_helper() returns, which means that kernel workqueue
      code worker_thread() still has worker->current_work pointer to be work
      A->normal_work's, ie. arg's address.
      
      Meanwhile, work C is allocated after work A is freed, work C->normal_work
      and work A->normal_work are likely to share the same address(I confirmed this
      with ftrace output, so I'm not just guessing, it's rare though).
      
      When another kthread picks up work C->normal_work to process, and finds our
      kthread is processing it(see find_worker_executing_work()), it'll think
      work C as a collision and skip then, which ends up nobody processing work C.
      
      So the situation is that our kthread is waiting forever on work C.
      
      Besides, there're other cases that can lead to deadlock, but the real problem
      is that all btrfs workqueue shares one work->func, -- normal_work_helper,
      so this makes each workqueue to have its own helper function, but only a
      wraper pf normal_work_helper.
      
      With this patch, I no long hit the above hang.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9e0af237
  15. 19 8月, 2014 1 次提交
    • M
      Btrfs: Fix wrong device size when we are resizing the device · 7df69d3e
      Miao Xie 提交于
      total_bytes of device is just a in-memory variant which is used to record
      the size of the device, and it might be changed before we resize a device,
      if the resize operation fails, it will be fallbacked. But some code used it
      to update on-disk metadata of the device, it would cause the problem that
      on-disk metadata of the devices was not consistent. We should use the other
      variant named disk_total_bytes to update the on-disk metadata of device,
      because that variant is updated only when the resize operation is successful.
      Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      7df69d3e
  16. 15 8月, 2014 1 次提交
    • C
      btrfs: disable strict file flushes for renames and truncates · 8d875f95
      Chris Mason 提交于
      Truncates and renames are often used to replace old versions of a file
      with new versions.  Applications often expect this to be an atomic
      replacement, even if they haven't done anything to make sure the new
      version is fully on disk.
      
      Btrfs has strict flushing in place to make sure that renaming over an
      old file with a new file will fully flush out the new file before
      allowing the transaction commit with the rename to complete.
      
      This ordering means the commit code needs to be able to lock file pages,
      and there are a few paths in the filesystem where we will try to end a
      transaction with the page lock held.  It's rare, but these things can
      deadlock.
      
      This patch removes the ordered flushes and switches to a best effort
      filemap_flush like ext4 uses. It's not perfect, but it should fix the
      deadlocks.
      Signed-off-by: NChris Mason <clm@fb.com>
      8d875f95
  17. 03 7月, 2014 1 次提交
  18. 29 6月, 2014 1 次提交
  19. 10 6月, 2014 1 次提交
    • C
      Btrfs: async delayed refs · a79b7d4b
      Chris Mason 提交于
      Delayed extent operations are triggered during transaction commits.
      The goal is to queue up a healthly batch of changes to the extent
      allocation tree and run through them in bulk.
      
      This farms them off to async helper threads.  The goal is to have the
      bulk of the delayed operations being done in the background, but this is
      also important to limit our stack footprint.
      Signed-off-by: NChris Mason <clm@fb.com>
      a79b7d4b