1. 23 9月, 2014 3 次提交
    • J
      Btrfs: try not to ENOSPC on log replay · 1d52c78a
      Josef Bacik 提交于
      When doing log replay we may have to update inodes, which traditionally goes
      through our delayed inode stuff.  This will try to move space over from the
      trans handle, but we don't reserve space in our trans handle on replay since we
      don't know how much we will need, so instead we try to flush.  But because we
      have a trans handle open we won't flush anything, so if we are out of reserve
      space we will simply return ENOSPC.  Since we know that if an operation made it
      into the log then we definitely had space before the box bought the farm then we
      don't need to worry about doing this space reservation.  Use the
      fs_info->log_root_recovering flag to skip the delayed inode stuff and update the
      item directly.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1d52c78a
    • J
      Btrfs: don't do async reclaim during log replay · f6acfd50
      Josef Bacik 提交于
      Trying to reproduce a log enospc bug I hit a panic in the async reclaim code
      during log replay.  This is because we use fs_info->fs_root as our root for
      shrinking and such.  Technically we can use whatever root we want, but let's
      just not allow async reclaim while we're doing log replay.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      f6acfd50
    • J
      Btrfs: remove empty block groups automatically · 47ab2a6c
      Josef Bacik 提交于
      One problem that has plagued us is that a user will use up all of his space with
      data, remove a bunch of that data, and then try to create a bunch of small files
      and run out of space.  This happens because all the chunks were allocated for
      data since the metadata requirements were so low.  But now there's a bunch of
      empty data block groups and not enough metadata space to do anything.  This
      patch solves this problem by automatically deleting empty block groups.  If we
      notice the used count go down to 0 when deleting or on mount notice that a block
      group has a used count of 0 then we will queue it to be deleted.
      
      When the cleaner thread runs we will double check to make sure the block group
      is still empty and then we will delete it.  This patch has the side effect of no
      longer having a bunch of BUG_ON()'s in the chunk delete code, which will be
      helpful for both this and relocate.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      47ab2a6c
  2. 19 9月, 2014 2 次提交
    • F
      Btrfs: fix data corruption after fast fsync and writeback error · 8407f553
      Filipe Manana 提交于
      When we do a fast fsync, we start all ordered operations and then while
      they're running in parallel we visit the list of modified extent maps
      and construct their matching file extent items and write them to the
      log btree. After that, in btrfs_sync_log() we wait for all the ordered
      operations to finish (via btrfs_wait_logged_extents).
      
      The problem with this is that we were completely ignoring errors that
      can happen in the extent write path, such as -ENOSPC, a temporary -ENOMEM
      or -EIO errors for example. When such error happens, it means we have parts
      of the on disk extent that weren't written to, and so we end up logging
      file extent items that point to these extents that contain garbage/random
      data - so after a crash/reboot plus log replay, we get our inode's metadata
      pointing to those extents.
      
      This worked in contrast with the full (non-fast) fsync path, where we
      start all ordered operations, wait for them to finish and then write
      to the log btree. In this path, after each ordered operation completes
      we check if it's flagged with an error (BTRFS_ORDERED_IOERR) and return
      -EIO if so (via btrfs_wait_ordered_range).
      
      So if an error happens with any ordered operation, just return a -EIO
      error to userspace, so that it knows that not all of its previous writes
      were durably persisted and the application can take proper action (like
      redo the writes for e.g.) - and definitely not leave any file extent items
      in the log refer to non fully written extents.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8407f553
    • F
      Btrfs: fix fsync race leading to invalid data after log replay · 669249ee
      Filipe Manana 提交于
      When the fsync callback (btrfs_sync_file) starts, it first waits for
      the writeback of any dirty pages to start and finish without holding
      the inode's mutex (to reduce contention). After this it acquires the
      inode's mutex and repeats that process via btrfs_wait_ordered_range
      only if we're doing a full sync (BTRFS_INODE_NEEDS_FULL_SYNC flag
      is set on the inode).
      
      This is not safe for a non full sync - we need to start and wait for
      writeback to finish for any pages that might have been made dirty
      before acquiring the inode's mutex and after that first step mentioned
      before. Why this is needed is explained by the following comment added
      to btrfs_sync_file:
      
        "Right before acquiring the inode's mutex, we might have new
         writes dirtying pages, which won't immediately start the
         respective ordered operations - that is done through the
         fill_delalloc callbacks invoked from the writepage and
         writepages address space operations. So make sure we start
         all ordered operations before starting to log our inode. Not
         doing this means that while logging the inode, writeback
         could start and invoke writepage/writepages, which would call
         the fill_delalloc callbacks (cow_file_range,
         submit_compressed_extents). These callbacks add first an
         extent map to the modified list of extents and then create
         the respective ordered operation, which means in
         tree-log.c:btrfs_log_inode() we might capture all existing
         ordered operations (with btrfs_get_logged_extents()) before
         the fill_delalloc callback adds its ordered operation, and by
         the time we visit the modified list of extent maps (with
         btrfs_log_changed_extents()), we see and process the extent
         map they created. We then use the extent map to construct a
         file extent item for logging without waiting for the
         respective ordered operation to finish - this file extent
         item points to a disk location that might not have yet been
         written to, containing random data - so after a crash a log
         replay will make our inode have file extent items that point
         to disk locations containing invalid data, as we returned
         success to userspace without waiting for the respective
         ordered operation to finish, because it wasn't captured by
         btrfs_get_logged_extents()."
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      669249ee
  3. 18 9月, 2014 35 次提交