1. 28 3月, 2011 12 次提交
    • L
      Btrfs: add btrfs_trim_fs() to handle FITRIM · f7039b1d
      Li Dongyang 提交于
      We take an free extent out from allocator, trim it, then put it back,
      but before we trim the block group, we should make sure the block group is
      cached, so plus a little change to make cache_block_group() run without a
      transaction.
      Signed-off-by: NLi Dongyang <lidongyang@novell.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f7039b1d
    • L
      Btrfs: adjust btrfs_discard_extent() return errors and trimmed bytes · 5378e607
      Li Dongyang 提交于
      Callers of btrfs_discard_extent() should check if we are mounted with -o discard,
      as we want to make fitrim to work even the fs is not mounted with -o discard.
      Also we should use REQ_DISCARD to map the free extent to get a full mapping,
      last we only return errors if
      1. the error is not a EOPNOTSUPP
      2. no device supports discard
      Signed-off-by: NLi Dongyang <lidongyang@novell.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5378e607
    • L
      Btrfs: make btrfs_map_block() return entire free extent for each device of RAID0/1/10/DUP · fce3bb9a
      Li Dongyang 提交于
      btrfs_map_block() will only return a single stripe length, but we want the
      full extent be mapped to each disk when we are trimming the extent,
      so we add length to btrfs_bio_stripe and fill it if we are mapping for REQ_DISCARD.
      Signed-off-by: NLi Dongyang <lidongyang@novell.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      fce3bb9a
    • L
      Btrfs: make update_reserved_bytes() public · b4d00d56
      Li Dongyang 提交于
      Make the function public as we should update the reserved extents calculations
      after taking out an extent for trimming.
      Signed-off-by: NLi Dongyang <lidongyang@novell.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b4d00d56
    • M
      btrfs: return EXDEV when linking from different subvolumes · 3ab3564f
      Mark Fasheh 提交于
      btrfs_link returns EPERM if a cross-subvolume link is attempted.
      
      However, in this case I believe EXDEV to be the more appropriate value.
      >From the link(2) man page:
      
      EXDEV  oldpath and newpath are not on the same mounted file system.  (Linux
             permits a file system to be mounted at multiple points, but link()
             does not work across different mount points, even if the same file
             system is mounted on both.)
      
      This matters because an application may have different behaviors based on
      return codes.
      Signed-off-by: NMark Fasheh <mfasheh@suse.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      3ab3564f
    • L
      Btrfs: Per file/directory controls for COW and compression · 75e7cb7f
      Liu Bo 提交于
      Data compression and data cow are controlled across the entire FS by mount
      options right now.  ioctls are needed to set this on a per file or per
      directory basis.  This has been proposed previously, but VFS developers
      wanted us to use generic ioctls rather than btrfs-specific ones.
      
      According to Chris's comment, there should be just one true compression
      method(probably LZO) stored in the super.  However, before this, we would
      wait for that one method is stable enough to be adopted into the super.
      So I list it as a long term goal, and just store it in ram today.
      
      After applying this patch, we can use the generic "FS_IOC_SETFLAGS" ioctl to
      control file and directory's datacow and compression attribute.
      
      NOTE:
       - The compression type is selected by such rules:
         If we mount btrfs with compress options, ie, zlib/lzo, the type is it.
         Otherwise, we'll use the default compress type (zlib today).
      
      v1->v2:
      - rebase to the latest btrfs.
      v2->v3:
      - fix a problem, i.e. when a file is set NOCOW via mount option, then this NOCOW
        will be screwed by inheritance from parent directory.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      75e7cb7f
    • M
      btrfs: use GFP_NOFS instead of GFP_KERNEL · fc0e4a31
      Miao Xie 提交于
      In the filesystem context, we must allocate memory by GFP_NOFS,
      or we may start another filesystem operation and make kswap thread hang up.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      fc0e4a31
    • T
      Btrfs: check return value of read_tree_block() · 97d9a8a4
      Tsutomu Itoh 提交于
      This patch is checking return value of read_tree_block(),
      and if it is NULL, error processing.
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      97d9a8a4
    • D
      btrfs: properly access unaligned checksum buffer · 7e75bf3f
      David Sterba 提交于
      On Fri, Mar 18, 2011 at 11:56:53AM -0400, Chris Mason wrote:
      > Thanks for fielding this one.  Does put_unaligned_le32 optimize away on
      > platforms with efficient access?  It would be great if we didn't need
      > the #ifdef.
      
      (quicktest: assembly output is same for put_unaligned_le32 and direct
      assignment on my x86_64)
      I was originally following examples in
      Documentation/unaligned-memory-access.txt. From other code it seems to me that
      the define CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is intended for larger
      portions of code. Macros/wrappers for {put,get}_unaligned* are chosen via
      arch/<arch>/include/asm/unaligned.h accordingly, therefore it's safe to use
      put_unaligned_le32 without the ifdef.
      
      dave
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7e75bf3f
    • T
      Btrfs: cleanup some BUG_ON() · db5b493a
      Tsutomu Itoh 提交于
      This patch changes some BUG_ON() to the error return.
      (but, most callers still use BUG_ON())
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      db5b493a
    • L
      Btrfs: add initial tracepoint support for btrfs · 1abe9b8a
      liubo 提交于
      Tracepoints can provide insight into why btrfs hits bugs and be greatly
      helpful for debugging, e.g
                    dd-7822  [000]  2121.641088: btrfs_inode_request: root = 5(FS_TREE), gen = 4, ino = 256, blocks = 8, disk_i_size = 0, last_trans = 8, logged_trans = 0
                    dd-7822  [000]  2121.641100: btrfs_inode_new: root = 5(FS_TREE), gen = 8, ino = 257, blocks = 0, disk_i_size = 0, last_trans = 0, logged_trans = 0
       btrfs-transacti-7804  [001]  2146.935420: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29368320 (orig_level = 0), cow_buf = 29388800 (cow_level = 0)
       btrfs-transacti-7804  [001]  2146.935473: btrfs_cow_block: root = 1(ROOT_TREE), refs = 2, orig_buf = 29364224 (orig_level = 0), cow_buf = 29392896 (cow_level = 0)
       btrfs-transacti-7804  [001]  2146.972221: btrfs_transaction_commit: root = 1(ROOT_TREE), gen = 8
         flush-btrfs-2-7821  [001]  2155.824210: btrfs_chunk_alloc: root = 3(CHUNK_TREE), offset = 1103101952, size = 1073741824, num_stripes = 1, sub_stripes = 0, type = DATA
         flush-btrfs-2-7821  [001]  2155.824241: btrfs_cow_block: root = 2(EXTENT_TREE), refs = 2, orig_buf = 29388800 (orig_level = 0), cow_buf = 29396992 (cow_level = 0)
         flush-btrfs-2-7821  [001]  2155.824255: btrfs_cow_block: root = 4(DEV_TREE), refs = 2, orig_buf = 29372416 (orig_level = 0), cow_buf = 29401088 (cow_level = 0)
         flush-btrfs-2-7821  [000]  2155.824329: btrfs_cow_block: root = 3(CHUNK_TREE), refs = 2, orig_buf = 20971520 (orig_level = 0), cow_buf = 20975616 (cow_level = 0)
       btrfs-endio-wri-7800  [001]  2155.898019: btrfs_cow_block: root = 5(FS_TREE), refs = 2, orig_buf = 29384704 (orig_level = 0), cow_buf = 29405184 (cow_level = 0)
       btrfs-endio-wri-7800  [001]  2155.898043: btrfs_cow_block: root = 7(CSUM_TREE), refs = 2, orig_buf = 29376512 (orig_level = 0), cow_buf = 29409280 (cow_level = 0)
      
      Here is what I have added:
      
      1) ordere_extent:
              btrfs_ordered_extent_add
              btrfs_ordered_extent_remove
              btrfs_ordered_extent_start
              btrfs_ordered_extent_put
      
      These provide critical information to understand how ordered_extents are
      updated.
      
      2) extent_map:
              btrfs_get_extent
      
      extent_map is used in both read and write cases, and it is useful for tracking
      how btrfs specific IO is running.
      
      3) writepage:
              __extent_writepage
              btrfs_writepage_end_io_hook
      
      Pages are cirtical resourses and produce a lot of corner cases during writeback,
      so it is valuable to know how page is written to disk.
      
      4) inode:
              btrfs_inode_new
              btrfs_inode_request
              btrfs_inode_evict
      
      These can show where and when a inode is created, when a inode is evicted.
      
      5) sync:
              btrfs_sync_file
              btrfs_sync_fs
      
      These show sync arguments.
      
      6) transaction:
              btrfs_transaction_commit
      
      In transaction based filesystem, it will be useful to know the generation and
      who does commit.
      
      7) back reference and cow:
      	btrfs_delayed_tree_ref
      	btrfs_delayed_data_ref
      	btrfs_delayed_ref_head
      	btrfs_cow_block
      
      Btrfs natively supports back references, these tracepoints are helpful on
      understanding btrfs's COW mechanism.
      
      8) chunk:
      	btrfs_chunk_alloc
      	btrfs_chunk_free
      
      Chunk is a link between physical offset and logical offset, and stands for space
      infomation in btrfs, and these are helpful on tracing space things.
      
      9) reserved_extent:
      	btrfs_reserved_extent_alloc
      	btrfs_reserved_extent_free
      
      These can show how btrfs uses its space.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      1abe9b8a
    • C
      Btrfs: use RCU instead of a spinlock to protect the root node · 240f62c8
      Chris Mason 提交于
      The pointer to the extent buffer for the root of each tree
      is protected by a spinlock so that we can safely read the pointer
      and take a reference on the extent buffer.
      
      But now that the extent buffers are freed via RCU, we can safely
      use rcu_read_lock instead.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      240f62c8
  2. 26 3月, 2011 3 次提交
    • J
      Btrfs: mark the bio with an error if we have a failure in dio · c0da7aa1
      Josef Bacik 提交于
      I noticed that dio_end_io calls the appropriate endio function with an error,
      but the endio functions don't actually do anything with that error, they assume
      that if there was an error then the bio will not be uptodate.  So if we had
      checksum failures we would never pass back EIO.  So if there is an error in our
      endio functions make sure to clear the uptodate flag on the bio.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      c0da7aa1
    • J
      Btrfs: don't allocate dip->csums when doing writes · 98bc3149
      Josef Bacik 提交于
      When doing direct writes we store the checksums in the ordered sum stuff in the
      ordered extent for writing them when the write completes, so we don't even use
      the dip->csums array.  So if we're writing, don't bother allocating dip->csums
      since we won't use it anyway.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      98bc3149
    • J
      Btrfs: cleanup how we setup free space clusters · 4e69b598
      Josef Bacik 提交于
      This patch makes the free space cluster refilling code a little easier to
      understand, and fixes some things with the bitmap part of it.  Currently we
      either want to refill a cluster with
      
      1) All normal extent entries (those without bitmaps)
      2) A bitmap entry with enough space
      
      The current code has this ugly jump around logic that will first try and fill up
      the cluster with extent entries and then if it can't do that it will try and
      find a bitmap to use.  So instead split this out into two functions, one that
      tries to find only normal entries, and one that tries to find bitmaps.
      
      This also fixes a suboptimal thing we would do with bitmaps.  If we used a
      bitmap we would just tell the cluster that we were pointing at a bitmap and it
      would do the tree search in the block group for that entry every time we tried
      to make an allocation.  Instead of doing that now we just add it to the clusters
      group.
      
      I tested this with my ENOSPC tests and xfstests and it survived.
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      4e69b598
  3. 21 3月, 2011 3 次提交
    • J
      Btrfs: don't be as aggressive about using bitmaps · 32cb0840
      Josef Bacik 提交于
      We have been creating bitmaps for small extents unconditionally forever.  This
      was great when testing to make sure the bitmap stuff was working, but is
      overkill normally.  So instead of always adding small chunks of free space to
      bitmaps, only start doing it if we go past half of our extent threshold.  This
      will keeps us from creating a bitmap for just one small free extent at the front
      of the block group, and will make the allocator a little faster as a result.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      32cb0840
    • J
      Btrfs: deal with min_bytes appropriately when looking for a cluster · d0a365e8
      Josef Bacik 提交于
      We do all this fun stuff with min_bytes, but either don't use it in the case of
      just normal extents, or use it completely wrong in the case of bitmaps.  So fix
      this for both cases
      
      1) In the extent case, stop looking for space with window_free >= min_bytes
      instead of bytes + empty_size.
      
      2) In the bitmap case, we were looking for streches of free space that was at
      least min_bytes in size, which was not right at all.  So instead search for
      stretches of free space that are at least bytes in size (this will make a
      difference when we have > page size blocks) and then only search for min_bytes
      amount of free space.
      
      Thanks,
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      d0a365e8
    • J
      Btrfs: check free space in block group before searching for a cluster · 7d0d2e8e
      Josef Bacik 提交于
      The free space cluster stuff is heavy duty, so there is no sense in going
      through the entire song and dance if there isn't enough space in the block group
      to begin with.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      7d0d2e8e
  4. 18 3月, 2011 16 次提交
  5. 12 3月, 2011 1 次提交
    • C
      Btrfs: break out of shrink_delalloc earlier · 36e39c40
      Chris Mason 提交于
      Josef had changed shrink_delalloc to exit after three shrink
      attempts, which wasn't quite enough because new writers could
      race in and steal free space.
      
      But it also fixed deadlocks and stalls as we tried to recover
      delalloc reservations.  The code was tweaked to loop 1024
      times, and would reset the counter any time a small amount
      of progress was made.  This was too drastic, and with a
      lot of writers we can end up stuck in shrink_delalloc forever.
      
      The shrink_delalloc loop is fairly complex because the caller is looping
      too, and the caller will go ahead and force a transaction commit to make
      sure we reclaim space.
      
      This reworks things to exit shrink_delalloc when we've forced some
      writeback and the delalloc reservations have gone down.  This means
      the writeback has not just started but has also finished at
      least some of the metadata changes required to reclaim delalloc
      space.
      
      If we've got this wrong, we're returning ENOSPC too early, which
      is a big improvement over the current behavior of hanging the machine.
      
      Test 224 in xfstests hammers on this nicely, and with 1000 writers
      trying to fill a 1GB drive we get our first ENOSPC at 93% full.  The
      other writers are able to continue until we get 100%.
      
      This is a worst case test for btrfs because the 1000 writers are doing
      small IO, and the small FS size means we don't have a lot of room
      for metadata chunks.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      36e39c40
  6. 11 3月, 2011 2 次提交
  7. 09 3月, 2011 1 次提交
  8. 08 3月, 2011 1 次提交
    • C
      Btrfs: deal with short returns from copy_from_user · 31339acd
      Chris Mason 提交于
      When copy_from_user is only able to copy some of the bytes we requested,
      we may end up creating a partially up to date page.  To avoid garbage in
      the page, we need to treat a partial copy as a zero length copy.
      
      This makes the rest of the file_write code drop the page and
      retry the whole copy instead of marking the partially up to
      date page as dirty.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cc: stable@kernel.org
      31339acd
  9. 07 3月, 2011 1 次提交
    • C
      Btrfs: fix regressions in copy_from_user handling · b1bf862e
      Chris Mason 提交于
      Commit 914ee295 fixed deadlocks in
      btrfs_file_write where we would catch page faults on pages we had
      locked.
      
      But, there were a few problems:
      
      1) The x86-32 iov_iter_copy_from_user_atomic code always fails to copy
      data when the amount to copy is more than 4K and the offset to start
      copying from is not page aligned.  The result was btrfs_file_write
      looping forever retrying the iov_iter_copy_from_user_atomic
      
      We deal with this by changing btrfs_file_write to drop down to single
      page copies when iov_iter_copy_from_user_atomic starts returning failure.
      
      2) The btrfs_file_write code was leaking delalloc reservations when
      iov_iter_copy_from_user_atomic returned zero.  The looping above would
      result in the entire filesystem running out of delalloc reservations and
      constantly trying to flush things to disk.
      
      3) btrfs_file_write will lock down page cache pages, make sure
      any writeback is finished, do the copy_from_user and then release them.
      Before the loop runs we check the first and last pages in the write to
      see if they are only being partially modified.  If the start or end of
      the write isn't aligned, we make sure the corresponding pages are
      up to date so that we don't introduce garbage into the file.
      
      With the copy_from_user changes, we're allowing the VM to reclaim the
      pages after a partial update from copy_from_user, but we're not
      making sure the page cache page is up to date when we loop around to
      resume the write.
      
      We deal with this by pushing the up to date checks down into the page
      prep code.  This fits better with how the rest of file_write works.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Reported-by: NMitch Harder <mitch.harder@sabayonlinux.org>
      cc: stable@kernel.org
      b1bf862e