1. 02 3月, 2013 1 次提交
    • J
      Btrfs: delete inline extents when we find them during logging · 124fe663
      Josef Bacik 提交于
      Apparently when we do inline extents we allow the data to overlap the last chunk
      of the btrfs_file_extent_item, which means that we can possibly have a
      btrfs_file_extent_item that isn't actually as large as a btrfs_file_extent_item.
      This messes with us when we try to overwrite the extent when logging new extents
      since we expect for it to be the right size.  To fix this just delete the item
      and try to do the insert again which will give us the proper sized
      btrfs_file_extent_item.  This fixes a panic where map_private_extent_buffer
      would blow up because we're trying to write past the end of the leaf.  Thanks,
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      124fe663
  2. 01 3月, 2013 14 次提交
  3. 27 2月, 2013 5 次提交
    • Q
      btrfs: cleanup for open-coded alignment · fda2832f
      Qu Wenruo 提交于
      Though most of the btrfs codes are using ALIGN macro for page alignment,
      there are still some codes using open-coded alignment like the
      following:
      ------
              u64 mask = ((u64)root->stripesize - 1);
              u64 ret = (val + mask) & ~mask;
      ------
      Or even hidden one:
      ------
              num_bytes = (end - start + blocksize) & ~(blocksize - 1);
      ------
      
      Sometimes these open-coded alignment is not so easy to understand for
      newbie like me.
      
      This commit changes the open-coded alignment to the ALIGN macro for a
      better readability.
      
      Also there is a previous patch from David Sterba with similar changes,
      but the patch is for 3.2 kernel and seems not merged.
      http://www.spinics.net/lists/linux-btrfs/msg12747.html
      
      Cc: David Sterba <dave@jikos.cz>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fda2832f
    • L
      Btrfs: do not change inode flags in rename · 8c4ce81e
      Liu Bo 提交于
      Before we forced to change a file's NOCOW and COMPRESS flag due to
      the parent directory's, but this ends up a bad idea, because it
      confuses end users a lot about file's NOCOW status, eg. if someone
      change a file to NOCOW via 'chattr' and then rename it in the current
      directory which is without NOCOW attribute, the file will lose the
      NOCOW flag silently.
      
      This diables 'change flags in rename', so from now on we'll only
      inherit flags from the parent directory on creation stage while in
      other places we can use 'chattr' to set NOCOW or COMPRESS flags.
      Reported-by: NMarios Titas <redneb8888@gmail.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      8c4ce81e
    • L
      Btrfs: use reserved space for creating a snapshot · 2382c5cc
      Liu Bo 提交于
      While inserting dir index and updating inode for a snapshot, we'd
      add delayed items which consume trans->block_rsv, if we don't have
      any space reserved in this trans handle, we either just return or
      reserve space again.
      
      But before creating pending snapshots during committing transaction,
      we've done a release on this trans handle, so we don't have space reserved
      in it at this stage.
      
      What we're using is block_rsv of pending snapshots which has already
      reserved well enough space for both inserting dir index and updating
      inode, so we need to set trans handle to indicate that we have space
      now.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2382c5cc
    • A
      clear chunk_alloc flag on retryable failure · a81cb9a2
      Alexandre Oliva 提交于
      I've experienced filesystem freezes with permanent spikes in the active
      process count for quite a while, particularly on filesystems whose
      available raw space has already been fully allocated to chunks.
      
      While looking into this, I found a pretty obvious error in
      do_chunk_alloc: it sets space_info->chunk_alloc, but if
      btrfs_alloc_chunk returns an error other than ENOSPC, it returns leaving
      that flag set, which causes any other threads waiting for
      space_info->chunk_alloc to become zero to spin indefinitely.
      
      I haven't double-checked that this patch fixes the failure I've observed
      fully (it's not exactly trivial to trigger), but it surely is a bug and
      the fix is trivial, so...  Please put it in :-)
      
      What I saw in that function also happens to explain why in some cases I
      see filesystems allocate a huge number of chunks that remain unused
      (leading to the scenario above, of not having more chunks to allocate).
      It happens for data and metadata, but not necessarily both.  I'm
      guessing some thread sets the force_alloc flag on the corresponding
      space_info, and then several threads trying to get disk space end up
      attempting to allocate a new chunk concurrently.  All of them will see
      the force_alloc flag and bump their local copy of force up to the level
      they see first, and they won't clear it even if another thread succeeds
      in allocating a chunk, thus clearing the force flag.  Then each thread
      that observed the force flag will, on its turn, force the allocation of
      a new chunk.  And any threads that come in while it does that will see
      the force flag still set and pick it up, and so on.  This sounds like a
      problem to me, but...  what should the correct behavior be?  Clear
      force_flag once we copy it to a local force?  Reset force to the
      incoming value on every loop?  Set the flag to our incoming force if we
      have it at first, clear our local flag, and move it from the space_info
      when we determined that we are the thread that's going to perform the
      allocation?
      
      btrfs: clear chunk_alloc flag on retryable failure
      
      From: Alexandre Oliva <oliva@gnu.org>
      
      If btrfs_alloc_chunk fails with e.g. ENOMEM, we exit do_chunk_alloc
      without clearing chunk_alloc in space_info.  As a result, any further
      calls to do_chunk_alloc on that filesystem will start busy-waiting for
      chunk_alloc to be cleared, but it never will be.  This patch adjusts
      do_chunk_alloc so that it clears this flag in case of an error.
      Signed-off-by: NAlexandre Oliva <oliva@gnu.org>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      a81cb9a2
    • J
      Btrfs: fix backref walking race with tree deletions · ca60ebfa
      Jan Schmidt 提交于
      When a subvolume is removed, we remove the root item from the root tree,
      while the tree blocks and backrefs remain for a while. When backref walking
      comes across one of those orphan tree blocks, it can find a backref for a
      no longer existing root. This is all good, we only must tolerate
      __resolve_indirect_ref returning an error and continue with the good refs
      found.
      Reported-by: NAlex Lyakas <alex.btrfs@zadarastorage.com>
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      ca60ebfa
  4. 26 2月, 2013 1 次提交
    • J
      Btrfs: make sure NODATACOW also gets NODATASUM set · f2bdf9a8
      Josef Bacik 提交于
      A user reported hitting the BUG_ON() in btrfs_finished_ordered_io() where we had
      csums on a NOCOW extent.  This can happen if we have NODATACOW set but not
      NODATASUM set, which can happen in two cases, either we mount with -o nodatacow
      and then write into preallocated space, or chattr +C a directory and move a file
      into that directory.  Liu has fixed the move case in a different place, but this
      fixes the mount -o nodatacow case.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      f2bdf9a8
  5. 21 2月, 2013 19 次提交
    • M
      Btrfs: fix remount vs autodefrag · dc81cdc5
      Miao Xie 提交于
      If we remount the fs to close the auto defragment or make the fs R/O,
      we should stop the auto defragment.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      dc81cdc5
    • M
      Btrfs: fix wrong outstanding_extents when doing DIO write · 172a5049
      Miao Xie 提交于
      When running the 083th case of xfstests on the filesystem with
      "compress-force=lzo", the following WARNINGs were triggered.
        WARNING: at fs/btrfs/inode.c:7908
        WARNING: at fs/btrfs/inode.c:7909
        WARNING: at fs/btrfs/inode.c:7911
        WARNING: at fs/btrfs/extent-tree.c:4510
        WARNING: at fs/btrfs/extent-tree.c:4511
      
      This problem was introduced by the patch "Btrfs: fix deadlock due
      to unsubmitted". In this patch, there are two bugs which caused
      the above problem.
      
      The 1st one is a off-by-one bug, if the DIO write return 0, it is
      also a short write, we need release the reserved space for it. But
      we didn't do it in that patch. Fix it by change "ret > 0" to
      "ret >= 0".
      
      The 2nd one is ->outstanding_extents was increased twice when
      a short write happened. As we know, ->outstanding_extents is
      a counter to keep track of the number of extent items we may
      use duo to delalloc, when we reserve the free space for a
      delalloc write, we assume that the write will introduce just
      one extent item, so we increase ->outstanding_extents by 1 at
      that time. And then we will increase it every time we split the
      write, it is done at the beginning of btrfs_get_blocks_direct().
      So when a short write happens, we needn't increase
      ->outstanding_extents again. But this patch done.
      
      In order to fix the 2nd problem, I re-write the logic for
      ->outstanding_extents operation. We don't increase it at the
      beginning of btrfs_get_blocks_direct(), instead, we just
      increase it when the split actually happens.
      Reported-by: NMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      172a5049
    • L
      Btrfs: snapshot-aware defrag · 38c227d8
      Liu Bo 提交于
      This comes from one of btrfs's project ideas,
      As we defragment files, we break any sharing from other snapshots.
      The balancing code will preserve the sharing, and defrag needs to grow this
      as well.
      
      Now we're able to fill the blank with this patch, in which we make full use of
      backref walking stuff.
      
      Here is the basic idea,
      o  set the writeback ranges started by defragment with flag EXTENT_DEFRAG
      o  at endio, after we finish updating fs tree, we use backref walking to find
         all parents of the ranges and re-link them with the new COWed file layout by
         adding corresponding backrefs.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      38c227d8
    • C
      Btrfs: fix max chunk size on raid5/6 · 86db2578
      Chris Mason 提交于
      We try to limit the size of a chunk to 10GB, which keeps the unit of
      work reasonable during balance and resize operations.  The limit checks
      were taking into account the number of copies of the data we had but
      what they really should be doing is comparing against the logical
      size of the chunk we're creating.
      
      This moves the code around a little to use the count of data stripes
      from raid5/6.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      86db2578
    • Z
      btrfs: limit fallocate extent reservation to 256MB · 24542bf7
      Zach Brown 提交于
      Very large fallocate requests are cpu bound and result in extents with a
      repeating pattern of ever decreasing size:
      
      $ time fallocate -l 1T file
      real	0m13.039s
      
      ( an excerpt of the extents from btrfs-debug-tree: )
        prealloc data disk byte 1536292564992 nr 397312
        prealloc data disk byte 1536292962304 nr 196608
        prealloc data disk byte 1536293158912 nr 98304
        prealloc data disk byte 1536293257216 nr 49152
        prealloc data disk byte 1536293306368 nr 24576
        prealloc data disk byte 1536293330944 nr 12288
        prealloc data disk byte 1536293343232 nr 8192
        prealloc data disk byte 1536293351424 nr 4096
        prealloc data disk byte 1536293355520 nr 4096
        prealloc data disk byte 1536293359616 nr 4096
      
      The excessive cpu use comes from __btrfs_prealloc_file_range() trying to
      allocate the entire remaining size after each extent is allocated.
      btrfs_reserve_extent() repeatedly cuts this requested size in half until
      it gets down to the size that the allocators can return.  We limit the
      problem for now by capping each reservation at 256 meg.
      
      The small extents come from a masking bug when decreasing the requested
      reservation size.  The high 32bits are cleared and the remaining low
      bits might happen to reserve a small size.   Fix this by using
      round_down() which properly casts the mask.
      
      After these fixes huge fallocate requests are fast and result in nice
      large extents:
      
      $ time fallocate -l 1T file
      real	0m0.082s
      
        prealloc data disk byte 1112425889792 nr 268435456
        prealloc data disk byte 1112694325248 nr 268435456
        prealloc data disk byte 1112962760704 nr 268435456
      Reported-by: NEric Sandeen <sandeen@redhat.com>
      Signed-off-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      24542bf7
    • T
      btrfs: Init io_lock after cloning btrfs device struct · 1cba0cdf
      Thomas Gleixner 提交于
      __btrfs_close_devices() clones btrfs device structs with
      memcpy(). Some of the fields in the clone are reinitialized, but it's
      missing to init io_lock. In mainline this goes unnoticed, but on RT it
      leaves the plist pointing to the original about to be freed lock
      struct.
      
      Initialize io_lock after cloning, so no references to the original
      struct are left.
      Reported-and-tested-by: NMike Galbraith <efault@gmx.de>
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      1cba0cdf
    • C
      Merge branch 'raid56-experimental' into for-linus-3.9 · e942f883
      Chris Mason 提交于
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      
      Conflicts:
      	fs/btrfs/ctree.h
      	fs/btrfs/extent-tree.c
      	fs/btrfs/inode.c
      	fs/btrfs/volumes.c
      e942f883
    • C
      Merge branch 'master' of... · b2c6b3e0
      Chris Mason 提交于
      Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next into for-linus-3.9
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      
      Conflicts:
      	fs/btrfs/disk-io.c
      b2c6b3e0
    • M
      Btrfs: fix missing release of qgroup reservation in commit_transaction() · 272d26d0
      Miao Xie 提交于
      We forget to free qgroup reservation in commit_transaction(),fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NWang Shilong <wangsl-fnst@cn.fujitsu.com>
      Cc: Arne Jansen <sensille@gmx.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      272d26d0
    • W
      Btrfs: fix missing check before disabling quota · 683cebda
      Wang Shilong 提交于
      The original code forget to check whether quota has been disabled firstly,
      and it will return 'EINVAL' and return error to users if quota has been
      disabled,it will be unfriendly and confusing for users to see that.
      So just return directly if quota has been disabled.
      Signed-off-by: NWang Shilong <wangsl-fnst@cn.fujitsu.com>
      Cc: Arne Jansen <sensille@gmx.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      683cebda
    • L
      Btrfs: fix cleaner thread not working with inode cache option · fa6ac876
      Liu Bo 提交于
      Right now inode cache inode is treated as the same as space cache
      inode, ie. keep inode in memory till putting super.
      
      But this leads to an awkward situation.
      
      If we're going to delete a snapshot/subvolume, btrfs will not
      actually delete it and return free space, but will add it to dead
      roots list until the last inode on this snap/subvol being destroyed.
      Then we'll fetch deleted roots and cleanup them via cleaner thread.
      
      So here is the problem, if we enable inode cache option, each
      snap/subvol has a cached inode which is used to store inode allcation
      information.  And this cache inode will be kept in memory, as the above
      said.  So with inode cache, snap/subvol can only be added into
      dead roots list during freeing roots stage in umount, so that we can
      ONLY get space back after another remount(we cleanup dead roots on mount).
      
      But the real thing is we'll no more use the snap/subvol if we mark it
      deleted, so we can safely iput its cache inode when we delete snap/subvol.
      
      Another thing is that we need to change the rules of droping inode, we
      don't keep snap/subvol's cache inode in memory till end so that we can
      add snap/subvol into dead roots list in time.
      Reported-by: NMitch Harder <mitch.harder@sabayonlinux.org>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fa6ac876
    • M
      Btrfs: fix uncompleted transaction · d4edf39b
      Miao Xie 提交于
      In some cases, we need commit the current transaction, but don't want
      to start a new one if there is no running transaction, so we introduce
      the function - btrfs_attach_transaction(), which can catch the current
      transaction, and return -ENOENT if there is no running transaction.
      
      But no running transaction doesn't mean the current transction completely,
      because we removed the running transaction before it completes. In some
      cases, it doesn't matter. But in some special cases, such as freeze fs, we
      hope the transaction is fully on disk, it will introduce some bugs, for
      example, we may feeze the fs and dump the data in the disk, if the transction
      doesn't complete, we would dump inconsistent data. So we need fix the above
      problem for those cases.
      
      We fixes this problem by introducing a function:
      	btrfs_attach_transaction_barrier()
      if we hope all the transaction is fully on the disk, even they are not
      running, we can use this function.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      d4edf39b
    • M
      Btrfs: fix the deadlock between the transaction start/attach and commit · 178260b2
      Miao Xie 提交于
      Now btrfs_commit_transaction() does this
      
      ret = btrfs_run_ordered_operations(root, 0)
      
      which async flushes all inodes on the ordered operations list, it introduced
      a deadlock that transaction-start task, transaction-commit task and the flush
      workers waited for each other.
      (See the following URL to get the detail
       http://marc.info/?l=linux-btrfs&m=136070705732646&w=2)
      
      As we know, if ->in_commit is set, it means someone is committing the
      current transaction, we should not try to join it if we are not JOIN
      or JOIN_NOLOCK, wait is the best choice for it. In this way, we can avoid
      the above problem. In this way, there is another benefit: there is no new
      transaction handle to block the transaction which is on the way of commit,
      once we set ->in_commit.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      178260b2
    • M
      Btrfs: fix the qgroup reserved space is released prematurely · 4b824906
      Miao Xie 提交于
      In start_transactio(), we will try to join the transaction again after
      the current transaction is committed, so we should not release the
      reserved space of the qgroup. Fix it.
      
      Cc: Arne Jansen <sensille@gmx.net>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      4b824906
    • Z
      btrfs: define BTRFS_MAGIC as a u64 value · cdb4c574
      Zach Brown 提交于
      super.magic is an le64 but it's treated as an unterminated string when
      compared against BTRFS_MAGIC which is defined as a string.  Instead
      define BTRFS_MAGIC as a normal hex value and use endian helpers to
      compare it to the super's magic.
      
      I tested this by mounting an fs made before the change and made sure
      that it didn't introduce sparse errors.  This matches a similar cleanup
      that is pending in btrfs-progs.  David Sterba pointed out that we should
      fix the kernel side as well :).
      Signed-off-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      cdb4c574
    • J
      Btrfs: set/change the label of a mounted file system · a8bfd4ab
      jeff.liu 提交于
      With this new ioctl(2) BTRFS_IOC_SET_FSLABEL, we can set/change the label of a mounted file system.
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Reviewed-by: NGoffredo Baroncelli <kreijack@inwind.it>
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Reviewed-by: NGoffredo Baroncelli <kreijack@inwind.it>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      a8bfd4ab
    • J
      Btrfs: Add a new ioctl to get the label of a mounted file system · 867ab667
      jeff.liu 提交于
      Add a new ioctl(2) BTRFS_IOC_GET_FSLABLE, so that we can get the label upon a mounted filesystem.
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Cc: Miao Xie <miaox@cn.fujitsu.com>
      Cc: Goffredo Baroncelli <kreijack@inwind.it>
      Cc: David Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      867ab667
    • J
      Btrfs: place ordered operations on a per transaction list · 569e0f35
      Josef Bacik 提交于
      Miao made the ordered operations stuff run async, which introduced a
      deadlock where we could get somebody (sync) racing in and committing the
      transaction while a commit was already happening.  The new committer would
      try and flush ordered operations which would hang waiting for the commit to
      finish because it is done asynchronously and no longer inherits the callers
      trans handle.  To fix this we need to make the ordered operations list a per
      transaction list.  We can get new inodes added to the ordered operation list
      by truncating them and then having another process writing to them, so this
      makes it so that anybody trying to add an ordered operation _must_ start a
      transaction in order to add itself to the list, which will keep new inodes
      from getting added to the ordered operations list after we start committing.
      This should fix the deadlock and also keeps us from doing a lot more work
      than we need to during commit.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      569e0f35
    • J
      Btrfs: relax the block group size limit for bitmaps · dde5740f
      Josef Bacik 提交于
      Dave pointed out that xfstests 273 will tell you that it failed to load the
      space cache for a block group when it remounts.  This is because we run out
      of space writing out the block group cache.  This is ok and is working as it
      should, but let's try to be a bit nicer.  This happens because the block
      group was 100mb, but bitmap entries cover 128mb, so we were only getting
      extent entries for this block group, which ended up being too many to fit in
      the free space cache.  So relax the bitmap size requirements to block groups
      that are at least half the size a bitmap will cover or larger, that way we
      can still keep the amount of space used in the free space cache low enough
      to be able to write it out.  With this patch I no longer fail to write out
      the free space cache.  Thanks,
      Reported-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      dde5740f