1. 30 5月, 2012 1 次提交
    • J
      Btrfs: use i_version instead of our own sequence · 0c4d2d95
      Josef Bacik 提交于
      We've been keeping around the inode sequence number in hopes that somebody
      would use it, but nobody uses it and people actually use i_version which
      serves the same purpose, so use i_version where we used the incore inode's
      sequence number and that way the sequence is updated properly across the
      board, and not just in file write.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      0c4d2d95
  2. 28 4月, 2012 1 次提交
  3. 19 4月, 2012 3 次提交
    • J
      Btrfs: always store the mirror we read the eb from · 5cf1ab56
      Josef Bacik 提交于
      A user reported a panic where we were trying to fix a bad mirror but the
      mirror number we were giving was 0, which is invalid.  This is because we
      don't do the transid verification until after the read, so as far as the
      read code is concerned the read was a success.  So instead store the mirror
      we read from so that if there is some failure post read we know which mirror
      to try next and which mirror needs to be fixed if we find a good copy of the
      block.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      5cf1ab56
    • A
      btrfs: fix race in reada · 8c9c2bf7
      Arne Jansen 提交于
      When inserting into the radix tree returns EEXIST, get the existing
      entry without giving up the spinlock in between.
      There was a race for both the zones trees and the extent tree.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      8c9c2bf7
    • L
      Btrfs: avoid setting ->d_op twice · 848cce0d
      Li Zefan 提交于
      Follow those instructions, and you'll trigger a warning in the
      beginning of d_set_d_op():
      
        # mkfs.btrfs /dev/loop3
        # mount /dev/loop3 /mnt
        # btrfs sub create /mnt/sub
        # btrfs sub snap /mnt /mnt/snap
        # touch /mnt/snap/sub
        touch: cannot touch `tmp': Permission denied
      
      __d_alloc() set d_op to sb->s_d_op (btrfs_dentry_operations), and
      then simple_lookup() reset it to simple_dentry_operations, which
      triggered the warning.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      848cce0d
  4. 29 3月, 2012 1 次提交
    • L
      Btrfs: fix recursive defragment with autodefrag option · 4cb13e5d
      Liu Bo 提交于
      $ mkfs.btrfs disk
      $ mount disk /mnt -o autodefrag
      $ dd if=/dev/zero of=/mnt/foobar bs=4k count=10 2>/dev/null && sync
      $ for i in `seq 9 -2 0`; do dd if=/dev/zero of=/mnt/foobar bs=4k count=1 \
        seek=$i conv=notrunc 2> /dev/null; done && sync
      
      then we'll get to defrag "foobar" again and again.
      So does option "-o autodefrag,compress".
      
      Reasons:
      When the cleaner kthread gets to fetch inodes from the defrag tree and defrag
      them, it will dirty pages and submit them, this will comes to another DATA COW
      where the processing inode will be inserted to the defrag tree again.
      
      This patch sets a rule for COW code, i.e. insert an inode when we're really
      going to make some defragments.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4cb13e5d
  5. 27 3月, 2012 2 次提交
  6. 22 3月, 2012 8 次提交
  7. 20 3月, 2012 1 次提交
  8. 23 2月, 2012 1 次提交
  9. 16 2月, 2012 1 次提交
  10. 15 2月, 2012 1 次提交
    • J
      btrfs: delalloc for page dirtied out-of-band in fixup worker · 87826df0
      Jeff Mahoney 提交于
       We encountered an issue that was easily observable on s/390 systems but
       could really happen anywhere. The timing just seemed to hit reliably
       on s/390 with limited memory.
      
       The gist is that when an unexpected set_page_dirty() happened, we'd
       run into the BUG() in btrfs_writepage_fixup_worker since it wasn't
       properly set up for delalloc.
      
       This patch does the following:
       - Performs the missing delalloc in the fixup worker
       - Allow the start hook to return -EBUSY which informs __extent_writepage
         that it should mark the page skipped and not to redirty it. This is
         required since the fixup worker can fail with -ENOSPC and the page
         will have already been redirtied. That causes an Oops in
         drop_outstanding_extents later. Retrying the fixup worker could
         lead to an infinite loop. Deferring the page redirty also saves us
         some cycles since the page would be stuck in a resubmit-redirty loop
         until the fixup worker completes. It's not harmful, just wasteful.
       - If the fixup worker fails, we mark the page and mapping as errored,
         and end the writeback, similar to what we would do had the page
         actually been submitted to writeback.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      87826df0
  11. 27 1月, 2012 1 次提交
    • C
      Btrfs: fix reservations in btrfs_page_mkwrite · 9998eb70
      Chris Mason 提交于
      Josef fixed btrfs_page_mkwrite to properly release reserved
      extents if there was an error.  But if we fail to get a reservation
      and we fail to dirty the inode (for ENOSPC reasons), we'll end up
      trying to release a reservation we never had.
      
      This makes sure we only release if we were able to reserve.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9998eb70
  12. 17 1月, 2012 5 次提交
    • J
      Btrfs: add a delalloc mutex to inodes for delalloc reservations · f248679e
      Josef Bacik 提交于
      I was using i_mutex for this, but we're getting bogus lockdep warnings by doing
      that and theres no real way to get rid of those, so just stop using i_mutex to
      protect delalloc metadata reservations and use a delalloc mutex instead.  This
      shouldn't be contended often at all, only if you are writing and mmap writing to
      the file at the same time.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      f248679e
    • J
      Btrfs: protect orphan block rsv with spin_lock · 90290e19
      Josef Bacik 提交于
      We've been seeing warnings coming out of the orphan commit stuff forever from
      ceph.  Turns out it's because we're racing with checking if the orphan block
      reserve is set, because we clear it outside of the spin_lock.  So leave the
      normal fastpath checks where they are, but take the spin_lock and _recheck_ to
      make sure we haven't had an orphan block rsv added in the meantime.  Then clear
      the root's orphan block rsv and release the lock.  With this patch a user said
      the warnings went away and they usually showed up pretty soon after he started
      ceph.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      90290e19
    • J
      Btrfs: release space on error in page_mkwrite · ec39e180
      Josef Bacik 提交于
      If updating the inode gave us an ENOSPC we were just returning in page_mkwrite,
      which is a problem since we make our reservation right before trying to update
      the inode, so fix the out label so that we actually free our reservation.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ec39e180
    • M
      Btrfs: fix btrfsck error 400 when truncating a compressed · f70a9a6b
      Miao Xie 提交于
      Reproduce steps:
       # mkfs.btrfs /dev/sdb5
       # mount /dev/sdb5 -o compress=lzo /mnt
       # dd if=/dev/zero of=/mnt/tmpfile bs=128K count=1
       # sync
       # truncate -s 64K /mnt/tmpfile
       root 5 inode 257 errors 400
      
      This is because of the wrong if condition, which is used to check if we should
      subtract the bytes of the dropped range from i_blocks/i_bytes of i-node or not.
      When we truncate a compressed extent, btrfs substracts the bytes of the whole
      extent, it's wrong. We should substract the real size that we truncate, no
      matter it is a compressed extent or not. Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f70a9a6b
    • J
      Btrfs: do not use btrfs_end_transaction_throttle everywhere · 7ad85bb7
      Josef Bacik 提交于
      A user reported a problem where things like open with O_CREAT would take up to
      30 seconds when he had nfs activity on the same mount.  This is because all of
      our quick metadata operations, like create, symlink etc all do
      btrfs_end_transaction_throttle, which if the transaction is blocked will wait
      for the commit to complete before it returns.  This adds a ridiculous amount of
      latency and isn't really needed.  The normal btrfs_end_transaction will mark the
      transaction as blocked and wake the transaction kthread up if it thinks the
      transaction needs to end (this being in the running out of global reserve space
      scenario), and this is all that is really needed since we've already done
      everything we're going to do, we just need to return.  This should help people
      with the latency they were seeing when using synchronous heavy workloads.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      7ad85bb7
  13. 05 1月, 2012 1 次提交
  14. 04 1月, 2012 5 次提交
  15. 23 12月, 2011 1 次提交
  16. 22 12月, 2011 1 次提交
    • A
      Btrfs: mark delayed refs as for cow · 66d7e7f0
      Arne Jansen 提交于
      Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value
      from every call site. The for_cow parameter will later on be used to
      determine if a ref will change anything with respect to qgroups.
      
      Delayed refs coming from relocation are always counted as for_cow, as they
      don't change subvol quota.
      
      Also pass in the fs_info for later use.
      
      btrfs_find_all_roots() will use this as an optimization, as changes that are
      for_cow will not change anything with respect to which root points to a
      certain leaf. Thus, we don't need to add the current sequence number to
      those delayed refs.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      66d7e7f0
  17. 16 12月, 2011 4 次提交
    • J
      Btrfs: don't panic if orphan item already exists · ee4d89f0
      Josef Bacik 提交于
      I've been hitting this BUG_ON() in btrfs_orphan_add when running xfstest 269 in
      a loop.  This is because we will add an orphan item, do the truncate, the
      truncate will fail for whatever reason (*cough*ENOSPC*cough*) and then we're
      left with an orphan item still in the fs.  Then we come back later to do another
      truncate and it blows up because we already have an orphan item.  This is ok so
      just fix the BUG_ON() to only BUG() if ret is not EEXIST.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      ee4d89f0
    • J
      Btrfs: fix leaked space in truncate · 7041ee97
      Josef Bacik 提交于
      We were occasionaly leaking space when running xfstest 269.  This is because if
      we failed to start the transaction in the truncate loop we'd just goto out, but
      we need to break so that the inode is removed from the orphan list and the space
      is properly freed.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      7041ee97
    • J
      Btrfs: fix how we do delalloc reservations and how we free reservations on error · 660d3f6c
      Josef Bacik 提交于
      Running xfstests 269 with some tracing my scripts kept spitting out errors about
      releasing bytes that we didn't actually have reserved.  This took me down a huge
      rabbit hole and it turns out the way we deal with reserved_extents is wrong,
      we need to only be setting it if the reservation succeeds, otherwise the free()
      method will come in and unreserve space that isn't actually reserved yet, which
      can lead to other warnings and such.  The math was all working out right in the
      end, but it caused all sorts of other issues in addition to making my scripts
      yell and scream and generally make it impossible for me to track down the
      original issue I was looking for.  The other problem is with our error handling
      in the reservation code.  There are two cases that we need to deal with
      
      1) We raced with free.  In this case free won't free anything because csum_bytes
      is modified before we dro the lock in our reservation path, so free rightly
      doesn't release any space because the reservation code may be depending on that
      reservation.  However if we fail, we need the reservation side to do the free at
      that point since that space is no longer in use.  So as it stands the code was
      doing this fine and it worked out, except in case #2
      
      2) We don't race with free.  Nobody comes in and changes anything, and our
      reservation fails.  In this case we didn't reserve anything anyway and we just
      need to clean up csum_bytes but not free anything.  So we keep track of
      csum_bytes before we drop the lock and if it hasn't changed we know we can just
      decrement csum_bytes and carry on.
      
      Because of the case where we can race with free()'s since we have to drop our
      spin_lock to do the reservation, I'm going to serialize all reservations with
      the i_mutex.  We already get this for free in the heavy use paths, truncate and
      file write all hold the i_mutex, just needed to add it to page_mkwrite and
      various ioctl/balance things.  With this patch my space leak scripts no longer
      scream bloody murder.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      660d3f6c
    • J
      Btrfs: deal with enospc from dirtying inodes properly · 22c44fe6
      Josef Bacik 提交于
      Now that we're properly keeping track of delayed inode space we've been getting
      a lot of warnings out of btrfs_dirty_inode() when running xfstest 83.  This is
      because a bunch of people call mark_inode_dirty, which is void so we can't
      return ENOSPC.  This needs to be fixed in a few areas
      
      1) file_update_time - this updates the mtime and such when writing to a file,
      which will call mark_inode_dirty.  So copy file_update_time into btrfs so we can
      call btrfs_dirty_inode directly and return an error if we get one appropriately.
      
      2) fix symlinks to use btrfs_setattr for ->setattr.  For some reason we weren't
      setting ->setattr for symlinks, even though we should have been.  This catches
      one of the cases where we were getting errors in mark_inode_dirty.
      
      3) Fix btrfs_setattr and btrfs_setsize to call btrfs_dirty_inode directly
      instead of mark_inode_dirty.  This lets us return errors properly for truncate
      and chown/anything related to setattr.
      
      4) Add a new btrfs_fs_dirty_inode which will just call btrfs_dirty_inode and
      print an error if we have one.  The only remaining user we can't control for
      this is touch_atime(), but we don't really want to keep people from walking
      down the tree if we don't have space to save the atime update, so just complain
      but don't worry about it.
      
      With this patch xfstests 83 complains a handful of times instead of hundreds of
      times.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      22c44fe6
  18. 15 12月, 2011 2 次提交
    • C
      BTRFS: Establish i_ops before calling d_instantiate · ad19db71
      Casey Schaufler 提交于
      The Smack LSM hook for security_d_instantiate checks
      the inode's i_op->getxattr value to determine if the
      containing filesystem supports extended attributes.
      The BTRFS filesystem sets the inode's i_op value only
      after it has instantiated the inode. This results in
      Smack incorrectly giving new BTRFS inodes attributes
      from the filesystem defaults on the assumption that
      values can't be stored on the filesystem. This patch
      moves the assignment of inode operation vectors ahead
      of the calls to d_instantiate, letting Smack know that
      the filesystem supports extended attributes. There
      should be no impact on the performance or behavior of
      BTRFS.
      Signed-off-by: NCasey Schaufler <casey@schaufler-ca.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ad19db71
    • A
      btrfs: keep orphans for subvolume deletion · f8e9e0b0
      Arne Jansen 提交于
      Since we have the free space caches, btrfs_orphan_cleanup also runs for
      the tree_root. Unfortunately this also cleans up the orphans used to mark
      subvol deletions in progress.
      
      Currently if a subvol deletion gets interrupted twice by umount/mount, the
      deletion will not be continued and the space permanently lost, though it
      would be possible to write a tool to recover those lost subvol deletions.
      This patch checks if the orphan belongs to a subvol (dead root) and skips
      the deletion.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f8e9e0b0