1. 26 7月, 2012 1 次提交
    • A
      Btrfs: introduce subvol uuids and times · 8ea05e3a
      Alexander Block 提交于
      This patch introduces uuids for subvolumes. Each
      subvolume has it's own uuid. In case it was snapshotted,
      it also contains parent_uuid. In case it was received,
      it also contains received_uuid.
      
      It also introduces subvolume ctime/otime/stime/rtime. The
      first two are comparable to the times found in inodes. otime
      is the origin/creation time and ctime is the change time.
      stime/rtime are only valid on received subvolumes.
      stime is the time of the subvolume when it was
      sent. rtime is the time of the subvolume when it was
      received.
      
      Additionally to the times, we have a transid for each
      time. They are updated at the same place as the times.
      
      btrfs receive uses stransid and rtransid to find out
      if a received subvolume changed in the meantime.
      
      If an older kernel mounts a filesystem with the
      extented fields, all fields become invalid. The next
      mount with a new kernel will detect this and reset the
      fields.
      Signed-off-by: NAlexander Block <ablock84@googlemail.com>
      Reviewed-by: NDavid Sterba <dave@jikos.cz>
      Reviewed-by: NArne Jansen <sensille@gmx.net>
      Reviewed-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Reviewed-by: NAlex Lyakas <alex.bolshoy.btrfs@gmail.com>
      8ea05e3a
  2. 03 7月, 2012 2 次提交
    • L
      Btrfs: fix wrong check during log recovery · 6bf02314
      Liu Bo 提交于
      When we're evicting an inode during log recovery, we need to ensure that the inode
      is not in orphan state any more, which means inode's run_time flags has _no_
      BTRFS_INODE_HAS_ORPHAN_ITEM.  Thus, the BUG_ON was triggered because of a wrong
      check for the flags.
      Reviewed-by: NDavid Sterba <dsterba@suse.cz>
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      6bf02314
    • J
      Btrfs: fix dio write vs buffered read race · c3473e83
      Josef Bacik 提交于
      Miao pointed out there's a problem with mixing dio writes and buffered
      reads.  If the read happens between us invalidating the page range and
      actually locking the extent we can bring in pages into page cache.  Then
      once the write finishes if somebody tries to read again it will just find
      uptodate pages and we'll read stale data.  So we need to lock the extent and
      check for uptodate bits in the range.  If there are uptodate bits we need to
      unlock and invalidate again.  This will keep this race from happening since
      we will hold the extent locked until we create the ordered extent, and then
      teh read side always waits for ordered extents.  There was also a race in
      how we updated i_size, previously we were relying on the generic DIO stuff
      to adjust the i_size after the DIO had completed, but this happens outside
      of the extent lock which means reads could come in and not see the updated
      i_size.  So instead move this work into where we create the extents, and
      then this way the update ordered i_size stuff works properly in the endio
      handlers.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      c3473e83
  3. 21 6月, 2012 1 次提交
  4. 15 6月, 2012 5 次提交
    • L
      Btrfs: fix missing inherited flag in rename · bc178237
      Liu Bo 提交于
      When we move a file into a directory with compression flag, we need to
      inherite BTRFS_INODE_COMPRESS and clear BTRFS_INODE_NOCOMPRESS as well.
      But if we move a file into a directory without compression flag, we need
      to clear both of them.
      
      It is the way how our setflags deals with compression flag, so keep
      the same behaviour here.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      bc178237
    • J
      Btrfs: call filemap_fdatawrite twice for compression · 7ddf5a42
      Josef Bacik 提交于
      I removed this in an earlier commit and I was wrong.  Because compression
      can return from filemap_fdatawrite() without having actually set any of it's
      pages as writeback() it can make filemap_fdatawait() do essentially nothing,
      and then we won't find any ordered extents because they may not have been
      created yet.  So not only does this make fsync() completely useless, but it
      will also screw up if you truncate on a non-page aligned offset since we
      zero out the end and then wait on ordered extents and then call drop caches.
      We can drop the cache before the io completes and then we try to unpin the
      extent we just wrote we won't find it and everything goes sideways.  So fix
      this by putting it back and put a giant comment there to keep me from trying
      to remove it in the future.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      7ddf5a42
    • J
      Btrfs: keep inode pinned when compressing writes · 8180ef88
      Josef Bacik 提交于
      A user reported lots of problems using compression on the new code and it
      turns out part of the problem was that igrab() was failing when we added a
      new ordered extent.  This is because when writing out an inode under
      compression we immediately return without actually doing anything to the
      pages, and then in another thread at some point down the line actually do
      the ordered dance.  The problem is between the point that we start writeback
      and we actually add the ordered extent we could be trying to reclaim the
      inode, which makes igrab() return NULL.  So we need to do an igrab() when we
      create the async extent and then drop it when we are done with it.  This
      makes sure we stay pinned in memory until the ordered extent can get a
      reference on it and we are good to go.  With this patch we no longer panic
      in btrfs_finish_ordered_io().  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      8180ef88
    • J
      Btrfs: unlock everything properly in the error case for nocow · 17ca04af
      Josef Bacik 提交于
      I was getting hung on umount when a transaction was aborted because a range
      of one of the free space inodes was still locked.  This is because the nocow
      stuff doesn't unlock anything on error.  This fixed the problem and I
      verified that is what was happening.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      17ca04af
    • J
      Btrfs: pass locked_page into extent_clear_unlock_delalloc if theres an error · beb42dd7
      Josef Bacik 提交于
      While doing my enospc work I got a transaction abortion that resulted in a
      panic when we tried to unlock_page() an already unlocked page.  This is
      because we aren't calling extent_clear_unlock_delalloc with the locked page
      so it was unlocking all the pages in the range.  This is wrong since
      __extent_writepage expects to have the page locked still unless we return
      *page_started as 1.  This should keep us from panicing.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      beb42dd7
  5. 02 6月, 2012 1 次提交
  6. 30 5月, 2012 5 次提交
    • J
      Btrfs: fall back to non-inline if we don't have enough space · 2adcac1a
      Josef Bacik 提交于
      If cow_file_range_inline fails with ENOSPC we abort the transaction which
      isn't very nice.  This really shouldn't be happening anyways but there's no
      sense in making it a horrible error when we can easily just go allocate
      normal data space for this stuff.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      2adcac1a
    • J
      Btrfs: fix how we deal with the orphan block rsv · 8a35d95f
      Josef Bacik 提交于
      Ceph was hitting this race where we would remove an inode from the per-root
      orphan list before we would release the space we had reserved for the inode.
      We actually don't need a list or anything, we just need to make sure the
      root doesn't try to free up the orphan reserve until after the inodes have
      released their reservations.  So use an atomic counter instead of a list on
      the root and only decrement the counter after we've released our
      reservation.  I've tested this as well as several others and we no longer
      see the warnings that you would see while running ceph.  Thanks,
      Btrfs: fix how we deal with the orphan block rsv
      
      Ceph was hitting this race where we would remove an inode from the per-root
      orphan list before we would release the space we had reserved for the inode.
      We actually don't need a list or anything, we just need to make sure the
      root doesn't try to free up the orphan reserve until after the inodes have
      released their reservations.  So use an atomic counter instead of a list on
      the root and only decrement the counter after we've released our
      reservation.  I've tested this as well as several others and we no longer
      see the warnings that you would see while running ceph.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      8a35d95f
    • J
      Btrfs: convert the inode bit field to use the actual bit operations · 72ac3c0d
      Josef Bacik 提交于
      Miao pointed this out while I was working on an orphan problem that messing
      with a bitfield where different ranges are protected by different locks
      doesn't work out right.  Turns out we've been doing this forever where we
      have different parts of the bit field protected by either no lock at all or
      different locks which could cause all sorts of weird problems including the
      issue I was hitting.  So instead make a runtime_flags thing that we use the
      normal bit operations on that are all atomic so we can keep having our
      no/different locking for the different flags and then make force_compress
      it's own thing so it can be treated normally.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      72ac3c0d
    • J
      Btrfs: finish ordered extents in their own thread · 5fd02043
      Josef Bacik 提交于
      We noticed that the ordered extent completion doesn't really rely on having
      a page and that it could be done independantly of ending the writeback on a
      page.  This patch makes us not do the threaded endio stuff for normal
      buffered writes and direct writes so we can end page writeback as soon as
      possible (in irq context) and only start threads to do the ordered work when
      it is actually done.  Compression needs to be reworked some to take
      advantage of this as well, but atm it has to do a find_get_page in its endio
      handler so it must be done in its own thread.  This makes direct writes
      quite a bit faster.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      5fd02043
    • J
      Btrfs: use i_version instead of our own sequence · 0c4d2d95
      Josef Bacik 提交于
      We've been keeping around the inode sequence number in hopes that somebody
      would use it, but nobody uses it and people actually use i_version which
      serves the same purpose, so use i_version where we used the incore inode's
      sequence number and that way the sequence is updated properly across the
      board, and not just in file write.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      0c4d2d95
  7. 06 5月, 2012 1 次提交
  8. 28 4月, 2012 1 次提交
  9. 19 4月, 2012 3 次提交
    • J
      Btrfs: always store the mirror we read the eb from · 5cf1ab56
      Josef Bacik 提交于
      A user reported a panic where we were trying to fix a bad mirror but the
      mirror number we were giving was 0, which is invalid.  This is because we
      don't do the transid verification until after the read, so as far as the
      read code is concerned the read was a success.  So instead store the mirror
      we read from so that if there is some failure post read we know which mirror
      to try next and which mirror needs to be fixed if we find a good copy of the
      block.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      5cf1ab56
    • A
      btrfs: fix race in reada · 8c9c2bf7
      Arne Jansen 提交于
      When inserting into the radix tree returns EEXIST, get the existing
      entry without giving up the spinlock in between.
      There was a race for both the zones trees and the extent tree.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      8c9c2bf7
    • L
      Btrfs: avoid setting ->d_op twice · 848cce0d
      Li Zefan 提交于
      Follow those instructions, and you'll trigger a warning in the
      beginning of d_set_d_op():
      
        # mkfs.btrfs /dev/loop3
        # mount /dev/loop3 /mnt
        # btrfs sub create /mnt/sub
        # btrfs sub snap /mnt /mnt/snap
        # touch /mnt/snap/sub
        touch: cannot touch `tmp': Permission denied
      
      __d_alloc() set d_op to sb->s_d_op (btrfs_dentry_operations), and
      then simple_lookup() reset it to simple_dentry_operations, which
      triggered the warning.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      848cce0d
  10. 29 3月, 2012 1 次提交
    • L
      Btrfs: fix recursive defragment with autodefrag option · 4cb13e5d
      Liu Bo 提交于
      $ mkfs.btrfs disk
      $ mount disk /mnt -o autodefrag
      $ dd if=/dev/zero of=/mnt/foobar bs=4k count=10 2>/dev/null && sync
      $ for i in `seq 9 -2 0`; do dd if=/dev/zero of=/mnt/foobar bs=4k count=1 \
        seek=$i conv=notrunc 2> /dev/null; done && sync
      
      then we'll get to defrag "foobar" again and again.
      So does option "-o autodefrag,compress".
      
      Reasons:
      When the cleaner kthread gets to fetch inodes from the defrag tree and defrag
      them, it will dirty pages and submit them, this will comes to another DATA COW
      where the processing inode will be inserted to the defrag tree again.
      
      This patch sets a rule for COW code, i.e. insert an inode when we're really
      going to make some defragments.
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4cb13e5d
  11. 27 3月, 2012 2 次提交
  12. 22 3月, 2012 8 次提交
  13. 20 3月, 2012 1 次提交
  14. 23 2月, 2012 1 次提交
  15. 16 2月, 2012 1 次提交
  16. 15 2月, 2012 1 次提交
    • J
      btrfs: delalloc for page dirtied out-of-band in fixup worker · 87826df0
      Jeff Mahoney 提交于
       We encountered an issue that was easily observable on s/390 systems but
       could really happen anywhere. The timing just seemed to hit reliably
       on s/390 with limited memory.
      
       The gist is that when an unexpected set_page_dirty() happened, we'd
       run into the BUG() in btrfs_writepage_fixup_worker since it wasn't
       properly set up for delalloc.
      
       This patch does the following:
       - Performs the missing delalloc in the fixup worker
       - Allow the start hook to return -EBUSY which informs __extent_writepage
         that it should mark the page skipped and not to redirty it. This is
         required since the fixup worker can fail with -ENOSPC and the page
         will have already been redirtied. That causes an Oops in
         drop_outstanding_extents later. Retrying the fixup worker could
         lead to an infinite loop. Deferring the page redirty also saves us
         some cycles since the page would be stuck in a resubmit-redirty loop
         until the fixup worker completes. It's not harmful, just wasteful.
       - If the fixup worker fails, we mark the page and mapping as errored,
         and end the writeback, similar to what we would do had the page
         actually been submitted to writeback.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      87826df0
  17. 27 1月, 2012 1 次提交
    • C
      Btrfs: fix reservations in btrfs_page_mkwrite · 9998eb70
      Chris Mason 提交于
      Josef fixed btrfs_page_mkwrite to properly release reserved
      extents if there was an error.  But if we fail to get a reservation
      and we fail to dirty the inode (for ENOSPC reasons), we'll end up
      trying to release a reservation we never had.
      
      This makes sure we only release if we were able to reserve.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      9998eb70
  18. 17 1月, 2012 4 次提交
    • J
      Btrfs: add a delalloc mutex to inodes for delalloc reservations · f248679e
      Josef Bacik 提交于
      I was using i_mutex for this, but we're getting bogus lockdep warnings by doing
      that and theres no real way to get rid of those, so just stop using i_mutex to
      protect delalloc metadata reservations and use a delalloc mutex instead.  This
      shouldn't be contended often at all, only if you are writing and mmap writing to
      the file at the same time.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      f248679e
    • J
      Btrfs: protect orphan block rsv with spin_lock · 90290e19
      Josef Bacik 提交于
      We've been seeing warnings coming out of the orphan commit stuff forever from
      ceph.  Turns out it's because we're racing with checking if the orphan block
      reserve is set, because we clear it outside of the spin_lock.  So leave the
      normal fastpath checks where they are, but take the spin_lock and _recheck_ to
      make sure we haven't had an orphan block rsv added in the meantime.  Then clear
      the root's orphan block rsv and release the lock.  With this patch a user said
      the warnings went away and they usually showed up pretty soon after he started
      ceph.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      90290e19
    • J
      Btrfs: release space on error in page_mkwrite · ec39e180
      Josef Bacik 提交于
      If updating the inode gave us an ENOSPC we were just returning in page_mkwrite,
      which is a problem since we make our reservation right before trying to update
      the inode, so fix the out label so that we actually free our reservation.
      Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      ec39e180
    • M
      Btrfs: fix btrfsck error 400 when truncating a compressed · f70a9a6b
      Miao Xie 提交于
      Reproduce steps:
       # mkfs.btrfs /dev/sdb5
       # mount /dev/sdb5 -o compress=lzo /mnt
       # dd if=/dev/zero of=/mnt/tmpfile bs=128K count=1
       # sync
       # truncate -s 64K /mnt/tmpfile
       root 5 inode 257 errors 400
      
      This is because of the wrong if condition, which is used to check if we should
      subtract the bytes of the dropped range from i_blocks/i_bytes of i-node or not.
      When we truncate a compressed extent, btrfs substracts the bytes of the whole
      extent, it's wrong. We should substract the real size that we truncate, no
      matter it is a compressed extent or not. Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      f70a9a6b