1. 17 12月, 2012 1 次提交
  2. 13 12月, 2012 4 次提交
  3. 13 10月, 2012 1 次提交
    • E
      btrfs: Fix compilation with user namespace support enabled · e9069f47
      Eric W. Biederman 提交于
      When compiling with user namespace support btrfs fails like:
      
      fs/btrfs/tree-log.c: In function ‘fill_inode_item’:
      fs/btrfs/tree-log.c:2955:2: error: incompatible type for argument 3 of ‘btrfs_set_inode_uid’
      fs/btrfs/ctree.h:2026:1: note: expected ‘u32’ but argument is of type ‘kuid_t’
      fs/btrfs/tree-log.c:2956:2: error: incompatible type for argument 3 of ‘btrfs_set_inode_gid’
      fs/btrfs/ctree.h:2027:1: note: expected ‘u32’ but argument is of type ‘kgid_t’
      
      Fix this by using i_uid_read and i_gid_read in
      
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      e9069f47
  4. 09 10月, 2012 9 次提交
  5. 04 10月, 2012 1 次提交
    • J
      Btrfs: do not hold the write_lock on the extent tree while logging · ff44c6e3
      Josef Bacik 提交于
      Dave Sterba pointed out a sleeping while atomic bug while doing fsync.  This
      is because I'm an idiot and didn't realize that rwlock's were spin locks, so
      we've been holding this thing while doing allocations and such which is not
      good.  This patch fixes this by dropping the write lock before we do
      anything heavy and re-acquire it when it is done.  We also need to take a
      ref on the em's in case their corresponding pages are evicted and mark them
      as being logged so that releasepage does not remove them and doesn't remove
      them from our local list.  Thanks,
      Reported-by: NDave Sterba <dave@jikos.cz>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      ff44c6e3
  6. 02 10月, 2012 9 次提交
    • M
      Btrfs: fix unprotected ->log_batch · 2ecb7923
      Miao Xie 提交于
      We forget to protect ->log_batch when syncing a file, this patch fix
      this problem by atomic operation. And ->log_batch is used to check
      if there are parallel sync operations or not, so it is unnecessary to
      reset it to 0 after the sync operation of the current log tree complete.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      2ecb7923
    • J
      Btrfs: add hole punching · 2aaa6655
      Josef Bacik 提交于
      This patch adds hole punching via fallocate.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2aaa6655
    • J
      Btrfs: remove unused hint byte argument for btrfs_drop_extents · 2671485d
      Josef Bacik 提交于
      I audited all users of btrfs_drop_extents and found that nobody actually uses
      the hint_byte argument.  I'm sure it was used for something at some point but
      it's not used now, and the way the pinning works the disk bytenr would never be
      immediately useful anyway so lets just remove it.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2671485d
    • L
      Btrfs: check if an inode has no checksum when logging it · d2794405
      Liu Bo 提交于
      This is based on Josef's "Btrfs: turbo charge fsync".
      
      If an inode is a BTRFS_INODE_NODATASUM one, we don't need to look for csum
      items any more.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      d2794405
    • L
      Btrfs: fix a bug in checking whether a inode is already in log · 46d8bc34
      Liu Bo 提交于
      This is based on Josef's "Btrfs: turbo charge fsync".
      
      The current btrfs checks if an inode is in log by comparing
      root's last_log_commit to inode's last_sub_trans[2].
      
      But the problem is that this root->last_log_commit is shared among
      inodes.
      
      Say we have N inodes to be logged, after the first inode,
      root's last_log_commit is updated and the N-1 remained files will
      be skipped.
      
      This fixes the bug by keeping a local copy of root's last_log_commit
      inside each inode and this local copy will be maintained itself.
      
      [1]: we regard each log transaction as a subset of btrfs's transaction,
      i.e. sub_trans
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      46d8bc34
    • L
      Btrfs: improve fsync by filtering extents that we want · 4e2f84e6
      Liu Bo 提交于
      This is based on Josef's "Btrfs: turbo charge fsync".
      
      The above Josef's patch performs very good in random sync write test,
      because we won't have too much extents to merge.
      
      However, it does not performs good on the test:
      dd if=/dev/zero of=foobar bs=4k count=12500 oflag=sync
      
      The reason is when we do sequencial sync write, we need to merge the
      current extent just with the previous one, so that we can get accumulated
      extents to log:
      
      A(4k) --> AA(8k) --> AAA(12k) --> AAAA(16k) ...
      
      So we'll have to flush more and more checksum into log tree, which is the
      bottleneck according to my tests.
      
      But we can avoid this by telling fsync the real extents that are needed
      to be logged.
      
      With this, I did the above dd sync write test (size=50m),
      
               w/o (orig)   w/ (josef's)   w/ (this)
      SATA      104KB/s       109KB/s       121KB/s
      ramdisk   1.5MB/s       1.5MB/s       10.7MB/s (613%)
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      4e2f84e6
    • L
      Btrfs: cleanup extents after we finish logging inode · 06d3d22b
      Liu Bo 提交于
      This is based on Josef's "Btrfs: turbo charge fsync".
      
      We should cleanup those extents after we've finished logging inode,
      otherwise we may do redundant work on them.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      06d3d22b
    • J
      Btrfs: only warn if we hit an error when doing the tree logging · 0fa83cdb
      Josef Bacik 提交于
      I hit this a couple times while working on my fsync patch (all my bugs, not
      normal operation), but with my new stuff we could have new errors from cases
      I have not encountered, so instead of BUG()'ing we should be WARN()'ing so
      that we are notified there is a problem but the user doesn't lose their
      data.  We can easily commit the transaction in the case that the tree
      logging fails and still be fine, so let's try and be as nice to the user as
      possible.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      0fa83cdb
    • J
      Btrfs: turbo charge fsync · 5dc562c5
      Josef Bacik 提交于
      At least for the vm workload.  Currently on fsync we will
      
      1) Truncate all items in the log tree for the given inode if they exist
      
      and
      
      2) Copy all items for a given inode into the log
      
      The problem with this is that for things like VMs you can have lots of
      extents from the fragmented writing behavior, and worst yet you may have
      only modified a few extents, not the entire thing.  This patch fixes this
      problem by tracking which transid modified our extent, and then when we do
      the tree logging we find all of the extents we've modified in our current
      transaction, sort them and commit them.  We also only truncate up to the
      xattrs of the inode and copy that stuff in normally, and then just drop any
      extents in the range we have that exist in the log already.  Here are some
      numbers of a 50 meg fio job that does random writes and fsync()s after every
      write
      
      		Original	Patched
      SATA drive	82KB/s		140KB/s
      Fusion drive	431KB/s		2532KB/s
      
      So around 2-6 times faster depending on your hardware.  There are a few
      corner cases, for example if you truncate at all we have to do it the old
      way since there is no way to be sure what is in the log is ok.  This
      probably could be done smarter, but if you write-fsync-truncate-write-fsync
      you deserve what you get.  All this work is in RAM of course so if your
      inode gets evicted from cache and you read it in and fsync it we'll do it
      the slow way if we are still in the same transaction that we last modified
      the inode in.
      
      The biggest cool part of this is that it requires no changes to the recovery
      code, so if you fsync with this patch and crash and load an old kernel, it
      will run the recovery and be a-ok.  I have tested this pretty thoroughly
      with an fsync tester and everything comes back fine, as well as xfstests.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      5dc562c5
  7. 24 7月, 2012 1 次提交
  8. 03 7月, 2012 1 次提交
    • C
      Btrfs: run delayed directory updates during log replay · b6305567
      Chris Mason 提交于
      While we are resolving directory modifications in the
      tree log, we are triggering delayed metadata updates to
      the filesystem btrees.
      
      This commit forces the delayed updates to run so the
      replay code can find any modifications done.  It stops
      us from crashing because the directory deleltion replay
      expects items to be removed immediately from the tree.
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      cc: stable@kernel.org
      b6305567
  9. 30 5月, 2012 3 次提交
    • J
      Btrfs: fix return code in drop_objectid_items · 5bdbeb21
      Josef Bacik 提交于
      So dpkg fsync()'s the file and the directory containing the file whenever it
      writes to a file which is really slow in btrfs.  This is partly because
      fsync()'ing a directory _always_ committed the transaction instead of just
      going to the tree log.  This is because drop_objectid_items() would return 1
      since it does a btrfs_search_slot() which returns 1.  In tree-log jargon
      this means that we have to commit the transaction to be safe.  So just check
      if ret is greater than 0 and set it to 0 if it does.  With this patch we now
      use the tree-log instead of committing the entire transaction, which is
      twice as fast on my box.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      5bdbeb21
    • J
      Btrfs: check to see if the inode is in the log before fsyncing · 22ee6985
      Josef Bacik 提交于
      We have this check down in the actual logging code, but this is after we
      start a transaction and all that good stuff.  So move the helper
      inode_in_log() out so we can call it in fsync() and avoid starting a
      transaction altogether and just exit if we've already fsync()'ed this file
      recently.  You would notice this issue if you fsync()'ed a file over and
      over again until the transaction committed.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      22ee6985
    • T
      Btrfs: return value of btrfs_read_buffer is checked correctly · 018642a1
      Tsutomu Itoh 提交于
      btrfs_read_buffer() has the possibility of returning the error.
      Therefore, I add the code in which the return value of btrfs_read_buffer()
      is checked.
      Signed-off-by: NTsutomu Itoh <t-itoh@jp.fujitsu.com>
      018642a1
  10. 06 5月, 2012 1 次提交
    • C
      Btrfs: avoid sleeping in verify_parent_transid while atomic · b9fab919
      Chris Mason 提交于
      verify_parent_transid needs to lock the extent range to make
      sure no IO is underway, and so it can safely clear the
      uptodate bits if our checks fail.
      
      But, a few callers are using it with spinlocks held.  Most
      of the time, the generation numbers are going to match, and
      we don't want to switch to a blocking lock just for the error
      case.  This adds an atomic flag to verify_parent_transid,
      and changes it to return EAGAIN if it needs to block to
      properly verifiy things.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      b9fab919
  11. 22 3月, 2012 2 次提交
  12. 27 1月, 2012 1 次提交
  13. 22 12月, 2011 1 次提交
    • A
      Btrfs: mark delayed refs as for cow · 66d7e7f0
      Arne Jansen 提交于
      Add a for_cow parameter to add_delayed_*_ref and pass the appropriate value
      from every call site. The for_cow parameter will later on be used to
      determine if a ref will change anything with respect to qgroups.
      
      Delayed refs coming from relocation are always counted as for_cow, as they
      don't change subvol quota.
      
      Also pass in the fs_info for later use.
      
      btrfs_find_all_roots() will use this as an optimization, as changes that are
      for_cow will not change anything with respect to which root points to a
      certain leaf. Thus, we don't need to add the current sequence number to
      those delayed refs.
      Signed-off-by: NArne Jansen <sensille@gmx.net>
      Signed-off-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      66d7e7f0
  14. 06 11月, 2011 3 次提交
    • D
      btrfs: separate superblock items out of fs_info · 6c41761f
      David Sterba 提交于
      fs_info has now ~9kb, more than fits into one page. This will cause
      mount failure when memory is too fragmented. Top space consumers are
      super block structures super_copy and super_for_commit, ~2.8kb each.
      Allocate them dynamically. fs_info will be ~3.5kb. (measured on x86_64)
      
      Add a wrapper for freeing fs_info and all of it's dynamically allocated
      members.
      Signed-off-by: NDavid Sterba <dsterba@suse.cz>
      6c41761f
    • C
      Btrfs: fix extent pinning bugs in the tree log · e688b725
      Chris Mason 提交于
      The tree log had two important bugs that could cause corruptions after a
      crash.  Sometimes we were allowing tree log blocks to be reused after
      the tree log was committed but before the transaction commit was done.
      
      This allowed a future metadata write to overwrite the tree log data.  It
      is fixed by adding a new variant of freeing reserved extents that always
      pins them.  Credit goes to Stefan Behrens and Arne Jansen for many many
      hours spent tracking this bug down.
      
      During tree log replay, we do a pass through the tree log and pin all
      the extents we find.  This makes sure the replay code won't go in and
      use any of those blocks for new allocations during replay.  The problem
      is the free space cache isn't honoring these pinned extents.  So the
      allocator can end up handing them out, leading to all kinds of problems
      during replay.
      
      The fix here is to force any free space cache to load while we pin the
      extents, and then to make sure we remove the pinned extents from the
      free space rbtree.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      Reported-by: NStefan Behrens <sbehrens@giantdisaster.de>
      e688b725
    • C
      Btrfs: don't wait as long for more batches during SSD log commit · cd354ad6
      Chris Mason 提交于
      When we're doing log commits, we try to wait for more writers to come in
      and make the commit bigger.  This helps improve performance on rotating
      disks, but on SSDs it adds latencies.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      cd354ad6
  15. 02 11月, 2011 1 次提交
  16. 17 8月, 2011 1 次提交
    • L
      Btrfs: fix an oops of log replay · 34f3e4f2
      liubo 提交于
      When btrfs recovers from a crash, it may hit the oops below:
      
      ------------[ cut here ]------------
      kernel BUG at fs/btrfs/inode.c:4580!
      [...]
      RIP: 0010:[<ffffffffa03df251>]  [<ffffffffa03df251>] btrfs_add_link+0x161/0x1c0 [btrfs]
      [...]
      Call Trace:
       [<ffffffffa03e7b31>] ? btrfs_inode_ref_index+0x31/0x80 [btrfs]
       [<ffffffffa04054e9>] add_inode_ref+0x319/0x3f0 [btrfs]
       [<ffffffffa0407087>] replay_one_buffer+0x2c7/0x390 [btrfs]
       [<ffffffffa040444a>] walk_down_log_tree+0x32a/0x480 [btrfs]
       [<ffffffffa0404695>] walk_log_tree+0xf5/0x240 [btrfs]
       [<ffffffffa0406cc0>] btrfs_recover_log_trees+0x250/0x350 [btrfs]
       [<ffffffffa0406dc0>] ? btrfs_recover_log_trees+0x350/0x350 [btrfs]
       [<ffffffffa03d18b2>] open_ctree+0x1442/0x17d0 [btrfs]
      [...]
      
      This comes from that while replaying an inode ref item, we forget to
      check those old conflicting DIR_ITEM and DIR_INDEX items in fs/file tree,
      then we will come to conflict corners which lead to BUG_ON().
      Signed-off-by: NLiu Bo <liubo2009@cn.fujitsu.com>
      Tested-by: NAndy Lutomirski <luto@mit.edu>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      34f3e4f2