1. 01 9月, 2013 10 次提交
  2. 10 8月, 2013 10 次提交
    • Z
      btrfs: don't loop on large offsets in readdir · db62efbb
      Zach Brown 提交于
      When btrfs readdir() hits the last entry it sets the readdir offset to a
      huge value to stop buggy apps from breaking when the same name is
      returned by readdir() with concurrent rename()s.
      
      But unconditionally setting the offset to INT_MAX causes readdir() to
      loop returning any entries with offsets past INT_MAX.  It only takes a
      few hours of constant file creation and removal to create entries past
      INT_MAX.
      
      So let's set the huge offset to LLONG_MAX if the last entry has already
      overflowed 32bit loff_t.   Without large offsets behaviour is identical.
      With large offsets 64bit apps will work and 32bit apps will be no more
      broken than they currently are if they see large offsets.
      Signed-off-by: NZach Brown <zab@redhat.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      db62efbb
    • J
      Btrfs: check to see if root_list is empty before adding it to dead roots · cfad392b
      Josef Bacik 提交于
      A user reported a panic when running with autodefrag and deleting snapshots.
      This is because we could end up trying to add the root to the dead roots list
      twice.  To fix this check to see if we are empty before adding ourselves to the
      dead roots list.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      cfad392b
    • J
      Btrfs: release both paths before logging dir/changed extents · f3b15ccd
      Josef Bacik 提交于
      The ceph guys tripped over this bug where we were still holding onto the
      original path that we used to copy the inode with when logging.  This is based
      on Chris's fix which was reported to fix the problem.  We need to drop the paths
      in two cases anyway so just move the drop up so that we don't have duplicate
      code.  Thanks,
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      f3b15ccd
    • J
      Btrfs: allow splitting of hole em's when dropping extent cache · ee20a983
      Josef Bacik 提交于
      I noticed while running multi-threaded fsync tests that sometimes fsck would
      complain about an improper gap.  This happens because we fail to add a hole
      extent to the file, which was happening when we'd split a hole EM because
      btrfs_drop_extent_cache was just discarding the whole em instead of splitting
      it.  So this patch fixes this by allowing us to split a hole em properly, which
      means that added holes actually get logged properly and we no longer see this
      fsck error.  Thankfully we're tolerant of these sort of problems so a user would
      not see any adverse effects of this bug, other than fsck complaining.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      ee20a983
    • J
      Btrfs: make sure the backref walker catches all refs to our extent · ed8c4913
      Josef Bacik 提交于
      Because we don't mess with the offset into the extent for compressed we will
      properly find both extents for this case
      
      [extent a][extent b][rest of extent a]
      
      but because we already added a ref for the front half we won't add the inode
      information for the second half.  This causes us to leak that memory and not
      print out the other offset when we do logical-resolve.  So fix this by calling
      ulist_add_merge and then add our eie to the existing entry if there is one.
      With this patch we get both offsets out of logical-resolve.  With this and the
      other 2 patches I've sent we now pass btrfs/276 on my vm with compress-force=lzo
      set.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      ed8c4913
    • J
      Btrfs: fix backref walking when we hit a compressed extent · 8ca15e05
      Josef Bacik 提交于
      If you do btrfs inspect-internal logical-resolve on a compressed extent that has
      been partly overwritten it won't find anything.  This is because we try and
      match the extent offset we've searched for based on the extent offset in the
      data extent entry.  However this doesn't work for compressed extents because the
      offsets are for the uncompressed size, not the compressed size.  So instead only
      do this check if we are not compressed, that way we can get an actual entry for
      the physical offset rather than nothing for compressed.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      8ca15e05
    • J
      Btrfs: do not offset physical if we're compressed · b76bb701
      Josef Bacik 提交于
      xfstest btrfs/276 was freaking out on slower boxes partly because fiemap was
      offsetting the physical based on the extent offset.  This is perfectly fine with
      uncompressed extents, however the extent offset is into the uncompressed area,
      not the compressed.  So we can return a physical value that isn't at all within
      the area we have allocated on disk.  Fix this by returning the start of the
      extent if it is compressed no matter what the offset.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      b76bb701
    • L
      Btrfs: fix extent buffer leak after backref walking · b5b9b5b3
      Liu Bo 提交于
      commit 47fb091f(Btrfs: fix unlock after free on rewinded tree blocks)
      takes an extra increment on the reference of allocated dummy extent buffer, so now we
      cannot free this dummy one, and end up with extent buffer leak.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      b5b9b5b3
    • L
      Btrfs: fix a bug of snapshot-aware defrag to make it work on partial extents · e68afa49
      Liu Bo 提交于
      For partial extents, snapshot-aware defrag does not work as expected,
      since
      a) we use the wrong logical offset to search for parents, which should be
         disk_bytenr + extent_offset, not just disk_bytenr,
      b) 'offset' returned by the backref walking just refers to key.offset, not
         the 'offset' stored in btrfs_extent_data_ref which is
         (key.offset - extent_offset).
      
      The reproducer:
      $ mkfs.btrfs sda
      $ mount sda /mnt
      $ btrfs sub create /mnt/sub
      $ for i in `seq 5 -1 1`; do dd if=/dev/zero of=/mnt/sub/foo bs=5k count=1 seek=$i conv=notrunc oflag=sync; done
      $ btrfs sub snap /mnt/sub /mnt/snap1
      $ btrfs sub snap /mnt/sub /mnt/snap2
      $ sync; btrfs filesystem defrag /mnt/sub/foo;
      $ umount /mnt
      $ btrfs-debug-tree sda (Here we can check whether the defrag operation is snapshot-awared.
      
      This addresses the above two problems.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      e68afa49
    • J
      btrfs: fix file truncation if FALLOC_FL_KEEP_SIZE is specified · 7cddc193
      Jie Liu 提交于
      Create a small file and fallocate it to a big size with
      FALLOC_FL_KEEP_SIZE option, then truncate it back to the
      small size again, the disk free space is not changed back
      in this case. i.e,
      
      total 4
      -rw-r--r-- 1 root root 512 Jun 28 11:35 test
      
      Filesystem      Size  Used Avail Use% Mounted on
      ....
      /dev/sdb1       8.0G   56K  7.2G   1% /mnt
      
      -rw-r--r-- 1 root root 512 Jun 28 11:35 /mnt/test
      
      Filesystem      Size  Used Avail Use% Mounted on
      ....
      /dev/sdb1       8.0G  5.1G  2.2G  70% /mnt
      
      Filesystem      Size  Used Avail Use% Mounted on
      ....
      /dev/sdb1       8.0G  5.1G  2.2G  70% /mnt
      
      With this fix, the truncated up space is back as:
      Filesystem      Size  Used Avail Use% Mounted on
      ....
      /dev/sdb1       8.0G   56K  7.2G   1% /mnt
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      7cddc193
  3. 20 7月, 2013 4 次提交
    • S
      Btrfs: fix wrong write offset when replacing a device · 115930cb
      Stefan Behrens 提交于
      Miao Xie reported the following issue:
      
      The filesystem was corrupted after we did a device replace.
      
      Steps to reproduce:
       # mkfs.btrfs -f -m single -d raid10 <device0>..<device3>
       # mount <device0> <mnt>
       # btrfs replace start -rfB 1 <device4> <mnt>
       # umount <mnt>
       # btrfsck <device4>
      
      The reason for the issue is that we changed the write offset by mistake,
      introduced by commit 625f1c8d.
      
      We read the data from the source device at first, and then write the
      data into the corresponding place of the new device. In order to
      implement the "-r" option, the source location is remapped using
      btrfs_map_block(). The read takes place on the mapped location, and
      the write needs to take place on the unmapped location. Currently
      the write is using the mapped location, and this commit changes it
      back by undoing the change to the write address that the aforementioned
      commit added by mistake.
      Reported-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: <stable@vger.kernel.org> # 3.10+
      Signed-off-by: NStefan Behrens <sbehrens@giantdisaster.de>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      115930cb
    • J
      Btrfs: re-add root to dead root list if we stop dropping it · d29a9f62
      Josef Bacik 提交于
      If we stop dropping a root for whatever reason we need to add it back to the
      dead root list so that we will re-start the dropping next transaction commit.
      The other case this happens is if we recover a drop because we will add a root
      without adding it to the fs radix tree, so we can leak it's root and commit root
      extent buffer, adding this to the dead root list makes this cleanup happen.
      Thanks,
      
      Cc: stable@vger.kernel.org
      Reported-by: NAlex Lyakas <alex.btrfs@zadarastorage.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      d29a9f62
    • J
      Btrfs: fix lock leak when resuming snapshot deletion · fec386ac
      Josef Bacik 提交于
      We aren't setting path->locks[level] when we resume a snapshot deletion which
      means we won't unlock the buffer when we free the path.  This causes deadlocks
      if we happen to re-allocate the block before we've evicted the extent buffer
      from cache.  Thanks,
      
      Cc: stable@vger.kernel.org
      Reported-by: NAlex Lyakas <alex.btrfs@zadarastorage.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      fec386ac
    • J
      Btrfs: update drop progress before stopping snapshot dropping · 3c8f2422
      Josef Bacik 提交于
      Alex pointed out a problem and fix that exists in the drop one snapshot at a
      time patch.  If we decide we need to exit for whatever reason (umount for
      example) we will just exit the snapshot dropping without updating the drop
      progress.  So the next time we go to resume we will BUG_ON() because we can't
      find the extent we left off at because we never updated it.  This patch fixes
      the problem.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAlex Lyakas <alex.btrfs@zadarastorage.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      3c8f2422
  4. 03 7月, 2013 1 次提交
    • J
      vfs: export lseek_execute() to modules · 46a1c2c7
      Jie Liu 提交于
      For those file systems(btrfs/ext4/ocfs2/tmpfs) that support
      SEEK_DATA/SEEK_HOLE functions, we end up handling the similar
      matter in lseek_execute() to update the current file offset
      to the desired offset if it is valid, ceph also does the
      simliar things at ceph_llseek().
      
      To reduce the duplications, this patch make lseek_execute()
      public accessible so that we can call it directly from the
      underlying file systems.
      
      Thanks Dave Chinner for this suggestion.
      
      [AV: call it vfs_setpos(), don't bring the removed 'inode' argument back]
      
      v2->v1:
      - Add kernel-doc comments for lseek_execute()
      - Call lseek_execute() in ceph->llseek()
      Signed-off-by: NJie Liu <jeff.liu@oracle.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chris Mason <chris.mason@fusionio.com>
      Cc: Josef Bacik <jbacik@fusionio.com>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Ted Tso <tytso@mit.edu>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Sage Weil <sage@inktank.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      46a1c2c7
  5. 02 7月, 2013 15 次提交
    • J
      Btrfs: wait ordered range before doing direct io · 0e267c44
      Josef Bacik 提交于
      My recent truncate patch uncovered this bug, but I can reproduce it without the
      truncate patch.  If you mount with -o compress-force, do a direct write to some
      area, do a buffered write to some other area, and then do a direct read you will
      get the wrong data for where you did the buffered write.  This is because the
      generic direct io helpers only call filemap_write_and_wait once, and for
      compression we need it twice.  So to be safe add the btrfs_wait_ordered_range to
      the start of the direct io function to make sure any compressed writes have
      truly been written.  This patch makes xfstests 130 pass when you mount with -o
      compress-force=lzo.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      0e267c44
    • J
      Btrfs: only do the tree_mod_log_free_eb if this is our last ref · 7fb7d76f
      Josef Bacik 提交于
      There is another bug in the tree mod log stuff in that we're calling
      tree_mod_log_free_eb every single time a block is cow'ed.  The problem with this
      is that if this block is shared by multiple snapshots we will call this multiple
      times per block, so if we go to rewind the mod log for this block we'll BUG_ON()
      in __tree_mod_log_rewind because we try to rewind a free twice.  We only want to
      call tree_mod_log_free_eb if we are actually freeing the block.  With this patch
      I no longer hit the panic in __tree_mod_log_rewind.  Thanks,
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NJan Schmidt <list.btrfs@jan-o-sch.net>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      7fb7d76f
    • J
      Btrfs: hold the tree mod lock in __tree_mod_log_rewind · f1ca7e98
      Josef Bacik 提交于
      We need to hold the tree mod log lock in __tree_mod_log_rewind since we walk
      forward in the tree mod entries, otherwise we'll end up with random entries and
      trip the BUG_ON() at the front of __tree_mod_log_rewind.  This fixes the panics
      people were seeing when running
      
      find /whatever -type f -exec btrfs fi defrag {} \;
      
      Thansk,
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      f1ca7e98
    • J
      Btrfs: make backref walking code handle skinny metadata · 261c84b6
      Josef Bacik 提交于
      I missed fixing the backref stuff when I introduced the skinny metadata.  If you
      try and do things like snapshot aware defrag with skinny metadata you are going
      to see tons of warnings related to the backref count being less than 0.  This is
      because the delayed refs will be found for stuff just fine, but it won't find
      the skinny metadata extent refs.  With this patch I'm not seeing warnings
      anymore.  Thanks,
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      261c84b6
    • L
      Btrfs: fix crash regarding to ulist_add_merge · 35f0399d
      Liu Bo 提交于
      Several users reported this crash of NULL pointer or general protection,
      the story is that we add a rbtree for speedup ulist iteration, and we
      use krealloc() to address ulist growth, and krealloc() use memcpy to copy
      old data to new memory area, so it's OK for an array as it doesn't use
      pointers while it's not OK for a rbtree as it uses pointers.
      
      So krealloc() will mess up our rbtree and it ends up with crash.
      Reviewed-by: NWang Shilong <wangsl-fnst@cn.fujitsu.com>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      35f0399d
    • M
      Btrfs: fix several potential problems in copy_nocow_pages_for_inode · edd1400b
      Miao Xie 提交于
      - It makes no sense that we deal with a inode in the dead tree.
      - fix the race between dio and page copy by waiting the dio completion
      - avoid the page copy vs truncate/punch hole
      - check if the page is in the page cache or not
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      edd1400b
    • M
      Btrfs: cleanup the code of copy_nocow_pages_for_inode() · 826aa0a8
      Miao Xie 提交于
      - It make no sense that we continue to do something after the error
        happened, just go back with this patch.
      - remove some check of copy_nocow_pages_for_inode(), such as page check
        after write, inode check in the end of the function, because we are
        sure they exist.
      - remove the unnecessary goto in the return value check of the write
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      826aa0a8
    • M
      Btrfs: fix oops when recovering the file data by scrub function · 26b25891
      Miao Xie 提交于
      We get oops while running btrfs replace start test,
      ------------[ cut here ]------------
      kernel BUG at mm/filemap.c:608!
      [SNIP]
      Call Trace:
        [<ffffffffa04b36c7>] copy_nocow_pages_for_inode+0x217/0x3f0 [btrfs]
        [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
        [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
        [<ffffffffa04bb8ce>] iterate_extent_inodes+0x1ae/0x300 [btrfs]
        [<ffffffffa04bbab2>] iterate_inodes_from_logical+0x92/0xb0 [btrfs]
        [<ffffffffa04b34b0>] ? scrub_print_warning_inode+0x230/0x230 [btrfs]
        [<ffffffffa04b3b07>] copy_nocow_pages_worker+0x97/0x150 [btrfs]
        [<ffffffffa048eed4>] worker_loop+0x134/0x540 [btrfs]
        [<ffffffff816274ea>] ? __schedule+0x3ca/0x7f0
        [<ffffffffa048eda0>] ? btrfs_queue_worker+0x300/0x300 [btrfs]
        [<ffffffff8106f2f0>] kthread+0xc0/0xd0
        [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80
        [<ffffffff8163181c>] ret_from_fork+0x7c/0xb0
        [<ffffffff8106f230>] ? flush_kthread_worker+0x80/0x80
      [SNIP]
       RIP  [<ffffffff8111f4c5>] unlock_page+0x35/0x40
        RSP <ffff88010316bb98>
       ---[ end trace 421e79ad0dd72c7d ]---
      
      it is because we forgot to lock the page again after we read data to
      the page. Fix it.
      Signed-off-by: NLin Feng <linfeng@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      26b25891
    • J
      Btrfs: make the chunk allocator completely tree lockless · 6df9a95e
      Josef Bacik 提交于
      When adjusting the enospc rules for relocation I ran into a deadlock because we
      were relocating the only system chunk and that forced us to try and allocate a
      new system chunk while holding locks in the chunk tree, which caused us to
      deadlock.  To fix this I've moved all of the dev extent addition and chunk
      addition out to the delayed chunk completion stuff.  We still keep the in-memory
      stuff which makes sure everything is consistent.
      
      One change I had to make was to search the commit root of the device tree to
      find a free dev extent, and hold onto any chunk em's that we allocated in that
      transaction so we do not allocate the same dev extent twice.  This has the side
      effect of fixing a bug with balance that has been there ever since balance
      existed.  Basically you can free a block group and it's dev extent and then
      immediately allocate that dev extent for a new block group and write stuff to
      that dev extent, all within the same transaction.  So if you happen to crash
      during a balance you could come back to a completely broken file system.  This
      patch should keep these sort of things from happening in the future since we
      won't be able to allocate free'd dev extents until after the transaction
      commits.  This has passed all of the xfstests and my super annoying stress test
      followed by a balance.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      6df9a95e
    • J
      Btrfs: cleanup orphaned root orphan item · 68a7342c
      Josef Bacik 提交于
      I hit a weird problem were my root item had been deleted but the orphan item had
      not.  This isn't necessarily a problem, but it keeps the file system from being
      mounted.  To fix this we just need to axe the orphan item if we can't find the
      fs root when we're putting them altogether.  With this patch I was able to
      successfully mount my file system.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      68a7342c
    • M
      Btrfs: fix wrong mirror number tuning · a70c6172
      Miao Xie 提交于
      Now reading the data from the target device of the replace operation is allowed,
      so the mirror number that is greater than the stripes number of a chunk is valid,
      we will tune it when we find there is no target device later. Fix it.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      a70c6172
    • M
      e6da5d2e
    • M
      Btrfs: remove btrfs_sector_sum structure · f51a4a18
      Miao Xie 提交于
      Using the structure btrfs_sector_sum to keep the checksum value is
      unnecessary, because the extents that btrfs_sector_sum points to are
      continuous, we can find out the expected checksums by btrfs_ordered_sum's
      bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
      removing bytenr, there is only one member in the structure, so it makes
      no sense to keep the structure, just remove it, and use a u32 array to
      store the checksum value.
      
      By this change, we don't use the while loop to get the checksums one by
      one. Now, we can get several checksum value at one time, it improved the
      performance by ~74% on my SSD (31MB/s -> 54MB/s).
      
      test command:
       # dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      f51a4a18
    • J
      Btrfs: check if we can nocow if we don't have data space · 7ee9e440
      Josef Bacik 提交于
      We always just try and reserve data space when we write, but if we are out of
      space but have prealloc'ed extents we should still successfully write.  This
      patch will try and see if we can write to prealloc'ed space and if we can go
      ahead and allow the write to continue.  With this patch we now pass xfstests
      generic/274.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      7ee9e440
    • J
      Btrfs: stop using try_to_writeback_inodes_sb_nr to flush delalloc · 925a6efb
      Josef Bacik 提交于
      try_to_writeback_inodes_sb_nr returns 1 if writeback is already underway, which
      is completely fraking useless for us as we need to make sure pages are actually
      written before we go and check if there are ordered extents.  So replace this
      with an open coding of try_to_writeback_inodes_sb_nr minus the writeback
      underway check so that we are sure to actually have flushed some dirty pages out
      and will have ordered extents to use.  With this patch xfstests generic/273 now
      passes.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      925a6efb