1. 26 4月, 2017 1 次提交
    • F
      Btrfs: fix reported number of inode blocks · a7e3b975
      Filipe Manana 提交于
      Currently when there are buffered writes that were not yet flushed and
      they fall within allocated ranges of the file (that is, not in holes or
      beyond eof assuming there are no prealloc extents beyond eof), btrfs
      simply reports an incorrect number of used blocks through the stat(2)
      system call (or any of its variants), regardless of mount options or
      inode flags (compress, compress-force, nodatacow). This is because the
      number of blocks used that is reported is based on the current number
      of bytes in the vfs inode plus the number of dealloc bytes in the btrfs
      inode. The later covers bytes that both fall within allocated regions
      of the file and holes.
      
      Example scenarios where the number of reported blocks is wrong while the
      buffered writes are not flushed:
      
        $ mkfs.btrfs -f /dev/sdc
        $ mount /dev/sdc /mnt/sdc
      
        $ xfs_io -f -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo1
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (259.336 MiB/sec and 66390.0415 ops/sec)
      
        $ sync
      
        $ xfs_io -c "pwrite -S 0xbb 0 64K" /mnt/sdc/foo1
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (192.308 MiB/sec and 49230.7692 ops/sec)
      
        # The following should have reported 64K...
        $ du -h /mnt/sdc/foo1
        128K	/mnt/sdc/foo1
      
        $ sync
      
        # After flushing the buffered write, it now reports the correct value.
        $ du -h /mnt/sdc/foo1
        64K	/mnt/sdc/foo1
      
        $ xfs_io -f -c "falloc -k 0 128K" -c "pwrite -S 0xaa 0 64K" /mnt/sdc/foo2
        wrote 65536/65536 bytes at offset 0
        64 KiB, 16 ops; 0.0000 sec (520.833 MiB/sec and 133333.3333 ops/sec)
      
        $ sync
      
        $ xfs_io -c "pwrite -S 0xbb 64K 64K" /mnt/sdc/foo2
        wrote 65536/65536 bytes at offset 65536
        64 KiB, 16 ops; 0.0000 sec (260.417 MiB/sec and 66666.6667 ops/sec)
      
        # The following should have reported 128K...
        $ du -h /mnt/sdc/foo2
        192K	/mnt/sdc/foo2
      
        $ sync
      
        # After flushing the buffered write, it now reports the correct value.
        $ du -h /mnt/sdc/foo2
        128K	/mnt/sdc/foo2
      
      So the number of used file blocks is simply incorrect, unlike in other
      filesystems such as ext4 and xfs for example, but only while the buffered
      writes are not flushed.
      
      Fix this by tracking the number of delalloc bytes that fall within holes
      and beyond eof of a file, and use instead this new counter when reporting
      the number of used blocks for an inode.
      
      Another different problem that exists is that the delalloc bytes counter
      is reset when writeback starts (by clearing the EXTENT_DEALLOC flag from
      the respective range in the inode's iotree) and the vfs inode's bytes
      counter is only incremented when writeback finishes (through
      insert_reserved_file_extent()). Therefore while writeback is ongoing we
      simply report a wrong number of blocks used by an inode if the write
      operation covers a range previously unallocated. While this change does
      not fix this problem, it does minimizes it a lot by shortening that time
      window, as the new dealloc bytes counter (new_delalloc_bytes) is only
      decremented when writeback finishes right before updating the vfs inode's
      bytes counter. Fully fixing this second problem is not trivial and will
      be addressed later by a different patch.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      a7e3b975
  2. 28 2月, 2017 5 次提交
  3. 17 2月, 2017 1 次提交
  4. 14 2月, 2017 2 次提交
  5. 26 9月, 2016 1 次提交
    • J
      Btrfs: add a flags field to btrfs_fs_info · afcdd129
      Josef Bacik 提交于
      We have a lot of random ints in btrfs_fs_info that can be put into flags.  This
      is mostly equivalent with the exception of how we deal with quota going on or
      off, now instead we set a flag when we are turning it on or off and deal with
      that appropriately, rather than just having a pending state that the current
      quota_enabled gets set to.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      afcdd129
  6. 26 5月, 2016 1 次提交
  7. 13 5月, 2016 1 次提交
    • F
      Btrfs: add semaphore to synchronize direct IO writes with fsync · 5f9a8a51
      Filipe Manana 提交于
      Due to the optimization of lockless direct IO writes (the inode's i_mutex
      is not held) introduced in commit 38851cc1 ("Btrfs: implement unlocked
      dio write"), we started having races between such writes with concurrent
      fsync operations that use the fast fsync path. These races were addressed
      in the patches titled "Btrfs: fix race between fsync and lockless direct
      IO writes" and "Btrfs: fix race between fsync and direct IO writes for
      prealloc extents". The races happened because the direct IO path, like
      every other write path, does create extent maps followed by the
      corresponding ordered extents while the fast fsync path collected first
      ordered extents and then it collected extent maps. This made it possible
      to log file extent items (based on the collected extent maps) without
      waiting for the corresponding ordered extents to complete (get their IO
      done). The two fixes mentioned before added a solution that consists of
      making the direct IO path create first the ordered extents and then the
      extent maps, while the fsync path attempts to collect any new ordered
      extents once it collects the extent maps. This was simple and did not
      require adding any synchonization primitive to any data structure (struct
      btrfs_inode for example) but it makes things more fragile for future
      development endeavours and adds an exceptional approach compared to the
      other write paths.
      
      This change adds a read-write semaphore to the btrfs inode structure and
      makes the direct IO path create the extent maps and the ordered extents
      while holding read access on that semaphore, while the fast fsync path
      collects extent maps and ordered extents while holding write access on
      that semaphore. The logic for direct IO write path is encapsulated in a
      new helper function that is used both for cow and nocow direct IO writes.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      5f9a8a51
  8. 07 1月, 2016 1 次提交
    • D
      btrfs: put delayed item hook into inode · 8089fe62
      David Sterba 提交于
      Inodes for delayed iput allocate a trivial helper structure, let's place
      the list hook directly into the inode and save a kmalloc (killing a
      __GFP_NOFAIL as a bonus) at the cost of increasing size of btrfs_inode.
      
      The inode can be put into the delayed_iputs list more than once and we
      have to keep the count. This means we can't use the list_splice to
      process a bunch of inodes because we'd lost track of the count if the
      inode is put into the delayed iputs again while it's processed.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8089fe62
  9. 22 9月, 2015 1 次提交
    • C
      Btrfs: Direct I/O: Fix space accounting · 50745b0a
      chandan 提交于
      The following call trace is seen when generic/095 test is executed,
      
      WARNING: CPU: 3 PID: 2769 at /home/chandan/code/repos/linux/fs/btrfs/inode.c:8967 btrfs_destroy_inode+0x284/0x2a0()
      Modules linked in:
      CPU: 3 PID: 2769 Comm: umount Not tainted 4.2.0-rc5+ #31
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20150306_163512-brownie 04/01/2014
       ffffffff81c08150 ffff8802ec9cbce8 ffffffff81984058 ffff8802ffd8feb0
       0000000000000000 ffff8802ec9cbd28 ffffffff81050385 ffff8802ec9cbd38
       ffff8802d12f8588 ffff8802d12f8588 ffff8802f15ab000 ffff8800bb96c0b0
      Call Trace:
       [<ffffffff81984058>] dump_stack+0x45/0x57
       [<ffffffff81050385>] warn_slowpath_common+0x85/0xc0
       [<ffffffff81050465>] warn_slowpath_null+0x15/0x20
       [<ffffffff81340294>] btrfs_destroy_inode+0x284/0x2a0
       [<ffffffff8117ce07>] destroy_inode+0x37/0x60
       [<ffffffff8117cf39>] evict+0x109/0x170
       [<ffffffff8117cfd5>] dispose_list+0x35/0x50
       [<ffffffff8117dd3a>] evict_inodes+0xaa/0x100
       [<ffffffff81165667>] generic_shutdown_super+0x47/0xf0
       [<ffffffff81165951>] kill_anon_super+0x11/0x20
       [<ffffffff81302093>] btrfs_kill_super+0x13/0x110
       [<ffffffff81165c99>] deactivate_locked_super+0x39/0x70
       [<ffffffff811660cf>] deactivate_super+0x5f/0x70
       [<ffffffff81180e1e>] cleanup_mnt+0x3e/0x90
       [<ffffffff81180ebd>] __cleanup_mnt+0xd/0x10
       [<ffffffff81069c06>] task_work_run+0x96/0xb0
       [<ffffffff81003a3d>] do_notify_resume+0x3d/0x50
       [<ffffffff8198cbc2>] int_signal+0x12/0x17
      
      This means that the inode had non-zero "outstanding extents" during
      eviction. This occurs because, during direct I/O a task which successfully
      used up its reserved data space would set BTRFS_INODE_DIO_READY bit and does
      not clear the bit after finishing the DIO write. A future DIO write could
      actually fail and the unused reserve space won't be freed because of the
      previously set BTRFS_INODE_DIO_READY bit.
      
      Clearing the BTRFS_INODE_DIO_READY bit in btrfs_direct_IO() caused the
      following issue,
      |-----------------------------------+-------------------------------------|
      | Task A                            | Task B                              |
      |-----------------------------------+-------------------------------------|
      | Start direct i/o write on inode X.|                                     |
      | reserve space                     |                                     |
      | Allocate ordered extent           |                                     |
      | release reserved space            |                                     |
      | Set BTRFS_INODE_DIO_READY bit.    |                                     |
      |                                   | splice()                            |
      |                                   | Transfer data from pipe buffer to   |
      |                                   | destination file.                   |
      |                                   | - kmap(pipe buffer page)            |
      |                                   | - Start direct i/o write on         |
      |                                   |   inode X.                          |
      |                                   |   - reserve space                   |
      |                                   |   - dio_refill_pages()              |
      |                                   |     - sdio->blocks_available == 0   |
      |                                   |     - Since a kernel address is     |
      |                                   |       being passed instead of a     |
      |                                   |       user space address,           |
      |                                   |       iov_iter_get_pages() returns  |
      |                                   |       -EFAULT.                      |
      |                                   |   - Since BTRFS_INODE_DIO_READY is  |
      |                                   |     set, we don't release reserved  |
      |                                   |     space.                          |
      |                                   |   - Clear BTRFS_INODE_DIO_READY bit.|
      | -EIOCBQUEUED is returned.         |                                     |
      |-----------------------------------+-------------------------------------|
      
      Hence this commit introduces "struct btrfs_dio_data" to track the usage of
      reserved data space. The remaining unused "reserve space" can now be freed
      reliably.
      Signed-off-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      50745b0a
  10. 02 7月, 2015 1 次提交
  11. 27 3月, 2015 1 次提交
    • F
      Btrfs: fix metadata inconsistencies after directory fsync · 2f2ff0ee
      Filipe Manana 提交于
      We can get into inconsistency between inodes and directory entries
      after fsyncing a directory. The issue is that while a directory gets
      the new dentries persisted in the fsync log and replayed at mount time,
      the link count of the inode that directory entries point to doesn't
      get updated, staying with an incorrect link count (smaller then the
      correct value). This later leads to stale file handle errors when
      accessing (including attempt to delete) some of the links if all the
      other ones are removed, which also implies impossibility to delete the
      parent directories, since the dentries can not be removed.
      
      Another issue is that (unlike ext3/4, xfs, f2fs, reiserfs, nilfs2),
      when fsyncing a directory, new files aren't logged (their metadata and
      dentries) nor any child directories. So this patch fixes this issue too,
      since it has the same resolution as the incorrect inode link count issue
      mentioned before.
      
      This is very easy to reproduce, and the following excerpt from my test
      case for xfstests shows how:
      
        _scratch_mkfs >> $seqres.full 2>&1
        _init_flakey
        _mount_flakey
      
        # Create our main test file and directory.
        $XFS_IO_PROG -f -c "pwrite -S 0xaa 0 8K" $SCRATCH_MNT/foo | _filter_xfs_io
        mkdir $SCRATCH_MNT/mydir
      
        # Make sure all metadata and data are durably persisted.
        sync
      
        # Add a hard link to 'foo' inside our test directory and fsync only the
        # directory. The btrfs fsync implementation had a bug that caused the new
        # directory entry to be visible after the fsync log replay but, the inode
        # of our file remained with a link count of 1.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_2
      
        # Add a few more links and new files.
        # This is just to verify nothing breaks or gives incorrect results after the
        # fsync log is replayed.
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/foo_3
        $XFS_IO_PROG -f -c "pwrite -S 0xff 0 64K" $SCRATCH_MNT/hello | _filter_xfs_io
        ln $SCRATCH_MNT/hello $SCRATCH_MNT/mydir/hello_2
      
        # Add some subdirectories and new files and links to them. This is to verify
        # that after fsyncing our top level directory 'mydir', all the subdirectories
        # and their files/links are registered in the fsync log and exist after the
        # fsync log is replayed.
        mkdir -p $SCRATCH_MNT/mydir/x/y/z
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/foo_y_link
        ln $SCRATCH_MNT/foo $SCRATCH_MNT/mydir/x/y/z/foo_z_link
        touch $SCRATCH_MNT/mydir/x/y/z/qwerty
      
        # Now fsync only our top directory.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/mydir
      
        # And fsync now our new file named 'hello', just to verify later that it has
        # the expected content and that the previous fsync on the directory 'mydir' had
        # no bad influence on this fsync.
        $XFS_IO_PROG -c "fsync" $SCRATCH_MNT/hello
      
        # Simulate a crash/power loss.
        _load_flakey_table $FLAKEY_DROP_WRITES
        _unmount_flakey
      
        _load_flakey_table $FLAKEY_ALLOW_WRITES
        _mount_flakey
      
        # Verify the content of our file 'foo' remains the same as before, 8192 bytes,
        # all with the value 0xaa.
        echo "File 'foo' content after log replay:"
        od -t x1 $SCRATCH_MNT/foo
      
        # Remove the first name of our inode. Because of the directory fsync bug, the
        # inode's link count was 1 instead of 5, so removing the 'foo' name ended up
        # deleting the inode and the other names became stale directory entries (still
        # visible to applications). Attempting to remove or access the remaining
        # dentries pointing to that inode resulted in stale file handle errors and
        # made it impossible to remove the parent directories since it was impossible
        # for them to become empty.
        echo "file 'foo' link count after log replay: $(stat -c %h $SCRATCH_MNT/foo)"
        rm -f $SCRATCH_MNT/foo
      
        # Now verify that all files, links and directories created before fsyncing our
        # directory exist after the fsync log was replayed.
        [ -f $SCRATCH_MNT/mydir/foo_2 ] || echo "Link mydir/foo_2 is missing"
        [ -f $SCRATCH_MNT/mydir/foo_3 ] || echo "Link mydir/foo_3 is missing"
        [ -f $SCRATCH_MNT/hello ] || echo "File hello is missing"
        [ -f $SCRATCH_MNT/mydir/hello_2 ] || echo "Link mydir/hello_2 is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/foo_y_link ] || \
            echo "Link mydir/x/y/foo_y_link is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/z/foo_z_link ] || \
            echo "Link mydir/x/y/z/foo_z_link is missing"
        [ -f $SCRATCH_MNT/mydir/x/y/z/qwerty ] || \
            echo "File mydir/x/y/z/qwerty is missing"
      
        # We expect our file here to have a size of 64Kb and all the bytes having the
        # value 0xff.
        echo "file 'hello' content after log replay:"
        od -t x1 $SCRATCH_MNT/hello
      
        # Now remove all files/links, under our test directory 'mydir', and verify we
        # can remove all the directories.
        rm -f $SCRATCH_MNT/mydir/x/y/z/*
        rmdir $SCRATCH_MNT/mydir/x/y/z
        rm -f $SCRATCH_MNT/mydir/x/y/*
        rmdir $SCRATCH_MNT/mydir/x/y
        rmdir $SCRATCH_MNT/mydir/x
        rm -f $SCRATCH_MNT/mydir/*
        rmdir $SCRATCH_MNT/mydir
      
        # An fsck, run by the fstests framework everytime a test finishes, also detected
        # the inconsistency and printed the following error message:
        #
        # root 5 inode 257 errors 2001, no inode item, link count wrong
        #    unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
        #    unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref
      
        status=0
        exit
      
      The expected golden output for the test is:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 65536/65536 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File 'foo' content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000
        file 'foo' link count after log replay: 5
        file 'hello' content after log replay:
        0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        *
        0200000
      
      Which is the output after this patch and when running the test against
      ext3/4, xfs, f2fs, reiserfs or nilfs2. Without this patch, the test's
      output is:
      
        wrote 8192/8192 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        wrote 65536/65536 bytes at offset 0
        XXX Bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
        File 'foo' content after log replay:
        0000000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa
        *
        0020000
        file 'foo' link count after log replay: 1
        Link mydir/foo_2 is missing
        Link mydir/foo_3 is missing
        Link mydir/x/y/foo_y_link is missing
        Link mydir/x/y/z/foo_z_link is missing
        File mydir/x/y/z/qwerty is missing
        file 'hello' content after log replay:
        0000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
        *
        0200000
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y/z': No such file or directory
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x/y': No such file or directory
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/x': No such file or directory
        rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_2': Stale file handle
        rm: cannot remove '/home/fdmanana/btrfs-tests/scratch_1/mydir/foo_3': Stale file handle
        rmdir: failed to remove '/home/fdmanana/btrfs-tests/scratch_1/mydir': Directory not empty
      
      Fsck, without this fix, also complains about the wrong link count:
      
        root 5 inode 257 errors 2001, no inode item, link count wrong
            unresolved ref dir 258 index 2 namelen 5 name foo_2 filetype 1 errors 4, no inode ref
            unresolved ref dir 258 index 3 namelen 5 name foo_3 filetype 1 errors 4, no inode ref
      
      So fix this by logging the inodes that the dentries point to when
      fsyncing a directory.
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      2f2ff0ee
  12. 03 2月, 2015 1 次提交
  13. 04 10月, 2014 1 次提交
    • F
      Btrfs: be aware of btree inode write errors to avoid fs corruption · 656f30db
      Filipe Manana 提交于
      While we have a transaction ongoing, the VM might decide at any time
      to call btree_inode->i_mapping->a_ops->writepages(), which will start
      writeback of dirty pages belonging to btree nodes/leafs. This call
      might return an error or the writeback might finish with an error
      before we attempt to commit the running transaction. If this happens,
      we might have no way of knowing that such error happened when we are
      committing the transaction - because the pages might no longer be
      marked dirty nor tagged for writeback (if a subsequent modification
      to the extent buffer didn't happen before the transaction commit) which
      makes filemap_fdata[write|wait]_range unable to find such pages (even
      if they're marked with SetPageError).
      So if this happens we must abort the transaction, otherwise we commit
      a super block with btree roots that point to btree nodes/leafs whose
      content on disk is invalid - either garbage or the content of some
      node/leaf from a past generation that got cowed or deleted and is no
      longer valid (for this later case we end up getting error messages like
      "parent transid verify failed on 10826481664 wanted 25748 found 29562"
      when reading btree nodes/leafs from disk).
      
      Note that setting and checking AS_EIO/AS_ENOSPC in the btree inode's
      i_mapping would not be enough because we need to distinguish between
      log tree extents (not fatal) vs non-log tree extents (fatal) and
      because the next call to filemap_fdatawait_range() will catch and clear
      such errors in the mapping - and that call might be from a log sync and
      not from a transaction commit, which means we would not know about the
      error at transaction commit time. Also, checking for the eb flag
      EXTENT_BUFFER_IOERR at transaction commit time isn't done and would
      not be completely reliable, as the eb might be removed from memory and
      read back when trying to get it, which clears that flag right before
      reading the eb's pages from disk, making us not know about the previous
      write error.
      
      Using the new 3 flags for the btree inode also makes us achieve the
      goal of AS_EIO/AS_ENOSPC when writepages() returns success, started
      writeback for all dirty pages and before filemap_fdatawait_range() is
      called, the writeback for all dirty pages had already finished with
      errors - because we were not using AS_EIO/AS_ENOSPC,
      filemap_fdatawait_range() would return success, as it could not know
      that writeback errors happened (the pages were no longer tagged for
      writeback).
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      656f30db
  14. 18 9月, 2014 4 次提交
    • M
      Btrfs: implement repair function when direct read fails · 8b110e39
      Miao Xie 提交于
      This patch implement data repair function when direct read fails.
      
      The detail of the implementation is:
      - When we find the data is not right, we try to read the data from the other
        mirror.
      - When the io on the mirror ends, we will insert the endio work into the
        dedicated btrfs workqueue, not common read endio workqueue, because the
        original endio work is still blocked in the btrfs endio workqueue, if we
        insert the endio work of the io on the mirror into that workqueue, deadlock
        would happen.
      - After we get right data, we write it back to the corrupted mirror.
      - And if the data on the new mirror is still corrupted, we will try next
        mirror until we read right data or all the mirrors are traversed.
      - After the above work, we set the uptodate flag according to the result.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8b110e39
    • M
      Btrfs: do file data check by sub-bio's self · c1dc0896
      Miao Xie 提交于
      Direct IO splits the original bio to several sub-bios because of the limit of
      raid stripe, and the filesystem will wait for all sub-bios and then run final
      end io process.
      
      But it was very hard to implement the data repair when dio read failure happens,
      because at the final end io function, we didn't know which mirror the data was
      read from. So in order to implement the data repair, we have to move the file data
      check in the final end io function to the sub-bio end io function, in which we can
      get the mirror number of the device we access. This patch did this work as the
      first step of the direct io data repair implementation.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c1dc0896
    • M
      Btrfs: load checksum data once when submitting a direct read io · 23ea8e5a
      Miao Xie 提交于
      The current code would load checksum data for several times when we split
      a whole direct read io because of the limit of the raid stripe, it would
      make us search the csum tree for several times. In fact, it just wasted time,
      and made the contention of the csum tree root be more serious. This patch
      improves this problem by loading the data at once.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      23ea8e5a
    • W
      Btrfs: make defragment work with nodatacow option · 47059d93
      Wang Shilong 提交于
      Btrfs defragment will utilize COW feature, which means this
      did not work for nodatacow option, this problem was detected
      by xfstests generic/018 with nodatacow mount option.
      
      Fix this problem by forcing cow for a extent with state
      @EXTETN_DEFRAG setting.
      Signed-off-by: NWang Shilong <wangsl.fnst@cn.fujitsu.com>
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      47059d93
  15. 17 9月, 2014 1 次提交
    • F
      Btrfs: set inode's logged_trans/last_log_commit after ranged fsync · 125c4cf9
      Filipe Manana 提交于
      When a ranged fsync finishes if there are still extent maps in the modified
      list, still set the inode's logged_trans and last_log_commit. This is important
      in case an inode is fsync'ed and unlinked in the same transaction, to ensure its
      inode ref gets deleted from the log and the respective dentries in its parent
      are deleted too from the log (if the parent directory was fsync'ed in the same
      transaction).
      
      Instead make btrfs_inode_in_log() return false if the list of modified extent
      maps isn't empty.
      
      This is an incremental on top of the v4 version of the patch:
      
          "Btrfs: fix fsync data loss after a ranged fsync"
      
      which was added to its v5, but didn't make it on time.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      125c4cf9
  16. 15 8月, 2014 1 次提交
    • C
      btrfs: disable strict file flushes for renames and truncates · 8d875f95
      Chris Mason 提交于
      Truncates and renames are often used to replace old versions of a file
      with new versions.  Applications often expect this to be an atomic
      replacement, even if they haven't done anything to make sure the new
      version is fully on disk.
      
      Btrfs has strict flushing in place to make sure that renaming over an
      old file with a new file will fully flush out the new file before
      allowing the transaction commit with the rename to complete.
      
      This ordering means the commit code needs to be able to lock file pages,
      and there are a few paths in the filesystem where we will try to end a
      transaction with the page lock held.  It's rare, but these things can
      deadlock.
      
      This patch removes the ordered flushes and switches to a best effort
      filemap_flush like ext4 uses. It's not perfect, but it should fix the
      deadlocks.
      Signed-off-by: NChris Mason <clm@fb.com>
      8d875f95
  17. 10 6月, 2014 1 次提交
    • A
      btrfs: Drop EXTENT_UPTODATE check in hole punching and direct locking · fc4adbff
      Alex Gartrell 提交于
      In these instances, we are trying to determine if a page has been accessed
      since we began the operation for the sake of retry.  This is easily
      accomplished by doing a gang lookup in the page mapping radix tree, and it
      saves us the dependency on the flag (so that we might eventually delete
      it).
      
      btrfs_page_exists_in_range borrows heavily from find_get_page, replacing
      the radix tree look up with a gang lookup of 1, so that we can find the
      next highest page >= index and see if it falls into our lock range.
      Signed-off-by: NChris Mason <clm@fb.com>
      Signed-off-by: NAlex Gartrell <agartrell@fb.com>
      fc4adbff
  18. 18 4月, 2014 1 次提交
  19. 11 3月, 2014 1 次提交
  20. 29 1月, 2014 2 次提交
    • F
      Btrfs: add support for inode properties · 63541927
      Filipe David Borba Manana 提交于
      This change adds infrastructure to allow for generic properties for
      inodes. Properties are name/value pairs that can be associated with
      inodes for different purposes. They are stored as xattrs with the
      prefix "btrfs."
      
      Properties can be inherited - this means when a directory inode has
      inheritable properties set, these are added to new inodes created
      under that directory. Further, subvolumes can also have properties
      associated with them, and they can be inherited from their parent
      subvolume. Naturally, directory properties have priority over subvolume
      properties (in practice a subvolume property is just a regular
      property associated with the root inode, objectid 256, of the
      subvolume's fs tree).
      
      This change also adds one specific property implementation, named
      "compression", whose values can be "lzo" or "zlib" and it's an
      inheritable property.
      
      The corresponding changes to btrfs-progs were also implemented.
      A patch with xfstests for this feature will follow once there's
      agreement on this change/feature.
      
      Further, the script at the bottom of this commit message was used to
      do some benchmarks to measure any performance penalties of this feature.
      
      Basically the tests correspond to:
      
      Test 1 - create a filesystem and mount it with compress-force=lzo,
      then sequentially create N files of 64Kb each, measure how long it took
      to create the files, unmount the filesystem, mount the filesystem and
      perform an 'ls -lha' against the test directory holding the N files, and
      report the time the command took.
      
      Test 2 - create a filesystem and don't use any compression option when
      mounting it - instead set the compression property of the subvolume's
      root to 'lzo'. Then create N files of 64Kb, and report the time it took.
      The unmount the filesystem, mount it again and perform an 'ls -lha' like
      in the former test. This means every single file ends up with a property
      (xattr) associated to it.
      
      Test 3 - same as test 2, but uses 4 properties - 3 are duplicates of the
      compression property, have no real effect other than adding more work
      when inheriting properties and taking more btree leaf space.
      
      Test 4 - same as test 3 but with 10 properties per file.
      
      Results (in seconds, and averages of 5 runs each), for different N
      numbers of files follow.
      
      * Without properties (test 1)
      
                          file creation time        ls -lha time
      10 000 files              3.49                   0.76
      100 000 files            47.19                   8.37
      1 000 000 files         518.51                 107.06
      
      * With 1 property (compression property set to lzo - test 2)
      
                          file creation time        ls -lha time
      10 000 files              3.63                    0.93
      100 000 files            48.56                    9.74
      1 000 000 files         537.72                  125.11
      
      * With 4 properties (test 3)
      
                          file creation time        ls -lha time
      10 000 files              3.94                    1.20
      100 000 files            52.14                   11.48
      1 000 000 files         572.70                  142.13
      
      * With 10 properties (test 4)
      
                          file creation time        ls -lha time
      10 000 files              4.61                    1.35
      100 000 files            58.86                   13.83
      1 000 000 files         656.01                  177.61
      
      The increased latencies with properties are essencialy because of:
      
      *) When creating an inode, we now synchronously write 1 more item
         (an xattr item) for each property inherited from the parent dir
         (or subvolume). This could be done in an asynchronous way such
         as we do for dir intex items (delayed-inode.c), which could help
         reduce the file creation latency;
      
      *) With properties, we now have larger fs trees. For this particular
         test each xattr item uses 75 bytes of leaf space in the fs tree.
         This could be less by using a new item for xattr items, instead of
         the current btrfs_dir_item, since we could cut the 'location' and
         'type' fields (saving 18 bytes) and maybe 'transid' too (saving a
         total of 26 bytes per xattr item) from the btrfs_dir_item type.
      
      Also tried batching the xattr insertions (ignoring proper hash
      collision handling, since it didn't exist) when creating files that
      inherit properties from their parent inode/subvolume, but the end
      results were (surprisingly) essentially the same.
      
      Test script:
      
      $ cat test.pl
        #!/usr/bin/perl -w
      
        use strict;
        use Time::HiRes qw(time);
        use constant NUM_FILES => 10_000;
        use constant FILE_SIZES => (64 * 1024);
        use constant DEV => '/dev/sdb4';
        use constant MNT_POINT => '/home/fdmanana/btrfs-tests/dev';
        use constant TEST_DIR => (MNT_POINT . '/testdir');
      
        system("mkfs.btrfs", "-l", "16384", "-f", DEV) == 0 or die "mkfs.btrfs failed!";
      
        # following line for testing without properties
        #system("mount", "-o", "compress-force=lzo", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        # following 2 lines for testing with properties
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
        system("btrfs", "prop", "set", MNT_POINT, "compression", "lzo") == 0 or die "set prop failed!";
      
        system("mkdir", TEST_DIR) == 0 or die "mkdir failed!";
        my ($t1, $t2);
      
        $t1 = time();
        for (my $i = 1; $i <= NUM_FILES; $i++) {
            my $p = TEST_DIR . '/file_' . $i;
            open(my $f, '>', $p) or die "Error opening file!";
            $f->autoflush(1);
            for (my $j = 0; $j < FILE_SIZES; $j += 4096) {
                print $f ('A' x 4096) or die "Error writing to file!";
            }
            close($f);
        }
        $t2 = time();
        print "Time to create " . NUM_FILES . ": " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
        system("mount", DEV, MNT_POINT) == 0 or die "mount failed!";
      
        $t1 = time();
        system("bash -c 'ls -lha " . TEST_DIR . " > /dev/null'") == 0 or die "ls failed!";
        $t2 = time();
        print "Time to ls -lha all files: " . ($t2 - $t1) . " seconds.\n";
        system("umount", DEV) == 0 or die "umount failed!";
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      63541927
    • M
      Btrfs: introduce the delayed inode ref deletion for the single link inode · 67de1176
      Miao Xie 提交于
      The inode reference item is close to inode item, so we insert it simultaneously
      with the inode item insertion when we create a file/directory.. In fact, we also
      can handle the inode reference deletion by the same way. So we made this patch to
      introduce the delayed inode reference deletion for the single link inode(At most
      case, the file doesn't has hard link, so we don't take the hard link into account).
      
      This function is based on the delayed inode mechanism. After applying this patch,
      we can reduce the time of the file/directory deletion by ~10%.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      67de1176
  21. 12 11月, 2013 1 次提交
    • F
      Btrfs: improve inode hash function/inode lookup · 778ba82b
      Filipe David Borba Manana 提交于
      Currently the hash value used for adding an inode to the VFS's inode
      hash table consists of the plain inode number, which is a 64 bits
      integer. This results in hash table buckets (hlist_head lists) with
      too many elements for at least 2 important scenarios:
      
      1) When we have many subvolumes. Each subvolume has its own btree
         where its files and directories are added to, and each has its
         own objectid (inode number) namespace. This means that if we have
         N subvolumes, and all have inode number X associated to a file or
         directory, the corresponding inodes all map to the same hash table
         entry, resulting in a bucket (hlist_head list) with N elements;
      
      2) On 32 bits machines. Th VFS hash values are unsigned longs, which
         are 32 bits wide on 32 bits machines, and the inode (objectid)
         numbers are 64 bits unsigned integers. We simply cast the inode
         numbers to hash values, which means that for all inodes with the
         same 32 bits lower half, the same hash bucket is used for all of
         them. For example, all inodes with a number (objectid) between
         0x0000_0000_ffff_ffff and 0xffff_ffff_ffff_ffff will end up in
         the same hash table bucket.
      
      This change ensures the inode's hash value depends both on the
      objectid (inode number) and its subvolume's (btree root) objectid.
      For 32 bits machines, this change gives better entropy by making
      the hash value depend on both the upper and lower 32 bits of the
      64 bits hash previously computed.
      Signed-off-by: NFilipe David Borba Manana <fdmanana@gmail.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      778ba82b
  22. 21 9月, 2013 1 次提交
    • J
      Btrfs: check roots last log commit when checking if an inode has been logged · a5874ce6
      Josef Bacik 提交于
      Liu introduced a local copy of the last log commit for an inode to make sure we
      actually log an inode even if a log commit has already taken place.  In order to
      make sure we didn't relog the same inode multiple times he set this local copy
      to the current trans when we log the inode, because usually we log the inode and
      then sync the log.  The exception to this is during rename, we will relog an
      inode if the name changed and it is already in the log.  The problem with this
      is then we go to sync the inode, and our check to see if the inode has already
      been logged is tripped and we don't sync the log.  To fix this we need to _also_
      check against the roots last log commit, because it could be less than what is
      in our local copy of the log commit.  This fixes a bug where we rename a file
      into a directory and then fsync the directory and then on remount the directory
      is no longer there.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      a5874ce6
  23. 01 9月, 2013 1 次提交
  24. 07 5月, 2013 1 次提交
  25. 21 2月, 2013 2 次提交
    • M
      Btrfs: serialize unlocked dio reads with truncate · 2e60a51e
      Miao Xie 提交于
      Currently, we can do unlocked dio reads, but the following race
      is possible:
      
      dio_read_task			truncate_task
      				->btrfs_setattr()
      ->btrfs_direct_IO
          ->__blockdev_direct_IO
            ->btrfs_get_block
      				  ->btrfs_truncate()
      				 #alloc truncated blocks
      				 #to other inode
            ->submit_io()
           #INFORMATION LEAK
      
      In order to avoid this problem, we must serialize unlocked dio reads with
      truncate. There are two approaches:
      - use extent lock to protect the extent that we truncate
      - use inode_dio_wait() to make sure the truncating task will wait for
        the read DIO.
      
      If we use the 1st one, we will meet the endless truncation problem due to
      the nonlocked read DIO after we implement the nonlocked write DIO. It is
      because we still need invoke inode_dio_wait() avoid the race between write
      DIO and truncation. By that time, we have to introduce
      
        btrfs_inode_{block, resume}_nolock_dio()
      
      again. That is we have to implement this patch again, so I choose the 2nd
      way to fix the problem.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2e60a51e
    • M
      Btrfs: use the inode own lock to protect its delalloc_bytes · df0af1a5
      Miao Xie 提交于
      We need not use a global lock to protect the delalloc_bytes of the
      inode, just use its own lock. In this way, we can reduce the lock
      contention and ->delalloc_lock will just protect delalloc inode
      list.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      df0af1a5
  26. 17 12月, 2012 2 次提交
  27. 02 10月, 2012 2 次提交
    • L
      Btrfs: fix a bug in checking whether a inode is already in log · 46d8bc34
      Liu Bo 提交于
      This is based on Josef's "Btrfs: turbo charge fsync".
      
      The current btrfs checks if an inode is in log by comparing
      root's last_log_commit to inode's last_sub_trans[2].
      
      But the problem is that this root->last_log_commit is shared among
      inodes.
      
      Say we have N inodes to be logged, after the first inode,
      root's last_log_commit is updated and the N-1 remained files will
      be skipped.
      
      This fixes the bug by keeping a local copy of root's last_log_commit
      inside each inode and this local copy will be maintained itself.
      
      [1]: we regard each log transaction as a subset of btrfs's transaction,
      i.e. sub_trans
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      46d8bc34
    • J
      Btrfs: turbo charge fsync · 5dc562c5
      Josef Bacik 提交于
      At least for the vm workload.  Currently on fsync we will
      
      1) Truncate all items in the log tree for the given inode if they exist
      
      and
      
      2) Copy all items for a given inode into the log
      
      The problem with this is that for things like VMs you can have lots of
      extents from the fragmented writing behavior, and worst yet you may have
      only modified a few extents, not the entire thing.  This patch fixes this
      problem by tracking which transid modified our extent, and then when we do
      the tree logging we find all of the extents we've modified in our current
      transaction, sort them and commit them.  We also only truncate up to the
      xattrs of the inode and copy that stuff in normally, and then just drop any
      extents in the range we have that exist in the log already.  Here are some
      numbers of a 50 meg fio job that does random writes and fsync()s after every
      write
      
      		Original	Patched
      SATA drive	82KB/s		140KB/s
      Fusion drive	431KB/s		2532KB/s
      
      So around 2-6 times faster depending on your hardware.  There are a few
      corner cases, for example if you truncate at all we have to do it the old
      way since there is no way to be sure what is in the log is ok.  This
      probably could be done smarter, but if you write-fsync-truncate-write-fsync
      you deserve what you get.  All this work is in RAM of course so if your
      inode gets evicted from cache and you read it in and fsync it we'll do it
      the slow way if we are still in the same transaction that we last modified
      the inode in.
      
      The biggest cool part of this is that it requires no changes to the recovery
      code, so if you fsync with this patch and crash and load an old kernel, it
      will run the recovery and be a-ok.  I have tested this pretty thoroughly
      with an fsync tester and everything comes back fine, as well as xfstests.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      5dc562c5
  28. 24 7月, 2012 1 次提交