1. 22 10月, 2015 1 次提交
    • J
      Btrfs: change how we wait for pending ordered extents · 161c3549
      Josef Bacik 提交于
      We have a mechanism to make sure we don't lose updates for ordered extents that
      were logged in the transaction that is currently running.  We add the ordered
      extent to a transaction list and then the transaction waits on all the ordered
      extents in that list.  However are substantially large file systems this list
      can be extremely large, and can give us soft lockups, since the ordered extents
      don't remove themselves from the list when they do complete.
      
      To fix this we simply add a counter to the transaction that is incremented any
      time we have a logged extent that needs to be completed in the current
      transaction.  Then when the ordered extent finally completes it decrements the
      per transaction counter and wakes up the transaction if we are the last ones.
      This will eliminate the softlockup.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      161c3549
  2. 10 6月, 2015 1 次提交
    • F
      Btrfs: avoid syncing log in the fast fsync path when not necessary · b659ef02
      Filipe Manana 提交于
      Commit 3a8b36f3 ("Btrfs: fix data loss in the fast fsync path") added
      a performance regression for that causes an unnecessary sync of the log
      trees (fs/subvol and root log trees) when 2 consecutive fsyncs are done
      against a file, without no writes or any metadata updates to the inode in
      between them and if a transaction is committed before the second fsync is
      called.
      
      Huang Ying reported this to lkml (https://lkml.org/lkml/2015/3/18/99)
      after a test sysbench test that measured a -62% decrease of file io
      requests per second for that tests' workload.
      
      The test is:
      
        echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
        echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
        echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
        echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
        mkfs -t btrfs /dev/sda2
        mount -t btrfs /dev/sda2 /fs/sda2
        cd /fs/sda2
        for ((i = 0; i < 1024; i++)); do fallocate -l 67108864 testfile.$i; done
        sysbench --test=fileio --max-requests=0 --num-threads=4 --max-time=600 \
          --file-test-mode=rndwr --file-total-size=68719476736 --file-io-mode=sync \
          --file-num=1024 run
      
      A test on kvm guest, running a debug kernel gave me the following results:
      
      Without 3a8b36f3:             16.01 reqs/sec
      With 3a8b36f3:                 3.39 reqs/sec
      With 3a8b36f3 and this patch: 16.04 reqs/sec
      Reported-by: NHuang Ying <ying.huang@intel.com>
      Tested-by: NHuang, Ying <ying.huang@intel.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b659ef02
  3. 03 6月, 2015 1 次提交
    • L
      Btrfs: remove csum_bytes_left · 0c304304
      Liu Bo 提交于
      After commit 8407f553
      ("Btrfs: fix data corruption after fast fsync and writeback error"),
      during wait_ordered_extents(), we wait for ordered extent setting
      BTRFS_ORDERED_IO_DONE or BTRFS_ORDERED_IOERR, at which point we've
      already got checksum information, so we don't need to check
      (csum_bytes_left == 0) in the whole logging path.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0c304304
  4. 22 11月, 2014 2 次提交
    • F
      Btrfs: collect only the necessary ordered extents on ranged fsync · 0870295b
      Filipe Manana 提交于
      Instead of collecting all ordered extents from the inode's ordered tree
      and then wait for all of them to complete, just collect the ones that
      overlap the fsync range.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0870295b
    • J
      Btrfs: make sure logged extents complete in the current transaction V3 · 50d9aa99
      Josef Bacik 提交于
      Liu Bo pointed out that my previous fix would lose the generation update in the
      scenario I described.  It is actually much worse than that, we could lose the
      entire extent if we lose power right after the transaction commits.  Consider
      the following
      
      write extent 0-4k
      log extent in log tree
      commit transaction
      	< power fail happens here
      ordered extent completes
      
      We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
      the transaction commit will reset the log so it isn't replayed.  If we lose
      power before the transaction commit we are save, otherwise we are not.
      
      Fix this by keeping track of all extents we logged in this transaction.  Then
      when we go to commit the transaction make sure we wait for all of those ordered
      extents to complete before proceeding.  This will make sure that if we lose
      power after the transaction commit we still have our data.  This also fixes the
      problem of the improperly updated extent generation.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      50d9aa99
  5. 15 8月, 2014 1 次提交
    • C
      btrfs: disable strict file flushes for renames and truncates · 8d875f95
      Chris Mason 提交于
      Truncates and renames are often used to replace old versions of a file
      with new versions.  Applications often expect this to be an atomic
      replacement, even if they haven't done anything to make sure the new
      version is fully on disk.
      
      Btrfs has strict flushing in place to make sure that renaming over an
      old file with a new file will fully flush out the new file before
      allowing the transaction commit with the rename to complete.
      
      This ordering means the commit code needs to be able to lock file pages,
      and there are a few paths in the filesystem where we will try to end a
      transaction with the page lock held.  It's rare, but these things can
      deadlock.
      
      This patch removes the ordered flushes and switches to a best effort
      filemap_flush like ext4 uses. It's not perfect, but it should fix the
      deadlocks.
      Signed-off-by: NChris Mason <clm@fb.com>
      8d875f95
  6. 11 3月, 2014 4 次提交
  7. 12 11月, 2013 2 次提交
  8. 21 9月, 2013 1 次提交
  9. 01 9月, 2013 1 次提交
    • J
      Btrfs: allow partial ordered extent completion · 77cef2ec
      Josef Bacik 提交于
      We currently have this problem where you can truncate pages that have not yet
      been written for an ordered extent.  We do this because the truncate will be
      coming behind to clean us up anyway so what's the harm right?  Well if truncate
      fails for whatever reason we leave an orphan item around for the file to be
      cleaned up later.  But if the user goes and truncates up the file and tries to
      read from the area that had been discarded previously they will get a csum error
      because we never actually wrote that data out.
      
      This patch fixes this by allowing us to either discard the ordered extent
      completely, by which I mean we just free up the space we had allocated and not
      add the file extent, or adjust the length of the file extent we write.  We do
      this by setting the length we truncated down to in the ordered extent, and then
      we set the file extent length and ram bytes to this length.  The total disk
      space stays unchanged since we may be compressed and we can't just chop off the
      disk space, but at least this way the file extent only points to the valid data.
      Then when the file extent is free'd the extent and csums will be freed normally.
      
      This patch is needed for the next series which will give us more graceful
      recovery of failed truncates.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      77cef2ec
  10. 02 7月, 2013 1 次提交
    • M
      Btrfs: remove btrfs_sector_sum structure · f51a4a18
      Miao Xie 提交于
      Using the structure btrfs_sector_sum to keep the checksum value is
      unnecessary, because the extents that btrfs_sector_sum points to are
      continuous, we can find out the expected checksums by btrfs_ordered_sum's
      bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
      removing bytenr, there is only one member in the structure, so it makes
      no sense to keep the structure, just remove it, and use a u32 array to
      store the checksum value.
      
      By this change, we don't use the while loop to get the checksums one by
      one. Now, we can get several checksum value at one time, it improved the
      performance by ~74% on my SSD (31MB/s -> 54MB/s).
      
      test command:
       # dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      f51a4a18
  11. 14 6月, 2013 1 次提交
  12. 07 5月, 2013 1 次提交
  13. 21 2月, 2013 1 次提交
    • J
      Btrfs: place ordered operations on a per transaction list · 569e0f35
      Josef Bacik 提交于
      Miao made the ordered operations stuff run async, which introduced a
      deadlock where we could get somebody (sync) racing in and committing the
      transaction while a commit was already happening.  The new committer would
      try and flush ordered operations which would hang waiting for the commit to
      finish because it is done asynchronously and no longer inherits the callers
      trans handle.  To fix this we need to make the ordered operations list a per
      transaction list.  We can get new inodes added to the ordered operation list
      by truncating them and then having another process writing to them, so this
      makes it so that anybody trying to add an ordered operation _must_ start a
      transaction in order to add itself to the list, which will keep new inodes
      from getting added to the ordered operations list after we start committing.
      This should fix the deadlock and also keeps us from doing a lot more work
      than we need to during commit.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      569e0f35
  14. 20 2月, 2013 1 次提交
    • J
      Btrfs: wait on ordered extents at the last possible moment · 2ab28f32
      Josef Bacik 提交于
      Since we don't actually copy the extent information from the source tree in
      the fast case we don't need to wait for ordered io to be completed in order
      to fsync, we just need to wait for the io to be completed.  So when we're
      logging our file just attach all of the ordered extents to the log, and then
      when the log syncs just wait for IO_DONE on the ordered extents and then
      write the super.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2ab28f32
  15. 12 12月, 2012 2 次提交
  16. 19 11月, 2012 1 次提交
  17. 04 10月, 2012 1 次提交
  18. 02 10月, 2012 2 次提交
    • M
      Btrfs: use a slab for ordered extents allocation · 6352b91d
      Miao Xie 提交于
      The ordered extent allocation is in the fast path of the IO, so use a slab
      to improve the speed of the allocation.
      
       "Size of the struct is 280, so this will fall into the size-512 bucket,
        giving 8 objects per page, while own slab will pack 14 objects into a page.
      
        Another benefit I see is to check for leaked objects when the module is
        removed (and the cache destroy takes place)."
      						-- David Sterba
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      6352b91d
    • M
      Btrfs: fix file extent discount problem in the, snapshot · b9a8cc5b
      Miao Xie 提交于
      If a snapshot is created while we are writing some data into the file,
      the i_size of the corresponding file in the snapshot will be wrong, it will
      be beyond the end of the last file extent. And btrfsck will report:
        root 256 inode 257 errors 100
      
      Steps to reproduce:
       # mkfs.btrfs <partition>
       # mount <partition> <mnt>
       # cd <mnt>
       # dd if=/dev/zero of=tmpfile bs=4M count=1024 &
       # for ((i=0; i<4; i++))
       > do
       > btrfs sub snap . $i
       > done
      
      This because the algorithm of disk_i_size update is wrong. Though there are
      some ordered extents behind the current one which we use to update disk_i_size,
      it doesn't mean those extents will be dealt with in the same transaction. So
      We shouldn't use the offset of those extents to update disk_i_size. Or we will
      get the wrong i_size in the snapshot.
      
      We fix this problem by recording the max real i_size. If we find there is a
      ordered extent which is in front of the current one and doesn't complete, we
      will record the end of the current one into that ordered extent. Surely, if
      the current extent holds the end of other extent(it must be greater than
      the current one because it is behind the current one), we will record the
      number that the current extent holds. In this way, we can exclude the ordered
      extents that may not be dealth with in the same transaction, and be easy to
      know the real disk_i_size.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      b9a8cc5b
  19. 30 5月, 2012 1 次提交
    • J
      Btrfs: finish ordered extents in their own thread · 5fd02043
      Josef Bacik 提交于
      We noticed that the ordered extent completion doesn't really rely on having
      a page and that it could be done independantly of ending the writeback on a
      page.  This patch makes us not do the threaded endio stuff for normal
      buffered writes and direct writes so we can end page writeback as soon as
      possible (in irq context) and only start threads to do the ordered work when
      it is actually done.  Compression needs to be reworked some to take
      advantage of this as well, but atm it has to do a find_get_page in its endio
      handler so it must be done in its own thread.  This makes direct writes
      quite a bit faster.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      5fd02043
  20. 22 3月, 2012 1 次提交
  21. 22 12月, 2010 1 次提交
  22. 29 11月, 2010 1 次提交
  23. 25 5月, 2010 1 次提交
    • J
      Btrfs: add basic DIO read/write support · 4b46fce2
      Josef Bacik 提交于
      This provides basic DIO support for reading and writing.  It does not do the
      work to recover from mismatching checksums, that will come later.  A few design
      changes have been made from Jim's code (sorry Jim!)
      
      1) Use the generic direct-io code.  Jim originally re-wrote all the generic DIO
      code in order to account for all of BTRFS's oddities, but thanks to that work it
      seems like the best bet is to just ignore compression and such and just opt to
      fallback on buffered IO.
      
      2) Fallback on buffered IO for compressed or inline extents.  Jim's code did
      it's own buffering to make dio with compressed extents work.  Now we just
      fallback onto normal buffered IO.
      
      3) Use ordered extents for the writes so that all of the
      
      lock_extent()
      lookup_ordered()
      
      type checks continue to work.
      
      4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with
      DIO writes.
      
      I've tested this with fsx and everything works great.  This patch depends on my
      dio and filemap.c patches to work.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4b46fce2
  24. 15 3月, 2010 2 次提交
  25. 09 3月, 2010 1 次提交
  26. 18 12月, 2009 2 次提交
  27. 02 10月, 2009 1 次提交
  28. 12 9月, 2009 1 次提交
    • C
      Btrfs: Use PagePrivate2 to track pages in the data=ordered code. · 8b62b72b
      Chris Mason 提交于
      Btrfs writes go through delalloc to the data=ordered code.  This
      makes sure that all of the data is on disk before the metadata
      that references it.  The tracking means that we have to make sure
      each page in an extent is fully written before we add that extent into
      the on-disk btree.
      
      This was done in the past by setting the EXTENT_ORDERED bit for the
      range of an extent when it was added to the data=ordered code, and then
      clearing the EXTENT_ORDERED bit in the extent state tree as each page
      finished IO.
      
      One of the reasons we had to do this was because sometimes pages are
      magically dirtied without page_mkwrite being called.  The EXTENT_ORDERED
      bit is checked at writepage time, and if it isn't there, our page become
      dirty without going through the proper path.
      
      These bit operations make for a number of rbtree searches for each page,
      and can cause considerable lock contention.
      
      This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
      As pages go into the ordered code, PagePrivate2 is set on each one.
      This is a cheap operation because we already have all the pages locked
      and ready to go.
      
      As IO finishes, the PagePrivate2 bit is cleared and the ordered
      accoutning is updated for each page.
      
      At writepage time, if the PagePrivate2 bit is missing, we go into the
      writepage fixup code to handle improperly dirtied pages.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      8b62b72b
  29. 01 4月, 2009 1 次提交
    • C
      Btrfs: add extra flushing for renames and truncates · 5a3f23d5
      Chris Mason 提交于
      Renames and truncates are both common ways to replace old data with new
      data.  The filesystem can make an effort to make sure the new data is
      on disk before actually replacing the old data.
      
      This is especially important for rename, which many application use as
      though it were atomic for both the data and the metadata involved.  The
      current btrfs code will happily replace a file that is fully on disk
      with one that was just created and still has pending IO.
      
      If we crash after transaction commit but before the IO is done, we'll end
      up replacing a good file with a zero length file.  The solution used
      here is to create a list of inodes that need special ordering and force
      them to disk before the commit is done.  This is similar to the
      ext3 style data=ordering, except it is only done on selected files.
      
      Btrfs is able to get away with this because it does not wait on commits
      very often, even for fsync (which use a sub-commit).
      
      For renames, we order the file when it wasn't already
      on disk and when it is replacing an existing file.  Larger files
      are sent to filemap_flush right away (before the transaction handle is
      opened).
      
      For truncates, we order if the file goes from non-zero size down to
      zero size.  This is a little different, because at the time of the
      truncate the file has no dirty bytes to order.  But, we flag the inode
      so that it is added to the ordered list on close (via release method).  We
      also immediately add it to the ordered list of the current transaction
      so that we can try to flush down any writes the application sneaks in
      before commit.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      5a3f23d5
  30. 09 12月, 2008 1 次提交
    • C
      Btrfs: move data checksumming into a dedicated tree · d20f7043
      Chris Mason 提交于
      Btrfs stores checksums for each data block.  Until now, they have
      been stored in the subvolume trees, indexed by the inode that is
      referencing the data block.  This means that when we read the inode,
      we've probably read in at least some checksums as well.
      
      But, this has a few problems:
      
      * The checksums are indexed by logical offset in the file.  When
      compression is on, this means we have to do the expensive checksumming
      on the uncompressed data.  It would be faster if we could checksum
      the compressed data instead.
      
      * If we implement encryption, we'll be checksumming the plain text and
      storing that on disk.  This is significantly less secure.
      
      * For either compression or encryption, we have to get the plain text
      back before we can verify the checksum as correct.  This makes the raid
      layer balancing and extent moving much more expensive.
      
      * It makes the front end caching code more complex, as we have touch
      the subvolume and inodes as we cache extents.
      
      * There is potentitally one copy of the checksum in each subvolume
      referencing an extent.
      
      The solution used here is to store the extent checksums in a dedicated
      tree.  This allows us to index the checksums by phyiscal extent
      start and length.  It means:
      
      * The checksum is against the data stored on disk, after any compression
      or encryption is done.
      
      * The checksum is stored in a central location, and can be verified without
      following back references, or reading inodes.
      
      This makes compression significantly faster by reducing the amount of
      data that needs to be checksummed.  It will also allow much faster
      raid management code in general.
      
      The checksums are indexed by a key with a fixed objectid (a magic value
      in ctree.h) and offset set to the starting byte of the extent.  This
      allows us to copy the checksum items into the fsync log tree directly (or
      any other tree), without having to invent a second format for them.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      d20f7043
  31. 31 10月, 2008 1 次提交
    • Y
      Btrfs: Add fallocate support v2 · d899e052
      Yan Zheng 提交于
      This patch updates btrfs-progs for fallocate support.
      
      fallocate is a little different in Btrfs because we need to tell the
      COW system that a given preallocated extent doesn't need to be
      cow'd as long as there are no snapshots of it.  This leverages the
      -o nodatacow checks.
      Signed-off-by: NYan Zheng <zheng.yan@oracle.com>
      d899e052