1. 30 5月, 2016 1 次提交
    • F
      Btrfs: fix race setting block group readonly during device replace · f0e9b7d6
      Filipe Manana 提交于
      When we do a device replace, for each device extent we find from the
      source device, we set the corresponding block group to readonly mode to
      prevent writes into it from happening while we are copying the device
      extent from the source to the target device. However just before we set
      the block group to readonly mode some concurrent task might have already
      allocated an extent from it or decided it could perform a nocow write
      into one of its extents, which can make the device replace process to
      miss copying an extent since it uses the extent tree's commit root to
      search for extents and only once it finishes searching for all extents
      belonging to the block group it does set the left cursor to the logical
      end address of the block group - this is a problem if the respective
      ordered extents finish while we are searching for extents using the
      extent tree's commit root and no transaction commit happens while we
      are iterating the tree, since it's the delayed references created by the
      ordered extents (when they complete) that insert the extent items into
      the extent tree (using the non-commit root of course).
      Example:
      
                CPU 1                                            CPU 2
      
       btrfs_dev_replace_start()
         btrfs_scrub_dev()
           scrub_enumerate_chunks()
             --> finds device extent belonging
                 to block group X
      
                                     <transaction N starts>
      
                                                            starts buffered write
                                                            against some inode
      
                                                            writepages is run against
                                                            that inode forcing dellaloc
                                                            to run
      
                                                            btrfs_writepages()
                                                              extent_writepages()
                                                                extent_write_cache_pages()
                                                                  __extent_writepage()
                                                                    writepage_delalloc()
                                                                      run_delalloc_range()
                                                                        cow_file_range()
                                                                          btrfs_reserve_extent()
                                                                            --> allocates an extent
                                                                                from block group X
                                                                                (which is not yet
                                                                                 in RO mode)
                                                                          btrfs_add_ordered_extent()
                                                                            --> creates ordered extent Y
                                                              flush_epd_write_bio()
                                                                --> bio against the extent from
                                                                    block group X is submitted
      
             btrfs_inc_block_group_ro(bg X)
               --> sets block group X to readonly
      
             scrub_chunk(bg X)
               scrub_stripe(device extent from srcdev)
                 --> keeps searching for extent items
                     belonging to the block group using
                     the extent tree's commit root
                 --> it never blocks due to
                     fs_info->scrub_pause_req as no
                     one tries to commit transaction N
                 --> copies all extents found from the
                     source device into the target device
                 --> finishes search loop
      
                                                              bio completes
      
                                                              ordered extent Y completes
                                                              and creates delayed data
                                                              reference which will add an
                                                              extent item to the extent
                                                              tree when run (typically
                                                              at transaction commit time)
      
                                                                --> so the task doing the
                                                                    scrub/device replace
                                                                    at CPU 1 misses this
                                                                    and does not copy this
                                                                    extent into the new/target
                                                                    device
      
             btrfs_dec_block_group_ro(bg X)
               --> turns block group X back to RW mode
      
             dev_replace->cursor_left is set to the
             logical end offset of block group X
      
      So fix this by waiting for all cow and nocow writes after setting a block
      group to readonly mode.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      f0e9b7d6
  2. 26 5月, 2016 1 次提交
  3. 13 5月, 2016 1 次提交
  4. 22 10月, 2015 1 次提交
    • J
      Btrfs: change how we wait for pending ordered extents · 161c3549
      Josef Bacik 提交于
      We have a mechanism to make sure we don't lose updates for ordered extents that
      were logged in the transaction that is currently running.  We add the ordered
      extent to a transaction list and then the transaction waits on all the ordered
      extents in that list.  However are substantially large file systems this list
      can be extremely large, and can give us soft lockups, since the ordered extents
      don't remove themselves from the list when they do complete.
      
      To fix this we simply add a counter to the transaction that is incremented any
      time we have a logged extent that needs to be completed in the current
      transaction.  Then when the ordered extent finally completes it decrements the
      per transaction counter and wakes up the transaction if we are the last ones.
      This will eliminate the softlockup.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      161c3549
  5. 10 6月, 2015 1 次提交
    • F
      Btrfs: avoid syncing log in the fast fsync path when not necessary · b659ef02
      Filipe Manana 提交于
      Commit 3a8b36f3 ("Btrfs: fix data loss in the fast fsync path") added
      a performance regression for that causes an unnecessary sync of the log
      trees (fs/subvol and root log trees) when 2 consecutive fsyncs are done
      against a file, without no writes or any metadata updates to the inode in
      between them and if a transaction is committed before the second fsync is
      called.
      
      Huang Ying reported this to lkml (https://lkml.org/lkml/2015/3/18/99)
      after a test sysbench test that measured a -62% decrease of file io
      requests per second for that tests' workload.
      
      The test is:
      
        echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
        echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
        echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
        echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
        mkfs -t btrfs /dev/sda2
        mount -t btrfs /dev/sda2 /fs/sda2
        cd /fs/sda2
        for ((i = 0; i < 1024; i++)); do fallocate -l 67108864 testfile.$i; done
        sysbench --test=fileio --max-requests=0 --num-threads=4 --max-time=600 \
          --file-test-mode=rndwr --file-total-size=68719476736 --file-io-mode=sync \
          --file-num=1024 run
      
      A test on kvm guest, running a debug kernel gave me the following results:
      
      Without 3a8b36f3:             16.01 reqs/sec
      With 3a8b36f3:                 3.39 reqs/sec
      With 3a8b36f3 and this patch: 16.04 reqs/sec
      Reported-by: NHuang Ying <ying.huang@intel.com>
      Tested-by: NHuang, Ying <ying.huang@intel.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b659ef02
  6. 03 6月, 2015 1 次提交
    • L
      Btrfs: remove csum_bytes_left · 0c304304
      Liu Bo 提交于
      After commit 8407f553
      ("Btrfs: fix data corruption after fast fsync and writeback error"),
      during wait_ordered_extents(), we wait for ordered extent setting
      BTRFS_ORDERED_IO_DONE or BTRFS_ORDERED_IOERR, at which point we've
      already got checksum information, so we don't need to check
      (csum_bytes_left == 0) in the whole logging path.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0c304304
  7. 22 11月, 2014 2 次提交
    • F
      Btrfs: collect only the necessary ordered extents on ranged fsync · 0870295b
      Filipe Manana 提交于
      Instead of collecting all ordered extents from the inode's ordered tree
      and then wait for all of them to complete, just collect the ones that
      overlap the fsync range.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      0870295b
    • J
      Btrfs: make sure logged extents complete in the current transaction V3 · 50d9aa99
      Josef Bacik 提交于
      Liu Bo pointed out that my previous fix would lose the generation update in the
      scenario I described.  It is actually much worse than that, we could lose the
      entire extent if we lose power right after the transaction commits.  Consider
      the following
      
      write extent 0-4k
      log extent in log tree
      commit transaction
      	< power fail happens here
      ordered extent completes
      
      We would lose the 0-4k extent because it hasn't updated the actual fs tree, and
      the transaction commit will reset the log so it isn't replayed.  If we lose
      power before the transaction commit we are save, otherwise we are not.
      
      Fix this by keeping track of all extents we logged in this transaction.  Then
      when we go to commit the transaction make sure we wait for all of those ordered
      extents to complete before proceeding.  This will make sure that if we lose
      power after the transaction commit we still have our data.  This also fixes the
      problem of the improperly updated extent generation.  Thanks,
      
      cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      50d9aa99
  8. 15 8月, 2014 1 次提交
    • C
      btrfs: disable strict file flushes for renames and truncates · 8d875f95
      Chris Mason 提交于
      Truncates and renames are often used to replace old versions of a file
      with new versions.  Applications often expect this to be an atomic
      replacement, even if they haven't done anything to make sure the new
      version is fully on disk.
      
      Btrfs has strict flushing in place to make sure that renaming over an
      old file with a new file will fully flush out the new file before
      allowing the transaction commit with the rename to complete.
      
      This ordering means the commit code needs to be able to lock file pages,
      and there are a few paths in the filesystem where we will try to end a
      transaction with the page lock held.  It's rare, but these things can
      deadlock.
      
      This patch removes the ordered flushes and switches to a best effort
      filemap_flush like ext4 uses. It's not perfect, but it should fix the
      deadlocks.
      Signed-off-by: NChris Mason <clm@fb.com>
      8d875f95
  9. 11 3月, 2014 4 次提交
  10. 12 11月, 2013 2 次提交
  11. 21 9月, 2013 1 次提交
  12. 01 9月, 2013 1 次提交
    • J
      Btrfs: allow partial ordered extent completion · 77cef2ec
      Josef Bacik 提交于
      We currently have this problem where you can truncate pages that have not yet
      been written for an ordered extent.  We do this because the truncate will be
      coming behind to clean us up anyway so what's the harm right?  Well if truncate
      fails for whatever reason we leave an orphan item around for the file to be
      cleaned up later.  But if the user goes and truncates up the file and tries to
      read from the area that had been discarded previously they will get a csum error
      because we never actually wrote that data out.
      
      This patch fixes this by allowing us to either discard the ordered extent
      completely, by which I mean we just free up the space we had allocated and not
      add the file extent, or adjust the length of the file extent we write.  We do
      this by setting the length we truncated down to in the ordered extent, and then
      we set the file extent length and ram bytes to this length.  The total disk
      space stays unchanged since we may be compressed and we can't just chop off the
      disk space, but at least this way the file extent only points to the valid data.
      Then when the file extent is free'd the extent and csums will be freed normally.
      
      This patch is needed for the next series which will give us more graceful
      recovery of failed truncates.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      Signed-off-by: NChris Mason <chris.mason@fusionio.com>
      77cef2ec
  13. 02 7月, 2013 1 次提交
    • M
      Btrfs: remove btrfs_sector_sum structure · f51a4a18
      Miao Xie 提交于
      Using the structure btrfs_sector_sum to keep the checksum value is
      unnecessary, because the extents that btrfs_sector_sum points to are
      continuous, we can find out the expected checksums by btrfs_ordered_sum's
      bytenr and the offset, so we can remove btrfs_sector_sum's bytenr. After
      removing bytenr, there is only one member in the structure, so it makes
      no sense to keep the structure, just remove it, and use a u32 array to
      store the checksum value.
      
      By this change, we don't use the while loop to get the checksums one by
      one. Now, we can get several checksum value at one time, it improved the
      performance by ~74% on my SSD (31MB/s -> 54MB/s).
      
      test command:
       # dd if=/dev/zero of=/mnt/btrfs/file0 bs=1M count=1024 oflag=sync
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      f51a4a18
  14. 14 6月, 2013 1 次提交
  15. 07 5月, 2013 1 次提交
  16. 21 2月, 2013 1 次提交
    • J
      Btrfs: place ordered operations on a per transaction list · 569e0f35
      Josef Bacik 提交于
      Miao made the ordered operations stuff run async, which introduced a
      deadlock where we could get somebody (sync) racing in and committing the
      transaction while a commit was already happening.  The new committer would
      try and flush ordered operations which would hang waiting for the commit to
      finish because it is done asynchronously and no longer inherits the callers
      trans handle.  To fix this we need to make the ordered operations list a per
      transaction list.  We can get new inodes added to the ordered operation list
      by truncating them and then having another process writing to them, so this
      makes it so that anybody trying to add an ordered operation _must_ start a
      transaction in order to add itself to the list, which will keep new inodes
      from getting added to the ordered operations list after we start committing.
      This should fix the deadlock and also keeps us from doing a lot more work
      than we need to during commit.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      569e0f35
  17. 20 2月, 2013 1 次提交
    • J
      Btrfs: wait on ordered extents at the last possible moment · 2ab28f32
      Josef Bacik 提交于
      Since we don't actually copy the extent information from the source tree in
      the fast case we don't need to wait for ordered io to be completed in order
      to fsync, we just need to wait for the io to be completed.  So when we're
      logging our file just attach all of the ordered extents to the log, and then
      when the log syncs just wait for IO_DONE on the ordered extents and then
      write the super.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fusionio.com>
      2ab28f32
  18. 12 12月, 2012 2 次提交
  19. 19 11月, 2012 1 次提交
  20. 04 10月, 2012 1 次提交
  21. 02 10月, 2012 2 次提交
    • M
      Btrfs: use a slab for ordered extents allocation · 6352b91d
      Miao Xie 提交于
      The ordered extent allocation is in the fast path of the IO, so use a slab
      to improve the speed of the allocation.
      
       "Size of the struct is 280, so this will fall into the size-512 bucket,
        giving 8 objects per page, while own slab will pack 14 objects into a page.
      
        Another benefit I see is to check for leaked objects when the module is
        removed (and the cache destroy takes place)."
      						-- David Sterba
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      6352b91d
    • M
      Btrfs: fix file extent discount problem in the, snapshot · b9a8cc5b
      Miao Xie 提交于
      If a snapshot is created while we are writing some data into the file,
      the i_size of the corresponding file in the snapshot will be wrong, it will
      be beyond the end of the last file extent. And btrfsck will report:
        root 256 inode 257 errors 100
      
      Steps to reproduce:
       # mkfs.btrfs <partition>
       # mount <partition> <mnt>
       # cd <mnt>
       # dd if=/dev/zero of=tmpfile bs=4M count=1024 &
       # for ((i=0; i<4; i++))
       > do
       > btrfs sub snap . $i
       > done
      
      This because the algorithm of disk_i_size update is wrong. Though there are
      some ordered extents behind the current one which we use to update disk_i_size,
      it doesn't mean those extents will be dealt with in the same transaction. So
      We shouldn't use the offset of those extents to update disk_i_size. Or we will
      get the wrong i_size in the snapshot.
      
      We fix this problem by recording the max real i_size. If we find there is a
      ordered extent which is in front of the current one and doesn't complete, we
      will record the end of the current one into that ordered extent. Surely, if
      the current extent holds the end of other extent(it must be greater than
      the current one because it is behind the current one), we will record the
      number that the current extent holds. In this way, we can exclude the ordered
      extents that may not be dealth with in the same transaction, and be easy to
      know the real disk_i_size.
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      b9a8cc5b
  22. 30 5月, 2012 1 次提交
    • J
      Btrfs: finish ordered extents in their own thread · 5fd02043
      Josef Bacik 提交于
      We noticed that the ordered extent completion doesn't really rely on having
      a page and that it could be done independantly of ending the writeback on a
      page.  This patch makes us not do the threaded endio stuff for normal
      buffered writes and direct writes so we can end page writeback as soon as
      possible (in irq context) and only start threads to do the ordered work when
      it is actually done.  Compression needs to be reworked some to take
      advantage of this as well, but atm it has to do a find_get_page in its endio
      handler so it must be done in its own thread.  This makes direct writes
      quite a bit faster.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      5fd02043
  23. 22 3月, 2012 1 次提交
  24. 22 12月, 2010 1 次提交
  25. 29 11月, 2010 1 次提交
  26. 25 5月, 2010 1 次提交
    • J
      Btrfs: add basic DIO read/write support · 4b46fce2
      Josef Bacik 提交于
      This provides basic DIO support for reading and writing.  It does not do the
      work to recover from mismatching checksums, that will come later.  A few design
      changes have been made from Jim's code (sorry Jim!)
      
      1) Use the generic direct-io code.  Jim originally re-wrote all the generic DIO
      code in order to account for all of BTRFS's oddities, but thanks to that work it
      seems like the best bet is to just ignore compression and such and just opt to
      fallback on buffered IO.
      
      2) Fallback on buffered IO for compressed or inline extents.  Jim's code did
      it's own buffering to make dio with compressed extents work.  Now we just
      fallback onto normal buffered IO.
      
      3) Use ordered extents for the writes so that all of the
      
      lock_extent()
      lookup_ordered()
      
      type checks continue to work.
      
      4) Do the lock_extent() lookup_ordered() loop in readpage so we don't race with
      DIO writes.
      
      I've tested this with fsx and everything works great.  This patch depends on my
      dio and filemap.c patches to work.  Thanks,
      Signed-off-by: NJosef Bacik <josef@redhat.com>
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      4b46fce2
  27. 15 3月, 2010 2 次提交
  28. 09 3月, 2010 1 次提交
  29. 18 12月, 2009 2 次提交
  30. 02 10月, 2009 1 次提交
  31. 12 9月, 2009 1 次提交
    • C
      Btrfs: Use PagePrivate2 to track pages in the data=ordered code. · 8b62b72b
      Chris Mason 提交于
      Btrfs writes go through delalloc to the data=ordered code.  This
      makes sure that all of the data is on disk before the metadata
      that references it.  The tracking means that we have to make sure
      each page in an extent is fully written before we add that extent into
      the on-disk btree.
      
      This was done in the past by setting the EXTENT_ORDERED bit for the
      range of an extent when it was added to the data=ordered code, and then
      clearing the EXTENT_ORDERED bit in the extent state tree as each page
      finished IO.
      
      One of the reasons we had to do this was because sometimes pages are
      magically dirtied without page_mkwrite being called.  The EXTENT_ORDERED
      bit is checked at writepage time, and if it isn't there, our page become
      dirty without going through the proper path.
      
      These bit operations make for a number of rbtree searches for each page,
      and can cause considerable lock contention.
      
      This commit switches from the EXTENT_ORDERED bit to use PagePrivate2.
      As pages go into the ordered code, PagePrivate2 is set on each one.
      This is a cheap operation because we already have all the pages locked
      and ready to go.
      
      As IO finishes, the PagePrivate2 bit is cleared and the ordered
      accoutning is updated for each page.
      
      At writepage time, if the PagePrivate2 bit is missing, we go into the
      writepage fixup code to handle improperly dirtied pages.
      Signed-off-by: NChris Mason <chris.mason@oracle.com>
      8b62b72b