1. 26 4月, 2015 1 次提交
    • Y
      Btrfs: fill ->last_trans for delayed inode in btrfs_fill_inode. · 6e17d30b
      Yang Dongsheng 提交于
      We need to fill inode when we found a node for it in delayed_nodes_tree.
      But we did not fill the ->last_trans currently, it will cause the test
      of xfstest/generic/311 fail. Scenario of the 311 is shown as below:
      
      Problem:
      	(1). test_fd = open(fname, O_RDWR|O_DIRECT)
      	(2). pwrite(test_fd, buf, 4096, 0)
      	(3). close(test_fd)
      	(4). drop_all_caches()	<-------- "echo 3 > /proc/sys/vm/drop_caches"
      	(5). test_fd = open(fname, O_RDWR|O_DIRECT)
      	(6). fsync(test_fd);
      				<-------- we did not get the correct log entry for the file
      Reason:
      	When we re-open this file in (5), we would find a node
      in delayed_nodes_tree and fill the inode we are lookup with the
      information. But the ->last_trans is not filled, then the fsync()
      will check the ->last_trans and found it's 0 then say this inode
      is already in our tree which is commited, not recording the extents
      for it.
      
      Fix:
      	This patch fill the ->last_trans properly and set the
      runtime_flags if needed in this situation. Then we can get the
      log entries we expected after (6) and generic/311 passed.
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Reviewed-by: NMiao Xie <miaoxie@huawei.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      6e17d30b
  2. 25 4月, 2015 1 次提交
    • J
      direct-io: only inc/dec inode->i_dio_count for file systems · fe0f07d0
      Jens Axboe 提交于
      do_blockdev_direct_IO() increments and decrements the inode
      ->i_dio_count for each IO operation. It does this to protect against
      truncate of a file. Block devices don't need this sort of protection.
      
      For a capable multiqueue setup, this atomic int is the only shared
      state between applications accessing the device for O_DIRECT, and it
      presents a scaling wall for that. In my testing, as much as 30% of
      system time is spent incrementing and decrementing this value. A mixed
      read/write workload improved from ~2.5M IOPS to ~9.6M IOPS, with
      better latencies too. Before:
      
      clat percentiles (usec):
       |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   34],
       | 30.00th=[   34], 40.00th=[   34], 50.00th=[   35], 60.00th=[   35],
       | 70.00th=[   35], 80.00th=[   35], 90.00th=[   37], 95.00th=[   80],
       | 99.00th=[   98], 99.50th=[  151], 99.90th=[  155], 99.95th=[  155],
       | 99.99th=[  165]
      
      After:
      
      clat percentiles (usec):
       |  1.00th=[   95],  5.00th=[  108], 10.00th=[  129], 20.00th=[  149],
       | 30.00th=[  155], 40.00th=[  161], 50.00th=[  167], 60.00th=[  171],
       | 70.00th=[  177], 80.00th=[  185], 90.00th=[  201], 95.00th=[  270],
       | 99.00th=[  390], 99.50th=[  398], 99.90th=[  418], 99.95th=[  422],
       | 99.99th=[  438]
      
      In other setups, Robert Elliott reported seeing good performance
      improvements:
      
      https://lkml.org/lkml/2015/4/3/557
      
      The more applications accessing the device, the worse it gets.
      
      Add a new direct-io flags, DIO_SKIP_DIO_COUNT, which tells
      do_blockdev_direct_IO() that it need not worry about incrementing
      or decrementing the inode i_dio_count for this caller.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Elliott, Robert (Server Storage) <elliott@hp.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      fe0f07d0
  3. 16 4月, 2015 1 次提交
  4. 13 4月, 2015 3 次提交
    • D
      btrfs: qgroup: do a reservation in a higher level. · e2d1f923
      Dongsheng Yang 提交于
      There are two problems in qgroup:
      
      a). The PAGE_CACHE is 4K, even when we are writing a data of 1K,
      qgroup will reserve a 4K size. It will cause the last 3K in a qgroup
      is not available to user.
      
      b). When user is writing a inline data, qgroup will not reserve it,
      it means this is a window we can exceed the limit of a qgroup.
      
      The main idea of this patch is reserving the data size of write_bytes
      rather than the reserve_bytes. It means qgroup will not care about
      the data size btrfs will reserve for user, but only care about the
      data size user is going to write. Then reserve it when user want to
      write and release it in transaction committed.
      
      In this way, qgroup can be released from the complex procedure in
      btrfs and only do the reserve when user want to write and account
      when the data is written in commit_transaction().
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e2d1f923
    • D
      Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use. · 31193213
      Dongsheng Yang 提交于
      Currently, for pre_alloc or delay_alloc, the bytes will be accounted
      in space_info by the three guys.
      space_info->bytes_may_use --- space_info->reserved --- space_info->used.
      But on the other hand, in qgroup, there are only two counters to account the
      bytes, qgroup->reserved and qgroup->excl. And qg->reserved accounts
      bytes in space_info->bytes_may_use and qg->excl accounts bytes in
      space_info->used. So the bytes in space_info->reserved is not accounted
      in qgroup. If so, there is a window we can exceed the quota limit when
      bytes is in space_info->reserved.
      
      Example:
      	# btrfs quota enable /mnt
      	# btrfs qgroup limit -e 10M /mnt
      	# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
      	# sync
      	# btrfs qgroup show -pcre /mnt
      qgroupid rfer     excl     max_rfer max_excl parent  child
      -------- ----     ----     -------- -------- ------  -----
      0/5      20987904 20987904 0        10485760 ---     ---
      
      qg->excl is 20987904 larger than max_excl 10485760.
      
      This patch introduce a new counter named may_use to qgroup, then
      there are three counters in qgroup to account bytes in space_info
      as below.
      space_info->bytes_may_use --- space_info->reserved --- space_info->used.
      qgroup->may_use           --- qgroup->reserved     --- qgroup->excl
      
      With this patch applied:
      	# btrfs quota enable /mnt
      	# btrfs qgroup limit -e 10M /mnt
      	# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
      fallocate: /mnt/data9: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data10: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data11: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data12: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data13: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data14: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data15: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data16: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data17: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data18: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data19: fallocate failed: Disk quota exceeded
      	# sync
      	# btrfs qgroup show -pcre /mnt
      qgroupid rfer    excl    max_rfer max_excl parent  child
      -------- ----    ----    -------- -------- ------  -----
      0/5      9453568 9453568 0        10485760 ---     ---
      Reported-by: NCyril SCETBON <cyril.scetbon@free.fr>
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      31193213
    • Z
      btrfs: Fix NO_SPACE bug caused by delayed-iput · d7c15171
      Zhao Lei 提交于
      Steps to reproduce:
        while true; do
          dd if=/dev/zero of=/btrfs_dir/file count=[fs_size * 75%]
          rm /btrfs_dir/file
          sync
        done
      
        And we'll see dd failed because btrfs return NO_SPACE.
      
      Reason:
        Normally, btrfs_commit_transaction() call btrfs_run_delayed_iputs()
        in end to free fs space for next write, but sometimes it hadn't
        done work on time, because btrfs-cleaner thread get delayed-iputs
        from list before, but do iput() after next write.
      
        This is log:
        [ 2569.050776] comm=btrfs-cleaner func=btrfs_evict_inode() begin
      
        [ 2569.084280] comm=sync func=btrfs_commit_transaction() call btrfs_run_delayed_iputs()
        [ 2569.085418] comm=sync func=btrfs_commit_transaction() done btrfs_run_delayed_iputs()
        [ 2569.087554] comm=sync func=btrfs_commit_transaction() end
      
        [ 2569.191081] comm=dd begin
        [ 2569.790112] comm=dd func=__btrfs_buffered_write() ret=-28
      
        [ 2569.847479] comm=btrfs-cleaner func=add_pinned_bytes() 0 + 32677888 = 32677888
        [ 2569.849530] comm=btrfs-cleaner func=add_pinned_bytes() 32677888 + 23834624 = 56512512
        ...
        [ 2569.903893] comm=btrfs-cleaner func=add_pinned_bytes() 943976448 + 21762048 = 965738496
        [ 2569.908270] comm=btrfs-cleaner func=btrfs_evict_inode() end
      
      Fix:
        Make btrfs_commit_transaction() wait current running btrfs-cleaner's
        delayed-iputs() done in end.
      
      Test:
        Use script similar to above(more complex),
        before patch:
          7 failed in 100 * 20 loop.
        after patch:
          0 failed in 100 * 20 loop.
      Signed-off-by: NZhao Lei <zhaolei@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d7c15171
  5. 12 4月, 2015 3 次提交
  6. 11 4月, 2015 4 次提交
    • J
      Btrfs: don't steal from the global reserve if we don't have the space · 3bce876f
      Josef Bacik 提交于
      btrfs_evict_inode() needs to be more careful about stealing from the
      global_rsv.  We dont' want to end up aborting commit with ENOSPC just
      because the evict_inode code was too greedy.
      Signed-off-by: NChris Mason <clm@fb.com>
      3bce876f
    • C
      Btrfs: refill block reserves during truncate · 28f75a0e
      Chris Mason 提交于
      When truncate starts, it allocates some space in the block reserves so
      that we'll have enough to update metadata along the way.
      
      For very large files, we can easily go through all of that space as we
      loop through the extents.  This changes truncate to refill the space
      reservation as it progresses through the file.
      Signed-off-by: NChris Mason <clm@fb.com>
      28f75a0e
    • J
      Btrfs: account for crcs in delayed ref processing · 1262133b
      Josef Bacik 提交于
      As we delete large extents, we end up doing huge amounts of COW in order
      to delete the corresponding crcs.  This adds accounting so that we keep
      track of that space and flushing of delayed refs so that we don't build
      up too much delayed crc work.
      
      This helps limit the delayed work that must be done at commit time and
      tries to avoid ENOSPC aborts because the crcs eat all the global
      reserves.
      Signed-off-by: NChris Mason <clm@fb.com>
      1262133b
    • C
      btrfs: actively run the delayed refs while deleting large files · 28ed1345
      Chris Mason 提交于
      When we are deleting large files with large extents, we are building up
      a huge set of delayed refs for processing.  Truncate isn't checking
      often enough to see if we need to back off and process those, or let
      a commit proceed.
      
      The end result is long stalls after the rm, and very long commit times.
      During the commits, other processes back up waiting to start new
      transactions and we get into trouble.
      Signed-off-by: NChris Mason <clm@fb.com>
      28ed1345
  7. 02 4月, 2015 1 次提交
  8. 26 3月, 2015 1 次提交
  9. 18 3月, 2015 3 次提交
    • J
      Btrfs: fix outstanding_extents accounting in DIO · e1cbbfa5
      Josef Bacik 提交于
      We are keeping track of how many extents we need to reserve properly based on
      the amount we want to write, but we were still incrementing outstanding_extents
      if we wrote less than what we requested.  This isn't quite right since we will
      be limited to our max extent size.  So instead lets do something horrible!  Keep
      track of how many outstanding_extents we reserved, and decrement each time we
      allocate an extent.  If we use our entire reserve make sure to jack up
      outstanding_extents on the inode so the accounting works out properly.  Thanks,
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      e1cbbfa5
    • J
      Btrfs: add sanity test for outstanding_extents accounting · 6a3891c5
      Josef Bacik 提交于
      I introduced a regression wrt outstanding_extents accounting.  These are tricky
      areas that aren't easily covered by xfstests as we could change MAX_EXTENT_SIZE
      at any time.  So add sanity tests to cover the various conditions that are
      tricky in order to make sure we don't introduce regressions in the future.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      6a3891c5
    • J
      Btrfs: account merges/splits properly · ba117213
      Josef Bacik 提交于
      My fix
      
      Btrfs: fix merge delalloc logic
      
      only fixed half of the problems, it didn't fix the case where we have two large
      extents on either side and then join them together with a new small extent.  We
      need to instead keep track of how many extents we have accounted for with each
      side of the new extent, and then see how many extents we need for the new large
      extent.  If they match then we know we need to keep our reservation, otherwise
      we need to drop our reservation.  This shows up with a case like this
      
      [BTRFS_MAX_EXTENT_SIZE+4K][4K HOLE][BTRFS_MAX_EXTENT_SIZE+4K]
      
      Previously the logic would have said that the number extents required for the
      new size (3) is larger than the number of extents required for the largest side
      (2) therefore we need to keep our reservation.  But this isn't the case, since
      both sides require a reservation of 2 which leads to 4 for the whole range
      currently reserved, but we only need 3, so we need to drop one of the
      reservations.  The same problem existed for splits, we'd think we only need 3
      extents when creating the hole but in reality we need 4.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      ba117213
  10. 14 3月, 2015 1 次提交
    • J
      Btrfs: fix merge delalloc logic · 8461a3de
      Josef Bacik 提交于
      My patch to properly count outstanding extents wrt MAX_EXTENT_SIZE introduced a
      regression when re-dirtying already dirty areas.  We have logic in split to make
      sure we are taking the largest space into account but didn't have it for merge,
      so it was sometimes making us think we were turning a tiny extent into a huge
      extent, when in reality we already had a huge extent and needed to use the other
      side in our logic.  This fixes the regression that was reported by a user on
      list.  Thanks,
      Reported-by: NMarkus Trippelsdorf <markus@trippelsdorf.de>
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      8461a3de
  11. 04 3月, 2015 1 次提交
  12. 03 3月, 2015 1 次提交
  13. 15 2月, 2015 3 次提交
    • J
      Btrfs: account for large extents with enospc · dcab6a3b
      Josef Bacik 提交于
      On our gluster boxes we stream large tar balls of backups onto our fses.  With
      160gb of ram this means we get really large contiguous ranges of dirty data, but
      the way our ENOSPC stuff works is that as long as it's contiguous we only hold
      metadata reservation for one extent.  The problem is we limit our extents to
      128mb, so we'll end up with at least 800 extents so our enospc accounting is
      quite a bit lower than what we need.  To keep track of this make sure we
      increase outstanding_extents for every multiple of the max extent size so we can
      be sure to have enough reserved metadata space.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      dcab6a3b
    • J
      Btrfs: don't set and clear delalloc for O_DIRECT writes · 3266789f
      Josef Bacik 提交于
      We do this to get the space accounting, but this is just needless churn on the
      io_tree, so just drop setting/clearing delalloc and just drop the reserved data
      space when we have a successfull allocation.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3266789f
    • J
      Btrfs: only adjust outstanding_extents when we do a short write · 3e05bde8
      Josef Bacik 提交于
      We have this weird dance where we always inc outstanding_extents when we do a
      O_DIRECT write, even if we allocate the entire range.  To get around this we
      also drop the metadata space if we successfully write.  This is an unnecessary
      dance, we only need to jack up outstanding_extents if we don't satisfy the
      entire range request in get_blocks_direct, otherwise we are good using our
      original reservation.  So drop the unconditional inc and the drop of the
      metadata space that we have for the unconditional inc.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      3e05bde8
  14. 03 2月, 2015 2 次提交
  15. 22 1月, 2015 4 次提交
  16. 21 1月, 2015 1 次提交
  17. 15 1月, 2015 1 次提交
  18. 03 1月, 2015 1 次提交
  19. 25 11月, 2014 1 次提交
    • F
      Btrfs: fix snapshot inconsistency after a file write followed by truncate · 9ea24bbe
      Filipe Manana 提交于
      If right after starting the snapshot creation ioctl we perform a write against a
      file followed by a truncate, with both operations increasing the file's size, we
      can get a snapshot tree that reflects a state of the source subvolume's tree where
      the file truncation happened but the write operation didn't. This leaves a gap
      between 2 file extent items of the inode, which makes btrfs' fsck complain about it.
      
      For example, if we perform the following file operations:
      
          $ mkfs.btrfs -f /dev/vdd
          $ mount /dev/vdd /mnt
          $ xfs_io -f \
                -c "pwrite -S 0xaa -b 32K 0 32K" \
                -c "fsync" \
                -c "pwrite -S 0xbb -b 32770 16K 32770" \
                -c "truncate 90123" \
                /mnt/foobar
      
      and the snapshot creation ioctl was just called before the second write, we often
      can get the following inode items in the snapshot's btree:
      
              item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160
                      inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0
              item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20
                      inode ref index 282 namelen 10 name: foobar
              item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53
                      extent data disk byte 1104855040 nr 32768
                      extent data offset 0 nr 32768 ram 32768
                      extent compression 0
              item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53
                      extent data disk byte 0 nr 0
                      extent data offset 0 nr 40960 ram 40960
                      extent compression 0
      
      There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[
      for which there's no file extent item covering it. This is because the file write
      and file truncate operations happened both right after the snapshot creation ioctl
      called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the
      ordered extent that matches the write and, in btrfs_setsize(), we were able to call
      btrfs_cont_expand() before being able to commit the current transaction in the
      snapshot creation ioctl. So this made it possibe to insert the hole file extent
      item in the source subvolume (which represents the region added by the truncate)
      right before the transaction commit from the snapshot creation ioctl.
      
      Btrfs' fsck tool complains about such cases with a message like the following:
      
          "root 331 inode 257 errors 100, file extent discount"
      
      >From a user perspective, the expectation when a snapshot is created while those
      file operations are being performed is that the snapshot will have a file that
      either:
      
      1) is empty
      2) only the first write was captured
      3) only the 2 writes were captured
      4) both writes and the truncation were captured
      
      But never capture a state where only the first write and the truncation were
      captured (since the second write was performed before the truncation).
      
      A test case for xfstests follows.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      9ea24bbe
  20. 22 11月, 2014 1 次提交
    • F
      Btrfs: ensure ordered extent errors aren't missed on fsync · b38ef71c
      Filipe Manana 提交于
      When doing a fsync with a fast path we have a time window where we can miss
      the fact that writeback of some file data failed, and therefore we endup
      returning success (0) from fsync when we should return an error.
      The steps that lead to this are the following:
      
      1) We start all ordered extents by calling filemap_fdatawrite_range();
      
      2) We do some other work like locking the inode's i_mutex, start a transaction,
         start a log transaction, etc;
      
      3) We enter btrfs_log_inode(), acquire the inode's log_mutex and collect all the
         ordered extents from inode's ordered tree into a list;
      
      4) But by the time we do ordered extent collection, some ordered extents we started
         at step 1) might have already completed with an error, and therefore we didn't
         found them in the ordered tree and had no idea they finished with an error. This
         makes our fsync return success (0) to userspace, but has no bad effects on the log
         like for example insertion of file extent items into the log that point to unwritten
         extents, because the invalid extent maps were removed before the ordered extent
         completed (in inode.c:btrfs_finish_ordered_io).
      
      So after collecting the ordered extents just check if the inode's i_mapping has any
      error flags set (AS_EIO or AS_ENOSPC) and leave with an error if it does. Whenever
      writeback fails for a page of an ordered extent, we call mapping_set_error (done in
      extent_io.c:end_extent_writepage, called by extent_io.c:end_bio_extent_writepage)
      that sets one of those error flags in the inode's i_mapping flags.
      
      This change also has the side effect of fixing the issue where for fast fsyncs we
      never checked/cleared the error flags from the inode's i_mapping flags, which means
      that a full fsync performed after a fast fsync could get such errors that belonged
      to the fast fsync - because the full fsync calls btrfs_wait_ordered_range() which
      calls filemap_fdatawait_range(), and the later checks for and clears those flags,
      while for fast fsyncs we never call filemap_fdatawait_range() or anything else
      that checks for and clears the error flags from the inode's i_mapping.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      b38ef71c
  21. 21 11月, 2014 5 次提交
    • F
      Btrfs: report error after failure inlining extent in compressed write path · e6eb4314
      Filipe Manana 提交于
      If cow_file_range_inline() failed, when called from compress_file_range(),
      we were tagging the locked page for writeback, end its writeback and unlock it,
      but not marking it with an error nor setting AS_EIO in inode's mapping flags.
      
      This made it impossible for a caller of filemap_fdatawrite_range (writepages)
      or filemap_fdatawait_range() to know that an error happened. And the return
      value of compress_file_range() is useless because it's returned to a workqueue
      task and not to the task calling filemap_fdatawrite_range (writepages).
      
      This change applies on top of the previous patchset starting at the patch
      titled:
      
          "[1/5] Btrfs: set page and mapping error on compressed write failure"
      
      Which changed extent_clear_unlock_delalloc() to use SetPageError and
      mapping_set_error().
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e6eb4314
    • F
      Btrfs: add helper btrfs_fdatawrite_range · 728404da
      Filipe Manana 提交于
      To avoid duplicating this double filemap_fdatawrite_range() call for
      inodes with async extents (compressed writes) so often.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      728404da
    • F
      Btrfs: correctly flush compressed data before/after direct IO · 075bdbdb
      Filipe Manana 提交于
      For compressed writes, after doing the first filemap_fdatawrite_range() we
      don't get the pages tagged for writeback immediately. Instead we create
      a workqueue task, which is run by other kthread, and keep the pages locked.
      That other kthread compresses data, creates the respective ordered extent/s,
      tags the pages for writeback and unlocks them. Therefore we need a second
      call to filemap_fdatawrite_range() if we have compressed writes, as this
      second call will wait for the pages to become unlocked, then see they became
      tagged for writeback and finally wait for the writeback to finish.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      075bdbdb
    • F
      Btrfs: make inode.c:compress_file_range() return void · c44f649e
      Filipe Manana 提交于
      Its return value is useless, its single caller ignores it and can't do
      anything with it anyway, since it's a workqueue task and not the task
      calling filemap_fdatawrite_range (writepages) nor filemap_fdatawait_range().
      Failure is communicated to such functions via start and end of writeback
      with the respective pages tagged with an error and AS_EIO flag set in the
      inode's imapping.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      c44f649e
    • S
      Btrfs: fix incorrect compression ratio detection · 4bcbb332
      Shilong Wang 提交于
      Steps to reproduce:
       # mkfs.btrfs -f /dev/sdb
       # mount -t btrfs /dev/sdb /mnt -o compress=lzo
       # dd if=/dev/zero of=/mnt/data bs=$((33*4096)) count=1
      
      after previous steps, inode will be detected as bad compression ratio,
      and NOCOMPRESS flag will be set for that inode.
      
      Reason is that compress have a max limit pages every time(128K), if a
      132k write in, it will be splitted into two write(128k+4k), this bug
      is a leftover for commit 68bb462d(Btrfs: don't compress for a small write)
      
      Fix this problem by checking every time before compression, if it is a
      small write(<=blocksize), we bail out and fall into nocompression directly.
      Signed-off-by: NWang Shilong <wangshilong1991@gmail.com>
      Reviewed-by: NMiao Xie <miaox@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4bcbb332