1. 28 10月, 2016 1 次提交
    • C
      block: better op and flags encoding · ef295ecf
      Christoph Hellwig 提交于
      Now that we don't need the common flags to overflow outside the range
      of a 32-bit type we can encode them the same way for both the bio and
      request fields.  This in addition allows us to place the operation
      first (and make some room for more ops while we're at it) and to
      stop having to shift around the operation values.
      
      In addition this allows passing around only one value in the block layer
      instead of two (and eventuall also in the file systems, but we can do
      that later) and thus clean up a lot of code.
      
      Last but not least this allows decreasing the size of the cmd_flags
      field in struct request to 32-bits.  Various functions passing this
      value could also be updated, but I'd like to avoid the churn for now.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      ef295ecf
  2. 11 10月, 2016 1 次提交
    • A
      [btrfs] fix check_direct_IO() for non-iovec iterators · cd27e455
      Al Viro 提交于
      looking for duplicate ->iov_base makes sense only for
      iovec-backed iterators; for kvec-backed ones it's pointless,
      for bvec-backed ones it's pointless and broken on 32bit (we
      walk through an array of struct bio_vec accessing them as if
      they were struct iovec; works by accident on 64bit, but on
      32bit it'll blow up) and for pipe-backed ones it's pointless
      and ends up oopsing.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cd27e455
  3. 08 10月, 2016 1 次提交
  4. 28 9月, 2016 1 次提交
  5. 27 9月, 2016 3 次提交
  6. 26 9月, 2016 2 次提交
  7. 22 9月, 2016 1 次提交
  8. 16 9月, 2016 1 次提交
  9. 14 9月, 2016 1 次提交
  10. 25 8月, 2016 1 次提交
    • W
      btrfs: update btrfs_space_info's bytes_may_use timely · 18513091
      Wang Xiaoguang 提交于
      This patch can fix some false ENOSPC errors, below test script can
      reproduce one false ENOSPC error:
      	#!/bin/bash
      	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
      	dev=$(losetup --show -f fs.img)
      	mkfs.btrfs -f -M $dev
      	mkdir /tmp/mntpoint
      	mount $dev /tmp/mntpoint
      	cd /tmp/mntpoint
      	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
      
      Above script will fail for ENOSPC reason, but indeed fs still has free
      space to satisfy this request. Please see call graph:
      btrfs_fallocate()
      |-> btrfs_alloc_data_chunk_ondemand()
      |   bytes_may_use += 64M
      |-> btrfs_prealloc_file_range()
          |-> btrfs_reserve_extent()
              |-> btrfs_add_reserved_bytes()
              |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
              |   change bytes_may_use, and bytes_reserved += 64M. Now
              |   bytes_may_use + bytes_reserved == 128M, which is greater
              |   than btrfs_space_info's total_bytes, false enospc occurs.
              |   Note, the bytes_may_use decrease operation will be done in
              |   end of btrfs_fallocate(), which is too late.
      
      Here is another simple case for buffered write:
                          CPU 1              |              CPU 2
                                             |
      |-> cow_file_range()                   |-> __btrfs_buffered_write()
          |-> btrfs_reserve_extent()         |   |
          |                                  |   |
          |                                  |   |
          |    .....                         |   |-> btrfs_check_data_free_space()
          |                                  |
          |                                  |
          |-> extent_clear_unlock_delalloc() |
      
      In CPU 1, btrfs_reserve_extent()->find_free_extent()->
      btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
      operation will be delayed to be done in extent_clear_unlock_delalloc().
      Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
      btrfs_check_data_free_space() tries to reserve 100MB data space.
      If
      	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
      		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
      		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
      btrfs_check_data_free_space() will try to allcate new data chunk or call
      btrfs_start_delalloc_roots(), or commit current transaction in order to
      reserve some free space, obviously a lot of work. But indeed it's not
      necessary as long as decreasing bytes_may_use timely, we still have
      free space, decreasing 128M from bytes_may_use.
      
      To fix this issue, this patch chooses to update bytes_may_use for both
      data and metadata in btrfs_add_reserved_bytes(). For compress path, real
      extent length may not be equal to file content length, so introduce a
      ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
      btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
      file content length. Then compress path can update bytes_may_use
      correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
      and RESERVE_FREE.
      
      As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
      run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
      PREALLOC, we also need to update bytes_may_use, but can not pass
      EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
      here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
      to update btrfs_space_info's bytes_may_use.
      
      Meanwhile __btrfs_prealloc_file_range() will call
      btrfs_free_reserved_data_space() internally for both sucessful and failed
      path, btrfs_prealloc_file_range()'s callers does not need to call
      btrfs_free_reserved_data_space() any more.
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      18513091
  11. 08 8月, 2016 1 次提交
    • J
      block: rename bio bi_rw to bi_opf · 1eff9d32
      Jens Axboe 提交于
      Since commit 63a4cc24, bio->bi_rw contains flags in the lower
      portion and the op code in the higher portions. This means that
      old code that relies on manually setting bi_rw is most likely
      going to be broken. Instead of letting that brokeness linger,
      rename the member, to force old and out-of-tree code to break
      at compile time instead of at runtime.
      
      No intended functional changes in this commit.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      1eff9d32
  12. 01 8月, 2016 2 次提交
    • F
      Btrfs: improve performance on fsync against new inode after rename/unlink · 44f714da
      Filipe Manana 提交于
      With commit 56f23fdb ("Btrfs: fix file/data loss caused by fsync after
      rename and new inode") we got simple fix for a functional issue when the
      following sequence of actions is done:
      
        at transaction N
        create file A at directory D
        at transaction N + M (where M >= 1)
        move/rename existing file A from directory D to directory E
        create a new file named A at directory D
        fsync the new file
        power fail
      
      The solution was to simply detect such scenario and fallback to a full
      transaction commit when we detect it. However this turned out to had a
      significant impact on throughput (and a bit on latency too) for benchmarks
      using the dbench tool, which simulates real workloads from smbd (Samba)
      servers. For example on a test vm (with a debug kernel):
      
      Unpatched:
      Throughput 19.1572 MB/sec  32 clients  32 procs  max_latency=1005.229 ms
      
      Patched:
      Throughput 23.7015 MB/sec  32 clients  32 procs  max_latency=809.206 ms
      
      The patched results (this patch is applied) are similar to the results of
      a kernel with the commit 56f23fdb ("Btrfs: fix file/data loss caused
      by fsync after rename and new inode") reverted.
      
      This change avoids the fallback to a transaction commit and instead makes
      sure all the names of the conflicting inode (the one that had a name in a
      past transaction that matches the name of the new file in the same parent
      directory) are logged so that at log replay time we don't lose neither the
      new file nor the old file, and the old file gets the name it was renamed
      to.
      
      This also ends up avoiding a full transaction commit for a similar case
      that involves an unlink instead of a rename of the old file:
      
        at transaction N
        create file A at directory D
        at transaction N + M (where M >= 1)
        remove file A
        create a new file named A at directory D
        fsync the new file
        power fail
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      44f714da
    • F
      Btrfs: be more precise on errors when getting an inode from disk · 67710892
      Filipe Manana 提交于
      When we attempt to read an inode from disk, we end up always returning an
      -ESTALE error to the caller regardless of the actual failure reason, which
      can be an out of memory problem (when allocating a path), some error found
      when reading from the fs/subvolume btree (like a genuine IO error) or the
      inode does not exists. So lets start returning the real error code to the
      callers so that they don't treat all -ESTALE errors as meaning that the
      inode does not exists (such as during orphan cleanup). This will also be
      needed for a subsequent patch in the same series dealing with a special
      fsync case.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      67710892
  13. 26 7月, 2016 8 次提交
  14. 08 7月, 2016 1 次提交
    • J
      Btrfs: fix callers of btrfs_block_rsv_migrate · 25d609f8
      Josef Bacik 提交于
      So btrfs_block_rsv_migrate just unconditionally calls block_rsv_migrate_bytes.
      Not only this but it unconditionally changes the size of the block_rsv.  This
      isn't a bug strictly speaking, but it makes truncate block rsv's look funny
      because every time we migrate bytes over its size grows, even though we only
      want it to be a specific size.  So collapse this into one function that takes an
      update_size argument and make truncate and evict not update the size for
      consistency sake.  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      25d609f8
  15. 25 6月, 2016 1 次提交
    • O
      Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes · 02dbfc99
      Omar Sandoval 提交于
      Commit fe742fd4 ("Revert "btrfs: switch to ->iterate_shared()"")
      backed out the conversion to ->iterate_shared() for Btrfs because the
      delayed inode handling in btrfs_real_readdir() is racy. However, we can
      still do readdir in parallel if there are no delayed nodes.
      
      This is a temporary fix which upgrades the shared inode lock to an
      exclusive lock only when we have delayed items until we come up with a
      more complete solution. While we're here, rename the
      btrfs_{get,put}_delayed_items functions to make it very clear that
      they're just for readdir.
      
      Tested with xfstests and by doing a parallel kernel build:
      
      	while make tinyconfig && make -j4 && git clean dqfx; do
      		:
      	done
      
      along with a bunch of parallel finds in another shell:
      
      	while true; do
      		for ((i=0; i<4; i++)); do
      			find . >/dev/null &
      		done
      		wait
      	done
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      02dbfc99
  16. 23 6月, 2016 1 次提交
    • J
      Btrfs: track transid for delayed ref flushing · 31b9655f
      Josef Bacik 提交于
      Using the offwakecputime bpf script I noticed most of our time was spent waiting
      on the delayed ref throttling.  This is what is supposed to happen, but
      sometimes the transaction can commit and then we're waiting for throttling that
      doesn't matter anymore.  So change this stuff to be a little smarter by tracking
      the transid we were in when we initiated the throttling.  If the transaction we
      get is different then we can just bail out.  This resulted in a 50% speedup in
      my fs_mark test, and reduced the amount of time spent throttling by 60 seconds
      over the entire run (which is about 30 minutes).  Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      31b9655f
  17. 18 6月, 2016 1 次提交
  18. 08 6月, 2016 5 次提交
  19. 04 6月, 2016 1 次提交
    • C
      Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent · 8dff9c85
      Chris Mason 提交于
      When dealing with inline extents, btrfs_get_extent will incorrectly try
      to insert a duplicate extent_map.  The dup hits -EEXIST from
      add_extent_map, but then we try to merge with the existing one and end
      up trying to insert a zero length extent_map.
      
      This actually works most of the time, except when there are extent maps
      past the end of the inline extent.  rocksdb will trigger this sometimes
      because it preallocates an extent and then truncates down.
      
      Josef made a script to trigger with xfs_io:
      
      	#!/bin/bash
      
      	xfs_io -f -c "pwrite 0 1000" inline
      	xfs_io -c "falloc -k 4k 1M" inline
      	xfs_io -c "pread 0 1000" -c "fadvise -d 0 1000" -c "pread 0 1000" inline
      	xfs_io -c "fadvise -d 0 1000" inline
      	cat inline
      
      You'll get EIOs trying to read inline after this because add_extent_map
      is returning EEXIST
      Signed-off-by: NChris Mason <clm@fb.com>
      8dff9c85
  20. 26 5月, 2016 1 次提交
  21. 19 5月, 2016 1 次提交
  22. 18 5月, 2016 1 次提交
  23. 13 5月, 2016 3 次提交
    • F
      Btrfs: add semaphore to synchronize direct IO writes with fsync · 5f9a8a51
      Filipe Manana 提交于
      Due to the optimization of lockless direct IO writes (the inode's i_mutex
      is not held) introduced in commit 38851cc1 ("Btrfs: implement unlocked
      dio write"), we started having races between such writes with concurrent
      fsync operations that use the fast fsync path. These races were addressed
      in the patches titled "Btrfs: fix race between fsync and lockless direct
      IO writes" and "Btrfs: fix race between fsync and direct IO writes for
      prealloc extents". The races happened because the direct IO path, like
      every other write path, does create extent maps followed by the
      corresponding ordered extents while the fast fsync path collected first
      ordered extents and then it collected extent maps. This made it possible
      to log file extent items (based on the collected extent maps) without
      waiting for the corresponding ordered extents to complete (get their IO
      done). The two fixes mentioned before added a solution that consists of
      making the direct IO path create first the ordered extents and then the
      extent maps, while the fsync path attempts to collect any new ordered
      extents once it collects the extent maps. This was simple and did not
      require adding any synchonization primitive to any data structure (struct
      btrfs_inode for example) but it makes things more fragile for future
      development endeavours and adds an exceptional approach compared to the
      other write paths.
      
      This change adds a read-write semaphore to the btrfs inode structure and
      makes the direct IO path create the extent maps and the ordered extents
      while holding read access on that semaphore, while the fast fsync path
      collects extent maps and ordered extents while holding write access on
      that semaphore. The logic for direct IO write path is encapsulated in a
      new helper function that is used both for cow and nocow direct IO writes.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      5f9a8a51
    • F
      Btrfs: fix race between block group relocation and nocow writes · f78c436c
      Filipe Manana 提交于
      Relocation of a block group waits for all existing tasks flushing
      dellaloc, starting direct IO writes and any ordered extents before
      starting the relocation process. However for direct IO writes that end
      up doing nocow (inode either has the flag nodatacow set or the write is
      against a prealloc extent) we have a short time window that allows for a
      race that makes relocation proceed without waiting for the direct IO
      write to complete first, resulting in data loss after the relocation
      finishes. This is illustrated by the following diagram:
      
                 CPU 1                                     CPU 2
      
       btrfs_relocate_block_group(bg X)
      
                                                     direct IO write starts against
                                                     an extent in block group X
                                                     using nocow mode (inode has the
                                                     nodatacow flag or the write is
                                                     for a prealloc extent)
      
                                                     btrfs_direct_IO()
                                                       btrfs_get_blocks_direct()
                                                         --> can_nocow_extent() returns 1
      
         btrfs_inc_block_group_ro(bg X)
           --> turns block group into RO mode
      
         btrfs_wait_ordered_roots()
           --> returns and does not know about
               the DIO write happening at CPU 2
               (the task there has not created
                yet an ordered extent)
      
         relocate_block_group(bg X)
           --> rc->stage == MOVE_DATA_EXTENTS
      
           find_next_extent()
             --> returns extent that the DIO
                 write is going to write to
      
           relocate_data_extent()
      
             relocate_file_extent_cluster()
      
               --> reads the extent from disk into
                   pages belonging to the relocation
                   inode and dirties them
      
                                                         --> creates DIO ordered extent
      
                                                       btrfs_submit_direct()
                                                         --> submits bio against a location
                                                             on disk obtained from an extent
                                                             map before the relocation started
      
         btrfs_wait_ordered_range()
           --> writes all the pages read before
               to disk (belonging to the
               relocation inode)
      
         relocation finishes
      
                                                       bio completes and wrote new data
                                                       to the old location of the block
                                                       group
      
      So fix this by tracking the number of nocow writers for a block group and
      make sure relocation waits for that number to go down to 0 before starting
      to move the extents.
      
      The same race can also happen with buffered writes in nocow mode since the
      patch I recently made titled "Btrfs: don't do unnecessary delalloc flushes
      when relocating", because we are no longer flushing all delalloc which
      served as a synchonization mechanism (due to page locking) and ensured
      the ordered extents for nocow buffered writes were created before we
      called btrfs_wait_ordered_roots(). The race with direct IO writes in nocow
      mode existed before that patch (no pages are locked or used during direct
      IO) and that fixed only races with direct IO writes that do cow.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      f78c436c
    • F
      Btrfs: fix race between fsync and direct IO writes for prealloc extents · 0b901916
      Filipe Manana 提交于
      When we do a direct IO write against a preallocated extent (fallocate)
      that does not go beyond the i_size of the inode, we do the write operation
      without holding the inode's i_mutex (an optimization that landed in
      commit 38851cc1 ("Btrfs: implement unlocked dio write")). This allows
      for a very tiny time window where a race can happen with a concurrent
      fsync using the fast code path, as the direct IO write path creates first
      a new extent map (no longer flagged as a prealloc extent) and then it
      creates the ordered extent, while the fast fsync path first collects
      ordered extents and then it collects extent maps. This allows for the
      possibility of the fast fsync path to collect the new extent map without
      collecting the new ordered extent, and therefore logging an extent item
      based on the extent map without waiting for the ordered extent to be
      created and complete. This can result in a situation where after a log
      replay we end up with an extent not marked anymore as prealloc but it was
      only partially written (or not written at all), exposing random, stale or
      garbage data corresponding to the unwritten pages and without any
      checksums in the csum tree covering the extent's range.
      
      This is an extension of what was done in commit de0ee0ed ("Btrfs: fix
      race between fsync and lockless direct IO writes").
      
      So fix this by creating first the ordered extent and then the extent
      map, so that this way if the fast fsync patch collects the new extent
      map it also collects the corresponding ordered extent.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      0b901916