1. 06 12月, 2016 6 次提交
  2. 30 11月, 2016 7 次提交
    • W
      btrfs: improve delayed refs iterations · 1d57ee94
      Wang Xiaoguang 提交于
      This issue was found when I tried to delete a heavily reflinked file,
      when deleting such files, other transaction operation will not have a
      chance to make progress, for example, start_transaction() will blocked
      in wait_current_trans(root) for long time, sometimes it even triggers
      soft lockups, and the time taken to delete such heavily reflinked file
      is also very large, often hundreds of seconds. Using perf top, it reports
      that:
      
      PerfTop:    7416 irqs/sec  kernel:99.8%  exact:  0.0% [4000Hz cpu-clock],  (all, 4 CPUs)
      ---------------------------------------------------------------------------------------
          84.37%  [btrfs]             [k] __btrfs_run_delayed_refs.constprop.80
          11.02%  [kernel]            [k] delay_tsc
           0.79%  [kernel]            [k] _raw_spin_unlock_irq
           0.78%  [kernel]            [k] _raw_spin_unlock_irqrestore
           0.45%  [kernel]            [k] do_raw_spin_lock
           0.18%  [kernel]            [k] __slab_alloc
      It seems __btrfs_run_delayed_refs() took most cpu time, after some debug
      work, I found it's select_delayed_ref() causing this issue, for a delayed
      head, in our case, it'll be full of BTRFS_DROP_DELAYED_REF nodes, but
      select_delayed_ref() will firstly try to iterate node list to find
      BTRFS_ADD_DELAYED_REF nodes, obviously it's a disaster in this case, and
      waste much time.
      
      To fix this issue, we introduce a new ref_add_list in struct btrfs_delayed_ref_head,
      then in select_delayed_ref(), if this list is not empty, we can directly use
      nodes in this list. With this patch, it just took about 10~15 seconds to
      delte the same file. Now using perf top, it reports that:
      
      PerfTop:    2734 irqs/sec  kernel:99.5%  exact:  0.0% [4000Hz cpu-clock],  (all, 4 CPUs)
      ----------------------------------------------------------------------------------------
      
          20.74%  [kernel]          [k] _raw_spin_unlock_irqrestore
          16.33%  [kernel]          [k] __slab_alloc
           5.41%  [kernel]          [k] lock_acquired
           4.42%  [kernel]          [k] lock_acquire
           4.05%  [kernel]          [k] lock_release
           3.37%  [kernel]          [k] _raw_spin_unlock_irq
      
      For normal files, this patch also gives help, at least we do not need to
      iterate whole list to found BTRFS_ADD_DELAYED_REF nodes.
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1d57ee94
    • Q
      btrfs: Export and move leaf/subtree qgroup helpers to qgroup.c · 33d1f05c
      Qu Wenruo 提交于
      Move account_shared_subtree() to qgroup.c and rename it to
      btrfs_qgroup_trace_subtree().
      
      Do the same thing for account_leaf_items() and rename it to
      btrfs_qgroup_trace_leaf_items().
      
      Since all these functions are only for qgroup, move them to qgroup.c and
      export them is more appropriate.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-and-Tested-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      33d1f05c
    • Q
      btrfs: qgroup: Rename functions to make it follow reserve,trace,account steps · 50b3e040
      Qu Wenruo 提交于
      Rename btrfs_qgroup_insert_dirty_extent(_nolock) to
      btrfs_qgroup_trace_extent(_nolock), according to the new
      reserve/trace/account naming schema.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-and-Tested-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      50b3e040
    • J
      btrfs: Ensure proper sector alignment for btrfs_free_reserved_data_space · 0c476a5d
      Jeff Mahoney 提交于
      This fixes the WARN_ON on BTRFS_I(inode)->reserved_extents in
      btrfs_destroy_inode and the WARN_ON on nonzero delalloc bytes on umount
      with qgroups enabled.
      
      I was able to reproduce this by setting up a small (~500kb) quota limit
      and writing a file one byte at a time until I hit the limit.  The warnings
      would all hit on umount.
      
      The root cause is that we would reserve a block-sized range in both
      the reservation and the quota in btrfs_check_data_free_space, but if we
      encountered a problem (like e.g. EDQUOT), we would only release the single
      byte in the qgroup reservation.  That caused an iotree state split, which
      increased the number of outstanding extents, in turn disallowing releasing
      the metadata reservation.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0c476a5d
    • D
      btrfs: remove constant parameter to memset_extent_buffer and rename it · b159fa28
      David Sterba 提交于
      The only memset we do is to 0, so sink the parameter to the function and
      simplify all calls. Rename the function to reflect the behaviour.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b159fa28
    • D
      btrfs: remove trivial helper btrfs_find_tree_block · 62d1f9fe
      David Sterba 提交于
      During the time, the function has been shrunk to the point that it just
      calls find_extent_buffer, just passing the parameters.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      62d1f9fe
    • X
      btrfs: remove useless comments · 745699ef
      Xiaoguang Wang 提交于
      Fixes: ("btrfs: update btrfs_space_info's bytes_may_use timely")
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      745699ef
  3. 29 11月, 2016 2 次提交
  4. 25 10月, 2016 1 次提交
  5. 11 10月, 2016 1 次提交
  6. 27 9月, 2016 9 次提交
  7. 26 9月, 2016 5 次提交
  8. 22 9月, 2016 1 次提交
  9. 06 9月, 2016 1 次提交
    • W
      btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress · ce129655
      Wang Xiaoguang 提交于
      In btrfs_async_reclaim_metadata_space(), we use ticket's address to
      determine whether asynchronous metadata reclaim work is making progress.
      
      	ticket = list_first_entry(&space_info->tickets,
      				  struct reserve_ticket, list);
      	if (last_ticket == ticket) {
      		flush_state++;
      	} else {
      		last_ticket = ticket;
      		flush_state = FLUSH_DELAYED_ITEMS_NR;
      		if (commit_cycles)
      			commit_cycles--;
      	}
      
      But indeed it's wrong, we should not rely on local variable's address to
      do this check, because addresses may be same. In my test environment, I
      dd one 168MB file in a 256MB fs, found that for this file, every time
      wait_reserve_ticket() called, local variable ticket's address is same,
      
      For above codes, assume a previous ticket's address is addrA, last_ticket
      is addrA. Btrfs_async_reclaim_metadata_space() finished this ticket and
      wake up it, then another ticket is added, but with the same address addrA,
      now last_ticket will be same to current ticket, then current ticket's flush
      work will start from current flush_state, not initial FLUSH_DELAYED_ITEMS_NR,
      which may result in some enospc issues(I have seen this in my test machine).
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ce129655
  10. 05 9月, 2016 1 次提交
  11. 01 9月, 2016 1 次提交
  12. 25 8月, 2016 5 次提交
    • J
      Btrfs: fix em leak in find_first_block_group · 187ee58c
      Josef Bacik 提交于
      We need to call free_extent_map() on the em we look up.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      187ee58c
    • L
      Btrfs: clarify do_chunk_alloc()'s return value · 28b737f6
      Liu Bo 提交于
      Function start_transaction() can return ERR_PTR(1) when flush is
      BTRFS_RESERVE_FLUSH_LIMIT, so the call graph is
      
      start_transaction (return ERR_PTR(1))
        -> btrfs_block_rsv_add (return 1)
           -> reserve_metadata_bytes (return 1)
              -> flush_space (return 1)
                 -> do_chunk_alloc  (return 1)
      
      With BTRFS_RESERVE_FLUSH_LIMIT, if flush_space is already on the
      flush_state of ALLOC_CHUNK and it successfully allocates a new
      chunk, then instead of trying to reserve space again,
      reserve_metadata_bytes returns 1 immediately.
      
      Eventually the callers who call start_transaction() usually just
      do the IS_ERR() check which ERR_PTR(1) can pass, then it'll get
      a panic when dereferencing a pointer which is ERR_PTR(1).
      
      The following patch fixes the above problem.
      "btrfs: flush_space: treat return value of do_chunk_alloc properly"
      https://patchwork.kernel.org/patch/7778651/
      
      This add comments to clarify do_chunk_alloc()'s return value.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      28b737f6
    • W
      btrfs: update btrfs_space_info's bytes_may_use timely · 18513091
      Wang Xiaoguang 提交于
      This patch can fix some false ENOSPC errors, below test script can
      reproduce one false ENOSPC error:
      	#!/bin/bash
      	dd if=/dev/zero of=fs.img bs=$((1024*1024)) count=128
      	dev=$(losetup --show -f fs.img)
      	mkfs.btrfs -f -M $dev
      	mkdir /tmp/mntpoint
      	mount $dev /tmp/mntpoint
      	cd /tmp/mntpoint
      	xfs_io -f -c "falloc 0 $((64*1024*1024))" testfile
      
      Above script will fail for ENOSPC reason, but indeed fs still has free
      space to satisfy this request. Please see call graph:
      btrfs_fallocate()
      |-> btrfs_alloc_data_chunk_ondemand()
      |   bytes_may_use += 64M
      |-> btrfs_prealloc_file_range()
          |-> btrfs_reserve_extent()
              |-> btrfs_add_reserved_bytes()
              |   alloc_type is RESERVE_ALLOC_NO_ACCOUNT, so it does not
              |   change bytes_may_use, and bytes_reserved += 64M. Now
              |   bytes_may_use + bytes_reserved == 128M, which is greater
              |   than btrfs_space_info's total_bytes, false enospc occurs.
              |   Note, the bytes_may_use decrease operation will be done in
              |   end of btrfs_fallocate(), which is too late.
      
      Here is another simple case for buffered write:
                          CPU 1              |              CPU 2
                                             |
      |-> cow_file_range()                   |-> __btrfs_buffered_write()
          |-> btrfs_reserve_extent()         |   |
          |                                  |   |
          |                                  |   |
          |    .....                         |   |-> btrfs_check_data_free_space()
          |                                  |
          |                                  |
          |-> extent_clear_unlock_delalloc() |
      
      In CPU 1, btrfs_reserve_extent()->find_free_extent()->
      btrfs_add_reserved_bytes() do not decrease bytes_may_use, the decrease
      operation will be delayed to be done in extent_clear_unlock_delalloc().
      Assume in this case, btrfs_reserve_extent() reserved 128MB data, CPU2's
      btrfs_check_data_free_space() tries to reserve 100MB data space.
      If
      	100MB > data_sinfo->total_bytes - data_sinfo->bytes_used -
      		data_sinfo->bytes_reserved - data_sinfo->bytes_pinned -
      		data_sinfo->bytes_readonly - data_sinfo->bytes_may_use
      btrfs_check_data_free_space() will try to allcate new data chunk or call
      btrfs_start_delalloc_roots(), or commit current transaction in order to
      reserve some free space, obviously a lot of work. But indeed it's not
      necessary as long as decreasing bytes_may_use timely, we still have
      free space, decreasing 128M from bytes_may_use.
      
      To fix this issue, this patch chooses to update bytes_may_use for both
      data and metadata in btrfs_add_reserved_bytes(). For compress path, real
      extent length may not be equal to file content length, so introduce a
      ram_bytes argument for btrfs_reserve_extent(), find_free_extent() and
      btrfs_add_reserved_bytes(), it's becasue bytes_may_use is increased by
      file content length. Then compress path can update bytes_may_use
      correctly. Also now we can discard RESERVE_ALLOC_NO_ACCOUNT, RESERVE_ALLOC
      and RESERVE_FREE.
      
      As we know, usually EXTENT_DO_ACCOUNTING is used for error path. In
      run_delalloc_nocow(), for inode marked as NODATACOW or extent marked as
      PREALLOC, we also need to update bytes_may_use, but can not pass
      EXTENT_DO_ACCOUNTING, because it also clears metadata reservation, so
      here we introduce EXTENT_CLEAR_DATA_RESV flag to indicate btrfs_clear_bit_hook()
      to update btrfs_space_info's bytes_may_use.
      
      Meanwhile __btrfs_prealloc_file_range() will call
      btrfs_free_reserved_data_space() internally for both sucessful and failed
      path, btrfs_prealloc_file_range()'s callers does not need to call
      btrfs_free_reserved_data_space() any more.
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      18513091
    • W
      btrfs: divide btrfs_update_reserved_bytes() into two functions · 4824f1f4
      Wang Xiaoguang 提交于
      This patch divides btrfs_update_reserved_bytes() into
      btrfs_add_reserved_bytes() and btrfs_free_reserved_bytes(), and
      next patch will extend btrfs_add_reserved_bytes()to fix some
      false ENOSPC error, please see later patch for detailed info.
      Signed-off-by: NWang Xiaoguang <wangxg.fnst@cn.fujitsu.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      4824f1f4
    • Q
      btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent() · cb93b52c
      Qu Wenruo 提交于
      Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
      1. btrfs_qgroup_insert_dirty_extent_nolock()
         Almost the same with original code.
         For delayed_ref usage, which has delayed refs locked.
      
         Change the return value type to int, since caller never needs the
         pointer, but only needs to know if they need to free the allocated
         memory.
      
      2. btrfs_qgroup_insert_dirty_extent()
         The more encapsulated version.
      
         Will do the delayed_refs lock, memory allocation, quota enabled check
         and other things.
      
      The original design is to keep exported functions to minimal, but since
      more btrfs hacks exposed, like replacing path in balance, we need to
      record dirty extents manually, so we have to add such functions.
      
      Also, add comment for both functions, to info developers how to keep
      qgroup correct when doing hacks.
      
      Cc: Mark Fasheh <mfasheh@suse.de>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-and-Tested-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      cb93b52c