1. 30 6月, 2017 3 次提交
    • Q
      btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges · bc42bda2
      Qu Wenruo 提交于
      [BUG]
      For the following case, btrfs can underflow qgroup reserved space
      at an error path:
      (Page size 4K, function name without "btrfs_" prefix)
      
               Task A                  |             Task B
      ----------------------------------------------------------------------
      Buffered_write [0, 2K)           |
      |- check_data_free_space()       |
      |  |- qgroup_reserve_data()      |
      |     Range aligned to page      |
      |     range [0, 4K)          <<< |
      |     4K bytes reserved      <<< |
      |- copy pages to page cache      |
                                       | Buffered_write [2K, 4K)
                                       | |- check_data_free_space()
                                       | |  |- qgroup_reserved_data()
                                       | |     Range alinged to page
                                       | |     range [0, 4K)
                                       | |     Already reserved by A <<<
                                       | |     0 bytes reserved      <<<
                                       | |- delalloc_reserve_metadata()
                                       | |  And it *FAILED* (Maybe EQUOTA)
                                       | |- free_reserved_data_space()
                                            |- qgroup_free_data()
                                               Range aligned to page range
                                               [0, 4K)
                                               Freeing 4K
      (Special thanks to Chandan for the detailed report and analyse)
      
      [CAUSE]
      Above Task B is freeing reserved data range [0, 4K) which is actually
      reserved by Task A.
      
      And at writeback time, page dirty by Task A will go through writeback
      routine, which will free 4K reserved data space at file extent insert
      time, causing the qgroup underflow.
      
      [FIX]
      For btrfs_qgroup_free_data(), add @reserved parameter to only free
      data ranges reserved by previous btrfs_qgroup_reserve_data().
      So in above case, Task B will try to free 0 byte, so no underflow.
      Reported-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Tested-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc42bda2
    • Q
      btrfs: qgroup: Introduce extent changeset for qgroup reserve functions · 364ecf36
      Qu Wenruo 提交于
      Introduce a new parameter, struct extent_changeset for
      btrfs_qgroup_reserved_data() and its callers.
      
      Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
      which range it reserved in current reserve, so it can free it in error
      paths.
      
      The reason we need to export it to callers is, at buffered write error
      path, without knowing what exactly which range we reserved in current
      allocation, we can free space which is not reserved by us.
      
      This will lead to qgroup reserved space underflow.
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      364ecf36
    • Q
      btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function · d1b8b94a
      Qu Wenruo 提交于
      Quite a lot of qgroup corruption happens due to wrong time of calling
      btrfs_qgroup_prepare_account_extents().
      
      Since the safest time is to call it just before
      btrfs_qgroup_account_extents(), there is no need to separate these 2
      functions.
      
      Merging them will make code cleaner and less bug prone.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      [ changelog and comment adjustments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1b8b94a
  2. 18 4月, 2017 3 次提交
  3. 17 2月, 2017 1 次提交
    • Q
      btrfs: qgroup: Move half of the qgroup accounting time out of commit trans · fb235dc0
      Qu Wenruo 提交于
      Just as Filipe pointed out, the most time consuming parts of qgroup are
      btrfs_qgroup_account_extents() and
      btrfs_qgroup_prepare_account_extents().
      Which both call btrfs_find_all_roots() to get old_roots and new_roots
      ulist.
      
      What makes things worse is, we're calling that expensive
      btrfs_find_all_roots() at transaction committing time with
      TRANS_STATE_COMMIT_DOING, which will blocks all incoming transaction.
      
      Such behavior is necessary for @new_roots search as current
      btrfs_find_all_roots() can't do it correctly so we do call it just
      before switch commit roots.
      
      However for @old_roots search, it's not necessary as such search is
      based on commit_root, so it will always be correct and we can move it
      out of transaction committing.
      
      This patch moves the @old_roots search part out of
      commit_transaction(), so in theory we can half the time qgroup time
      consumption at commit_transaction().
      
      But please note that, this won't speedup qgroup overall, the total time
      consumption is still the same, just reduce the performance stall.
      
      Cc: Filipe Manana <fdmanana@suse.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fb235dc0
  4. 14 2月, 2017 1 次提交
    • J
      btrfs: allow unlink to exceed subvolume quota · 003d7c59
      Jeff Mahoney 提交于
      Once a qgroup limit is exceeded, it's impossible to restore normal
      operation to the subvolume without modifying the limit or removing
      the subvolume.  This is a surprising situation for many users used
      to the typical workflow with quotas on other file systems where it's
      possible to remove files until the used space is back under the limit.
      
      When we go to unlink a file and start the transaction, we'll hit
      the qgroup limit while trying to reserve space for the items we'll
      modify while removing the file.  We discussed last month how best
      to handle this situation and agreed that there is no perfect solution.
      The best principle-of-least-surprise solution is to handle it similarly
      to how we already handle ENOSPC when unlinking, which is to allow
      the operation to succeed with the expectation that it will ultimately
      release space under most circumstances.
      
      This patch modifies the transaction start path to select whether to
      honor the qgroups limits.  btrfs_start_transaction_fallback_global_rsv
      is the only caller that skips enforcement.  The reservation and tracking
      still happens normally -- it just skips the enforcement step.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      003d7c59
  5. 06 12月, 2016 1 次提交
  6. 30 11月, 2016 3 次提交
  7. 25 8月, 2016 2 次提交
    • Q
      btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent() · cb93b52c
      Qu Wenruo 提交于
      Refactor btrfs_qgroup_insert_dirty_extent() function, to two functions:
      1. btrfs_qgroup_insert_dirty_extent_nolock()
         Almost the same with original code.
         For delayed_ref usage, which has delayed refs locked.
      
         Change the return value type to int, since caller never needs the
         pointer, but only needs to know if they need to free the allocated
         memory.
      
      2. btrfs_qgroup_insert_dirty_extent()
         The more encapsulated version.
      
         Will do the delayed_refs lock, memory allocation, quota enabled check
         and other things.
      
      The original design is to keep exported functions to minimal, but since
      more btrfs hacks exposed, like replacing path in balance, we need to
      record dirty extents manually, so we have to add such functions.
      
      Also, add comment for both functions, to info developers how to keep
      qgroup correct when doing hacks.
      
      Cc: Mark Fasheh <mfasheh@suse.de>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-and-Tested-by: NGoldwyn Rodrigues <rgoldwyn@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      cb93b52c
    • J
      btrfs: waiting on qgroup rescan should not always be interruptible · d06f23d6
      Jeff Mahoney 提交于
      We wait on qgroup rescan completion in three places: file system
      shutdown, the quota disable ioctl, and the rescan wait ioctl.  If the
      user sends a signal while we're waiting, we continue happily along.  This
      is expected behavior for the rescan wait ioctl.  It's racy in the shutdown
      path but mostly works due to other unrelated synchronization points.
      In the quota disable path, it Oopses the kernel pretty much immediately.
      
      Cc: <stable@vger.kernel.org> # v4.4+
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      d06f23d6
  8. 26 7月, 2016 1 次提交
  9. 22 10月, 2015 7 次提交
  10. 11 6月, 2015 5 次提交
  11. 13 4月, 2015 3 次提交
    • D
      btrfs: qgroup: do a reservation in a higher level. · e2d1f923
      Dongsheng Yang 提交于
      There are two problems in qgroup:
      
      a). The PAGE_CACHE is 4K, even when we are writing a data of 1K,
      qgroup will reserve a 4K size. It will cause the last 3K in a qgroup
      is not available to user.
      
      b). When user is writing a inline data, qgroup will not reserve it,
      it means this is a window we can exceed the limit of a qgroup.
      
      The main idea of this patch is reserving the data size of write_bytes
      rather than the reserve_bytes. It means qgroup will not care about
      the data size btrfs will reserve for user, but only care about the
      data size user is going to write. Then reserve it when user want to
      write and release it in transaction committed.
      
      In this way, qgroup can be released from the complex procedure in
      btrfs and only do the reserve when user want to write and account
      when the data is written in commit_transaction().
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      e2d1f923
    • D
      Btrfs: qgroup: Introduce a may_use to account space_info->bytes_may_use. · 31193213
      Dongsheng Yang 提交于
      Currently, for pre_alloc or delay_alloc, the bytes will be accounted
      in space_info by the three guys.
      space_info->bytes_may_use --- space_info->reserved --- space_info->used.
      But on the other hand, in qgroup, there are only two counters to account the
      bytes, qgroup->reserved and qgroup->excl. And qg->reserved accounts
      bytes in space_info->bytes_may_use and qg->excl accounts bytes in
      space_info->used. So the bytes in space_info->reserved is not accounted
      in qgroup. If so, there is a window we can exceed the quota limit when
      bytes is in space_info->reserved.
      
      Example:
      	# btrfs quota enable /mnt
      	# btrfs qgroup limit -e 10M /mnt
      	# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
      	# sync
      	# btrfs qgroup show -pcre /mnt
      qgroupid rfer     excl     max_rfer max_excl parent  child
      -------- ----     ----     -------- -------- ------  -----
      0/5      20987904 20987904 0        10485760 ---     ---
      
      qg->excl is 20987904 larger than max_excl 10485760.
      
      This patch introduce a new counter named may_use to qgroup, then
      there are three counters in qgroup to account bytes in space_info
      as below.
      space_info->bytes_may_use --- space_info->reserved --- space_info->used.
      qgroup->may_use           --- qgroup->reserved     --- qgroup->excl
      
      With this patch applied:
      	# btrfs quota enable /mnt
      	# btrfs qgroup limit -e 10M /mnt
      	# for((i=0;i<20;i++));do fallocate -l 1M /mnt/data$i; done
      fallocate: /mnt/data9: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data10: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data11: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data12: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data13: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data14: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data15: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data16: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data17: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data18: fallocate failed: Disk quota exceeded
      fallocate: /mnt/data19: fallocate failed: Disk quota exceeded
      	# sync
      	# btrfs qgroup show -pcre /mnt
      qgroupid rfer    excl    max_rfer max_excl parent  child
      -------- ----    ----    -------- -------- ------  -----
      0/5      9453568 9453568 0        10485760 ---     ---
      Reported-by: NCyril SCETBON <cyril.scetbon@free.fr>
      Signed-off-by: NDongsheng Yang <yangds.fnst@cn.fujitsu.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      31193213
    • D
  12. 15 8月, 2014 1 次提交
    • M
      btrfs: qgroup: account shared subtrees during snapshot delete · 1152651a
      Mark Fasheh 提交于
      During its tree walk, btrfs_drop_snapshot() will skip any shared
      subtrees it encounters. This is incorrect when we have qgroups
      turned on as those subtrees need to have their contents
      accounted. In particular, the case we're concerned with is when
      removing our snapshot root leaves the subtree with only one root
      reference.
      
      In those cases we need to find the last remaining root and add
      each extent in the subtree to the corresponding qgroup exclusive
      counts.
      
      This patch implements the shared subtree walk and a new qgroup
      operation, BTRFS_QGROUP_OPER_SUB_SUBTREE. When an operation of
      this type is encountered during qgroup accounting, we search for
      any root references to that extent and in the case that we find
      only one reference left, we go ahead and do the math on it's
      exclusive counts.
      Signed-off-by: NMark Fasheh <mfasheh@suse.de>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      1152651a
  13. 10 6月, 2014 1 次提交
    • J
      Btrfs: rework qgroup accounting · fcebe456
      Josef Bacik 提交于
      Currently qgroups account for space by intercepting delayed ref updates to fs
      trees.  It does this by adding sequence numbers to delayed ref updates so that
      it can figure out how the tree looked before the update so we can adjust the
      counters properly.  The problem with this is that it does not allow delayed refs
      to be merged, so if you say are defragging an extent with 5k snapshots pointing
      to it we will thrash the delayed ref lock because we need to go back and
      manually merge these things together.  Instead we want to process quota changes
      when we know they are going to happen, like when we first allocate an extent, we
      free a reference for an extent, we add new references etc.  This patch
      accomplishes this by only adding qgroup operations for real ref changes.  We
      only modify the sequence number when we need to lookup roots for bytenrs, this
      reduces the amount of churn on the sequence number and allows us to merge
      delayed refs as we add them most of the time.  This patch encompasses a bunch of
      architectural changes
      
      1) qgroup ref operations: instead of tracking qgroup operations through the
      delayed refs we simply add new ref operations whenever we notice that we need to
      when we've modified the refs themselves.
      
      2) tree mod seq:  we no longer have this separation of major/minor counters.
      this makes the sequence number stuff much more sane and we can remove some
      locking that was needed to protect the counter.
      
      3) delayed ref seq: we now read the tree mod seq number and use that as our
      sequence.  This means each new delayed ref doesn't have it's own unique sequence
      number, rather whenever we go to lookup backrefs we inc the sequence number so
      we can make sure to keep any new operations from screwing up our world view at
      that given point.  This allows us to merge delayed refs during runtime.
      
      With all of these changes the delayed ref stuff is a little saner and the qgroup
      accounting stuff no longer goes negative in some cases like it was before.
      Thanks,
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NChris Mason <clm@fb.com>
      fcebe456