1. 30 6月, 2017 16 次提交
    • J
      btrfs: Don't clear SGID when inheriting ACLs · b7f8a09f
      Jan Kara 提交于
      When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
      set, DIR1 is expected to have SGID bit set (and owning group equal to
      the owning group of 'DIR0'). However when 'DIR0' also has some default
      ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
      'DIR1' to get cleared if user is not member of the owning group.
      
      Fix the problem by moving posix_acl_update_mode() out of
      __btrfs_set_acl() into btrfs_set_acl(). That way the function will not be
      called when inheriting ACLs which is what we want as it prevents SGID
      bit clearing and the mode has been properly set by posix_acl_create()
      anyway.
      
      Fixes: 07393101
      CC: stable@vger.kernel.org
      CC: linux-btrfs@vger.kernel.org
      CC: David Sterba <dsterba@suse.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b7f8a09f
    • C
      btrfs: fix integer overflow in calc_reclaim_items_nr · 6374e57a
      Chris Mason 提交于
      Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
      v4.12-rc6.  This was because commit 70e7af24 made it possible for
      calc_reclaim_items_nr() to return a negative number.  It's not really a
      bug in that commit, it just didn't go far enough down the stack to find
      all the possible 64->32 bit overflows.
      
      This switches calc_reclaim_items_nr() to return a u64 and changes everyone
      that uses the results of that math to u64 as well.
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Fixes: 70e7af24 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6374e57a
    • D
      btrfs: scrub: fix target device intialization while setting up scrub context · ded56184
      David Sterba 提交于
      The commit "btrfs: scrub: inline helper scrub_setup_wr_ctx" inlined a
      helper but wrongly sets up the target device. Incidentally there's a
      local variable with the same name as a parameter in the previous
      function, so this got caught during runtime as crash in test btrfs/027.
      Reported-by: NChris Mason <clm@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ded56184
    • Q
      btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges · bc42bda2
      Qu Wenruo 提交于
      [BUG]
      For the following case, btrfs can underflow qgroup reserved space
      at an error path:
      (Page size 4K, function name without "btrfs_" prefix)
      
               Task A                  |             Task B
      ----------------------------------------------------------------------
      Buffered_write [0, 2K)           |
      |- check_data_free_space()       |
      |  |- qgroup_reserve_data()      |
      |     Range aligned to page      |
      |     range [0, 4K)          <<< |
      |     4K bytes reserved      <<< |
      |- copy pages to page cache      |
                                       | Buffered_write [2K, 4K)
                                       | |- check_data_free_space()
                                       | |  |- qgroup_reserved_data()
                                       | |     Range alinged to page
                                       | |     range [0, 4K)
                                       | |     Already reserved by A <<<
                                       | |     0 bytes reserved      <<<
                                       | |- delalloc_reserve_metadata()
                                       | |  And it *FAILED* (Maybe EQUOTA)
                                       | |- free_reserved_data_space()
                                            |- qgroup_free_data()
                                               Range aligned to page range
                                               [0, 4K)
                                               Freeing 4K
      (Special thanks to Chandan for the detailed report and analyse)
      
      [CAUSE]
      Above Task B is freeing reserved data range [0, 4K) which is actually
      reserved by Task A.
      
      And at writeback time, page dirty by Task A will go through writeback
      routine, which will free 4K reserved data space at file extent insert
      time, causing the qgroup underflow.
      
      [FIX]
      For btrfs_qgroup_free_data(), add @reserved parameter to only free
      data ranges reserved by previous btrfs_qgroup_reserve_data().
      So in above case, Task B will try to free 0 byte, so no underflow.
      Reported-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Tested-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc42bda2
    • Q
      btrfs: qgroup: Introduce extent changeset for qgroup reserve functions · 364ecf36
      Qu Wenruo 提交于
      Introduce a new parameter, struct extent_changeset for
      btrfs_qgroup_reserved_data() and its callers.
      
      Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
      which range it reserved in current reserve, so it can free it in error
      paths.
      
      The reason we need to export it to callers is, at buffered write error
      path, without knowing what exactly which range we reserved in current
      allocation, we can free space which is not reserved by us.
      
      This will lead to qgroup reserved space underflow.
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      364ecf36
    • Q
      btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write... · a12b877b
      Qu Wenruo 提交于
      btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
      
      [BUG]
      Under the following case, we can underflow qgroup reserved space.
      
                  Task A                |            Task B
      ---------------------------------------------------------------
       Quota disabled                   |
       Buffered write                   |
       |- btrfs_check_data_free_space() |
       |  *NO* qgroup space is reserved |
       |  since quota is *DISABLED*     |
       |- All pages are copied to page  |
          cache                         |
                                        | Enable quota
                                        | Quota scan finished
                                        |
                                        | Sync_fs
                                        | |- run_delalloc_range
                                        | |- Write pages
                                        | |- btrfs_finish_ordered_io
                                        |    |- insert_reserved_file_extent
                                        |       |- btrfs_qgroup_release_data()
                                        |          Since no qgroup space is
                                                   reserved in Task A, we
                                                   underflow qgroup reserved
                                                   space
      This can be detected by fstest btrfs/104.
      
      [CAUSE]
      In insert_reserved_file_extent() we tell qgroup to release the @ram_bytes
      size of qgroup reserved_space in all cases.
      And btrfs_qgroup_release_data() will check if quotas are enabled.
      
      However in the above case, the buffered write happens before quota is
      enabled, so we don't have the reserved space for that range.
      
      [FIX]
      In insert_reserved_file_extent(), we tell qgroup to release the acctual
      byte number it released.
      In the above case, since we don't have the reserved space, we tell
      qgroups to release 0 byte, so the problem can be fixed.
      
      And thanks to the @reserved parameter introduced by the qgroup rework,
      and previous patch to return released bytes, the fix can be as small as
      10 lines.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      [ changelog updates ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a12b877b
    • Q
      btrfs: qgroup: Return actually freed bytes for qgroup release or free data · 7bc329c1
      Qu Wenruo 提交于
      btrfs_qgroup_release/free_data() only returns 0 or a negative error
      number (ENOMEM is the only possible error).
      
      This is normally good enough, but sometimes we need the exact byte
      count it freed/released.
      
      Change it to return actually released/freed bytenr number instead of 0
      for success.
      And slightly modify related extent_changeset structure, since in btrfs
      one no-hole data extent won't be larger than 128M, so "unsigned int"
      is large enough for the use case.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7bc329c1
    • Q
      btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function · d1b8b94a
      Qu Wenruo 提交于
      Quite a lot of qgroup corruption happens due to wrong time of calling
      btrfs_qgroup_prepare_account_extents().
      
      Since the safest time is to call it just before
      btrfs_qgroup_account_extents(), there is no need to separate these 2
      functions.
      
      Merging them will make code cleaner and less bug prone.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      [ changelog and comment adjustments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1b8b94a
    • Q
      btrfs: qgroup: Add quick exit for non-fs extents · 5edfd9fd
      Qu Wenruo 提交于
      Modify btrfs_qgroup_account_extent() to exit quicker for non-fs extents.
      
      The quick exit condition is:
      1) The extent belongs to a non-fs tree
         Only fs-tree extents can affect qgroup numbers and is the only case
         where extent can be shared between different trees.
      
         Although strictly speaking extent in data-reloc or tree-reloc tree
         can be shared, data/tree-reloc root won't appear in the result of
         btrfs_find_all_roots(), so we can ignore such case.
      
         So we can check the first root in old_roots/new_roots ulist.
         - if we find the 1st root is a not a fs/subvol root, then we can skip
           the extent
         - if we find the 1st root is a fs/subvol root, then we must continue
           calculation
      
      OR
      
      2) both 'nr_old_roots' and 'nr_new_roots' are 0
         This means either such extent got allocated then freed in current
         transaction or it's a new reloc tree extent, whose nr_new_roots is 0.
         Either way it won't affect qgroup accounting and can be skipped
         safely.
      
      Such quick exit can make trace output more quite and less confusing:
      (example with fs uuid and time stamp removed)
      
      Before:
      ------
      add_delayed_tree_ref: bytenr=29556736 num_bytes=16384 action=ADD_DELAYED_REF parent=0(-) ref_root=2(EXTENT_TREE) level=0 type=TREE_BLOCK_REF seq=0
      btrfs_qgroup_account_extent: bytenr=29556736 num_bytes=16384 nr_old_roots=0 nr_new_roots=1
      ------
      Extent tree block will trigger btrfs_qgroup_account_extent() trace point
      while no qgroup number is changed, as extent tree won't affect qgroup
      accounting.
      
      After:
      ------
      add_delayed_tree_ref: bytenr=29556736 num_bytes=16384 action=ADD_DELAYED_REF parent=0(-) ref_root=2(EXTENT_TREE) level=0 type=TREE_BLOCK_REF seq=0
      ------
      Now such unrelated extent won't trigger btrfs_qgroup_account_extent()
      trace point, making the trace less noisy.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      [ changelog and comment adjustments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5edfd9fd
    • O
      Btrfs: rework delayed ref total_bytes_pinned accounting · d7eae340
      Omar Sandoval 提交于
      The total_bytes_pinned counter is completely broken when accounting
      delayed refs:
      
      - If two drops for the same extent are merged, we will decrement
        total_bytes_pinned twice but only increment it once.
      - If an add is merged into a drop or vice versa, we will decrement the
        total_bytes_pinned counter but never increment it.
      - If multiple references to an extent are dropped, we will account it
        multiple times, potentially vastly over-estimating the number of bytes
        that will be freed by a commit and doing unnecessary work when we're
        close to ENOSPC.
      
      The last issue is relatively minor, but the first two make the
      total_bytes_pinned counter leak or underflow very often. These
      accounting issues were introduced in b150a4f1 ("Btrfs: use a percpu
      to keep track of possibly pinned bytes"), but they were papered over by
      zeroing out the counter on every commit until d288db5d ("Btrfs: fix
      race of using total_bytes_pinned").
      
      We need to make sure that an extent is accounted as pinned exactly once
      if and only if we will drop references to it when when the transaction
      is committed. Ideally we would only add to total_bytes_pinned when the
      *last* reference is dropped, but this information isn't readily
      available for data extents. Again, this over-estimation can lead to
      extra commits when we're close to ENOSPC, but it's not as bad as before.
      
      The fix implemented here is to increment total_bytes_pinned when the
      total refmod count for an extent goes negative and decrement it if the
      refmod count goes back to non-negative or after we've run all of the
      delayed refs for that extent.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d7eae340
    • O
      Btrfs: return old and new total ref mods when adding delayed refs · 7be07912
      Omar Sandoval 提交于
      We need this to decide when to account pinned bytes.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7be07912
    • O
      Btrfs: always account pinned bytes when dropping a tree block ref · 0a16c7d7
      Omar Sandoval 提交于
      Currently, we only increment total_bytes_pinned in
      btrfs_free_tree_block() when dropping the last reference on the block.
      However, when the delayed ref is run later, we will decrement
      total_bytes_pinned regardless of whether it was the last reference or
      not. This causes the counter to underflow when the reference we dropped
      was not the last reference. Fix it by incrementing the counter
      unconditionally, which is what btrfs_free_extent() does. This makes
      total_bytes_pinned an overestimate when references to shared extents are
      dropped, but in the worst case this will just make us try to commit the
      transaction to try to free up space and find we didn't free enough.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0a16c7d7
    • O
      Btrfs: update total_bytes_pinned when pinning down extents · 4da8b76d
      Omar Sandoval 提交于
      The extents marked in pin_down_extent() will be unpinned later in
      unpin_extent_range(), which decrements total_bytes_pinned.
      pin_down_extent() must increment the counter to avoid underflowing it.
      Also adjust btrfs_free_tree_block() to avoid accounting for the same
      extent twice.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4da8b76d
    • O
      Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT() · 55e8196a
      Omar Sandoval 提交于
      The value of flags is one of DATA/METADATA/SYSTEM, they must exist at
      when add_pinned_bytes is called.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ added changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      55e8196a
    • O
      Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64 · 0d9f824d
      Omar Sandoval 提交于
      There are a few places where we pass in a negative num_bytes, so make it
      signed for clarity. Also move it up in the file since later patches will
      need it there.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0d9f824d
    • D
      btrfs: fix validation of XATTR_ITEM dir items · 1164a9fb
      David Sterba 提交于
      The XATTR_ITEM is a type of a directory item so we use the common
      validator helper. Unlike other dir items, it can have data. The way the
      name len validation is currently implemented does not reflect that. We'd
      have to adjust by the data_len when comparing the read and item limits.
      
      However, this will not work for multi-item xattr dir items.
      
      Example from tree dump of generic/337:
      
              item 7 key (257 XATTR_ITEM 751495445) itemoff 15667 itemsize 147
                      location key (0 UNKNOWN.0 0) type XATTR
                      transid 8 data_len 3 name_len 11
                      name: user.foobar
                      data 123
                      location key (0 UNKNOWN.0 0) type XATTR
                      transid 8 data_len 6 name_len 13
                      name: user.WvG1c1Td
                      data qwerty
                      location key (0 UNKNOWN.0 0) type XATTR
                      transid 8 data_len 5 name_len 19
                      name: user.J3__T_Km3dVsW_
                      data hello
      
      At the point of btrfs_is_name_len_valid call we don't have access to the
      data_len value of the 2nd and 3rd sub-item. So simple btrfs_dir_data_len(leaf,
      di) would always return 3, although we'd need to get 6 and 5 respectively to
      get the claculations right. (read_end + name_len + data_len vs item_end)
      
      We'd have to also pass data_len externally, which is not point of the
      name validation. The last check is supposed to test if there's at least
      one dir item space after the one we're processing. I don't think this is
      particularly useful, validation of the next item would catch that too.
      So the check is removed and we don't weaken the validation. Now tests
      btrfs/048, btrfs/053, generic/273 and generic/337 pass.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1164a9fb
  2. 22 6月, 2017 12 次提交
  3. 21 6月, 2017 4 次提交
    • F
      Btrfs: incremental send, fix invalid path for unlink commands · fdb13889
      Filipe Manana 提交于
      An incremental send can contain unlink operations with an invalid target
      path when we rename some directory inode A, then rename some file inode B
      to the old name of inode A and directory inode A is an ancestor of inode B
      in the parent snapshot (but not anymore in the send snapshot).
      
      Consider the following example scenario where this issue happens.
      
      Parent snapshot:
      
       .                                                      (ino 256)
       |
       |--- dir1/                                             (ino 257)
             |--- dir2/                                       (ino 258)
             |     |--- file1                                 (ino 259)
             |     |--- file3                                 (ino 261)
             |
             |--- dir3/                                       (ino 262)
                   |--- file22                                (ino 260)
                   |--- dir4/                                 (ino 263)
      
      Send snapshot:
      
       .                                                      (ino 256)
       |
       |--- dir1/                                             (ino 257)
             |--- dir2/                                       (ino 258)
             |--- dir3                                        (ino 260)
             |--- file3/                                      (ino 262)
                   |--- dir4/                                 (ino 263)
                         |--- file11                          (ino 269)
                         |--- file33                          (ino 261)
      
      When attempting to apply the corresponding incremental send stream, an
      unlink operation contains an invalid path which makes the receiver fail.
      The following is verbose output of the btrfs receive command:
      
       receiving snapshot snap2 uuid=7d5450da-a573-e043-a451-ec85f4879f0f (...)
       utimes
       utimes dir1
       utimes dir1/dir2
       link dir1/dir3/dir4/file11 -> dir1/dir2/file1
       unlink dir1/dir2/file1
       utimes dir1/dir2
       truncate dir1/dir3/dir4/file11 size=0
       utimes dir1/dir3/dir4/file11
       rename dir1/dir3 -> o262-7-0
       link dir1/dir3 -> o262-7-0/file22
       unlink dir1/dir3/file22
       ERROR: unlink dir1/dir3/file22 failed. Not a directory
      
      The following steps happen during the computation of the incremental send
      stream the lead to this issue:
      
      1) Before we start processing the new and deleted references for inode
         260, we compute the full path of the deleted reference
         ("dir1/dir3/file22") and cache it in the list of deleted references
         for our inode.
      
      2) We then start processing the new references for inode 260, for which
         there is only one new, located at "dir1/dir3". When processing this
         new reference, we check that inode 262, which was not yet processed,
         collides with the new reference and because of that we orphanize
         inode 262 so its new full path becomes "o262-7-0".
      
      3) After the orphanization of inode 262, we create the new reference for
         inode 260 by issuing a link command with a target path of "dir1/dir3"
         and a source path of "o262-7-0/file22".
      
      4) We then start processing the deleted references for inode 260, for
         which there is only one with the base name of "file22", and issue
         an unlink operation containing the target path computed at step 1,
         which is wrong because that path no longer exists and should be
         replaced with "o262-7-0/file22".
      
      So fix this issue by recomputing the full path of deleted references if
      when we processed the new references for an inode we ended up orphanizing
      any other inode that is an ancestor of our inode in the parent snapshot.
      
      A test case for fstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      [ adjusted after prev patch removed fs_path::dir_path and dir_path_len ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fdb13889
    • F
      Btrfs: send, fix invalid path after renaming and linking file · 72c3668f
      Filipe Manana 提交于
      Currently an incremental snapshot can generate link operations which
      contain an invalid target path. Such case happens when in the send
      snapshot a file was renamed, a new hard link added for it and some
      other inode (with a lower number) got renamed to the former name of
      that file. Example:
      
      Parent snapshot
      
       .                  (ino 256)
       |
       |--- f1            (ino 257)
       |--- f2            (ino 258)
       |--- f3            (ino 259)
      
      Send snapshot
      
       .                  (ino 256)
       |
       |--- f2            (ino 257)
       |--- f3            (ino 258)
       |--- f4            (ino 259)
       |--- f5            (ino 258)
      
      The following steps happen when computing the incremental send stream:
      
      1) When processing inode 257, inode 258 is orphanized (renamed to
         "o258-7-0"), because its current reference has the same name as the
         new reference for inode 257;
      
      2) When processing inode 258, we iterate over all its new references,
         which have the names "f3" and "f5". The first iteration sees name
         "f5" and renames the inode from its orphan name ("o258-7-0") to
         "f5", while the second iteration sees the name "f3" and, incorrectly,
         issues a link operation with a target name matching the orphan name,
         which no longer exists. The first iteration had reset the current
         valid path of the inode to "f5", but in the second iteration we lost
         it because we found another inode, with a higher number of 259, which
         has a reference named "f3" as well, so we orphanized inode 259 and
         recomputed the current valid path of inode 258 to its old orphan
         name because inode 259 could be an ancestor of inode 258 and therefore
         the current valid path could contain the pre-orphanization name of
         inode 259. However in this case inode 259 is not an ancestor of inode
         258 so the current valid path should not be recomputed.
         This makes the receiver fail with the following error:
      
         ERROR: link f3 -> o258-7-0 failed: No such file or directory
      
      So fix this by not recomputing the current valid path for an inode
      whenever we find a colliding reference from some not yet processed inode
      (inode number higher then the one currently being processed), unless
      that other inode is an ancestor of the one we are currently processing.
      
      A test case for fstests will follow soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      72c3668f
    • F
      Btrfs: fix invalid extent maps due to hole punching · 609805d8
      Filipe Manana 提交于
      While punching a hole in a range that is not aligned with the sector size
      (currently the same as the page size) we can end up leaving an extent map
      in memory with a length that is smaller then the sector size or with a
      start offset that is not aligned to the sector size. Both cases are not
      expected and can lead to problems. This issue is easily detected
      after the patch from commit a7e3b975 ("Btrfs: fix reported number of
      inode blocks"), introduced in kernel 4.12-rc1, in a scenario like the
      following for example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ xfs_io -c "pwrite -S 0xaa -b 100K 0 100K" /mnt/foo
        $ xfs_io -c "fpunch 60K 90K" /mnt/foo
        $ xfs_io -c "pwrite -S 0xbb -b 100K 50K 100K" /mnt/foo
        $ xfs_io -c "pwrite -S 0xcc -b 50K 100K 50K" /mnt/foo
        $ umount /mnt
      
      After the unmount operation we can see several warnings emmitted due to
      underflows related to space reservation counters:
      
      [ 2837.443299] ------------[ cut here ]------------
      [ 2837.447395] WARNING: CPU: 8 PID: 2474 at fs/btrfs/inode.c:9444 btrfs_destroy_inode+0xe8/0x27e [btrfs]
      [ 2837.452108] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button se
      rio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_gene
      ric raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.458389] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.459754] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.462379] Call Trace:
      [ 2837.462379]  dump_stack+0x68/0x92
      [ 2837.462379]  __warn+0xc2/0xdd
      [ 2837.462379]  warn_slowpath_null+0x1d/0x1f
      [ 2837.462379]  btrfs_destroy_inode+0xe8/0x27e [btrfs]
      [ 2837.462379]  destroy_inode+0x3d/0x55
      [ 2837.462379]  evict+0x177/0x17e
      [ 2837.462379]  dispose_list+0x50/0x71
      [ 2837.462379]  evict_inodes+0x132/0x141
      [ 2837.462379]  generic_shutdown_super+0x3f/0xeb
      [ 2837.462379]  kill_anon_super+0x12/0x1c
      [ 2837.462379]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.462379]  deactivate_locked_super+0x30/0x68
      [ 2837.462379]  deactivate_super+0x36/0x39
      [ 2837.462379]  cleanup_mnt+0x58/0x76
      [ 2837.462379]  __cleanup_mnt+0x12/0x14
      [ 2837.462379]  task_work_run+0x77/0x9b
      [ 2837.462379]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.462379]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.462379]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.462379] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.462379] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.462379] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.462379] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.462379] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.462379] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.462379] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.519355] ---[ end trace e79345fe24b30b8d ]---
      [ 2837.596256] ------------[ cut here ]------------
      [ 2837.597625] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5699 btrfs_free_block_groups+0x246/0x3eb [btrfs]
      [ 2837.603547] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.659372] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.663359] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.663359] Call Trace:
      [ 2837.663359]  dump_stack+0x68/0x92
      [ 2837.663359]  __warn+0xc2/0xdd
      [ 2837.663359]  warn_slowpath_null+0x1d/0x1f
      [ 2837.663359]  btrfs_free_block_groups+0x246/0x3eb [btrfs]
      [ 2837.663359]  close_ctree+0x1dd/0x2e1 [btrfs]
      [ 2837.663359]  ? evict_inodes+0x132/0x141
      [ 2837.663359]  btrfs_put_super+0x15/0x17 [btrfs]
      [ 2837.663359]  generic_shutdown_super+0x6a/0xeb
      [ 2837.663359]  kill_anon_super+0x12/0x1c
      [ 2837.663359]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.663359]  deactivate_locked_super+0x30/0x68
      [ 2837.663359]  deactivate_super+0x36/0x39
      [ 2837.663359]  cleanup_mnt+0x58/0x76
      [ 2837.663359]  __cleanup_mnt+0x12/0x14
      [ 2837.663359]  task_work_run+0x77/0x9b
      [ 2837.663359]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.663359]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.663359]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.663359] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.663359] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.663359] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.663359] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.663359] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.663359] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.663359] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.739445] ---[ end trace e79345fe24b30b8e ]---
      [ 2837.745595] ------------[ cut here ]------------
      [ 2837.746412] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5700 btrfs_free_block_groups+0x261/0x3eb [btrfs]
      [ 2837.747955] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.755395] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.756769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.758526] Call Trace:
      [ 2837.758925]  dump_stack+0x68/0x92
      [ 2837.759383]  __warn+0xc2/0xdd
      [ 2837.759383]  warn_slowpath_null+0x1d/0x1f
      [ 2837.759383]  btrfs_free_block_groups+0x261/0x3eb [btrfs]
      [ 2837.759383]  close_ctree+0x1dd/0x2e1 [btrfs]
      [ 2837.759383]  ? evict_inodes+0x132/0x141
      [ 2837.759383]  btrfs_put_super+0x15/0x17 [btrfs]
      [ 2837.759383]  generic_shutdown_super+0x6a/0xeb
      [ 2837.759383]  kill_anon_super+0x12/0x1c
      [ 2837.759383]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.759383]  deactivate_locked_super+0x30/0x68
      [ 2837.759383]  deactivate_super+0x36/0x39
      [ 2837.759383]  cleanup_mnt+0x58/0x76
      [ 2837.759383]  __cleanup_mnt+0x12/0x14
      [ 2837.759383]  task_work_run+0x77/0x9b
      [ 2837.759383]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.759383]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.759383]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.759383] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.759383] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.759383] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.759383] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.759383] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.759383] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.759383] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.777063] ---[ end trace e79345fe24b30b8f ]---
      [ 2837.778235] ------------[ cut here ]------------
      [ 2837.778856] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:9825 btrfs_free_block_groups+0x348/0x3eb [btrfs]
      [ 2837.791385] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.797711] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.798594] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.800118] Call Trace:
      [ 2837.800515]  dump_stack+0x68/0x92
      [ 2837.801015]  __warn+0xc2/0xdd
      [ 2837.801471]  warn_slowpath_null+0x1d/0x1f
      [ 2837.801698]  btrfs_free_block_groups+0x348/0x3eb [btrfs]
      [ 2837.801698]  close_ctree+0x1dd/0x2e1 [btrfs]
      [ 2837.801698]  ? evict_inodes+0x132/0x141
      [ 2837.801698]  btrfs_put_super+0x15/0x17 [btrfs]
      [ 2837.801698]  generic_shutdown_super+0x6a/0xeb
      [ 2837.801698]  kill_anon_super+0x12/0x1c
      [ 2837.801698]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.801698]  deactivate_locked_super+0x30/0x68
      [ 2837.801698]  deactivate_super+0x36/0x39
      [ 2837.801698]  cleanup_mnt+0x58/0x76
      [ 2837.801698]  __cleanup_mnt+0x12/0x14
      [ 2837.801698]  task_work_run+0x77/0x9b
      [ 2837.801698]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.801698]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.801698]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.801698] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.801698] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.801698] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.801698] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.801698] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.801698] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.801698] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.818441] ---[ end trace e79345fe24b30b90 ]---
      [ 2837.818991] BTRFS info (device sdc): space_info 1 has 7974912 free, is not full
      [ 2837.819830] BTRFS info (device sdc): space_info total=8388608, used=417792, pinned=0, reserved=0, may_use=18446744073709547520, readonly=0
      
      What happens in the above example is the following:
      
      1) When punching the hole, at btrfs_punch_hole(), the variable tail_len
         is set to 2048 (as tail_start is 148Kb + 1 and offset + len is 150Kb).
         This results in the creation of an extent map with a length of 2Kb
         starting at file offset 148Kb, through find_first_non_hole() ->
         btrfs_get_extent().
      
      2) The second write (first write after the hole punch operation), sets
         the range [50Kb, 152Kb[ to delalloc.
      
      3) The third write, at btrfs_find_new_delalloc_bytes(), sees the extent
         map covering the range [148Kb, 150Kb[ and ends up calling
         set_extent_bit() for the same range, which results in splitting an
         existing extent state record, covering the range [148Kb, 152Kb[ into
         two 2Kb extent state records, covering the ranges [148Kb, 150Kb[ and
         [150Kb, 152Kb[.
      
      4) Finally at lock_and_cleanup_extent_if_need(), immediately after calling
         btrfs_find_new_delalloc_bytes() we clear the delalloc bit from the
         range [100Kb, 152Kb[ which results in the btrfs_clear_bit_hook()
         callback being invoked against the two 2Kb extent state records that
         cover the ranges [148Kb, 150Kb[ and [150Kb, 152Kb[. When called against
         the first 2Kb extent state, it calls btrfs_delalloc_release_metadata()
         with a length argument of 2048 bytes. That function rounds up the length
         to a sector size aligned length, so it ends up considering a length of
         4096 bytes, and then calls calc_csum_metadata_size() which results in
         decrementing the inode's csum_bytes counter by 4096 bytes, so after
         it stays a value of 0 bytes. Then the same happens when
         btrfs_clear_bit_hook() is called against the second extent state that
         has a length of 2Kb, covering the range [150Kb, 152Kb[, the length is
         rounded up to 4096 and calc_csum_metadata_size() ends up being called
         to decrement 4096 bytes from the inode's csum_bytes counter, which
         at that time has a value of 0, leading to an underflow, which is
         exactly what triggers the first warning, at btrfs_destroy_inode().
         All the other warnings relate to several space accounting counters
         that underflow as well due to similar reasons.
      
      A similar case but where the hole punching operation creates an extent map
      with a start offset not aligned to the sector size is the following:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ xfs_io -f -c "fpunch 695K 820K" $SCRATCH_MNT/bar
        $ xfs_io -c "pwrite -S 0xaa 1008K 307K" $SCRATCH_MNT/bar
        $ xfs_io -c "pwrite -S 0xbb -b 630K 1073K 630K" $SCRATCH_MNT/bar
        $ xfs_io -c "pwrite -S 0xcc -b 459K 1068K 459K" $SCRATCH_MNT/bar
        $ umount /mnt
      
      During the unmount operation we get similar traces for the same reasons as
      in the first example.
      
      So fix the hole punching operation to make sure it never creates extent
      maps with a length that is not aligned to the sector size nor with a start
      offset that is not aligned to the sector size, as this breaks all
      assumptions and it's a land mine.
      
      Fixes: d7781546 ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      609805d8
    • J
      btrfs: add cond_resched to btrfs_qgroup_trace_leaf_items · cddf3b2c
      Jeff Mahoney 提交于
      On an uncontended system, we can end up hitting soft lockups while
      doing replace_path.  At the core, and frequently called is
      btrfs_qgroup_trace_leaf_items, so it makes sense to add a cond_resched
      there.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cddf3b2c
  4. 20 6月, 2017 8 次提交