1. 07 7月, 2017 2 次提交
    • F
      Btrfs: incremental send, fix invalid memory access · 24e52b11
      Filipe Manana 提交于
      When doing an incremental send, while processing an extent that changed
      between the parent and send snapshots and that extent was an inline extent
      in the parent snapshot, it's possible to access a memory region beyond
      the end of leaf if the inline extent is very small and it is the first
      item in a leaf.
      
      An example scenario is described below.
      
      The send snapshot has the following leaf:
      
       leaf 33865728 items 33 free space 773 generation 46 owner 5
       fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
       chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
              (...)
              item 14 key (335 EXTENT_DATA 0) itemoff 3052 itemsize 53
                      generation 36 type 1 (regular)
                      extent data disk byte 12791808 nr 4096
                      extent data offset 0 nr 4096 ram 4096
                      extent compression 0 (none)
              item 15 key (335 EXTENT_DATA 8192) itemoff 2999 itemsize 53
                      generation 36 type 1 (regular)
                      extent data disk byte 138170368 nr 225280
                      extent data offset 0 nr 225280 ram 225280
                      extent compression 0 (none)
              (...)
      
      And the parent snapshot has the following leaf:
      
       leaf 31272960 items 17 free space 17 generation 31 owner 5
       fs uuid ab7090d8-dafd-4fb9-9246-723b6d2e2fb7
       chunk uuid 2d16478c-c704-4ab9-b574-68bff2281b1f
              item 0 key (335 EXTENT_DATA 0) itemoff 3951 itemsize 44
                      generation 31 type 0 (inline)
                      inline extent data size 23 ram_bytes 613 compression 1 (zlib)
              (...)
      
      When computing the send stream, it is detected that the extent of inode
      335, at file offset 0, and at fs/btrfs/send.c:is_extent_unchanged() we
      grab the leaf from the parent snapshot and access the inline extent item.
      However, before jumping to the 'out' label, we access the 'offset' and
      'disk_bytenr' fields of the extent item, which should not be done for
      inline extents since the inlined data starts at the offset of the
      'disk_bytenr' field and can be very small. For example accessing the
      'offset' field of the file extent item results in the following trace:
      
      [  599.705368] general protection fault: 0000 [#1] PREEMPT SMP
      [  599.706296] Modules linked in: btrfs psmouse i2c_piix4 ppdev acpi_cpufreq serio_raw parport_pc i2c_core evdev tpm_tis tpm_tis_core sg pcspkr parport tpm button su$
      [  599.709340] CPU: 7 PID: 5283 Comm: btrfs Not tainted 4.10.0-rc8-btrfs-next-46+ #1
      [  599.709340] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [  599.709340] task: ffff88023eedd040 task.stack: ffffc90006658000
      [  599.709340] RIP: 0010:read_extent_buffer+0xdb/0xf4 [btrfs]
      [  599.709340] RSP: 0018:ffffc9000665ba00 EFLAGS: 00010286
      [  599.709340] RAX: db73880000000000 RBX: 0000000000000000 RCX: 0000000000000001
      [  599.709340] RDX: ffffc9000665ba60 RSI: db73880000000000 RDI: ffffc9000665ba5f
      [  599.709340] RBP: ffffc9000665ba30 R08: 0000000000000001 R09: ffff88020dc5e098
      [  599.709340] R10: 0000000000001000 R11: 0000160000000000 R12: 6db6db6db6db6db7
      [  599.709340] R13: ffff880000000000 R14: 0000000000000000 R15: ffff88020dc5e088
      [  599.709340] FS:  00007f519555a8c0(0000) GS:ffff88023f3c0000(0000) knlGS:0000000000000000
      [  599.709340] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  599.709340] CR2: 00007f1411afd000 CR3: 0000000235f8e000 CR4: 00000000000006e0
      [  599.709340] Call Trace:
      [  599.709340]  btrfs_get_token_64+0x93/0xce [btrfs]
      [  599.709340]  ? printk+0x48/0x50
      [  599.709340]  btrfs_get_64+0xb/0xd [btrfs]
      [  599.709340]  process_extent+0x3a1/0x1106 [btrfs]
      [  599.709340]  ? btree_read_extent_buffer_pages+0x5/0xef [btrfs]
      [  599.709340]  changed_cb+0xb03/0xb3d [btrfs]
      [  599.709340]  ? btrfs_get_token_32+0x7a/0xcc [btrfs]
      [  599.709340]  btrfs_compare_trees+0x432/0x53d [btrfs]
      [  599.709340]  ? process_extent+0x1106/0x1106 [btrfs]
      [  599.709340]  btrfs_ioctl_send+0x960/0xe26 [btrfs]
      [  599.709340]  btrfs_ioctl+0x181b/0x1fed [btrfs]
      [  599.709340]  ? trace_hardirqs_on_caller+0x150/0x1ac
      [  599.709340]  vfs_ioctl+0x21/0x38
      [  599.709340]  ? vfs_ioctl+0x21/0x38
      [  599.709340]  do_vfs_ioctl+0x611/0x645
      [  599.709340]  ? rcu_read_unlock+0x5b/0x5d
      [  599.709340]  ? __fget+0x6d/0x79
      [  599.709340]  SyS_ioctl+0x57/0x7b
      [  599.709340]  entry_SYSCALL_64_fastpath+0x18/0xad
      [  599.709340] RIP: 0033:0x7f51945eec47
      [  599.709340] RSP: 002b:00007ffc21c13e98 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
      [  599.709340] RAX: ffffffffffffffda RBX: ffffffff81096459 RCX: 00007f51945eec47
      [  599.709340] RDX: 00007ffc21c13f20 RSI: 0000000040489426 RDI: 0000000000000004
      [  599.709340] RBP: ffffc9000665bf98 R08: 00007f519450d700 R09: 00007f519450d700
      [  599.709340] R10: 00007f519450d9d0 R11: 0000000000000202 R12: 0000000000000046
      [  599.709340] R13: ffffc9000665bf78 R14: 0000000000000000 R15: 00007f5195574040
      [  599.709340]  ? trace_hardirqs_off_caller+0x43/0xb1
      [  599.709340] Code: 29 f0 49 39 d8 4c 0f 47 c3 49 03 81 58 01 00 00 44 89 c1 4c 01 c2 4c 29 c3 48 c1 f8 03 49 0f af c4 48 c1 e0 0c 4c 01 e8 48 01 c6 <f3> a4 31 f6 4$
      [  599.709340] RIP: read_extent_buffer+0xdb/0xf4 [btrfs] RSP: ffffc9000665ba00
      [  599.762057] ---[ end trace fe00d7af61b9f49e ]---
      
      This is because the 'offset' field starts at an offset of 37 bytes
      (offsetof(struct btrfs_file_extent_item, offset)), has a length of 8
      bytes and therefore attemping to read it causes a 1 byte access beyond
      the end of the leaf, as the first item's content in a leaf is located
      at the tail of the leaf, the item size is 44 bytes and the offset of
      that field plus its length (37 + 8 = 45) goes beyond the item's size
      by 1 byte.
      
      So fix this by accessing the 'offset' and 'disk_bytenr' fields after
      jumping to the 'out' label if we are processing an inline extent. We
      move the reading operation of the 'disk_bytenr' field too because we
      have the same problem as for the 'offset' field explained above when
      the inline data is less then 8 bytes. The access to the 'generation'
      field is also moved but just for the sake of grouping access to all
      the fields.
      
      Fixes: e1cbfd7b ("Btrfs: send, fix file hole not being preserved due to inline extent")
      Cc: <stable@vger.kernel.org>  # v4.12+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      24e52b11
    • F
      Btrfs: incremental send, fix invalid path for link commands · f5962781
      Filipe Manana 提交于
      In some scenarios an incremental send stream can contain link commands
      with an invalid target path. Such scenarios happen after moving some
      directory inode A, renaming a regular file inode B into the old name of
      inode A and finally creating a new hard link for inode B at directory
      inode A.
      
      Consider the following example scenario where this issue happens.
      
      Parent snapshot:
      
        .                                                      (ino 256)
        |
        |--- dir1/                                             (ino 257)
        |      |--- dir2/                                      (ino 258)
        |             |--- dir3/                               (ino 259)
        |                   |--- file1                         (ino 261)
        |                   |--- dir4/                         (ino 262)
        |
        |--- dir5/                                             (ino 260)
      
      Send snapshot:
      
        .                                                      (ino 256)
        |
        |--- dir1/                                             (ino 257)
               |--- dir2/                                      (ino 258)
               |      |--- dir3/                               (ino 259)
               |            |--- dir4                          (ino 261)
               |
               |--- dir6/                                      (ino 263)
                      |--- dir44/                              (ino 262)
                             |--- file11                       (ino 261)
                             |--- dir55/                       (ino 260)
      
      When attempting to apply the corresponding incremental send stream, a
      link command contains an invalid target path which makes the receiver
      fail. The following is the verbose output of the btrfs receive command:
      
        receiving snapshot mysnap2 uuid=90076fe6-5ba6-e64a-9321-9279670ed16b (...)
        utimes
        utimes dir1
        utimes dir1/dir2/dir3
        utimes
        rename dir1/dir2/dir3/dir4 -> o262-7-0
        link dir1/dir2/dir3/dir4 -> dir1/dir2/dir3/file1
        link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1
        ERROR: link dir1/dir2/dir3/dir4/file11 -> dir1/dir2/dir3/file1 failed: Not a directory
      
      The following steps happen during the computation of the incremental send
      stream the lead to this issue:
      
      1) When processing inode 261, we orphanize inode 262 due to a name/location
         collision with one of the new hard links for inode 261 (created in the
         second step below).
      
      2) We create one of the 2 new hard links for inode 261, the one whose
         location is at "dir1/dir2/dir3/dir4".
      
      3) We then attempt to create the other new hard link for inode 261, which
         has inode 262 as its parent directory. Because the path for this new
         hard link was computed before we started processing the new references
         (hard links), it reflects the old name/location of inode 262, that is,
         it does not account for the orphanization step that happened when
         we started processing the new references for inode 261, whence it is
         no longer valid, causing the receiver to fail.
      
      So fix this issue by recomputing the full path of new references if we
      ended up orphanizing other inodes which are directories.
      
      A test case for fstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      f5962781
  2. 30 6月, 2017 17 次提交
    • Q
      btrfs: Remove false alert when fiemap range is smaller than on-disk extent · 848c23b7
      Qu Wenruo 提交于
      Commit 4751832d ("btrfs: fiemap: Cache and merge fiemap extent before
      submit it to user") introduced a warning to catch unemitted cached
      fiemap extent.
      
      However such warning doesn't take the following case into consideration:
      
      0			4K			8K
      |<---- fiemap range --->|
      |<----------- On-disk extent ------------------>|
      
      In this case, the whole 0~8K is cached, and since it's larger than
      fiemap range, it break the fiemap extent emit loop.
      This leaves the fiemap extent cached but not emitted, and caught by the
      final fiemap extent sanity check, causing kernel warning.
      
      This patch removes the kernel warning and renames the sanity check to
      emit_last_fiemap_cache() since it's possible and valid to have cached
      fiemap extent.
      Reported-by: NDavid Sterba <dsterba@suse.cz>
      Reported-by: NAdam Borowski <kilobyte@angband.pl>
      Fixes: 4751832d ("btrfs: fiemap: Cache and merge fiemap extent ...")
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      848c23b7
    • J
      btrfs: Don't clear SGID when inheriting ACLs · b7f8a09f
      Jan Kara 提交于
      When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit
      set, DIR1 is expected to have SGID bit set (and owning group equal to
      the owning group of 'DIR0'). However when 'DIR0' also has some default
      ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on
      'DIR1' to get cleared if user is not member of the owning group.
      
      Fix the problem by moving posix_acl_update_mode() out of
      __btrfs_set_acl() into btrfs_set_acl(). That way the function will not be
      called when inheriting ACLs which is what we want as it prevents SGID
      bit clearing and the mode has been properly set by posix_acl_create()
      anyway.
      
      Fixes: 07393101
      CC: stable@vger.kernel.org
      CC: linux-btrfs@vger.kernel.org
      CC: David Sterba <dsterba@suse.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b7f8a09f
    • C
      btrfs: fix integer overflow in calc_reclaim_items_nr · 6374e57a
      Chris Mason 提交于
      Dave Jones hit a WARN_ON(nr < 0) in btrfs_wait_ordered_roots() with
      v4.12-rc6.  This was because commit 70e7af24 made it possible for
      calc_reclaim_items_nr() to return a negative number.  It's not really a
      bug in that commit, it just didn't go far enough down the stack to find
      all the possible 64->32 bit overflows.
      
      This switches calc_reclaim_items_nr() to return a u64 and changes everyone
      that uses the results of that math to u64 as well.
      Reported-by: NDave Jones <davej@codemonkey.org.uk>
      Fixes: 70e7af24 ("Btrfs: fix delalloc accounting leak caused by u32 overflow")
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      6374e57a
    • D
      btrfs: scrub: fix target device intialization while setting up scrub context · ded56184
      David Sterba 提交于
      The commit "btrfs: scrub: inline helper scrub_setup_wr_ctx" inlined a
      helper but wrongly sets up the target device. Incidentally there's a
      local variable with the same name as a parameter in the previous
      function, so this got caught during runtime as crash in test btrfs/027.
      Reported-by: NChris Mason <clm@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ded56184
    • Q
      btrfs: qgroup: Fix qgroup reserved space underflow by only freeing reserved ranges · bc42bda2
      Qu Wenruo 提交于
      [BUG]
      For the following case, btrfs can underflow qgroup reserved space
      at an error path:
      (Page size 4K, function name without "btrfs_" prefix)
      
               Task A                  |             Task B
      ----------------------------------------------------------------------
      Buffered_write [0, 2K)           |
      |- check_data_free_space()       |
      |  |- qgroup_reserve_data()      |
      |     Range aligned to page      |
      |     range [0, 4K)          <<< |
      |     4K bytes reserved      <<< |
      |- copy pages to page cache      |
                                       | Buffered_write [2K, 4K)
                                       | |- check_data_free_space()
                                       | |  |- qgroup_reserved_data()
                                       | |     Range alinged to page
                                       | |     range [0, 4K)
                                       | |     Already reserved by A <<<
                                       | |     0 bytes reserved      <<<
                                       | |- delalloc_reserve_metadata()
                                       | |  And it *FAILED* (Maybe EQUOTA)
                                       | |- free_reserved_data_space()
                                            |- qgroup_free_data()
                                               Range aligned to page range
                                               [0, 4K)
                                               Freeing 4K
      (Special thanks to Chandan for the detailed report and analyse)
      
      [CAUSE]
      Above Task B is freeing reserved data range [0, 4K) which is actually
      reserved by Task A.
      
      And at writeback time, page dirty by Task A will go through writeback
      routine, which will free 4K reserved data space at file extent insert
      time, causing the qgroup underflow.
      
      [FIX]
      For btrfs_qgroup_free_data(), add @reserved parameter to only free
      data ranges reserved by previous btrfs_qgroup_reserve_data().
      So in above case, Task B will try to free 0 byte, so no underflow.
      Reported-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Tested-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      bc42bda2
    • Q
      btrfs: qgroup: Introduce extent changeset for qgroup reserve functions · 364ecf36
      Qu Wenruo 提交于
      Introduce a new parameter, struct extent_changeset for
      btrfs_qgroup_reserved_data() and its callers.
      
      Such extent_changeset was used in btrfs_qgroup_reserve_data() to record
      which range it reserved in current reserve, so it can free it in error
      paths.
      
      The reason we need to export it to callers is, at buffered write error
      path, without knowing what exactly which range we reserved in current
      allocation, we can free space which is not reserved by us.
      
      This will lead to qgroup reserved space underflow.
      Reviewed-by: NChandan Rajendra <chandan@linux.vnet.ibm.com>
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      364ecf36
    • Q
      btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write... · a12b877b
      Qu Wenruo 提交于
      btrfs: qgroup: Fix qgroup reserved space underflow caused by buffered write and quotas being enabled
      
      [BUG]
      Under the following case, we can underflow qgroup reserved space.
      
                  Task A                |            Task B
      ---------------------------------------------------------------
       Quota disabled                   |
       Buffered write                   |
       |- btrfs_check_data_free_space() |
       |  *NO* qgroup space is reserved |
       |  since quota is *DISABLED*     |
       |- All pages are copied to page  |
          cache                         |
                                        | Enable quota
                                        | Quota scan finished
                                        |
                                        | Sync_fs
                                        | |- run_delalloc_range
                                        | |- Write pages
                                        | |- btrfs_finish_ordered_io
                                        |    |- insert_reserved_file_extent
                                        |       |- btrfs_qgroup_release_data()
                                        |          Since no qgroup space is
                                                   reserved in Task A, we
                                                   underflow qgroup reserved
                                                   space
      This can be detected by fstest btrfs/104.
      
      [CAUSE]
      In insert_reserved_file_extent() we tell qgroup to release the @ram_bytes
      size of qgroup reserved_space in all cases.
      And btrfs_qgroup_release_data() will check if quotas are enabled.
      
      However in the above case, the buffered write happens before quota is
      enabled, so we don't have the reserved space for that range.
      
      [FIX]
      In insert_reserved_file_extent(), we tell qgroup to release the acctual
      byte number it released.
      In the above case, since we don't have the reserved space, we tell
      qgroups to release 0 byte, so the problem can be fixed.
      
      And thanks to the @reserved parameter introduced by the qgroup rework,
      and previous patch to return released bytes, the fix can be as small as
      10 lines.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      [ changelog updates ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a12b877b
    • Q
      btrfs: qgroup: Return actually freed bytes for qgroup release or free data · 7bc329c1
      Qu Wenruo 提交于
      btrfs_qgroup_release/free_data() only returns 0 or a negative error
      number (ENOMEM is the only possible error).
      
      This is normally good enough, but sometimes we need the exact byte
      count it freed/released.
      
      Change it to return actually released/freed bytenr number instead of 0
      for success.
      And slightly modify related extent_changeset structure, since in btrfs
      one no-hole data extent won't be larger than 128M, so "unsigned int"
      is large enough for the use case.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7bc329c1
    • Q
      btrfs: qgroup: Cleanup btrfs_qgroup_prepare_account_extents function · d1b8b94a
      Qu Wenruo 提交于
      Quite a lot of qgroup corruption happens due to wrong time of calling
      btrfs_qgroup_prepare_account_extents().
      
      Since the safest time is to call it just before
      btrfs_qgroup_account_extents(), there is no need to separate these 2
      functions.
      
      Merging them will make code cleaner and less bug prone.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      [ changelog and comment adjustments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1b8b94a
    • Q
      btrfs: qgroup: Add quick exit for non-fs extents · 5edfd9fd
      Qu Wenruo 提交于
      Modify btrfs_qgroup_account_extent() to exit quicker for non-fs extents.
      
      The quick exit condition is:
      1) The extent belongs to a non-fs tree
         Only fs-tree extents can affect qgroup numbers and is the only case
         where extent can be shared between different trees.
      
         Although strictly speaking extent in data-reloc or tree-reloc tree
         can be shared, data/tree-reloc root won't appear in the result of
         btrfs_find_all_roots(), so we can ignore such case.
      
         So we can check the first root in old_roots/new_roots ulist.
         - if we find the 1st root is a not a fs/subvol root, then we can skip
           the extent
         - if we find the 1st root is a fs/subvol root, then we must continue
           calculation
      
      OR
      
      2) both 'nr_old_roots' and 'nr_new_roots' are 0
         This means either such extent got allocated then freed in current
         transaction or it's a new reloc tree extent, whose nr_new_roots is 0.
         Either way it won't affect qgroup accounting and can be skipped
         safely.
      
      Such quick exit can make trace output more quite and less confusing:
      (example with fs uuid and time stamp removed)
      
      Before:
      ------
      add_delayed_tree_ref: bytenr=29556736 num_bytes=16384 action=ADD_DELAYED_REF parent=0(-) ref_root=2(EXTENT_TREE) level=0 type=TREE_BLOCK_REF seq=0
      btrfs_qgroup_account_extent: bytenr=29556736 num_bytes=16384 nr_old_roots=0 nr_new_roots=1
      ------
      Extent tree block will trigger btrfs_qgroup_account_extent() trace point
      while no qgroup number is changed, as extent tree won't affect qgroup
      accounting.
      
      After:
      ------
      add_delayed_tree_ref: bytenr=29556736 num_bytes=16384 action=ADD_DELAYED_REF parent=0(-) ref_root=2(EXTENT_TREE) level=0 type=TREE_BLOCK_REF seq=0
      ------
      Now such unrelated extent won't trigger btrfs_qgroup_account_extent()
      trace point, making the trace less noisy.
      Signed-off-by: NQu Wenruo <quwenruo@cn.fujitsu.com>
      [ changelog and comment adjustments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5edfd9fd
    • O
      Btrfs: rework delayed ref total_bytes_pinned accounting · d7eae340
      Omar Sandoval 提交于
      The total_bytes_pinned counter is completely broken when accounting
      delayed refs:
      
      - If two drops for the same extent are merged, we will decrement
        total_bytes_pinned twice but only increment it once.
      - If an add is merged into a drop or vice versa, we will decrement the
        total_bytes_pinned counter but never increment it.
      - If multiple references to an extent are dropped, we will account it
        multiple times, potentially vastly over-estimating the number of bytes
        that will be freed by a commit and doing unnecessary work when we're
        close to ENOSPC.
      
      The last issue is relatively minor, but the first two make the
      total_bytes_pinned counter leak or underflow very often. These
      accounting issues were introduced in b150a4f1 ("Btrfs: use a percpu
      to keep track of possibly pinned bytes"), but they were papered over by
      zeroing out the counter on every commit until d288db5d ("Btrfs: fix
      race of using total_bytes_pinned").
      
      We need to make sure that an extent is accounted as pinned exactly once
      if and only if we will drop references to it when when the transaction
      is committed. Ideally we would only add to total_bytes_pinned when the
      *last* reference is dropped, but this information isn't readily
      available for data extents. Again, this over-estimation can lead to
      extra commits when we're close to ENOSPC, but it's not as bad as before.
      
      The fix implemented here is to increment total_bytes_pinned when the
      total refmod count for an extent goes negative and decrement it if the
      refmod count goes back to non-negative or after we've run all of the
      delayed refs for that extent.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d7eae340
    • O
      Btrfs: return old and new total ref mods when adding delayed refs · 7be07912
      Omar Sandoval 提交于
      We need this to decide when to account pinned bytes.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7be07912
    • O
      Btrfs: always account pinned bytes when dropping a tree block ref · 0a16c7d7
      Omar Sandoval 提交于
      Currently, we only increment total_bytes_pinned in
      btrfs_free_tree_block() when dropping the last reference on the block.
      However, when the delayed ref is run later, we will decrement
      total_bytes_pinned regardless of whether it was the last reference or
      not. This causes the counter to underflow when the reference we dropped
      was not the last reference. Fix it by incrementing the counter
      unconditionally, which is what btrfs_free_extent() does. This makes
      total_bytes_pinned an overestimate when references to shared extents are
      dropped, but in the worst case this will just make us try to commit the
      transaction to try to free up space and find we didn't free enough.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0a16c7d7
    • O
      Btrfs: update total_bytes_pinned when pinning down extents · 4da8b76d
      Omar Sandoval 提交于
      The extents marked in pin_down_extent() will be unpinned later in
      unpin_extent_range(), which decrements total_bytes_pinned.
      pin_down_extent() must increment the counter to avoid underflowing it.
      Also adjust btrfs_free_tree_block() to avoid accounting for the same
      extent twice.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4da8b76d
    • O
      Btrfs: make BUG_ON() in add_pinned_bytes() an ASSERT() · 55e8196a
      Omar Sandoval 提交于
      The value of flags is one of DATA/METADATA/SYSTEM, they must exist at
      when add_pinned_bytes is called.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ added changelog ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      55e8196a
    • O
      Btrfs: make add_pinned_bytes() take an s64 num_bytes instead of u64 · 0d9f824d
      Omar Sandoval 提交于
      There are a few places where we pass in a negative num_bytes, so make it
      signed for clarity. Also move it up in the file since later patches will
      need it there.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0d9f824d
    • D
      btrfs: fix validation of XATTR_ITEM dir items · 1164a9fb
      David Sterba 提交于
      The XATTR_ITEM is a type of a directory item so we use the common
      validator helper. Unlike other dir items, it can have data. The way the
      name len validation is currently implemented does not reflect that. We'd
      have to adjust by the data_len when comparing the read and item limits.
      
      However, this will not work for multi-item xattr dir items.
      
      Example from tree dump of generic/337:
      
              item 7 key (257 XATTR_ITEM 751495445) itemoff 15667 itemsize 147
                      location key (0 UNKNOWN.0 0) type XATTR
                      transid 8 data_len 3 name_len 11
                      name: user.foobar
                      data 123
                      location key (0 UNKNOWN.0 0) type XATTR
                      transid 8 data_len 6 name_len 13
                      name: user.WvG1c1Td
                      data qwerty
                      location key (0 UNKNOWN.0 0) type XATTR
                      transid 8 data_len 5 name_len 19
                      name: user.J3__T_Km3dVsW_
                      data hello
      
      At the point of btrfs_is_name_len_valid call we don't have access to the
      data_len value of the 2nd and 3rd sub-item. So simple btrfs_dir_data_len(leaf,
      di) would always return 3, although we'd need to get 6 and 5 respectively to
      get the claculations right. (read_end + name_len + data_len vs item_end)
      
      We'd have to also pass data_len externally, which is not point of the
      name validation. The last check is supposed to test if there's at least
      one dir item space after the one we're processing. I don't think this is
      particularly useful, validation of the next item would catch that too.
      So the check is removed and we don't weaken the validation. Now tests
      btrfs/048, btrfs/053, generic/273 and generic/337 pass.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1164a9fb
  3. 22 6月, 2017 12 次提交
  4. 21 6月, 2017 4 次提交
    • F
      Btrfs: incremental send, fix invalid path for unlink commands · fdb13889
      Filipe Manana 提交于
      An incremental send can contain unlink operations with an invalid target
      path when we rename some directory inode A, then rename some file inode B
      to the old name of inode A and directory inode A is an ancestor of inode B
      in the parent snapshot (but not anymore in the send snapshot).
      
      Consider the following example scenario where this issue happens.
      
      Parent snapshot:
      
       .                                                      (ino 256)
       |
       |--- dir1/                                             (ino 257)
             |--- dir2/                                       (ino 258)
             |     |--- file1                                 (ino 259)
             |     |--- file3                                 (ino 261)
             |
             |--- dir3/                                       (ino 262)
                   |--- file22                                (ino 260)
                   |--- dir4/                                 (ino 263)
      
      Send snapshot:
      
       .                                                      (ino 256)
       |
       |--- dir1/                                             (ino 257)
             |--- dir2/                                       (ino 258)
             |--- dir3                                        (ino 260)
             |--- file3/                                      (ino 262)
                   |--- dir4/                                 (ino 263)
                         |--- file11                          (ino 269)
                         |--- file33                          (ino 261)
      
      When attempting to apply the corresponding incremental send stream, an
      unlink operation contains an invalid path which makes the receiver fail.
      The following is verbose output of the btrfs receive command:
      
       receiving snapshot snap2 uuid=7d5450da-a573-e043-a451-ec85f4879f0f (...)
       utimes
       utimes dir1
       utimes dir1/dir2
       link dir1/dir3/dir4/file11 -> dir1/dir2/file1
       unlink dir1/dir2/file1
       utimes dir1/dir2
       truncate dir1/dir3/dir4/file11 size=0
       utimes dir1/dir3/dir4/file11
       rename dir1/dir3 -> o262-7-0
       link dir1/dir3 -> o262-7-0/file22
       unlink dir1/dir3/file22
       ERROR: unlink dir1/dir3/file22 failed. Not a directory
      
      The following steps happen during the computation of the incremental send
      stream the lead to this issue:
      
      1) Before we start processing the new and deleted references for inode
         260, we compute the full path of the deleted reference
         ("dir1/dir3/file22") and cache it in the list of deleted references
         for our inode.
      
      2) We then start processing the new references for inode 260, for which
         there is only one new, located at "dir1/dir3". When processing this
         new reference, we check that inode 262, which was not yet processed,
         collides with the new reference and because of that we orphanize
         inode 262 so its new full path becomes "o262-7-0".
      
      3) After the orphanization of inode 262, we create the new reference for
         inode 260 by issuing a link command with a target path of "dir1/dir3"
         and a source path of "o262-7-0/file22".
      
      4) We then start processing the deleted references for inode 260, for
         which there is only one with the base name of "file22", and issue
         an unlink operation containing the target path computed at step 1,
         which is wrong because that path no longer exists and should be
         replaced with "o262-7-0/file22".
      
      So fix this issue by recomputing the full path of deleted references if
      when we processed the new references for an inode we ended up orphanizing
      any other inode that is an ancestor of our inode in the parent snapshot.
      
      A test case for fstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      [ adjusted after prev patch removed fs_path::dir_path and dir_path_len ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fdb13889
    • F
      Btrfs: send, fix invalid path after renaming and linking file · 72c3668f
      Filipe Manana 提交于
      Currently an incremental snapshot can generate link operations which
      contain an invalid target path. Such case happens when in the send
      snapshot a file was renamed, a new hard link added for it and some
      other inode (with a lower number) got renamed to the former name of
      that file. Example:
      
      Parent snapshot
      
       .                  (ino 256)
       |
       |--- f1            (ino 257)
       |--- f2            (ino 258)
       |--- f3            (ino 259)
      
      Send snapshot
      
       .                  (ino 256)
       |
       |--- f2            (ino 257)
       |--- f3            (ino 258)
       |--- f4            (ino 259)
       |--- f5            (ino 258)
      
      The following steps happen when computing the incremental send stream:
      
      1) When processing inode 257, inode 258 is orphanized (renamed to
         "o258-7-0"), because its current reference has the same name as the
         new reference for inode 257;
      
      2) When processing inode 258, we iterate over all its new references,
         which have the names "f3" and "f5". The first iteration sees name
         "f5" and renames the inode from its orphan name ("o258-7-0") to
         "f5", while the second iteration sees the name "f3" and, incorrectly,
         issues a link operation with a target name matching the orphan name,
         which no longer exists. The first iteration had reset the current
         valid path of the inode to "f5", but in the second iteration we lost
         it because we found another inode, with a higher number of 259, which
         has a reference named "f3" as well, so we orphanized inode 259 and
         recomputed the current valid path of inode 258 to its old orphan
         name because inode 259 could be an ancestor of inode 258 and therefore
         the current valid path could contain the pre-orphanization name of
         inode 259. However in this case inode 259 is not an ancestor of inode
         258 so the current valid path should not be recomputed.
         This makes the receiver fail with the following error:
      
         ERROR: link f3 -> o258-7-0 failed: No such file or directory
      
      So fix this by not recomputing the current valid path for an inode
      whenever we find a colliding reference from some not yet processed inode
      (inode number higher then the one currently being processed), unless
      that other inode is an ancestor of the one we are currently processing.
      
      A test case for fstests will follow soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      72c3668f
    • F
      Btrfs: fix invalid extent maps due to hole punching · 609805d8
      Filipe Manana 提交于
      While punching a hole in a range that is not aligned with the sector size
      (currently the same as the page size) we can end up leaving an extent map
      in memory with a length that is smaller then the sector size or with a
      start offset that is not aligned to the sector size. Both cases are not
      expected and can lead to problems. This issue is easily detected
      after the patch from commit a7e3b975 ("Btrfs: fix reported number of
      inode blocks"), introduced in kernel 4.12-rc1, in a scenario like the
      following for example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ xfs_io -c "pwrite -S 0xaa -b 100K 0 100K" /mnt/foo
        $ xfs_io -c "fpunch 60K 90K" /mnt/foo
        $ xfs_io -c "pwrite -S 0xbb -b 100K 50K 100K" /mnt/foo
        $ xfs_io -c "pwrite -S 0xcc -b 50K 100K 50K" /mnt/foo
        $ umount /mnt
      
      After the unmount operation we can see several warnings emmitted due to
      underflows related to space reservation counters:
      
      [ 2837.443299] ------------[ cut here ]------------
      [ 2837.447395] WARNING: CPU: 8 PID: 2474 at fs/btrfs/inode.c:9444 btrfs_destroy_inode+0xe8/0x27e [btrfs]
      [ 2837.452108] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button se
      rio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_gene
      ric raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.458389] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.459754] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.462379] Call Trace:
      [ 2837.462379]  dump_stack+0x68/0x92
      [ 2837.462379]  __warn+0xc2/0xdd
      [ 2837.462379]  warn_slowpath_null+0x1d/0x1f
      [ 2837.462379]  btrfs_destroy_inode+0xe8/0x27e [btrfs]
      [ 2837.462379]  destroy_inode+0x3d/0x55
      [ 2837.462379]  evict+0x177/0x17e
      [ 2837.462379]  dispose_list+0x50/0x71
      [ 2837.462379]  evict_inodes+0x132/0x141
      [ 2837.462379]  generic_shutdown_super+0x3f/0xeb
      [ 2837.462379]  kill_anon_super+0x12/0x1c
      [ 2837.462379]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.462379]  deactivate_locked_super+0x30/0x68
      [ 2837.462379]  deactivate_super+0x36/0x39
      [ 2837.462379]  cleanup_mnt+0x58/0x76
      [ 2837.462379]  __cleanup_mnt+0x12/0x14
      [ 2837.462379]  task_work_run+0x77/0x9b
      [ 2837.462379]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.462379]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.462379]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.462379] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.462379] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.462379] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.462379] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.462379] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.462379] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.462379] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.519355] ---[ end trace e79345fe24b30b8d ]---
      [ 2837.596256] ------------[ cut here ]------------
      [ 2837.597625] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5699 btrfs_free_block_groups+0x246/0x3eb [btrfs]
      [ 2837.603547] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.659372] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.663359] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.663359] Call Trace:
      [ 2837.663359]  dump_stack+0x68/0x92
      [ 2837.663359]  __warn+0xc2/0xdd
      [ 2837.663359]  warn_slowpath_null+0x1d/0x1f
      [ 2837.663359]  btrfs_free_block_groups+0x246/0x3eb [btrfs]
      [ 2837.663359]  close_ctree+0x1dd/0x2e1 [btrfs]
      [ 2837.663359]  ? evict_inodes+0x132/0x141
      [ 2837.663359]  btrfs_put_super+0x15/0x17 [btrfs]
      [ 2837.663359]  generic_shutdown_super+0x6a/0xeb
      [ 2837.663359]  kill_anon_super+0x12/0x1c
      [ 2837.663359]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.663359]  deactivate_locked_super+0x30/0x68
      [ 2837.663359]  deactivate_super+0x36/0x39
      [ 2837.663359]  cleanup_mnt+0x58/0x76
      [ 2837.663359]  __cleanup_mnt+0x12/0x14
      [ 2837.663359]  task_work_run+0x77/0x9b
      [ 2837.663359]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.663359]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.663359]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.663359] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.663359] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.663359] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.663359] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.663359] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.663359] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.663359] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.739445] ---[ end trace e79345fe24b30b8e ]---
      [ 2837.745595] ------------[ cut here ]------------
      [ 2837.746412] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:5700 btrfs_free_block_groups+0x261/0x3eb [btrfs]
      [ 2837.747955] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.755395] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.756769] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.758526] Call Trace:
      [ 2837.758925]  dump_stack+0x68/0x92
      [ 2837.759383]  __warn+0xc2/0xdd
      [ 2837.759383]  warn_slowpath_null+0x1d/0x1f
      [ 2837.759383]  btrfs_free_block_groups+0x261/0x3eb [btrfs]
      [ 2837.759383]  close_ctree+0x1dd/0x2e1 [btrfs]
      [ 2837.759383]  ? evict_inodes+0x132/0x141
      [ 2837.759383]  btrfs_put_super+0x15/0x17 [btrfs]
      [ 2837.759383]  generic_shutdown_super+0x6a/0xeb
      [ 2837.759383]  kill_anon_super+0x12/0x1c
      [ 2837.759383]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.759383]  deactivate_locked_super+0x30/0x68
      [ 2837.759383]  deactivate_super+0x36/0x39
      [ 2837.759383]  cleanup_mnt+0x58/0x76
      [ 2837.759383]  __cleanup_mnt+0x12/0x14
      [ 2837.759383]  task_work_run+0x77/0x9b
      [ 2837.759383]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.759383]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.759383]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.759383] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.759383] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.759383] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.759383] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.759383] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.759383] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.759383] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.777063] ---[ end trace e79345fe24b30b8f ]---
      [ 2837.778235] ------------[ cut here ]------------
      [ 2837.778856] WARNING: CPU: 8 PID: 2474 at fs/btrfs/extent-tree.c:9825 btrfs_free_block_groups+0x348/0x3eb [btrfs]
      [ 2837.791385] Modules linked in: dm_flakey dm_mod ppdev parport_pc psmouse parport sg pcspkr acpi_cpufreq tpm_tis tpm_tis_core i2c_piix4 i2c_core evdev tpm button serio_raw sunrpc loop autofs4 ext4 crc16 jbd2 mbcache btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c crc32c_generic raid1 raid0 multipath linear md_mod sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix libata virtio_pci virtio_ring virtio e1000 scsi_mod floppy
      [ 2837.797711] CPU: 8 PID: 2474 Comm: umount Tainted: G        W       4.10.0-rc8-btrfs-next-43+ #1
      [ 2837.798594] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
      [ 2837.800118] Call Trace:
      [ 2837.800515]  dump_stack+0x68/0x92
      [ 2837.801015]  __warn+0xc2/0xdd
      [ 2837.801471]  warn_slowpath_null+0x1d/0x1f
      [ 2837.801698]  btrfs_free_block_groups+0x348/0x3eb [btrfs]
      [ 2837.801698]  close_ctree+0x1dd/0x2e1 [btrfs]
      [ 2837.801698]  ? evict_inodes+0x132/0x141
      [ 2837.801698]  btrfs_put_super+0x15/0x17 [btrfs]
      [ 2837.801698]  generic_shutdown_super+0x6a/0xeb
      [ 2837.801698]  kill_anon_super+0x12/0x1c
      [ 2837.801698]  btrfs_kill_super+0x16/0x21 [btrfs]
      [ 2837.801698]  deactivate_locked_super+0x30/0x68
      [ 2837.801698]  deactivate_super+0x36/0x39
      [ 2837.801698]  cleanup_mnt+0x58/0x76
      [ 2837.801698]  __cleanup_mnt+0x12/0x14
      [ 2837.801698]  task_work_run+0x77/0x9b
      [ 2837.801698]  prepare_exit_to_usermode+0x9d/0xc5
      [ 2837.801698]  syscall_return_slowpath+0x196/0x1b9
      [ 2837.801698]  entry_SYSCALL_64_fastpath+0xab/0xad
      [ 2837.801698] RIP: 0033:0x7f3ef3e6b9a7
      [ 2837.801698] RSP: 002b:00007ffdd0d8de58 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 2837.801698] RAX: 0000000000000000 RBX: 0000556f76a39060 RCX: 00007f3ef3e6b9a7
      [ 2837.801698] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000556f76a3f910
      [ 2837.801698] RBP: 0000556f76a3f910 R08: 0000556f76a3e670 R09: 0000000000000015
      [ 2837.801698] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007f3ef436ce64
      [ 2837.801698] R13: 0000000000000000 R14: 0000556f76a39240 R15: 00007ffdd0d8e0e0
      [ 2837.818441] ---[ end trace e79345fe24b30b90 ]---
      [ 2837.818991] BTRFS info (device sdc): space_info 1 has 7974912 free, is not full
      [ 2837.819830] BTRFS info (device sdc): space_info total=8388608, used=417792, pinned=0, reserved=0, may_use=18446744073709547520, readonly=0
      
      What happens in the above example is the following:
      
      1) When punching the hole, at btrfs_punch_hole(), the variable tail_len
         is set to 2048 (as tail_start is 148Kb + 1 and offset + len is 150Kb).
         This results in the creation of an extent map with a length of 2Kb
         starting at file offset 148Kb, through find_first_non_hole() ->
         btrfs_get_extent().
      
      2) The second write (first write after the hole punch operation), sets
         the range [50Kb, 152Kb[ to delalloc.
      
      3) The third write, at btrfs_find_new_delalloc_bytes(), sees the extent
         map covering the range [148Kb, 150Kb[ and ends up calling
         set_extent_bit() for the same range, which results in splitting an
         existing extent state record, covering the range [148Kb, 152Kb[ into
         two 2Kb extent state records, covering the ranges [148Kb, 150Kb[ and
         [150Kb, 152Kb[.
      
      4) Finally at lock_and_cleanup_extent_if_need(), immediately after calling
         btrfs_find_new_delalloc_bytes() we clear the delalloc bit from the
         range [100Kb, 152Kb[ which results in the btrfs_clear_bit_hook()
         callback being invoked against the two 2Kb extent state records that
         cover the ranges [148Kb, 150Kb[ and [150Kb, 152Kb[. When called against
         the first 2Kb extent state, it calls btrfs_delalloc_release_metadata()
         with a length argument of 2048 bytes. That function rounds up the length
         to a sector size aligned length, so it ends up considering a length of
         4096 bytes, and then calls calc_csum_metadata_size() which results in
         decrementing the inode's csum_bytes counter by 4096 bytes, so after
         it stays a value of 0 bytes. Then the same happens when
         btrfs_clear_bit_hook() is called against the second extent state that
         has a length of 2Kb, covering the range [150Kb, 152Kb[, the length is
         rounded up to 4096 and calc_csum_metadata_size() ends up being called
         to decrement 4096 bytes from the inode's csum_bytes counter, which
         at that time has a value of 0, leading to an underflow, which is
         exactly what triggers the first warning, at btrfs_destroy_inode().
         All the other warnings relate to several space accounting counters
         that underflow as well due to similar reasons.
      
      A similar case but where the hole punching operation creates an extent map
      with a start offset not aligned to the sector size is the following:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
        $ xfs_io -f -c "fpunch 695K 820K" $SCRATCH_MNT/bar
        $ xfs_io -c "pwrite -S 0xaa 1008K 307K" $SCRATCH_MNT/bar
        $ xfs_io -c "pwrite -S 0xbb -b 630K 1073K 630K" $SCRATCH_MNT/bar
        $ xfs_io -c "pwrite -S 0xcc -b 459K 1068K 459K" $SCRATCH_MNT/bar
        $ umount /mnt
      
      During the unmount operation we get similar traces for the same reasons as
      in the first example.
      
      So fix the hole punching operation to make sure it never creates extent
      maps with a length that is not aligned to the sector size nor with a start
      offset that is not aligned to the sector size, as this breaks all
      assumptions and it's a land mine.
      
      Fixes: d7781546 ("btrfs: Avoid trucating page or punching hole in a already existed hole.")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      609805d8
    • J
      btrfs: add cond_resched to btrfs_qgroup_trace_leaf_items · cddf3b2c
      Jeff Mahoney 提交于
      On an uncontended system, we can end up hitting soft lockups while
      doing replace_path.  At the core, and frequently called is
      btrfs_qgroup_trace_leaf_items, so it makes sense to add a cond_resched
      there.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      cddf3b2c
  5. 20 6月, 2017 5 次提交
    • N
      btrfs: Round down values which are written for total_bytes_size · 7dfb8be1
      Nikolay Borisov 提交于
      We got an internal report about a file system not wanting to mount
      following 99e3ecfc ("Btrfs: add more validation checks for
      superblock").
      
      BTRFS error (device sdb1): super_total_bytes 1000203816960 mismatch with
      fs_devices total_rw_bytes 1000203820544
      
      Subtracting the numbers we get a difference of less than a 4kb. Upon
      closer inspection it became apparent that mkfs actually rounds down the
      size of the device to a multiple of sector size. However, the same
      cannot be said for various functions which modify the total size and are
      called from btrfs_balance as well as when adding a new device. So this
      patch ensures that values being saved into on-disk data structures are
      always rounded down to a multiple of sectorsize.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7dfb8be1
    • N
      btrfs: Manually implement device_total_bytes getter/setter · eca152ed
      Nikolay Borisov 提交于
      The device->total_bytes member needs to always be rounded down to sectorsize
      so that it corresponds to the value of super->total_bytes. However, there are
      multiple places where the setter is fed a value which is not rounded which
      can cause a fs to be unmountable due to the check introduced in
      99e3ecfc ("Btrfs: add more validation checks for superblock"). This patch
      implements the getter/setter manually so that in a later patch I can add
      necessary code to catch offenders.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eca152ed
    • D
      btrfs: obsolete and remove mount option alloc_start · 0d0c71b3
      David Sterba 提交于
      The mount option alloc_start was used in the past for debugging and
      stressing the chunk allocator. Not meant to be used by users, so we're
      not breaking anybody's setup.
      
      There was some added complexity handling changes of the value and when
      it was not same as default. Such code has likely been untested and I
      think it's better to remove it.
      
      This patch kills all use of alloc_start, and by doing that also fixes
      a bug when alloc_size is set, potentially called from statfs:
      
      in btrfs_calc_avail_data_space, traversing the list in RCU, the RCU
      protection is temporarily dropped so btrfs_account_dev_extents_size can
      be called and then RCU is locked again! Doing that inside
      list_for_each_entry_rcu is just asking for trouble, but unlikely to be
      observed in practice.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0d0c71b3
    • D
      btrfs: move fs_info::fs_frozen to the flags · fac03c8d
      David Sterba 提交于
      We can keep the state among the other fs_info flags, there's no reason
      why fs_frozen would need to be separate.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fac03c8d
    • D
      btrfs: cleanup duplicate return value in insert_inline_extent · 79b4f4c6
      David Sterba 提交于
      The pattern when err is used for function exit and ret is used for
      return values of callees is not used here.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      79b4f4c6