1. 19 1月, 2019 3 次提交
    • J
      btrfs: wakeup cleaner thread when adding delayed iput · fd340d0f
      Josef Bacik 提交于
      The cleaner thread usually takes care of delayed iputs, with the
      exception of the btrfs_end_transaction_throttle path.  Delaying iputs
      means we are potentially delaying the eviction of an inode and it's
      respective space.  The cleaner thread only gets woken up every 30
      seconds, or when we require space.  If there are a lot of inodes that
      need to be deleted we could induce a serious amount of latency while we
      wait for these inodes to be evicted.  So instead wakeup the cleaner if
      it's not already awake to process any new delayed iputs we add to the
      list.  If we suddenly need space we will less likely be backed up
      behind a bunch of inodes that are waiting to be deleted, and we could
      possibly free space before we need to get into the flushing logic which
      will save us some latency.
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fd340d0f
    • J
      btrfs: wait on ordered extents on abort cleanup · 74d5d229
      Josef Bacik 提交于
      If we flip read-only before we initiate writeback on all dirty pages for
      ordered extents we've created then we'll have ordered extents left over
      on umount, which results in all sorts of bad things happening.  Fix this
      by making sure we wait on ordered extents if we have to do the aborted
      transaction cleanup stuff.
      
      generic/475 can produce this warning:
      
       [ 8531.177332] WARNING: CPU: 2 PID: 11997 at fs/btrfs/disk-io.c:3856 btrfs_free_fs_root+0x95/0xa0 [btrfs]
       [ 8531.183282] CPU: 2 PID: 11997 Comm: umount Tainted: G        W 5.0.0-rc1-default+ #394
       [ 8531.185164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
       [ 8531.187851] RIP: 0010:btrfs_free_fs_root+0x95/0xa0 [btrfs]
       [ 8531.193082] RSP: 0018:ffffb1ab86163d98 EFLAGS: 00010286
       [ 8531.194198] RAX: ffff9f3449494d18 RBX: ffff9f34a2695000 RCX:0000000000000000
       [ 8531.195629] RDX: 0000000000000002 RSI: 0000000000000001 RDI:0000000000000000
       [ 8531.197315] RBP: ffff9f344e930000 R08: 0000000000000001 R09:0000000000000000
       [ 8531.199095] R10: 0000000000000000 R11: ffff9f34494d4ff8 R12:ffffb1ab86163dc0
       [ 8531.200870] R13: ffff9f344e9300b0 R14: ffffb1ab86163db8 R15:0000000000000000
       [ 8531.202707] FS:  00007fc68e949fc0(0000) GS:ffff9f34bd800000(0000)knlGS:0000000000000000
       [ 8531.204851] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [ 8531.205942] CR2: 00007ffde8114dd8 CR3: 000000002dfbd000 CR4:00000000000006e0
       [ 8531.207516] Call Trace:
       [ 8531.208175]  btrfs_free_fs_roots+0xdb/0x170 [btrfs]
       [ 8531.210209]  ? wait_for_completion+0x5b/0x190
       [ 8531.211303]  close_ctree+0x157/0x350 [btrfs]
       [ 8531.212412]  generic_shutdown_super+0x64/0x100
       [ 8531.213485]  kill_anon_super+0x14/0x30
       [ 8531.214430]  btrfs_kill_super+0x12/0xa0 [btrfs]
       [ 8531.215539]  deactivate_locked_super+0x29/0x60
       [ 8531.216633]  cleanup_mnt+0x3b/0x70
       [ 8531.217497]  task_work_run+0x98/0xc0
       [ 8531.218397]  exit_to_usermode_loop+0x83/0x90
       [ 8531.219324]  do_syscall_64+0x15b/0x180
       [ 8531.220192]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [ 8531.221286] RIP: 0033:0x7fc68e5e4d07
       [ 8531.225621] RSP: 002b:00007ffde8116608 EFLAGS: 00000246 ORIG_RAX:00000000000000a6
       [ 8531.227512] RAX: 0000000000000000 RBX: 00005580c2175970 RCX:00007fc68e5e4d07
       [ 8531.229098] RDX: 0000000000000001 RSI: 0000000000000000 RDI:00005580c2175b80
       [ 8531.230730] RBP: 0000000000000000 R08: 00005580c2175ba0 R09:00007ffde8114e80
       [ 8531.232269] R10: 0000000000000000 R11: 0000000000000246 R12:00005580c2175b80
       [ 8531.233839] R13: 00007fc68eac61c4 R14: 00005580c2175a68 R15:0000000000000000
      
      Leaving a tree in the rb-tree:
      
      3853 void btrfs_free_fs_root(struct btrfs_root *root)
      3854 {
      3855         iput(root->ino_cache_inode);
      3856         WARN_ON(!RB_EMPTY_ROOT(&root->inode_tree));
      
      CC: stable@vger.kernel.org
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add stacktrace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      74d5d229
    • J
      btrfs: handle delayed ref head accounting cleanup in abort · 31890da0
      Josef Bacik 提交于
      We weren't doing any of the accounting cleanup when we aborted
      transactions.  Fix this by making cleanup_ref_head_accounting global and
      calling it from the abort code, this fixes the issue where our
      accounting was all wrong after the fs aborts.
      
      The test generic/475 on a 2G VM can trigger the problems eg.:
      
        [ 8502.136957] WARNING: CPU: 0 PID: 11064 at fs/btrfs/extent-tree.c:5986 btrfs_free_block_grou +ps+0x3dc/0x410 [btrfs]
        [ 8502.148372] CPU: 0 PID: 11064 Comm: umount Not tainted 5.0.0-rc1-default+ #394
        [ 8502.150807] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626 +cc-prebuilt.qemu-project.org 04/01/2014
        [ 8502.154317] RIP: 0010:btrfs_free_block_groups+0x3dc/0x410 [btrfs]
        [ 8502.160623] RSP: 0018:ffffb1ab84b93de8 EFLAGS: 00010206
        [ 8502.161906] RAX: 0000000001000000 RBX: ffff9f34b1756400 RCX: 0000000000000000
        [ 8502.163448] RDX: 0000000000000002 RSI: 0000000000000001 RDI: ffff9f34b1755400
        [ 8502.164906] RBP: ffff9f34b7e8c000 R08: 0000000000000001 R09: 0000000000000000
        [ 8502.166716] R10: 0000000000000000 R11: 0000000000000001 R12: ffff9f34b7e8c108
        [ 8502.168498] R13: ffff9f34b7e8c158 R14: 0000000000000000 R15: dead000000000100
        [ 8502.170296] FS:  00007fb1cf15ffc0(0000) GS:ffff9f34bd400000(0000) knlGS:0000000000000000
        [ 8502.172439] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [ 8502.173669] CR2: 00007fb1ced507b0 CR3: 000000002f7a6000 CR4: 00000000000006f0
        [ 8502.175094] Call Trace:
        [ 8502.175759]  close_ctree+0x17f/0x350 [btrfs]
        [ 8502.176721]  generic_shutdown_super+0x64/0x100
        [ 8502.177702]  kill_anon_super+0x14/0x30
        [ 8502.178607]  btrfs_kill_super+0x12/0xa0 [btrfs]
        [ 8502.179602]  deactivate_locked_super+0x29/0x60
        [ 8502.180595]  cleanup_mnt+0x3b/0x70
        [ 8502.181406]  task_work_run+0x98/0xc0
        [ 8502.182255]  exit_to_usermode_loop+0x83/0x90
        [ 8502.183113]  do_syscall_64+0x15b/0x180
        [ 8502.183919]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Corresponding to
      
        release_global_block_rsv() {
        ...
        WARN_ON(fs_info->delayed_refs_rsv.reserved > 0);
      
      CC: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      [ add log dump ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      31890da0
  2. 17 12月, 2018 12 次提交
    • A
      btrfs: Fix typos in comments and strings · 52042d8e
      Andrea Gelmini 提交于
      The typos accumulate over time so once in a while time they get fixed in
      a large patch.
      Signed-off-by: NAndrea Gelmini <andrea.gelmini@gelma.net>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      52042d8e
    • J
      btrfs: introduce delayed_refs_rsv · ba2c4d4e
      Josef Bacik 提交于
      Traditionally we've had voodoo in btrfs to account for the space that
      delayed refs may take up by having a global_block_rsv.  This works most
      of the time, except when it doesn't.  We've had issues reported and seen
      in production where sometimes the global reserve is exhausted during
      transaction commit before we can run all of our delayed refs, resulting
      in an aborted transaction.  Because of this voodoo we have equally
      dubious flushing semantics around throttling delayed refs which we often
      get wrong.
      
      So instead give them their own block_rsv.  This way we can always know
      exactly how much outstanding space we need for delayed refs.  This
      allows us to make sure we are constantly filling that reservation up
      with space, and allows us to put more precise pressure on the enospc
      system.  Instead of doing math to see if its a good time to throttle,
      the normal enospc code will be invoked if we have a lot of delayed refs
      pending, and they will be run via the normal flushing mechanism.
      
      For now the delayed_refs_rsv will hold the reservations for the delayed
      refs, the block group updates, and deleting csums.  We could have a
      separate rsv for the block group updates, but the csum deletion stuff is
      still handled via the delayed_refs so that will stay there.
      
      Historical background:
      
      The global reserve has grown to cover everything we don't reserve space
      explicitly for, and we've grown a lot of weird ad-hoc heuristics to know
      if we're running short on space and when it's time to force a commit.  A
      failure rate of 20-40 file systems when we run hundreds of thousands of
      them isn't super high, but cleaning up this code will make things less
      ugly and more predictible.
      
      Thus the delayed refs rsv.  We always know how many delayed refs we have
      outstanding, and although running them generates more we can use the
      global reserve for that spill over, which fits better into it's desired
      use than a full blown reservation.  This first approach is to simply
      take how many times we're reserving space for and multiply that by 2 in
      order to save enough space for the delayed refs that could be generated.
      This is a niave approach and will probably evolve, but for now it works.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Reviewed-by: David Sterba <dsterba@suse.com> # high-level review
      [ added background notes from the cover letter ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ba2c4d4e
    • D
      btrfs: dev-replace: remove custom read/write blocking scheme · 53176dde
      David Sterba 提交于
      After the rw semaphore has been added, the custom blocking using
      ::blocking_readers and ::read_lock_wq is redundant.
      
      The blocking logic in __btrfs_map_block is replaced by extending the
      time the semaphore is held, that has the same blocking effect on writes
      as the previous custom scheme that waited until ::blocking_readers was
      zero.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      53176dde
    • D
      btrfs: dev-replace: swich locking to rw semaphore · 129827e3
      David Sterba 提交于
      This is the first part of removing the custom locking and waiting scheme
      used for device replace. It was probably copied from extent buffer
      locking, but there's nothing that would require more than is provided by
      the common locking primitives.
      
      The rw spinlock protects waiting tasks counter in case of incompatible
      locks and the waitqueue. Same as rw semaphore.
      
      This patch only switches the locking primitive, for better
      bisectability.  There should be no functional change other than the
      overhead of the locking and potential sleeping instead of spinning when
      the lock is contended.
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      129827e3
    • J
      btrfs: document extent mapping assumptions in checksum · d2e174d5
      Johannes Thumshirn 提交于
      Document why map_private_extent_buffer() cannot return '1' (i.e. the map
      spans two pages) for the csum_tree_block() case.
      
      The current algorithm for detecting a page boundary crossing in
      map_private_extent_buffer() will return a '1' *IFF* the extent buffer's
      offset in the page + the offset passed in by csum_tree_block() and the
      minimal length passed in by csum_tree_block() - 1 are bigger than
      PAGE_SIZE.
      
      We always pass BTRFS_CSUM_SIZE (32) as offset and a minimal length of 32
      and the current extent buffer allocator always guarantees page aligned
      extends, so the above condition can't be true.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d2e174d5
    • N
      btrfs: Remove extent_io_ops::readpage_io_failed_hook · 78e62c02
      Nikolay Borisov 提交于
      For data inodes this hook does nothing but to return -EAGAIN which is
      used to signal to the endio routines that this bio belongs to a data
      inode. If this is the case the actual retrying is handled by
      bio_readpage_error. Alternatively, if this bio belongs to the btree
      inode then btree_io_failed_hook just does some cleanup and doesn't retry
      anything.
      
      This patch simplifies the code flow by eliminating
      readpage_io_failed_hook and instead open-coding btree_io_failed_hook in
      end_bio_extent_readpage. Also eliminate some needless checks since IO is
      always performed on either data inode or btree inode, both of which are
      guaranteed to have their extent_io_tree::ops set.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      78e62c02
    • D
      btrfs: merge btrfs_submit_bio_done to its caller · 06ea01b1
      David Sterba 提交于
      There's one caller and its code is simple, we can open code it in
      run_one_async_done. The errors are passed through bio.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      06ea01b1
    • F
      Btrfs: allow clear_extent_dirty() to receive a cached extent state record · 0e6ec385
      Filipe Manana 提交于
      We can have a lot freed extents during the life span of transaction, so
      the red black tree that keeps track of the ranges of each freed extent
      (fs_info->freed_extents[]) can get quite big. When finishing a
      transaction commit we find each range, process it (discard the extents,
      unpin them) and then remove it from the red black tree.
      
      We can use an extent state record as a cache when searching for a range,
      so that when we clean the range we can use the cached extent state we
      passed to the search function instead of iterating the red black tree
      again. Doing things as fast as possible when finishing a transaction (in
      state TRANS_STATE_UNBLOCKED) is convenient as it reduces the time we
      block another task that wants to commit the next transaction.
      
      So change clear_extent_dirty() to allow an optional extent state record to
      be passed as an argument, which will be passed down to __clear_extent_bit.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      0e6ec385
    • N
      btrfs: Add handling for disk split-brain scenario during fsid change · fbc6feae
      Nikolay Borisov 提交于
      Even though fsid change without rewrite is a very quick operation it's
      still possible to experience a split-brain scenario if power loss occurs
      at the most inconvenient time. This patch handles the case where power
      failure occurs while the first transaction (the one setting
      CHANGING_FSID_V2) flag is being persisted on disk. This can cause the
      btrfs_fs_devices of this filesystem to be created by a device which:
      
       a) has the CHANGING_FSID_V2 flag set but its fsid value is intact
      
       b) or a device which doesn't have CHANGING_FSID_V2 flag set and its
          fsid value is intact
      
      This situation is trivially handled by the current find_fsid code since
      in both cases the devices are going to be treated like ordinary devices.
      Since btrfs is always mounted using the superblock of the latest
      device (the one with highest generation number), meaning it will have
      the CHANGING_FSID_V2 flag set, ensure it's being cleared on mount. On
      the first transaction commit following mount all disks will have it
      cleared.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fbc6feae
    • N
      btrfs: Remove fsid/metadata_fsid fields from btrfs_info · de37aa51
      Nikolay Borisov 提交于
      Currently btrfs_fs_info structure contains a copy of the
      fsid/metadata_uuid fields. Same values are also contained in the
      btrfs_fs_devices structure which fs_info has a reference to. Let's
      reduce duplication by removing the fields from fs_info and always refer
      to the ones in fs_devices. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de37aa51
    • N
      btrfs: Introduce support for FSID change without metadata rewrite · 7239ff4b
      Nikolay Borisov 提交于
      This field is going to be used when the user wants to change the UUID
      of the filesystem without having to rewrite all metadata blocks. This
      field adds another level of indirection such that when the FSID is
      changed what really happens is the current UUID (the one with which the
      fs was created) is copied to the 'metadata_uuid' field in the superblock
      as well as a new incompat flag is set METADATA_UUID. When the kernel
      detects this flag is set it knows that the superblock in fact has 2
      UUIDs:
      
      1. Is the UUID which is user-visible, currently known as FSID.
      2. Metadata UUID - this is the UUID which is stamped into all on-disk
         datastructures belonging to this file system.
      
      When the new incompat flag is present device scanning checks whether
      both fsid/metadata_uuid of the scanned device match any of the
      registered filesystems. When the flag is not set then both UUIDs are
      equal and only the FSID is retained on disk, metadata_uuid is set only
      in-memory during mount.
      
      Additionally a new metadata_uuid field is also added to the fs_info
      struct. It's initialised either with the FSID in case METADATA_UUID
      incompat flag is not set or with the metdata_uuid of the superblock
      otherwise.
      
      This commit introduces the new fields as well as the new incompat flag
      and switches all users of the fsid to the new logic.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ minor updates in comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7239ff4b
    • O
      Btrfs: prevent ioctls from interfering with a swap file · eede2bf3
      Omar Sandoval 提交于
      A later patch will implement swap file support for Btrfs, but before we
      do that, we need to make sure that the various Btrfs ioctls cannot
      change a swap file.
      
      When a swap file is active, we must make sure that the extents of the
      file are not moved and that they don't become shared. That means that
      the following are not safe:
      
      - chattr +c (enable compression)
      - reflink
      - dedupe
      - snapshot
      - defrag
      
      Don't allow those to happen on an active swap file.
      
      Additionally, balance, resize, device remove, and device replace are
      also unsafe if they affect an active swapfile. Add a red-black tree of
      block groups and devices which contain an active swapfile. Relocation
      checks each block group against this tree and skips it or errors out for
      balance or resize, respectively. Device remove and device replace check
      the tree for the device they will operate on.
      
      Note that we don't have to worry about chattr -C (disable nocow), which
      we ignore for non-empty files, because an active swapfile must be
      non-empty and can't be truncated. We also don't have to worry about
      autodefrag because it's only done on COW files. Truncate and fallocate
      are already taken care of by the generic code. Device add doesn't do
      relocation so it's not an issue, either.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eede2bf3
  3. 13 11月, 2018 1 次提交
    • N
      btrfs: Always try all copies when reading extent buffers · f8397d69
      Nikolay Borisov 提交于
      When a metadata read is served the endio routine btree_readpage_end_io_hook
      is called which eventually runs the tree-checker. If tree-checker fails
      to validate the read eb then it sets EXTENT_BUFFER_CORRUPT flag. This
      leads to btree_read_extent_buffer_pages wrongly assuming that all
      available copies of this extent buffer are wrong and failing prematurely.
      Fix this modify btree_read_extent_buffer_pages to read all copies of
      the data.
      
      This failure was exhibitted in xfstests btrfs/124 which would
      spuriously fail its balance operations. The reason was that when balance
      was run following re-introduction of the missing raid1 disk
      __btrfs_map_block would map the read request to stripe 0, which
      corresponded to devid 2 (the disk which is being removed in the test):
      
          item 2 key (FIRST_CHUNK_TREE CHUNK_ITEM 3553624064) itemoff 15975 itemsize 112
      	length 1073741824 owner 2 stripe_len 65536 type DATA|RAID1
      	io_align 65536 io_width 65536 sector_size 4096
      	num_stripes 2 sub_stripes 1
      		stripe 0 devid 2 offset 2156920832
      		dev_uuid 8466c350-ed0c-4c3b-b17d-6379b445d5c8
      		stripe 1 devid 1 offset 3553624064
      		dev_uuid 1265d8db-5596-477e-af03-df08eb38d2ca
      
      This caused read requests for a checksum item that to be routed to the
      stale disk which triggered the aforementioned logic involving
      EXTENT_BUFFER_CORRUPT flag. This then triggered cascading failures of
      the balance operation.
      
      Fixes: a826d6dc ("Btrfs: check items for correctness as we search")
      CC: stable@vger.kernel.org # 4.4+
      Suggested-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      f8397d69
  4. 08 11月, 2018 1 次提交
    • O
      Btrfs: fix missing delayed iputs on unmount · d6fd0ae2
      Omar Sandoval 提交于
      There's a race between close_ctree() and cleaner_kthread().
      close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
      sees it set, but this is racy; the cleaner might have already checked
      the bit and could be cleaning stuff. In particular, if it deletes unused
      block groups, it will create delayed iputs for the free space cache
      inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
      longer running delayed iputs after a commit. Therefore, if the cleaner
      creates more delayed iputs after delayed iputs are run in
      btrfs_commit_super(), we will leak inodes on unmount and get a busy
      inode crash from the VFS.
      
      Fix it by parking the cleaner before we actually close anything. Then,
      any remaining delayed iputs will always be handled in
      btrfs_commit_super(). This also ensures that the commit in close_ctree()
      is really the last commit, so we can get rid of the commit in
      cleaner_kthread().
      
      The fstest/generic/475 followed by 476 can trigger a crash that
      manifests as a slab corruption caused by accessing the freed kthread
      structure by a wake up function. Sample trace:
      
      [ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
      [ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
      [ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
      [ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G        W         4.19.0-rc8-default+ #323
      [ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
      [ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
      [ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
      [ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
      [ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
      [ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
      [ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
      [ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
      [ 5657.099689] FS:  00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
      [ 5657.101032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
      [ 5657.103159] Call Trace:
      [ 5657.103776]  shrink_inactive_list+0x194/0x410
      [ 5657.104671]  shrink_node_memcg.constprop.84+0x39a/0x6a0
      [ 5657.105750]  shrink_node+0x62/0x1c0
      [ 5657.106529]  try_to_free_pages+0x1a4/0x500
      [ 5657.107408]  __alloc_pages_slowpath+0x2c9/0xb20
      [ 5657.108418]  __alloc_pages_nodemask+0x268/0x2b0
      [ 5657.109348]  kmalloc_large_node+0x37/0x90
      [ 5657.110205]  __kmalloc_node+0x236/0x310
      [ 5657.111014]  kvmalloc_node+0x3e/0x70
      
      Fixes: 30928e9b ("btrfs: don't run delayed_iputs in commit")
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add trace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d6fd0ae2
  5. 06 11月, 2018 1 次提交
    • L
      btrfs: fix pinned underflow after transaction aborted · fcd5e742
      Lu Fengqi 提交于
      When running generic/475, we may get the following warning in dmesg:
      
      [ 6902.102154] WARNING: CPU: 3 PID: 18013 at fs/btrfs/extent-tree.c:9776 btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
      [ 6902.109160] CPU: 3 PID: 18013 Comm: umount Tainted: G        W  O      4.19.0-rc8+ #8
      [ 6902.110971] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      [ 6902.112857] RIP: 0010:btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
      [ 6902.118921] RSP: 0018:ffffc9000459bdb0 EFLAGS: 00010286
      [ 6902.120315] RAX: ffff880175050bb0 RBX: ffff8801124a8000 RCX: 0000000000170007
      [ 6902.121969] RDX: 0000000000000002 RSI: 0000000000170007 RDI: ffffffff8125fb74
      [ 6902.123716] RBP: ffff880175055d10 R08: 0000000000000000 R09: 0000000000000000
      [ 6902.125417] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880175055d88
      [ 6902.127129] R13: ffff880175050bb0 R14: 0000000000000000 R15: dead000000000100
      [ 6902.129060] FS:  00007f4507223780(0000) GS:ffff88017ba00000(0000) knlGS:0000000000000000
      [ 6902.130996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6902.132558] CR2: 00005623599cac78 CR3: 000000014b700001 CR4: 00000000003606e0
      [ 6902.134270] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 6902.135981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 6902.137836] Call Trace:
      [ 6902.138939]  close_ctree+0x171/0x330 [btrfs]
      [ 6902.140181]  ? kthread_stop+0x146/0x1f0
      [ 6902.141277]  generic_shutdown_super+0x6c/0x100
      [ 6902.142517]  kill_anon_super+0x14/0x30
      [ 6902.143554]  btrfs_kill_super+0x13/0x100 [btrfs]
      [ 6902.144790]  deactivate_locked_super+0x2f/0x70
      [ 6902.146014]  cleanup_mnt+0x3b/0x70
      [ 6902.147020]  task_work_run+0x9e/0xd0
      [ 6902.148036]  do_syscall_64+0x470/0x600
      [ 6902.149142]  ? trace_hardirqs_off_thunk+0x1a/0x1c
      [ 6902.150375]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 6902.151640] RIP: 0033:0x7f45077a6a7b
      [ 6902.157324] RSP: 002b:00007ffd589f3e68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 6902.159187] RAX: 0000000000000000 RBX: 000055e8eec732b0 RCX: 00007f45077a6a7b
      [ 6902.160834] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055e8eec73490
      [ 6902.162526] RBP: 0000000000000000 R08: 000055e8eec734b0 R09: 00007ffd589f26c0
      [ 6902.164141] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e8eec73490
      [ 6902.165815] R13: 00007f4507ac61a4 R14: 0000000000000000 R15: 00007ffd589f40d8
      [ 6902.167553] irq event stamp: 0
      [ 6902.168998] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
      [ 6902.170731] hardirqs last disabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
      [ 6902.172773] softirqs last  enabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
      [ 6902.174671] softirqs last disabled at (0): [<0000000000000000>]           (null)
      [ 6902.176407] ---[ end trace 463138c2986b275c ]---
      [ 6902.177636] BTRFS info (device dm-3): space_info 4 has 273465344 free, is not full
      [ 6902.179453] BTRFS info (device dm-3): space_info total=276824064, used=4685824, pinned=18446744073708158976, reserved=0, may_use=0, readonly=65536
      
      In the above line there's "pinned=18446744073708158976" which is an
      unsigned u64 value of -1392640, an obvious underflow.
      
      When transaction_kthread is running cleanup_transaction(), another
      fsstress is running btrfs_commit_transaction(). The
      btrfs_finish_extent_commit() may get the same range as
      btrfs_destroy_pinned_extent() got, which causes the pinned underflow.
      
      Fixes: d4b450cd ("Btrfs: fix race between transaction commit and empty block group removal")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fcd5e742
  6. 15 10月, 2018 6 次提交
    • J
      btrfs: assert on non-empty delayed iputs · e187831e
      Josef Bacik 提交于
      I ran into an issue where there was some reference being held on an
      inode that I couldn't track.  This assert wasn't triggered, but it at
      least rules out we're doing something stupid.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e187831e
    • D
      btrfs: dev-replace: move replace members out of fs_info · 7f8d236a
      David Sterba 提交于
      The replace_wait and bio_counter were mistakenly added to fs_info in
      commit c404e0dc ("Btrfs: fix use-after-free in the finishing
      procedure of the device replace"), but they logically belong to
      fs_info::dev_replace. Besides, bio_counter is a very generic name and is
      confusing in bare fs_info context.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7f8d236a
    • D
      btrfs: remove btrfs_dev_replace::read_locks · 3280f874
      David Sterba 提交于
      This member seems to be copied from the extent_buffer locking scheme and
      is at least used to assert that the read lock/unlock is properly nested.
      In some way. While the _inc/_dec are called inside the read lock
      section, the asserts are both inside and outside, so the ordering is not
      guaranteed and we can see read/inc/dec ordered in any way
      (theoretically).
      
      A missing call of btrfs_dev_replace_clear_lock_blocking could cause
      unexpected read_locks count, so this at least looks like a valid
      assertion, but this will become unnecessary with later updates.
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3280f874
    • L
      Btrfs: delayed-refs: use rb_first_cached for ref_tree · e3d03965
      Liu Bo 提交于
      rb_first_cached() trades an extra pointer "leftmost" for doing the same
      job as rb_first() but in O(1).
      
      Functions manipulating href->ref_tree need to get the first entry, this
      converts href->ref_tree to use rb_first_cached().
      
      For more details about the optimization see patch "Btrfs: delayed-refs:
      use rb_first_cached for href_root".
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e3d03965
    • L
      Btrfs: delayed-refs: use rb_first_cached for href_root · 5c9d028b
      Liu Bo 提交于
      rb_first_cached() trades an extra pointer "leftmost" for doing the same
      job as rb_first() but in O(1).
      
      Functions manipulating href_root need to get the first entry, this
      converts href_root to use rb_first_cached().
      
      This patch is first in the sequenct of similar updates to other rbtrees
      and this is analysis of the expected behaviour and improvements.
      
      There's a common pattern:
      
      while (node = rb_first) {
              entry = rb_entry(node)
              next = rb_next(node)
              rb_erase(node)
              cleanup(entry)
      }
      
      rb_first needs to traverse the tree up to logN depth, rb_erase can
      completely reshuffle the tree. With the caching we'll skip the traversal
      in rb_first.  That's a cached memory access vs looped pointer
      dereference trade-off that IMHO has a clear winner.
      
      Measurements show there's not much difference in a sample tree with
      10000 nodes: 4.5s / rb_first and 4.8s / rb_first_cached. Real effects of
      caching and pointer chasing are unpredictable though.
      
      Further optimzations can be done to avoid the expensive rb_erase step.
      In some cases it's ok to process the nodes in any order, so the tree can
      be traversed in post-order, not rebalancing the children nodes and just
      calling free. Care must be taken regarding the next node.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update changelog from mail discussions ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      5c9d028b
    • M
      btrfs: Remove 'objectid' member from struct btrfs_root · 4fd786e6
      Misono Tomohiro 提交于
      There are two members in struct btrfs_root which indicate root's
      objectid: objectid and root_key.objectid.
      
      They are both set to the same value in __setup_root():
      
        static void __setup_root(struct btrfs_root *root,
                                 struct btrfs_fs_info *fs_info,
                                 u64 objectid)
        {
          ...
          root->objectid = objectid;
          ...
          root->root_key.objectid = objecitd;
          ...
        }
      
      and not changed to other value after initialization.
      
      grep in btrfs directory shows both are used in many places:
        $ grep -rI "root->root_key.objectid" | wc -l
        133
        $ grep -rI "root->objectid" | wc -l
        55
       (4.17, inc. some noise)
      
      It is confusing to have two similar variable names and it seems
      that there is no rule about which should be used in a certain case.
      
      Since ->root_key itself is needed for tree reloc tree, let's remove
      'objecitd' member and unify code to use ->root_key.objectid in all places.
      Signed-off-by: NMisono Tomohiro <misono.tomohiro@jp.fujitsu.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4fd786e6
  7. 18 8月, 2018 1 次提交
    • R
      Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space · 8ecebf4d
      Robbie Ko 提交于
      Commit e9894fd3 ("Btrfs: fix snapshot vs nocow writting") forced
      nocow writes to fallback to COW, during writeback, when a snapshot is
      created. This resulted in writes made before creating the snapshot to
      unexpectedly fail with ENOSPC during writeback when success (0) was
      returned to user space through the write system call.
      
      The steps leading to this problem are:
      
      1. When it's not possible to allocate data space for a write, the
         buffered write path checks if a NOCOW write is possible.  If it is,
         it will not reserve space and success (0) is returned to user space.
      
      2. Then when a snapshot is created, the root's will_be_snapshotted
         atomic is incremented and writeback is triggered for all inode's that
         belong to the root being snapshotted. Incrementing that atomic forces
         all previous writes to fallback to COW during writeback (running
         delalloc).
      
      3. This results in the writeback for the inodes to fail and therefore
         setting the ENOSPC error in their mappings, so that a subsequent
         fsync on them will report the error to user space. So it's not a
         completely silent data loss (since fsync will report ENOSPC) but it's
         a very unexpected and undesirable behaviour, because if a clean
         shutdown/unmount of the filesystem happens without previous calls to
         fsync, it is expected to have the data present in the files after
         mounting the filesystem again.
      
      So fix this by adding a new atomic named snapshot_force_cow to the
      root structure which prevents this behaviour and works the following way:
      
      1. It is incremented when we start to create a snapshot after triggering
         writeback and before waiting for writeback to finish.
      
      2. This new atomic is now what is used by writeback (running delalloc)
         to decide whether we need to fallback to COW or not. Because we
         incremented this new atomic after triggering writeback in the
         snapshot creation ioctl, we ensure that all buffered writes that
         happened before snapshot creation will succeed and not fallback to
         COW (which would make them fail with ENOSPC).
      
      3. The existing atomic, will_be_snapshotted, is kept because it is used
         to force new buffered writes, that start after we started
         snapshotting, to reserve data space even when NOCOW is possible.
         This makes these writes fail early with ENOSPC when there's no
         available space to allocate, preventing the unexpected behaviour of
         writeback later failing with ENOSPC due to a fallback to COW mode.
      
      Fixes: e9894fd3 ("Btrfs: fix snapshot vs nocow writting")
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8ecebf4d
  8. 06 8月, 2018 15 次提交