1. 27 12月, 2019 1 次提交
    • L
      btrfs: skip file_extent generation check for free_space_inode in run_delalloc_nocow · e967a98f
      Lu Fengqi 提交于
      commit 27a7ff55 upstream.
      
      The test case btrfs/001 with inode_cache mount option will encounter the
      following warning:
      
        WARNING: CPU: 1 PID: 23700 at fs/btrfs/inode.c:956 cow_file_range.isra.19+0x32b/0x430 [btrfs]
        CPU: 1 PID: 23700 Comm: btrfs Kdump: loaded Tainted: G        W  O      4.20.0-rc4-custom+ #30
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
        RIP: 0010:cow_file_range.isra.19+0x32b/0x430 [btrfs]
        Call Trace:
         ? free_extent_buffer+0x46/0x90 [btrfs]
         run_delalloc_nocow+0x455/0x900 [btrfs]
         btrfs_run_delalloc_range+0x1a7/0x360 [btrfs]
         writepage_delalloc+0xf9/0x150 [btrfs]
         __extent_writepage+0x125/0x3e0 [btrfs]
         extent_write_cache_pages+0x1b6/0x3e0 [btrfs]
         ? __wake_up_common_lock+0x63/0xc0
         extent_writepages+0x50/0x80 [btrfs]
         do_writepages+0x41/0xd0
         ? __filemap_fdatawrite_range+0x9e/0xf0
         __filemap_fdatawrite_range+0xbe/0xf0
         btrfs_fdatawrite_range+0x1b/0x50 [btrfs]
         __btrfs_write_out_cache+0x42c/0x480 [btrfs]
         btrfs_write_out_ino_cache+0x84/0xd0 [btrfs]
         btrfs_save_ino_cache+0x551/0x660 [btrfs]
         commit_fs_roots+0xc5/0x190 [btrfs]
         btrfs_commit_transaction+0x2bf/0x8d0 [btrfs]
         btrfs_mksubvol+0x48d/0x4d0 [btrfs]
         btrfs_ioctl_snap_create_transid+0x170/0x180 [btrfs]
         btrfs_ioctl_snap_create_v2+0x124/0x180 [btrfs]
         btrfs_ioctl+0x123f/0x3030 [btrfs]
      
      The file extent generation of the free space inode is equal to the last
      snapshot of the file root, so the inode will be passed to cow_file_rage.
      But the inode was created and its extents were preallocated in
      btrfs_save_ino_cache, there are no cow copies on disk.
      
      The preallocated extent is not yet in the extent tree, and
      btrfs_cross_ref_exist will ignore the -ENOENT returned by
      check_committed_ref, so we can directly write the inode to the disk.
      
      Fixes: 78d4295b ("btrfs: lift some btrfs_cross_ref_exist checks in nocow path")
      CC: stable@vger.kernel.org # 4.18+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e967a98f
  2. 21 11月, 2018 1 次提交
    • R
      Btrfs: fix cur_offset in the error case for nocow · db39065c
      Robbie Ko 提交于
      commit 506481b20e818db40b6198815904ecd2d6daee64 upstream.
      
      When the cow_file_range fails, the related resources are unlocked
      according to the range [start..end), so the unlock cannot be repeated in
      run_delalloc_nocow.
      
      In some cases (e.g. cur_offset <= end && cow_start != -1), cur_offset is
      not updated correctly, so move the cur_offset update before
      cow_file_range.
      
        kernel BUG at mm/page-writeback.c:2663!
        Internal error: Oops - BUG: 0 [#1] SMP
        CPU: 3 PID: 31525 Comm: kworker/u8:7 Tainted: P O
        Hardware name: Realtek_RTD1296 (DT)
        Workqueue: writeback wb_workfn (flush-btrfs-1)
        task: ffffffc076db3380 ti: ffffffc02e9ac000 task.ti: ffffffc02e9ac000
        PC is at clear_page_dirty_for_io+0x1bc/0x1e8
        LR is at clear_page_dirty_for_io+0x14/0x1e8
        pc : [<ffffffc00033c91c>] lr : [<ffffffc00033c774>] pstate: 40000145
        sp : ffffffc02e9af4f0
        Process kworker/u8:7 (pid: 31525, stack limit = 0xffffffc02e9ac020)
        Call trace:
        [<ffffffc00033c91c>] clear_page_dirty_for_io+0x1bc/0x1e8
        [<ffffffbffc514674>] extent_clear_unlock_delalloc+0x1e4/0x210 [btrfs]
        [<ffffffbffc4fb168>] run_delalloc_nocow+0x3b8/0x948 [btrfs]
        [<ffffffbffc4fb948>] run_delalloc_range+0x250/0x3a8 [btrfs]
        [<ffffffbffc514c0c>] writepage_delalloc.isra.21+0xbc/0x1d8 [btrfs]
        [<ffffffbffc516048>] __extent_writepage+0xe8/0x248 [btrfs]
        [<ffffffbffc51630c>] extent_write_cache_pages.isra.17+0x164/0x378 [btrfs]
        [<ffffffbffc5185a8>] extent_writepages+0x48/0x68 [btrfs]
        [<ffffffbffc4f5828>] btrfs_writepages+0x20/0x30 [btrfs]
        [<ffffffc00033d758>] do_writepages+0x30/0x88
        [<ffffffc0003ba0f4>] __writeback_single_inode+0x34/0x198
        [<ffffffc0003ba6c4>] writeback_sb_inodes+0x184/0x3c0
        [<ffffffc0003ba96c>] __writeback_inodes_wb+0x6c/0xc0
        [<ffffffc0003bac20>] wb_writeback+0x1b8/0x1c0
        [<ffffffc0003bb0f0>] wb_workfn+0x150/0x250
        [<ffffffc0002b0014>] process_one_work+0x1dc/0x388
        [<ffffffc0002b02f0>] worker_thread+0x130/0x500
        [<ffffffc0002b6344>] kthread+0x10c/0x110
        [<ffffffc000284590>] ret_from_fork+0x10/0x40
        Code: d503201f a9025bb5 a90363b7 f90023b9 (d4210000)
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      db39065c
  3. 14 11月, 2018 3 次提交
  4. 23 8月, 2018 1 次提交
    • F
      Btrfs: sync log after logging new name · d4682ba0
      Filipe Manana 提交于
      When we add a new name for an inode which was logged in the current
      transaction, we update the inode in the log so that its new name and
      ancestors are added to the log. However when we do this we do not persist
      the log, so the changes remain in memory only, and as a consequence, any
      ancestors that were created in the current transaction are updated such
      that future calls to btrfs_inode_in_log() return true. This leads to a
      subsequent fsync against such new ancestor directories returning
      immediately, without persisting the log, therefore after a power failure
      the new ancestor directories do not exist, despite fsync being called
      against them explicitly.
      
      Example:
      
        $ mkfs.btrfs -f /dev/sdb
        $ mount /dev/sdb /mnt
      
        $ mkdir /mnt/A
        $ mkdir /mnt/B
        $ mkdir /mnt/A/C
        $ touch /mnt/B/foo
        $ xfs_io -c "fsync" /mnt/B/foo
        $ ln /mnt/B/foo /mnt/A/C/foo
        $ xfs_io -c "fsync" /mnt/A
        <power failure>
      
      After the power failure, directory "A" does not exist, despite the explicit
      fsync on it.
      
      Instead of fixing this by changing the behaviour of the explicit fsync on
      directory "A" to persist the log instead of doing nothing, make the logging
      of the new file name (which happens when creating a hard link or renaming)
      persist the log. This approach not only is simpler, not requiring addition
      of new fields to the inode in memory structure, but also gives us the same
      behaviour as ext4, xfs and f2fs (possibly other filesystems too).
      
      A test case for fstests follows soon.
      
      Fixes: 12fcfd22 ("Btrfs: tree logging unlink/rename fixes")
      Reported-by: NVijay Chidambaram <vvijay03@gmail.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d4682ba0
  5. 18 8月, 2018 1 次提交
    • R
      Btrfs: fix unexpected failure of nocow buffered writes after snapshotting when low on space · 8ecebf4d
      Robbie Ko 提交于
      Commit e9894fd3 ("Btrfs: fix snapshot vs nocow writting") forced
      nocow writes to fallback to COW, during writeback, when a snapshot is
      created. This resulted in writes made before creating the snapshot to
      unexpectedly fail with ENOSPC during writeback when success (0) was
      returned to user space through the write system call.
      
      The steps leading to this problem are:
      
      1. When it's not possible to allocate data space for a write, the
         buffered write path checks if a NOCOW write is possible.  If it is,
         it will not reserve space and success (0) is returned to user space.
      
      2. Then when a snapshot is created, the root's will_be_snapshotted
         atomic is incremented and writeback is triggered for all inode's that
         belong to the root being snapshotted. Incrementing that atomic forces
         all previous writes to fallback to COW during writeback (running
         delalloc).
      
      3. This results in the writeback for the inodes to fail and therefore
         setting the ENOSPC error in their mappings, so that a subsequent
         fsync on them will report the error to user space. So it's not a
         completely silent data loss (since fsync will report ENOSPC) but it's
         a very unexpected and undesirable behaviour, because if a clean
         shutdown/unmount of the filesystem happens without previous calls to
         fsync, it is expected to have the data present in the files after
         mounting the filesystem again.
      
      So fix this by adding a new atomic named snapshot_force_cow to the
      root structure which prevents this behaviour and works the following way:
      
      1. It is incremented when we start to create a snapshot after triggering
         writeback and before waiting for writeback to finish.
      
      2. This new atomic is now what is used by writeback (running delalloc)
         to decide whether we need to fallback to COW or not. Because we
         incremented this new atomic after triggering writeback in the
         snapshot creation ioctl, we ensure that all buffered writes that
         happened before snapshot creation will succeed and not fallback to
         COW (which would make them fail with ENOSPC).
      
      3. The existing atomic, will_be_snapshotted, is kept because it is used
         to force new buffered writes, that start after we started
         snapshotting, to reserve data space even when NOCOW is possible.
         This makes these writes fail early with ENOSPC when there's no
         available space to allocate, preventing the unexpected behaviour of
         writeback later failing with ENOSPC due to a fallback to COW mode.
      
      Fixes: e9894fd3 ("Btrfs: fix snapshot vs nocow writting")
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8ecebf4d
  6. 06 8月, 2018 18 次提交
  7. 04 8月, 2018 1 次提交
  8. 28 6月, 2018 1 次提交
    • C
      Btrfs: fix regression in btrfs_page_mkwrite() from vm_fault_t conversion · 717beb96
      Chris Mason 提交于
      The vm_fault_t conversion commit introduced a ret2 variable for tracking
      the integer return values from internal btrfs functions.  It was
      sometimes returning VM_FAULT_LOCKED for pages that were actually invalid
      and had been removed from the radix.  Something like this:
      
          ret2 = btrfs_delalloc_reserve_space() // returns zero on success
      
          lock_page(page)
          if (page->mapping != inode->i_mapping)
      	goto out_unlock;
      
      ...
      
      out_unlock:
          if (!ret2) {
      	    ...
      	    return VM_FAULT_LOCKED;
          }
      
      This ends up triggering this WARNING in btrfs_destroy_inode()
          WARN_ON(BTRFS_I(inode)->block_rsv.size);
      
      xfstests generic/095 was able to reliably reproduce the errors.
      
      Since out_unlock: is only used for errors, this fix moves it below the
      if (!ret2) check we use to return VM_FAULT_LOCKED for success.
      
      Fixes: a528a241 (btrfs: change return type of btrfs_page_mkwrite to vm_fault_t)
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      717beb96
  9. 22 6月, 2018 1 次提交
    • F
      Btrfs: fix return value on rename exchange failure · c5b4a50b
      Filipe Manana 提交于
      If we failed during a rename exchange operation after starting/joining a
      transaction, we would end up replacing the return value, stored in the
      local 'ret' variable, with the return value from btrfs_end_transaction().
      So this could end up returning 0 (success) to user space despite the
      operation having failed and aborted the transaction, because if there are
      multiple tasks having a reference on the transaction at the time
      btrfs_end_transaction() is called by the rename exchange, that function
      returns 0 (otherwise it returns -EIO and not the original error value).
      So fix this by not overwriting the return value on error after getting
      a transaction handle.
      
      Fixes: cdd1fedf ("btrfs: add support for RENAME_EXCHANGE and RENAME_WHITEOUT")
      CC: stable@vger.kernel.org # 4.9+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c5b4a50b
  10. 07 6月, 2018 1 次提交
  11. 06 6月, 2018 1 次提交
    • D
      vfs: change inode times to use struct timespec64 · 95582b00
      Deepa Dinamani 提交于
      struct timespec is not y2038 safe. Transition vfs to use
      y2038 safe struct timespec64 instead.
      
      The change was made with the help of the following cocinelle
      script. This catches about 80% of the changes.
      All the header file and logic changes are included in the
      first 5 rules. The rest are trivial substitutions.
      I avoid changing any of the function signatures or any other
      filesystem specific data structures to keep the patch simple
      for review.
      
      The script can be a little shorter by combining different cases.
      But, this version was sufficient for my usecase.
      
      virtual patch
      
      @ depends on patch @
      identifier now;
      @@
      - struct timespec
      + struct timespec64
        current_time ( ... )
        {
      - struct timespec now = current_kernel_time();
      + struct timespec64 now = current_kernel_time64();
        ...
      - return timespec_trunc(
      + return timespec64_trunc(
        ... );
        }
      
      @ depends on patch @
      identifier xtime;
      @@
       struct \( iattr \| inode \| kstat \) {
       ...
      -       struct timespec xtime;
      +       struct timespec64 xtime;
       ...
       }
      
      @ depends on patch @
      identifier t;
      @@
       struct inode_operations {
       ...
      int (*update_time) (...,
      -       struct timespec t,
      +       struct timespec64 t,
      ...);
       ...
       }
      
      @ depends on patch @
      identifier t;
      identifier fn_update_time =~ "update_time$";
      @@
       fn_update_time (...,
      - struct timespec *t,
      + struct timespec64 *t,
       ...) { ... }
      
      @ depends on patch @
      identifier t;
      @@
      lease_get_mtime( ... ,
      - struct timespec *t
      + struct timespec64 *t
        ) { ... }
      
      @te depends on patch forall@
      identifier ts;
      local idexpression struct inode *inode_node;
      identifier i_xtime =~ "^i_[acm]time$";
      identifier ia_xtime =~ "^ia_[acm]time$";
      identifier fn_update_time =~ "update_time$";
      identifier fn;
      expression e, E3;
      local idexpression struct inode *node1;
      local idexpression struct inode *node2;
      local idexpression struct iattr *attr1;
      local idexpression struct iattr *attr2;
      local idexpression struct iattr attr;
      identifier i_xtime1 =~ "^i_[acm]time$";
      identifier i_xtime2 =~ "^i_[acm]time$";
      identifier ia_xtime1 =~ "^ia_[acm]time$";
      identifier ia_xtime2 =~ "^ia_[acm]time$";
      @@
      (
      (
      - struct timespec ts;
      + struct timespec64 ts;
      |
      - struct timespec ts = current_time(inode_node);
      + struct timespec64 ts = current_time(inode_node);
      )
      
      <+... when != ts
      (
      - timespec_equal(&inode_node->i_xtime, &ts)
      + timespec64_equal(&inode_node->i_xtime, &ts)
      |
      - timespec_equal(&ts, &inode_node->i_xtime)
      + timespec64_equal(&ts, &inode_node->i_xtime)
      |
      - timespec_compare(&inode_node->i_xtime, &ts)
      + timespec64_compare(&inode_node->i_xtime, &ts)
      |
      - timespec_compare(&ts, &inode_node->i_xtime)
      + timespec64_compare(&ts, &inode_node->i_xtime)
      |
      ts = current_time(e)
      |
      fn_update_time(..., &ts,...)
      |
      inode_node->i_xtime = ts
      |
      node1->i_xtime = ts
      |
      ts = inode_node->i_xtime
      |
      <+... attr1->ia_xtime ...+> = ts
      |
      ts = attr1->ia_xtime
      |
      ts.tv_sec
      |
      ts.tv_nsec
      |
      btrfs_set_stack_timespec_sec(..., ts.tv_sec)
      |
      btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
      |
      - ts = timespec64_to_timespec(
      + ts =
      ...
      -)
      |
      - ts = ktime_to_timespec(
      + ts = ktime_to_timespec64(
      ...)
      |
      - ts = E3
      + ts = timespec_to_timespec64(E3)
      |
      - ktime_get_real_ts(&ts)
      + ktime_get_real_ts64(&ts)
      |
      fn(...,
      - ts
      + timespec64_to_timespec(ts)
      ,...)
      )
      ...+>
      (
      <... when != ts
      - return ts;
      + return timespec64_to_timespec(ts);
      ...>
      )
      |
      - timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
      + timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
      |
      - timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
      + timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
      |
      - timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
      + timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
      |
      node1->i_xtime1 =
      - timespec_trunc(attr1->ia_xtime1,
      + timespec64_trunc(attr1->ia_xtime1,
      ...)
      |
      - attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
      + attr1->ia_xtime1 =  timespec64_trunc(attr2->ia_xtime2,
      ...)
      |
      - ktime_get_real_ts(&attr1->ia_xtime1)
      + ktime_get_real_ts64(&attr1->ia_xtime1)
      |
      - ktime_get_real_ts(&attr.ia_xtime1)
      + ktime_get_real_ts64(&attr.ia_xtime1)
      )
      
      @ depends on patch @
      struct inode *node;
      struct iattr *attr;
      identifier fn;
      identifier i_xtime =~ "^i_[acm]time$";
      identifier ia_xtime =~ "^ia_[acm]time$";
      expression e;
      @@
      (
      - fn(node->i_xtime);
      + fn(timespec64_to_timespec(node->i_xtime));
      |
       fn(...,
      - node->i_xtime);
      + timespec64_to_timespec(node->i_xtime));
      |
      - e = fn(attr->ia_xtime);
      + e = fn(timespec64_to_timespec(attr->ia_xtime));
      )
      
      @ depends on patch forall @
      struct inode *node;
      struct iattr *attr;
      identifier i_xtime =~ "^i_[acm]time$";
      identifier ia_xtime =~ "^ia_[acm]time$";
      identifier fn;
      @@
      {
      + struct timespec ts;
      <+...
      (
      + ts = timespec64_to_timespec(node->i_xtime);
      fn (...,
      - &node->i_xtime,
      + &ts,
      ...);
      |
      + ts = timespec64_to_timespec(attr->ia_xtime);
      fn (...,
      - &attr->ia_xtime,
      + &ts,
      ...);
      )
      ...+>
      }
      
      @ depends on patch forall @
      struct inode *node;
      struct iattr *attr;
      struct kstat *stat;
      identifier ia_xtime =~ "^ia_[acm]time$";
      identifier i_xtime =~ "^i_[acm]time$";
      identifier xtime =~ "^[acm]time$";
      identifier fn, ret;
      @@
      {
      + struct timespec ts;
      <+...
      (
      + ts = timespec64_to_timespec(node->i_xtime);
      ret = fn (...,
      - &node->i_xtime,
      + &ts,
      ...);
      |
      + ts = timespec64_to_timespec(node->i_xtime);
      ret = fn (...,
      - &node->i_xtime);
      + &ts);
      |
      + ts = timespec64_to_timespec(attr->ia_xtime);
      ret = fn (...,
      - &attr->ia_xtime,
      + &ts,
      ...);
      |
      + ts = timespec64_to_timespec(attr->ia_xtime);
      ret = fn (...,
      - &attr->ia_xtime);
      + &ts);
      |
      + ts = timespec64_to_timespec(stat->xtime);
      ret = fn (...,
      - &stat->xtime);
      + &ts);
      )
      ...+>
      }
      
      @ depends on patch @
      struct inode *node;
      struct inode *node2;
      identifier i_xtime1 =~ "^i_[acm]time$";
      identifier i_xtime2 =~ "^i_[acm]time$";
      identifier i_xtime3 =~ "^i_[acm]time$";
      struct iattr *attrp;
      struct iattr *attrp2;
      struct iattr attr ;
      identifier ia_xtime1 =~ "^ia_[acm]time$";
      identifier ia_xtime2 =~ "^ia_[acm]time$";
      struct kstat *stat;
      struct kstat stat1;
      struct timespec64 ts;
      identifier xtime =~ "^[acmb]time$";
      expression e;
      @@
      (
      ( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1  ;
      |
       node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
      |
       node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
      |
       node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
      |
       stat->xtime = node2->i_xtime1;
      |
       stat1.xtime = node2->i_xtime1;
      |
      ( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1  ;
      |
      ( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
      |
      - e = node->i_xtime1;
      + e = timespec64_to_timespec( node->i_xtime1 );
      |
      - e = attrp->ia_xtime1;
      + e = timespec64_to_timespec( attrp->ia_xtime1 );
      |
      node->i_xtime1 = current_time(...);
      |
       node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
      - e;
      + timespec_to_timespec64(e);
      |
       node->i_xtime1 = node->i_xtime3 =
      - e;
      + timespec_to_timespec64(e);
      |
      - node->i_xtime1 = e;
      + node->i_xtime1 = timespec_to_timespec64(e);
      )
      Signed-off-by: NDeepa Dinamani <deepa.kernel@gmail.com>
      Cc: <anton@tuxera.com>
      Cc: <balbi@kernel.org>
      Cc: <bfields@fieldses.org>
      Cc: <darrick.wong@oracle.com>
      Cc: <dhowells@redhat.com>
      Cc: <dsterba@suse.com>
      Cc: <dwmw2@infradead.org>
      Cc: <hch@lst.de>
      Cc: <hirofumi@mail.parknet.co.jp>
      Cc: <hubcap@omnibond.com>
      Cc: <jack@suse.com>
      Cc: <jaegeuk@kernel.org>
      Cc: <jaharkes@cs.cmu.edu>
      Cc: <jslaby@suse.com>
      Cc: <keescook@chromium.org>
      Cc: <mark@fasheh.com>
      Cc: <miklos@szeredi.hu>
      Cc: <nico@linaro.org>
      Cc: <reiserfs-devel@vger.kernel.org>
      Cc: <richard@nod.at>
      Cc: <sage@redhat.com>
      Cc: <sfrench@samba.org>
      Cc: <swhiteho@redhat.com>
      Cc: <tj@kernel.org>
      Cc: <trond.myklebust@primarydata.com>
      Cc: <tytso@mit.edu>
      Cc: <viro@zeniv.linux.org.uk>
      95582b00
  12. 31 5月, 2018 3 次提交
    • O
      Btrfs: clean up error handling in btrfs_truncate() · ad7e1a74
      Omar Sandoval 提交于
      btrfs_truncate() uses two variables for error handling, ret and err (if
      this sounds familiar, it's because btrfs_truncate_inode_items() did
      something similar). This is error prone, as was made evident by "Btrfs:
      fix error handling in btrfs_truncate()". We only have err because we
      don't want to mask an error if we call btrfs_update_inode() and
      btrfs_end_transaction(), so let's make that its own scoped return
      variable and use ret everywhere else.
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ad7e1a74
    • N
      btrfs: Factor out write portion of btrfs_get_blocks_direct · c5794e51
      Nikolay Borisov 提交于
      Now that the read side is extracted into its own function, do the same
      to the write side. This leaves btrfs_get_blocks_direct_write with the
      sole purpose of handling common locking required. Also flip the
      condition in btrfs_get_blocks_direct_write so that the write case
      comes first and we check for if (Create) rather than if (!create). This
      is purely subjective but I believe makes reading a bit more "linear".
      No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c5794e51
    • N
      btrfs: Factor out read portion of btrfs_get_blocks_direct · 1c8d0175
      Nikolay Borisov 提交于
      Currently this function handles both the READ and WRITE dio cases. This
      is facilitated by a bunch of 'if' statements, a goto short-circuit
      statement and a very perverse aliasing of "!created"(READ) case
      by setting lockstart = lockend and checking for lockstart < lockend for
      detecting the write. Let's simplify this mess by extracting the
      READ-only code into a separate __btrfs_get_block_direct_read function.
      This is only the first step, the next one will be to factor out the
      write side as well. The end goal will be to have the common locking/
      unlocking code in btrfs_get_blocks_direct and then it will call either
      the read|write subvariants. No functional changes.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1c8d0175
  13. 30 5月, 2018 5 次提交
    • S
      btrfs: return error value if create_io_em failed in cow_file_range · 090a127a
      Su Yue 提交于
      In cow_file_range(), create_io_em() may fail, but its return value is
      not recorded.  Then return value may be 0 even it failed which is a
      wrong behavior.
      
      Let cow_file_range() return PTR_ERR(em) if create_io_em() failed.
      
      Fixes: 6f9994db ("Btrfs: create a helper to create em for IO")
      CC: stable@vger.kernel.org # 4.11+
      Signed-off-by: NSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      090a127a
    • G
      btrfs: drop unused parameter qgroup_reserved · c4c129db
      Gu JinXiang 提交于
      Since commit 7775c818 ("btrfs: remove unused parameter from
      btrfs_subvolume_release_metadata") parameter qgroup_reserved is not used
      by caller of function btrfs_subvolume_reserve_metadata.  So remove it.
      Signed-off-by: NGu JinXiang <gujx@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c4c129db
    • E
      btrfs: balance dirty metadata pages in btrfs_finish_ordered_io · e73e81b6
      Ethan Lien 提交于
      [Problem description and how we fix it]
      We should balance dirty metadata pages at the end of
      btrfs_finish_ordered_io, since a small, unmergeable random write can
      potentially produce dirty metadata which is multiple times larger than
      the data itself. For example, a small, unmergeable 4KiB write may
      produce:
      
          16KiB dirty leaf (and possibly 16KiB dirty node) in subvolume tree
          16KiB dirty leaf (and possibly 16KiB dirty node) in checksum tree
          16KiB dirty leaf (and possibly 16KiB dirty node) in extent tree
      
      Although we do call balance dirty pages in write side, but in the
      buffered write path, most metadata are dirtied only after we reach the
      dirty background limit (which by far only counts dirty data pages) and
      wakeup the flusher thread. If there are many small, unmergeable random
      writes spread in a large btree, we'll find a burst of dirty pages
      exceeds the dirty_bytes limit after we wakeup the flusher thread - which
      is not what we expect. In our machine, it caused out-of-memory problem
      since a page cannot be dropped if it is marked dirty.
      
      Someone may worry about we may sleep in btrfs_btree_balance_dirty_nodelay,
      but since we do btrfs_finish_ordered_io in a separate worker, it will not
      stop the flusher consuming dirty pages. Also, we use different worker for
      metadata writeback endio, sleep in btrfs_finish_ordered_io help us throttle
      the size of dirty metadata pages.
      
      [Reproduce steps]
      To reproduce the problem, we need to do 4KiB write randomly spread in a
      large btree. In our 2GiB RAM machine:
      
      1) Create 4 subvolumes.
      2) Run fio on each subvolume:
      
         [global]
         direct=0
         rw=randwrite
         ioengine=libaio
         bs=4k
         iodepth=16
         numjobs=1
         group_reporting
         size=128G
         runtime=1800
         norandommap
         time_based
         randrepeat=0
      
      3) Take snapshot on each subvolume and repeat fio on existing files.
      4) Repeat step (3) until we get large btrees.
         In our case, by observing btrfs_root_item->bytes_used, we have 2GiB of
         metadata in each subvolume tree and 12GiB of metadata in extent tree.
      5) Stop all fio, take snapshot again, and wait until all delayed work is
         completed.
      6) Start all fio. Few seconds later we hit OOM when the flusher starts
         to work.
      
      It can be reproduced even when using nocow write.
      Signed-off-by: NEthan Lien <ethanlien@synology.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add comment ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e73e81b6
    • E
      btrfs: lift some btrfs_cross_ref_exist checks in nocow path · 78d4295b
      Ethan Lien 提交于
      In nocow path, we check if the extent is snapshotted in
      btrfs_cross_ref_exist(). We can do the similar check earlier and avoid
      unnecessary search into extent tree.
      
      A fio test on a Intel D-1531, 16GB RAM, SSD RAID-5 machine as follows:
      
      [global]
      group_reporting
      time_based
      thread=1
      ioengine=libaio
      bs=4k
      iodepth=32
      size=64G
      runtime=180
      numjobs=8
      rw=randwrite
      
      [file1]
      filename=/mnt/nocow/testfile
      
      IOPS result:   unpatched     patched
      
      1 fio round:     46670        46958
      snapshot
      2 fio round:     51826        54498
      3 fio round:     59767        61289
      
      After snapshot, the first fio get about 5% performance gain. As we
      continually write to the same file, all writes will resume to nocow mode
      and eventually we have no performance gain.
      Signed-off-by: NEthan Lien <ethanlien@synology.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ update comments ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      78d4295b
    • L
      btrfs: Remove fs_info argument from btrfs_uuid_tree_rem · d1957791
      Lu Fengqi 提交于
      This function always takes a transaction handle which contains a
      reference to the fs_info. Use that and remove the extra argument.
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      [ rename the function ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d1957791
  14. 29 5月, 2018 2 次提交