1. 29 5月, 2018 4 次提交
  2. 24 5月, 2018 1 次提交
    • O
      Btrfs: fix error handling in btrfs_truncate() · d5014738
      Omar Sandoval 提交于
      Jun Wu at Facebook reported that an internal service was seeing a return
      value of 1 from ftruncate() on Btrfs in some cases. This is coming from
      the NEED_TRUNCATE_BLOCK return value from btrfs_truncate_inode_items().
      
      btrfs_truncate() uses two variables for error handling, ret and err.
      When btrfs_truncate_inode_items() returns non-zero, we set err to the
      return value. However, NEED_TRUNCATE_BLOCK is not an error. Make sure we
      only set err if ret is an error (i.e., negative).
      
      To reproduce the issue: mount a filesystem with -o compress-force=zstd
      and the following program will encounter return value of 1 from
      ftruncate:
      
      int main(void) {
              char buf[256] = { 0 };
              int ret;
              int fd;
      
              fd = open("test", O_CREAT | O_WRONLY | O_TRUNC, 0666);
              if (fd == -1) {
                      perror("open");
                      return EXIT_FAILURE;
              }
      
              if (write(fd, buf, sizeof(buf)) != sizeof(buf)) {
                      perror("write");
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              if (fsync(fd) == -1) {
                      perror("fsync");
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              ret = ftruncate(fd, 128);
              if (ret) {
                      printf("ftruncate() returned %d\n", ret);
                      close(fd);
                      return EXIT_FAILURE;
              }
      
              close(fd);
              return EXIT_SUCCESS;
      }
      
      Fixes: ddfae63c ("btrfs: move btrfs_truncate_block out of trans handle")
      CC: stable@vger.kernel.org # 4.15+
      Reported-by: NJun Wu <quark@fb.com>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d5014738
  3. 17 5月, 2018 1 次提交
  4. 12 5月, 2018 1 次提交
    • A
      do d_instantiate/unlock_new_inode combinations safely · 1e2e547a
      Al Viro 提交于
      For anything NFS-exported we do _not_ want to unlock new inode
      before it has grown an alias; original set of fixes got the
      ordering right, but missed the nasty complication in case of
      lockdep being enabled - unlock_new_inode() does
      	lockdep_annotate_inode_mutex_key(inode)
      which can only be done before anyone gets a chance to touch
      ->i_mutex.  Unfortunately, flipping the order and doing
      unlock_new_inode() before d_instantiate() opens a window when
      mkdir can race with open-by-fhandle on a guessed fhandle, leading
      to multiple aliases for a directory inode and all the breakage
      that follows from that.
      
      	Correct solution: a new primitive (d_instantiate_new())
      combining these two in the right order - lockdep annotate, then
      d_instantiate(), then the rest of unlock_new_inode().  All
      combinations of d_instantiate() with unlock_new_inode() should
      be converted to that.
      
      Cc: stable@kernel.org	# 2.6.29 and later
      Tested-by: NMike Marshall <hubcap@omnibond.com>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      1e2e547a
  5. 19 4月, 2018 1 次提交
  6. 12 4月, 2018 1 次提交
  7. 31 3月, 2018 9 次提交
  8. 26 3月, 2018 11 次提交
    • S
      btrfs: adjust return values of btrfs_inode_by_name · 005d6712
      Su Yue 提交于
      Previously, btrfs_inode_by_name() returned 0 which left caller to check
      objectid of location even location if the type was invalid.
      
      Let btrfs_inode_by_name() return -EUCLEAN if a corrupted location of a
      dir entry is found.  Removal of label out_err also simplifies the
      function.
      Signed-off-by: NSu Yue <suy.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ drop unlikely ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      005d6712
    • N
      btrfs: Remove root argument from cow_file_range_inline · d02c0e20
      Nikolay Borisov 提交于
      This argument is always set to the root of the inode, which is also
      passed. So let's get a reference inside the function and simplify
      the arg list.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d02c0e20
    • F
      Btrfs: skip writeback of last page when truncating file to same size · 213e8c55
      Filipe Manana 提交于
      When we truncate a file to the same size and that size is not aligned
      with the sector size, we end up triggering writeback (and wait for it to
      complete) of the last page. This is unncessary as we can not have delayed
      allocation beyond the inode's i_size and the goal of truncating a file
      to its own size is to discard prealloc extents (allocated via the
      fallocate(2) system call). Besides the unnecessary IO start and wait, it
      also breaks the oppurtunity for larger contiguous extents on disk, as
      before the last dirty page there might be other dirty pages.
      
      This scenario is probably not very common in general, however it is
      common for btrfs receive implementations because currently the send
      stream always issues a truncate operation for each processed inode as
      the last operation for that inode (this truncate operation is not
      always needed and the send implementation will be addressed to avoid
      them).
      
      So improve this by not starting and waiting for writeback of the inode's
      last page when we are truncating to exactly the same size.
      
      The following script was used to quickly measure the time a receive
      operation takes:
      
       $ cat test_send.sh
       #!/bin/bash
      
       SRC_DEV=/dev/sdc
       DST_DEV=/dev/sdd
       SRC_MNT=/mnt/sdc
       DST_MNT=/mnt/sdd
      
       mkfs.btrfs -f $SRC_DEV >/dev/null
       mkfs.btrfs -f $DST_DEV >/dev/null
       mount $SRC_DEV $SRC_MNT
       mount $DST_DEV $DST_MNT
      
       echo "Creating source filesystem"
       for ((t = 0; t < 10; t++)); do
           (
               for ((i = 1; i <= 20000; i++)); do
                   xfs_io -f -c "pwrite -S 0xab 0 5000" \
                      $SRC_MNT/file_$i > /dev/null
               done
           ) &
           worker_pids[$t]=$!
       done
       wait ${worker_pids[@]}
      
       echo "Creating and sending snapshot"
       btrfs subvolume snapshot -r $SRC_MNT $SRC_MNT/snap1 >/dev/null
       /usr/bin/time -f "send took %e seconds"    \
           btrfs send -f $SRC_MNT/send_file $SRC_MNT/snap1
       /usr/bin/time -f "receive took %e seconds" \
           btrfs receive -f $SRC_MNT/send_file $DST_MNT
      
       umount $SRC_MNT
       umount $DST_MNT
      
      The results for 5 runs were the following:
      
      * Without this change
      
      average receive time was 26.49 seconds
      standard deviation of 2.53 seconds
      
      * With this change
      
      average receive time was 12.51 seconds
      standard deviation of 0.32 seconds
      Reported-by: NRobbie Ko <robbieko@synology.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      213e8c55
    • N
      btrfs: Remove redundant memory barriers around dio_private error status · de224b7c
      Nikolay Borisov 提交于
      Using any kind of memory barriers around atomic operations which have
      a return value is redundant, since those operations themselves are
      fully ordered. atomic_t.txt states:
      
          - RMW operations that have a return value are fully ordered;
      
          Fully ordered primitives are ordered against everything prior and
          everything subsequent. Therefore a fully ordered primitive is like
          having an smp_mb() before and an smp_mb() after the primitive.
      
      Given this let's replace the extra memory barriers with comments.
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      de224b7c
    • D
      btrfs: add more __cold annotations · e67c718b
      David Sterba 提交于
      The __cold functions are placed to a special section, as they're
      expected to be called rarely. This could help i-cache prefetches or help
      compiler to decide which branches are more/less likely to be taken
      without any other annotations needed.
      
      Though we can't add more __exit annotations, it's still possible to add
      __cold (that's also added with __exit). That way the following function
      categories are tagged:
      
      - printf wrappers, error messages
      - exit helpers
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e67c718b
    • L
      Btrfs: fix unexpected cow in run_delalloc_nocow · 58113753
      Liu Bo 提交于
      Fstests generic/475 provides a way to fail metadata reads while
      checking if checksum exists for the inode inside run_delalloc_nocow(),
      and csum_exist_in_range() interprets error (-EIO) as inode having
      checksum and makes its caller enter the cow path.
      
      In case of free space inode, this ends up with a warning in
      cow_file_range().
      
      The same problem applies to btrfs_cross_ref_exist() since it may also
      read metadata in between.
      
      With this, run_delalloc_nocow() bails out when errors occur at the two
      places.
      
      cc: <stable@vger.kernel.org> v2.6.28+
      Fixes: 17d217fe ("Btrfs: fix nodatasum handling in balancing code")
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      58113753
    • N
      btrfs: Remove custom crc32c init code · 9678c543
      Nikolay Borisov 提交于
      The custom crc32 init code was introduced in
      14a958e6 ("Btrfs: fix btrfs boot when compiled as built-in") to
      enable using btrfs as a built-in. However, later as pointed out by
      60efa5eb ("Btrfs: use late_initcall instead of module_init") this
      wasn't enough and finally btrfs was switched to late_initcall which
      comes after the generic crc32c implementation is initiliased. The
      latter commit superseeded the former. Now that we don't have to
      maintain our own code let's just remove it and switch to using the
      generic implementation.
      
      Despite touching a lot of files the patch is really simple. Here is the gist of
      the changes:
      
      1. Select LIBCRC32C rather than the low-level modules.
      2. s/btrfs_crc32c/crc32c/g
      3. replace hash.h with linux/crc32c.h
      4. Move the btrfs namehash funcs to ctree.h and change the tree accordingly.
      
      I've tested this with btrfs being both a module and a built-in and xfstest
      doesn't complain.
      
      Does seem to fix the longstanding problem of not automatically selectiong
      the crc32c module when btrfs is used. Possibly there is a workaround in
      dracut.
      
      The modinfo confirms that now all the module dependencies are there:
      
      before:
      depends:        zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
      
      after:
      depends:        libcrc32c,zstd_compress,zstd_decompress,raid6_pq,xor,zlib_deflate
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add more info to changelog from mails ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      9678c543
    • N
      btrfs: Remove btrfs_inode::delayed_iput_count · c1c3fac2
      Nikolay Borisov 提交于
      delayed_iput_count wa supposed to be used to implement, well, delayed
      iput. The idea is that we keep accumulating the number of iputs we do
      until eventually the inode is deleted. Turns out we never really
      switched the delayed_iput_count from 0 to 1, hence all conditional
      code relying on the value of that member being different than 0 was
      never executed. This, as it turns out, didn't cause any problem due
      to the simple fact that the generic inode's i_count member was always
      used to count the number of iputs. So let's just remove the unused
      member and all unused code. This patch essentially provides no
      functional changes. While at it, also add proper documentation for
      btrfs_add_delayed_iput
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ reformat comment ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c1c3fac2
    • L
      Btrfs: do not check inode's runtime flags under root->orphan_lock · 3d5addaf
      Liu Bo 提交于
      It's not necessary to hold ->orphan_lock when checking inode's runtime
      flags.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3d5addaf
    • A
      btrfs: use ASSERT to report logical error in cow_file_range() · 566b1760
      Anand Jain 提交于
      Use ASSERT to report logical error in cow_file_range(), also move it a
      bit closer to when the num_bytes is derived.
      
      The extent start could be (u64)-1 in some cases, the assert should catch
      that we do not accidentally pass it to cow_file_range.
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      566b1760
    • A
      btrfs: cow_file_range() num_bytes and disk_num_bytes are same · 3752d22f
      Anand Jain 提交于
      This patch deletes local variable disk_num_bytes as its value
      is same as num_bytes in the function cow_file_range().
      Signed-off-by: NAnand Jain <anand.jain@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      3752d22f
  9. 01 3月, 2018 1 次提交
  10. 02 2月, 2018 3 次提交
    • L
      Btrfs: fix use-after-free on root->orphan_block_rsv · 1a932ef4
      Liu Bo 提交于
      I got these from running generic/475,
      
      WARNING: CPU: 0 PID: 26384 at fs/btrfs/inode.c:3326 btrfs_orphan_commit_root+0x1ac/0x2b0 [btrfs]
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      IP: btrfs_block_rsv_release+0x1c/0x70 [btrfs]
      Call Trace:
        btrfs_orphan_release_metadata+0x9f/0x200 [btrfs]
        btrfs_orphan_del+0x10d/0x170 [btrfs]
        btrfs_setattr+0x500/0x640 [btrfs]
        notify_change+0x7ae/0x870
        do_truncate+0xca/0x130
        vfs_truncate+0x2ee/0x3d0
        do_sys_truncate+0xaf/0xf0
        SyS_truncate+0xe/0x10
        entry_SYSCALL_64_fastpath+0x1f/0x96
      
      The race is between btrfs_orphan_commit_root and btrfs_orphan_del,
              t1                                        t2
      btrfs_orphan_commit_root                     btrfs_orphan_del
         spin_lock
         check (&root->orphan_inodes)
         root->orphan_block_rsv = NULL;
         spin_unlock
                                                   atomic_dec(&root->orphan_inodes);
                                                   access root->orphan_block_rsv
      
      Accessing root->orphan_block_rsv must be done before decreasing
      root->orphan_inodes.
      
      cc: <stable@vger.kernel.org> v3.12+
      Fixes: 703c88e0 ("Btrfs: fix tracking of orphan inode count")
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      1a932ef4
    • L
      Btrfs: fix btrfs_evict_inode to handle abnormal inodes correctly · e8f1bc14
      Liu Bo 提交于
      This regression is introduced in
      commit 3d48d981 ("btrfs: Handle uninitialised inode eviction").
      
      There are two problems,
      
      a) it is ->destroy_inode() that does the final free on inode, not
         ->evict_inode(),
      b) clear_inode() must be called before ->evict_inode() returns.
      
      This could end up hitting BUG_ON(inode->i_state != (I_FREEING | I_CLEAR));
      in evict() because I_CLEAR is set in clear_inode().
      
      Fixes: commit 3d48d981 ("btrfs: Handle uninitialised inode eviction")
      Cc: <stable@vger.kernel.org> # v4.7-rc6+
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e8f1bc14
    • L
      Btrfs: fix deadlock in run_delalloc_nocow · e8916699
      Liu Bo 提交于
      @cur_offset is not set back to what it should be (@cow_start) if
      btrfs_next_leaf() returns something wrong, and the range [cow_start,
      cur_offset) remains locked forever.
      
      cc: <stable@vger.kernel.org>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      e8916699
  11. 29 1月, 2018 3 次提交
  12. 22 1月, 2018 4 次提交
    • N
      btrfs: Use IS_ALIGNED in btrfs_truncate_block instead of opencoding it · b03ebd99
      Nikolay Borisov 提交于
      No functional changes, just makes the code more readable
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b03ebd99
    • L
      Btrfs: move extent map specific code to extent_map.c · c04e61b5
      Liu Bo 提交于
      These helpers are extent map specific, move them to extent_map.c.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      c04e61b5
    • L
      Btrfs: add helper for em merge logic · 7b4df058
      Liu Bo 提交于
      This is a prepare work for the following extent map selftest, which
      runs tests against em merge logic.
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7b4df058
    • L
      Btrfs: fix unexpected EEXIST from btrfs_get_extent · 18e83ac7
      Liu Bo 提交于
      This fixes a corner case that is caused by a race of dio write vs dio
      read/write.
      
      Here is how the race could happen.
      
      Suppose that no extent map has been loaded into memory yet.
      There is a file extent [0, 32K), two jobs are running concurrently
      against it, t1 is doing dio write to [8K, 32K) and t2 is doing dio
      read from [0, 4K) or [4K, 8K).
      
      t1 goes ahead of t2 and splits em [0, 32K) to em [0K, 8K) and [8K 32K).
      
      ------------------------------------------------------
                   t1                                t2
            btrfs_get_blocks_direct()         btrfs_get_blocks_direct()
             -> btrfs_get_extent()              -> btrfs_get_extent()
                 -> lookup_extent_mapping()
                 -> add_extent_mapping()            -> lookup_extent_mapping()
                    # load [0, 32K)
             -> btrfs_new_extent_direct()
                 -> btrfs_drop_extent_cache()
                    # split [0, 32K) and
      	      # drop [8K, 32K)
                 -> add_extent_mapping()
                    # add [8K, 32K)
                                                    -> add_extent_mapping()
                                                       # handle -EEXIST when adding
                                                       # [0, 32K)
      ------------------------------------------------------
      About how t2(dio read/write) runs into -EEXIST:
      
      a) add_extent_mapping() gets -EEXIST for adding em [0, 32k),
      
      b) search_extent_mapping() then returns [0, 8k) as the existing em,
         even though start == existing->start, em is [0, 32k) so that
         extent_map_end(em) > extent_map_end(existing), i.e. 32k > 8k,
      
      c) then it goes thru merge_extent_mapping() which tries to add a [8k, 8k)
         (with a length 0) and returns -EEXIST as [8k, 32k) is already in tree,
      
      d) so btrfs_get_extent() ends up returning -EEXIST to dio read/write,
         which is confusing applications.
      
      Here I conclude all the possible situations,
      1) start < existing->start
      
                  +-----------+em+-----------+
      +--prev---+ |     +-------------+      |
      |         | |     |             |      |
      +---------+ +     +---+existing++      ++
                      +
                      |
                      +
                   start
      
      2) start == existing->start
      
            +------------em------------+
            |     +-------------+      |
            |     |             |      |
            +     +----existing-+      +
                  |
                  |
                  +
               start
      
      3) start > existing->start && start < (existing->start + existing->len)
      
            +------------em------------+
            |     +-------------+      |
            |     |             |      |
            +     +----existing-+      +
                     |
                     |
                     +
                   start
      
      4) start >= (existing->start + existing->len)
      
      +-----------+em+-----------+
      |     +-------------+      | +--next---+
      |     |             |      | |         |
      +     +---+existing++      + +---------+
                            +
                            |
                            +
                         start
      
      As we can see, it turns out that if start is within existing em (front
      inclusive), then the existing em should be returned as is, otherwise,
      we try our best to merge candidate em with sibling ems to form a
      larger em (in order to reduce the total number of em).
      Reported-by: NDavid Vallender <david.vallender@landmark.co.uk>
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      18e83ac7