1. 22 1月, 2018 15 次提交
  2. 03 1月, 2018 2 次提交
    • C
      btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes · ec35e48b
      Chris Mason 提交于
      refcounts have a generic implementation and an asm optimized one.  The
      generic version has extra debugging to make sure that once a refcount
      goes to zero, refcount_inc won't increase it.
      
      The btrfs delayed inode code wasn't expecting this, and we're tripping
      over the warnings when the generic refcounts are used.  We ended up with
      this race:
      
      Process A                                         Process B
                                                        btrfs_get_delayed_node()
      						  spin_lock(root->inode_lock)
      						  radix_tree_lookup()
      __btrfs_release_delayed_node()
      refcount_dec_and_test(&delayed_node->refs)
      our refcount is now zero
      						  refcount_add(2) <---
      						  warning here, refcount
                                                        unchanged
      
      spin_lock(root->inode_lock)
      radix_tree_delete()
      
      With the generic refcounts, we actually warn again when process B above
      tries to release his refcount because refcount_add() turned into a
      no-op.
      
      We saw this in production on older kernels without the asm optimized
      refcounts.
      
      The fix used here is to use refcount_inc_not_zero() to detect when the
      object is in the middle of being freed and return NULL.  This is almost
      always the right answer anyway, since we usually end up pitching the
      delayed_node if it didn't have fresh data in it.
      
      This also changes __btrfs_release_delayed_node() to remove the extra
      check for zero refcounts before radix tree deletion.
      btrfs_get_delayed_node() was the only path that was allowing refcounts
      to go from zero to one.
      
      Fixes: 6de5f18e ("btrfs: fix refcount_t usage when deleting btrfs_delayed_node")
      CC: <stable@vger.kernel.org> # 4.12+
      Signed-off-by: NChris Mason <clm@fb.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ec35e48b
    • N
      btrfs: Fix flush bio leak · beed9263
      Nikolay Borisov 提交于
      Commit e0ae9994 ("btrfs: preallocate device flush bio") reworked
      the way the flush bio is allocated and used. Concretely it allocates
      the bio in __alloc_device and then re-uses it multiple times with a
      very simple endio routine that just calls complete() without consuming
      a reference. Allocated bios by default come with a ref count of 1,
      which is then consumed by the endio routine (or not, in which case they
      should be bio_put by the caller). The way the impleementation works now
      is that the flush bio has a refcount of 2 and we only ever bio_put it
      once, leaving it to hang indefinitely. Fix this by removing the extra
      bio_get in __alloc_device.
      
      Fixes: e0ae9994 ("btrfs: preallocate device flush bio")
      Signed-off-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      beed9263
  3. 07 12月, 2017 5 次提交
  4. 29 11月, 2017 1 次提交
    • F
      Btrfs: incremental send, fix wrong unlink path after renaming file · ea37d599
      Filipe Manana 提交于
      Under some circumstances, an incremental send operation can issue wrong
      paths for unlink commands related to files that have multiple hard links
      and some (or all) of those links were renamed between the parent and send
      snapshots. Consider the following example:
      
      Parent snapshot
      
       .                                                      (ino 256)
       |---- a/                                               (ino 257)
       |     |---- b/                                         (ino 259)
       |     |     |---- c/                                   (ino 260)
       |     |     |---- f2                                   (ino 261)
       |     |
       |     |---- f2l1                                       (ino 261)
       |
       |---- d/                                               (ino 262)
             |---- f1l1_2                                     (ino 258)
             |---- f2l2                                       (ino 261)
             |---- f1_2                                       (ino 258)
      
      Send snapshot
      
       .                                                      (ino 256)
       |---- a/                                               (ino 257)
       |     |---- f2l1/                                      (ino 263)
       |             |---- b2/                                (ino 259)
       |                   |---- c/                           (ino 260)
       |                   |     |---- d3                     (ino 262)
       |                   |           |---- f1l1_2           (ino 258)
       |                   |           |---- f2l2_2           (ino 261)
       |                   |           |---- f1_2             (ino 258)
       |                   |
       |                   |---- f2                           (ino 261)
       |                   |---- f1l2                         (ino 258)
       |
       |---- d                                                (ino 261)
      
      When computing the incremental send stream the following steps happen:
      
      1) When processing inode 261, a rename operation is issued that renames
         inode 262, which currently as a path of "d", to an orphan name of
         "o262-7-0". This is done because in the send snapshot, inode 261 has
         of its hard links with a path of "d" as well.
      
      2) Two link operations are issued that create the new hard links for
         inode 261, whose names are "d" and "f2l2_2", at paths "/" and
         "o262-7-0/" respectively.
      
      3) Still while processing inode 261, unlink operations are issued to
         remove the old hard links of inode 261, with names "f2l1" and "f2l2",
         at paths "a/" and "d/". However path "d/" does not correspond anymore
         to the directory inode 262 but corresponds instead to a hard link of
         inode 261 (link command issued in the previous step). This makes the
         receiver fail with a ENOTDIR error when attempting the unlink
         operation.
      
      The problem happens because before sending the unlink operation, we failed
      to detect that inode 262 was one of ancestors for inode 261 in the parent
      snapshot, and therefore we didn't recompute the path for inode 262 before
      issuing the unlink operation for the link named "f2l2" of inode 262. The
      detection failed because the function "is_ancestor()" only follows the
      first hard link it finds for an inode instead of all of its hard links
      (as it was originally created for being used with directories only, for
      which only one hard link exists). So fix this by making "is_ancestor()"
      follow all hard links of the input inode.
      
      A test case for fstests follows soon.
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ea37d599
  5. 28 11月, 2017 4 次提交
    • Q
      btrfs: tree-checker: Fix false panic for sanity test · 69fc6cbb
      Qu Wenruo 提交于
      [BUG]
      If we run btrfs with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y, it will
      instantly cause kernel panic like:
      
      ------
      ...
      assertion failed: 0, file: fs/btrfs/disk-io.c, line: 3853
      ...
      Call Trace:
       btrfs_mark_buffer_dirty+0x187/0x1f0 [btrfs]
       setup_items_for_insert+0x385/0x650 [btrfs]
       __btrfs_drop_extents+0x129a/0x1870 [btrfs]
      ...
      -----
      
      [Cause]
      Btrfs will call btrfs_check_leaf() in btrfs_mark_buffer_dirty() to check
      if the leaf is valid with CONFIG_BTRFS_FS_RUN_SANITY_TESTS=y.
      
      However quite some btrfs_mark_buffer_dirty() callers(*) don't really
      initialize its item data but only initialize its item pointers, leaving
      item data uninitialized.
      
      This makes tree-checker catch uninitialized data as error, causing
      such panic.
      
      *: These callers include but not limited to
      setup_items_for_insert()
      btrfs_split_item()
      btrfs_expand_item()
      
      [Fix]
      Add a new parameter @check_item_data to btrfs_check_leaf().
      With @check_item_data set to false, item data check will be skipped and
      fallback to old btrfs_check_leaf() behavior.
      
      So we can still get early warning if we screw up item pointers, and
      avoid false panic.
      
      Cc: Filipe Manana <fdmanana@gmail.com>
      Reported-by: NLakshmipathi.G <lakshmipathi.g@gmail.com>
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      69fc6cbb
    • L
      Rename superblock flags (MS_xyz -> SB_xyz) · 1751e8a6
      Linus Torvalds 提交于
      This is a pure automated search-and-replace of the internal kernel
      superblock flags.
      
      The s_flags are now called SB_*, with the names and the values for the
      moment mirroring the MS_* flags that they're equivalent to.
      
      Note how the MS_xyz flags are the ones passed to the mount system call,
      while the SB_xyz flags are what we then use in sb->s_flags.
      
      The script to do this was:
      
          # places to look in; re security/*: it generally should *not* be
          # touched (that stuff parses mount(2) arguments directly), but
          # there are two places where we really deal with superblock flags.
          FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
                  include/linux/fs.h include/uapi/linux/bfs_fs.h \
                  security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
          # the list of MS_... constants
          SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
                DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
                POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
                I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
                ACTIVE NOUSER"
      
          SED_PROG=
          for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
      
          # we want files that contain at least one of MS_...,
          # with fs/namespace.c and fs/pnode.c excluded.
          L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
      
          for f in $L; do sed -i $f $SED_PROG; done
      Requested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1751e8a6
    • L
      Btrfs: fix list_add corruption and soft lockups in fsync · ebb70442
      Liu Bo 提交于
      Xfstests btrfs/146 revealed this corruption,
      
      [   58.138831] Buffer I/O error on dev dm-0, logical block 2621424, async page read
      [   58.151233] BTRFS error (device sdf): bdev /dev/mapper/error-test errs: wr 1, rd 0, flush 0, corrupt 0, gen 0
      [   58.152403] list_add corruption. prev->next should be next (ffff88005e6775d8), but was ffffc9000189be88. (prev=ffffc9000189be88).
      [   58.153518] ------------[ cut here ]------------
      [   58.153892] WARNING: CPU: 1 PID: 1287 at lib/list_debug.c:31 __list_add_valid+0x169/0x1f0
      ...
      [   58.157379] RIP: 0010:__list_add_valid+0x169/0x1f0
      ...
      [   58.161956] Call Trace:
      [   58.162264]  btrfs_log_inode_parent+0x5bd/0xfb0 [btrfs]
      [   58.163583]  btrfs_log_dentry_safe+0x60/0x80 [btrfs]
      [   58.164003]  btrfs_sync_file+0x4c2/0x6f0 [btrfs]
      [   58.164393]  vfs_fsync_range+0x5f/0xd0
      [   58.164898]  do_fsync+0x5a/0x90
      [   58.165170]  SyS_fsync+0x10/0x20
      [   58.165395]  entry_SYSCALL_64_fastpath+0x1f/0xbe
      ...
      
      It turns out that we could record btrfs_log_ctx:io_err in
      log_one_extents when IO fails, but make log_one_extents() return '0'
      instead of -EIO, so the IO error is not acknowledged by the callers,
      i.e.  btrfs_log_inode_parent(), which would remove btrfs_log_ctx:list
      from list head 'root->log_ctxs'.  Since btrfs_log_ctx is allocated
      from stack memory, it'd get freed with a object alive on the
      list. then a future list_add will throw the above warning.
      
      This returns the correct error in the above case.
      
      Jeff also reported this while testing against his fsync error
      patch set[1].
      
      [1]: https://www.spinics.net/lists/linux-btrfs/msg65308.html
      "btrfs list corruption and soft lockups while testing writeback error handling"
      
      Fixes: 8407f553 ("Btrfs: fix data corruption after fast fsync and writeback error")
      Signed-off-by: NLiu Bo <bo.li.liu@oracle.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ebb70442
    • Q
      btrfs: Fix wild memory access in compression level parser · eae8d825
      Qu Wenruo 提交于
      [BUG]
      Kernel panic when mounting with "-o compress" mount option.
      KASAN will report like:
      ------
      ==================================================================
      BUG: KASAN: wild-memory-access in strncmp+0x31/0xc0
      Read of size 1 at addr d86735fce994f800 by task mount/662
      ...
      Call Trace:
       dump_stack+0xe3/0x175
       kasan_report+0x163/0x370
       __asan_load1+0x47/0x50
       strncmp+0x31/0xc0
       btrfs_compress_str2level+0x20/0x70 [btrfs]
       btrfs_parse_options+0xff4/0x1870 [btrfs]
       open_ctree+0x2679/0x49f0 [btrfs]
       btrfs_mount+0x1b7f/0x1d30 [btrfs]
       mount_fs+0x49/0x190
       vfs_kern_mount.part.29+0xba/0x280
       vfs_kern_mount+0x13/0x20
       btrfs_mount+0x31e/0x1d30 [btrfs]
       mount_fs+0x49/0x190
       vfs_kern_mount.part.29+0xba/0x280
       do_mount+0xaad/0x1a00
       SyS_mount+0x98/0xe0
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      ------
      
      [Cause]
      For 'compress' and 'compress_force' options, its token doesn't expect
      any parameter so its args[0] contains uninitialized data.
      Accessing args[0] will cause above wild memory access.
      
      [Fix]
      For Opt_compress and Opt_compress_force, set compression level to
      the default.
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ set the default in advance ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      eae8d825
  6. 27 11月, 2017 1 次提交
  7. 21 11月, 2017 1 次提交
    • J
      btrfs: clear space cache inode generation always · 8e138e0d
      Josef Bacik 提交于
      We discovered a box that had double allocations, and suspected the space
      cache may be to blame.  While auditing the write out path I noticed that
      if we've already setup the space cache we will just carry on.  This
      means that any error we hit after cache_save_setup before we go to
      actually write the cache out we won't reset the inode generation, so
      whatever was already written will be considered correct, except it'll be
      stale.  Fix this by _always_ resetting the generation on the block group
      inode, this way we only ever have valid or invalid cache.
      
      With this patch I was no longer able to reproduce cache corruption with
      dm-log-writes and my bpf error injection tool.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      8e138e0d
  8. 16 11月, 2017 5 次提交
  9. 15 11月, 2017 5 次提交
  10. 13 11月, 2017 1 次提交
    • D
      Pass mode to wait_on_atomic_t() action funcs and provide default actions · 5e4def20
      David Howells 提交于
      Make wait_on_atomic_t() pass the TASK_* mode onto its action function as an
      extra argument and make it 'unsigned int throughout.
      
      Also, consolidate a bunch of identical action functions into a default
      function that can do the appropriate thing for the mode.
      
      Also, change the argument name in the bit_wait*() function declarations to
      reflect the fact that it's the mode and not the bit number.
      
      [Peter Z gives this a grudging ACK, but thinks that the whole atomic_t wait
      should be done differently, though he's not immediately sure as to how]
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      cc: Ingo Molnar <mingo@kernel.org>
      5e4def20