1. 10 11月, 2018 1 次提交
    • V
      ext4: missing !bh check in ext4_xattr_inode_write() · eb6984fa
      Vasily Averin 提交于
      According to Ted Ts'o ext4_getblk() called in ext4_xattr_inode_write()
      should not return bh = NULL
      
      The only time that bh could be NULL, then, would be in the case of
      something really going wrong; a programming error elsewhere (perhaps a
      wild pointer dereference) or I/O error causing on-disk file system
      corruption (although that would be highly unlikely given that we had
      *just* allocated the blocks and so the metadata blocks in question
      probably would still be in the cache).
      
      Fixes: e50e5129 ("ext4: xattr-in-inode support")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 4.13
      eb6984fa
  2. 09 11月, 2018 3 次提交
  3. 08 11月, 2018 9 次提交
    • E
      mount: Prevent MNT_DETACH from disconnecting locked mounts · 9c8e0a1b
      Eric W. Biederman 提交于
      Timothy Baldwin <timbaldwin@fastmail.co.uk> wrote:
      > As per mount_namespaces(7) unprivileged users should not be able to look under mount points:
      >
      >   Mounts that come as a single unit from more privileged mount are locked
      >   together and may not be separated in a less privileged mount namespace.
      >
      > However they can:
      >
      > 1. Create a mount namespace.
      > 2. In the mount namespace open a file descriptor to the parent of a mount point.
      > 3. Destroy the mount namespace.
      > 4. Use the file descriptor to look under the mount point.
      >
      > I have reproduced this with Linux 4.16.18 and Linux 4.18-rc8.
      >
      > The setup:
      >
      > $ sudo sysctl kernel.unprivileged_userns_clone=1
      > kernel.unprivileged_userns_clone = 1
      > $ mkdir -p A/B/Secret
      > $ sudo mount -t tmpfs hide A/B
      >
      >
      > "Secret" is indeed hidden as expected:
      >
      > $ ls -lR A
      > A:
      > total 0
      > drwxrwxrwt 2 root root 40 Feb 12 21:08 B
      >
      > A/B:
      > total 0
      >
      >
      > The attack revealing "Secret":
      >
      > $ unshare -Umr sh -c "exec unshare -m ls -lR /proc/self/fd/4/ 4<A"
      > /proc/self/fd/4/:
      > total 0
      > drwxr-xr-x 3 root root 60 Feb 12 21:08 B
      >
      > /proc/self/fd/4/B:
      > total 0
      > drwxr-xr-x 2 root root 40 Feb 12 21:08 Secret
      >
      > /proc/self/fd/4/B/Secret:
      > total 0
      
      I tracked this down to put_mnt_ns running passing UMOUNT_SYNC and
      disconnecting all of the mounts in a mount namespace.  Fix this by
      factoring drop_mounts out of drop_collected_mounts and passing
      0 instead of UMOUNT_SYNC.
      
      There are two possible behavior differences that result from this.
      - No longer setting UMOUNT_SYNC will no longer set MNT_SYNC_UMOUNT on
        the vfsmounts being unmounted.  This effects the lazy rcu walk by
        kicking the walk out of rcu mode and forcing it to be a non-lazy
        walk.
      - No longer disconnecting locked mounts will keep some mounts around
        longer as they stay because the are locked to other mounts.
      
      There are only two users of drop_collected mounts: audit_tree.c and
      put_mnt_ns.
      
      In audit_tree.c the mounts are private and there are no rcu lazy walks
      only calls to iterate_mounts. So the changes should have no effect
      except for a small timing effect as the connected mounts are disconnected.
      
      In put_mnt_ns there may be references from process outside the mount
      namespace to the mounts.  So the mounts remaining connected will
      be the bug fix that is needed.  That rcu walks are allowed to continue
      appears not to be a problem especially as the rcu walk change was about
      an implementation detail not about semantics.
      
      Cc: stable@vger.kernel.org
      Fixes: 5ff9d8a6 ("vfs: Lock in place mounts from more privileged users")
      Reported-by: NTimothy Baldwin <timbaldwin@fastmail.co.uk>
      Tested-by: NTimothy Baldwin <timbaldwin@fastmail.co.uk>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      9c8e0a1b
    • E
      mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts · df7342b2
      Eric W. Biederman 提交于
      Jonathan Calmels from NVIDIA reported that he's able to bypass the
      mount visibility security check in place in the Linux kernel by using
      a combination of the unbindable property along with the private mount
      propagation option to allow a unprivileged user to see a path which
      was purposefully hidden by the root user.
      
      Reproducer:
        # Hide a path to all users using a tmpfs
        root@castiana:~# mount -t tmpfs tmpfs /sys/devices/
        root@castiana:~#
      
        # As an unprivileged user, unshare user namespace and mount namespace
        stgraber@castiana:~$ unshare -U -m -r
      
        # Confirm the path is still not accessible
        root@castiana:~# ls /sys/devices/
      
        # Make /sys recursively unbindable and private
        root@castiana:~# mount --make-runbindable /sys
        root@castiana:~# mount --make-private /sys
      
        # Recursively bind-mount the rest of /sys over to /mnnt
        root@castiana:~# mount --rbind /sys/ /mnt
      
        # Access our hidden /sys/device as an unprivileged user
        root@castiana:~# ls /mnt/devices/
        breakpoint cpu cstate_core cstate_pkg i915 intel_pt isa kprobe
        LNXSYSTM:00 msr pci0000:00 platform pnp0 power software system
        tracepoint uncore_arb uncore_cbox_0 uncore_cbox_1 uprobe virtual
      
      Solve this by teaching copy_tree to fail if a mount turns out to be
      both unbindable and locked.
      
      Cc: stable@vger.kernel.org
      Fixes: 5ff9d8a6 ("vfs: Lock in place mounts from more privileged users")
      Reported-by: NJonathan Calmels <jcalmels@nvidia.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      df7342b2
    • E
      mount: Retest MNT_LOCKED in do_umount · 25d202ed
      Eric W. Biederman 提交于
      It was recently pointed out that the one instance of testing MNT_LOCKED
      outside of the namespace_sem is in ksys_umount.
      
      Fix that by adding a test inside of do_umount with namespace_sem and
      the mount_lock held.  As it helps to fail fails the existing test is
      maintained with an additional comment pointing out that it may be racy
      because the locks are not held.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Fixes: 5ff9d8a6 ("vfs: Lock in place mounts from more privileged users")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      25d202ed
    • V
      ext4: fix buffer leak in __ext4_read_dirblock() on error path · de59fae0
      Vasily Averin 提交于
      Fixes: dc6982ff ("ext4: refactor code to read directory blocks ...")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 3.9
      de59fae0
    • O
      Btrfs: fix missing delayed iputs on unmount · d6fd0ae2
      Omar Sandoval 提交于
      There's a race between close_ctree() and cleaner_kthread().
      close_ctree() sets btrfs_fs_closing(), and the cleaner stops when it
      sees it set, but this is racy; the cleaner might have already checked
      the bit and could be cleaning stuff. In particular, if it deletes unused
      block groups, it will create delayed iputs for the free space cache
      inodes. As of "btrfs: don't run delayed_iputs in commit", we're no
      longer running delayed iputs after a commit. Therefore, if the cleaner
      creates more delayed iputs after delayed iputs are run in
      btrfs_commit_super(), we will leak inodes on unmount and get a busy
      inode crash from the VFS.
      
      Fix it by parking the cleaner before we actually close anything. Then,
      any remaining delayed iputs will always be handled in
      btrfs_commit_super(). This also ensures that the commit in close_ctree()
      is really the last commit, so we can get rid of the commit in
      cleaner_kthread().
      
      The fstest/generic/475 followed by 476 can trigger a crash that
      manifests as a slab corruption caused by accessing the freed kthread
      structure by a wake up function. Sample trace:
      
      [ 5657.077612] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
      [ 5657.079432] PGD 1c57a067 P4D 1c57a067 PUD da10067 PMD 0
      [ 5657.080661] Oops: 0000 [#1] PREEMPT SMP
      [ 5657.081592] CPU: 1 PID: 5157 Comm: fsstress Tainted: G        W         4.19.0-rc8-default+ #323
      [ 5657.083703] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626cc-prebuilt.qemu-project.org 04/01/2014
      [ 5657.086577] RIP: 0010:shrink_page_list+0x2f9/0xe90
      [ 5657.091937] RSP: 0018:ffffb5c745c8f728 EFLAGS: 00010287
      [ 5657.092953] RAX: 0000000000000074 RBX: ffffb5c745c8f830 RCX: 0000000000000000
      [ 5657.094590] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff9a8747fdf3d0
      [ 5657.095987] RBP: ffffb5c745c8f9e0 R08: 0000000000000000 R09: 0000000000000000
      [ 5657.097159] R10: ffff9a8747fdf5e8 R11: 0000000000000000 R12: ffffb5c745c8f788
      [ 5657.098513] R13: ffff9a877f6ff2c0 R14: ffff9a877f6ff2c8 R15: dead000000000200
      [ 5657.099689] FS:  00007f948d853b80(0000) GS:ffff9a877d600000(0000) knlGS:0000000000000000
      [ 5657.101032] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 5657.101953] CR2: 00000000000000cc CR3: 00000000684bd000 CR4: 00000000000006e0
      [ 5657.103159] Call Trace:
      [ 5657.103776]  shrink_inactive_list+0x194/0x410
      [ 5657.104671]  shrink_node_memcg.constprop.84+0x39a/0x6a0
      [ 5657.105750]  shrink_node+0x62/0x1c0
      [ 5657.106529]  try_to_free_pages+0x1a4/0x500
      [ 5657.107408]  __alloc_pages_slowpath+0x2c9/0xb20
      [ 5657.108418]  __alloc_pages_nodemask+0x268/0x2b0
      [ 5657.109348]  kmalloc_large_node+0x37/0x90
      [ 5657.110205]  __kmalloc_node+0x236/0x310
      [ 5657.111014]  kvmalloc_node+0x3e/0x70
      
      Fixes: 30928e9b ("btrfs: don't run delayed_iputs in commit")
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      [ add trace ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      d6fd0ae2
    • V
      ext4: fix buffer leak in ext4_expand_extra_isize_ea() on error path · 53692ec0
      Vasily Averin 提交于
      Fixes: de05ca85 ("ext4: move call to ext4_error() into ...")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 4.17
      53692ec0
    • V
      ext4: fix buffer leak in ext4_xattr_move_to_block() on error path · 6bdc9977
      Vasily Averin 提交于
      Fixes: 3f2571c1 ("ext4: factor out xattr moving")
      Fixes: 6dd4ee7c ("ext4: Expand extra_inodes space per ...")
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 2.6.23
      6bdc9977
    • V
      ext4: release bs.bh before re-using in ext4_xattr_block_find() · 45ae932d
      Vasily Averin 提交于
      bs.bh was taken in previous ext4_xattr_block_find() call,
      it should be released before re-using
      
      Fixes: 7e01c8e5 ("ext3/4: fix uninitialized bs in ...")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 2.6.26
      45ae932d
    • V
      ext4: fix buffer leak in ext4_xattr_get_block() on error path · ecaaf408
      Vasily Averin 提交于
      Fixes: dec214d0 ("ext4: xattr inode deduplication")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 4.13
      ecaaf408
  4. 07 11月, 2018 8 次提交
  5. 06 11月, 2018 11 次提交
    • D
      xfs: fix overflow in xfs_attr3_leaf_verify · 837514f7
      Dave Chinner 提交于
      generic/070 on 64k block size filesystems is failing with a verifier
      corruption on writeback or an attribute leaf block:
      
      [   94.973083] XFS (pmem0): Metadata corruption detected at xfs_attr3_leaf_verify+0x246/0x260, xfs_attr3_leaf block 0x811480
      [   94.975623] XFS (pmem0): Unmount and run xfs_repair
      [   94.976720] XFS (pmem0): First 128 bytes of corrupted metadata buffer:
      [   94.978270] 000000004b2e7b45: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
      [   94.980268] 000000006b1db90b: 00 00 00 00 00 81 14 80 00 00 00 00 00 00 00 00  ................
      [   94.982251] 00000000433f2407: 22 7b 5c 82 2d 5c 47 4c bb 31 1c 37 fa a9 ce d6  "{\.-\GL.1.7....
      [   94.984157] 0000000010dc7dfb: 00 00 00 00 00 81 04 8a 00 0a 18 e8 dd 94 01 00  ................
      [   94.986215] 00000000d5a19229: 00 a0 dc f4 fe 98 01 68 f0 d8 07 e0 00 00 00 00  .......h........
      [   94.988171] 00000000521df36c: 0c 2d 32 e2 fe 20 01 00 0c 2d 58 65 fe 0c 01 00  .-2.. ...-Xe....
      [   94.990162] 000000008477ae06: 0c 2d 5b 66 fe 8c 01 00 0c 2d 71 35 fe 7c 01 00  .-[f.....-q5.|..
      [   94.992139] 00000000a4a6bca6: 0c 2d 72 37 fc d4 01 00 0c 2d d8 b8 f0 90 01 00  .-r7.....-......
      [   94.994789] XFS (pmem0): xfs_do_force_shutdown(0x8) called from line 1453 of file fs/xfs/xfs_buf.c. Return address = ffffffff815365f3
      
      This is failing this check:
      
                      end = ichdr.freemap[i].base + ichdr.freemap[i].size;
                      if (end < ichdr.freemap[i].base)
      >>>>>                   return __this_address;
                      if (end > mp->m_attr_geo->blksize)
                              return __this_address;
      
      And from the buffer output above, the freemap array is:
      
      	freemap[0].base = 0x00a0
      	freemap[0].size = 0xdcf4	end = 0xdd94
      	freemap[1].base = 0xfe98
      	freemap[1].size = 0x0168	end = 0x10000
      	freemap[2].base = 0xf0d8
      	freemap[2].size = 0x07e0	end = 0xf8b8
      
      These all look valid - the block size is 0x10000 and so from the
      last check in the above verifier fragment we know that the end
      of freemap[1] is valid. The problem is that end is declared as:
      
      	uint16_t	end;
      
      And (uint16_t)0x10000 = 0. So we have a verifier bug here, not a
      corruption. Fix the verifier to use uint32_t types for the check and
      hence avoid the overflow.
      
      Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=201577Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      837514f7
    • D
      xfs: print buffer offsets when dumping corrupt buffers · bdec055b
      Darrick J. Wong 提交于
      Use DUMP_PREFIX_OFFSET when printing hex dumps of corrupt buffers
      because modern Linux now prints a 32-bit hash of our 64-bit pointer when
      using DUMP_PREFIX_ADDRESS:
      
      00000000b4bb4297: 00 00 00 00 00 00 00 00 3b ee 00 00 00 00 00 00  ........;.......
      00000005ec77e26: 00 00 00 00 02 d0 5a 00 00 00 00 00 00 00 00 00  ......Z.........
      000000015938018: 21 98 e8 b4 fd de 4c 07 bc ea 3c e5 ae b4 7c 48  !.....L...<...|H
      
      This is totally worthless for a sequential dump since we probably only
      care about tracking the buffer offsets and afaik there's no way to
      recover the actual pointer from the hashed value.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      bdec055b
    • C
      xfs: Fix error code in 'xfs_ioc_getbmap()' · 132bf672
      Christophe JAILLET 提交于
      In this function, once 'buf' has been allocated, we unconditionally
      return 0.
      However, 'error' is set to some error codes in several error handling
      paths.
      Before commit 232b5194 ("xfs: simplify the xfs_getbmap interface")
      this was not an issue because all error paths were returning directly,
      but now that some cleanup at the end may be needed, we must propagate the
      error code.
      
      Fixes: 232b5194 ("xfs: simplify the xfs_getbmap interface")
      Signed-off-by: NChristophe JAILLET <christophe.jaillet@wanadoo.fr>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      132bf672
    • F
      Btrfs: fix data corruption due to cloning of eof block · ac765f83
      Filipe Manana 提交于
      We currently allow cloning a range from a file which includes the last
      block of the file even if the file's size is not aligned to the block
      size. This is fine and useful when the destination file has the same size,
      but when it does not and the range ends somewhere in the middle of the
      destination file, it leads to corruption because the bytes between the EOF
      and the end of the block have undefined data (when there is support for
      discard/trimming they have a value of 0x00).
      
      Example:
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
      
       $ export foo_size=$((256 * 1024 + 100))
       $ xfs_io -f -c "pwrite -S 0x3c 0 $foo_size" /mnt/foo
       $ xfs_io -f -c "pwrite -S 0xb5 0 1M" /mnt/bar
      
       $ xfs_io -c "reflink /mnt/foo 0 512K $foo_size" /mnt/bar
      
       $ od -A d -t x1 /mnt/bar
       0000000 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
       *
       0524288 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c
       *
       0786528 3c 3c 3c 3c 00 00 00 00 00 00 00 00 00 00 00 00
       0786544 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       *
       0790528 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
       *
       1048576
      
      The bytes in the range from 786532 (512Kb + 256Kb + 100 bytes) to 790527
      (512Kb + 256Kb + 4Kb - 1) got corrupted, having now a value of 0x00 instead
      of 0xb5.
      
      This is similar to the problem we had for deduplication that got recently
      fixed by commit de02b9f6 ("Btrfs: fix data corruption when
      deduplicating between different files").
      
      Fix this by not allowing such operations to be performed and return the
      errno -EINVAL to user space. This is what XFS is doing as well at the VFS
      level. This change however now makes us return -EINVAL instead of
      -EOPNOTSUPP for cases where the source range maps to an inline extent and
      the destination range's end is smaller then the destination file's size,
      since the detection of inline extents is done during the actual process of
      dropping file extent items (at __btrfs_drop_extents()). Returning the
      -EINVAL error is done early on and solely based on the input parameters
      (offsets and length) and destination file's size. This makes us consistent
      with XFS and anyone else supporting cloning since this case is now checked
      at a higher level in the VFS and is where the -EINVAL will be returned
      from starting with kernel 4.20 (the VFS changed was introduced in 4.20-rc1
      by commit 07d19dc9 ("vfs: avoid problematic remapping requests into
      partial EOF block"). So this change is more geared towards stable kernels,
      as it's unlikely the new VFS checks get removed intentionally.
      
      A test case for fstests follows soon, as well as an update to filter
      existing tests that expect -EOPNOTSUPP to accept -EINVAL as well.
      
      CC: <stable@vger.kernel.org> # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      ac765f83
    • F
      Btrfs: fix infinite loop on inode eviction after deduplication of eof block · 11023d3f
      Filipe Manana 提交于
      If we attempt to deduplicate the last block of a file A into the middle of
      a file B, and file A's size is not a multiple of the block size, we end
      rounding the deduplication length to 0 bytes, to avoid the data corruption
      issue fixed by commit de02b9f6 ("Btrfs: fix data corruption when
      deduplicating between different files"). However a length of zero will
      cause the insertion of an extent state with a start value greater (by 1)
      then the end value, leading to a corrupt extent state that will trigger a
      warning and cause chaos such as an infinite loop during inode eviction.
      Example trace:
      
       [96049.833585] ------------[ cut here ]------------
       [96049.833714] WARNING: CPU: 0 PID: 24448 at fs/btrfs/extent_io.c:436 insert_state+0x101/0x120 [btrfs]
       [96049.833767] CPU: 0 PID: 24448 Comm: xfs_io Not tainted 4.19.0-rc7-btrfs-next-39 #1
       [96049.833768] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [96049.833780] RIP: 0010:insert_state+0x101/0x120 [btrfs]
       [96049.833783] RSP: 0018:ffffafd2c3707af0 EFLAGS: 00010282
       [96049.833785] RAX: 0000000000000000 RBX: 000000000004dfff RCX: 0000000000000006
       [96049.833786] RDX: 0000000000000007 RSI: ffff99045c143230 RDI: ffff99047b2168a0
       [96049.833787] RBP: ffff990457851cd0 R08: 0000000000000001 R09: 0000000000000000
       [96049.833787] R10: ffffafd2c3707ab8 R11: 0000000000000000 R12: ffff9903b93b12c8
       [96049.833788] R13: 000000000004e000 R14: ffffafd2c3707b80 R15: ffffafd2c3707b78
       [96049.833790] FS:  00007f5c14e7d700(0000) GS:ffff99047b200000(0000) knlGS:0000000000000000
       [96049.833791] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [96049.833792] CR2: 00007f5c146abff8 CR3: 0000000115f4c004 CR4: 00000000003606f0
       [96049.833795] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [96049.833796] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [96049.833796] Call Trace:
       [96049.833809]  __set_extent_bit+0x46c/0x6a0 [btrfs]
       [96049.833823]  lock_extent_bits+0x6b/0x210 [btrfs]
       [96049.833831]  ? _raw_spin_unlock+0x24/0x30
       [96049.833841]  ? test_range_bit+0xdf/0x130 [btrfs]
       [96049.833853]  lock_extent_range+0x8e/0x150 [btrfs]
       [96049.833864]  btrfs_double_extent_lock+0x78/0xb0 [btrfs]
       [96049.833875]  btrfs_extent_same_range+0x14e/0x550 [btrfs]
       [96049.833885]  ? rcu_read_lock_sched_held+0x3f/0x70
       [96049.833890]  ? __kmalloc_node+0x2b0/0x2f0
       [96049.833899]  ? btrfs_dedupe_file_range+0x19a/0x280 [btrfs]
       [96049.833909]  btrfs_dedupe_file_range+0x270/0x280 [btrfs]
       [96049.833916]  vfs_dedupe_file_range_one+0xd9/0xe0
       [96049.833919]  vfs_dedupe_file_range+0x131/0x1b0
       [96049.833924]  do_vfs_ioctl+0x272/0x6e0
       [96049.833927]  ? __fget+0x113/0x200
       [96049.833931]  ksys_ioctl+0x70/0x80
       [96049.833933]  __x64_sys_ioctl+0x16/0x20
       [96049.833937]  do_syscall_64+0x60/0x1b0
       [96049.833939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [96049.833941] RIP: 0033:0x7f5c1478ddd7
       [96049.833943] RSP: 002b:00007ffe15b196a8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
       [96049.833945] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5c1478ddd7
       [96049.833946] RDX: 00005625ece322d0 RSI: 00000000c0189436 RDI: 0000000000000004
       [96049.833947] RBP: 0000000000000000 R08: 00007f5c14a46f48 R09: 0000000000000040
       [96049.833948] R10: 0000000000000541 R11: 0000000000000202 R12: 0000000000000000
       [96049.833949] R13: 0000000000000000 R14: 0000000000000004 R15: 00005625ece322d0
       [96049.833954] irq event stamp: 6196
       [96049.833956] hardirqs last  enabled at (6195): [<ffffffff91b00663>] console_unlock+0x503/0x640
       [96049.833958] hardirqs last disabled at (6196): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
       [96049.833959] softirqs last  enabled at (6114): [<ffffffff92600370>] __do_softirq+0x370/0x421
       [96049.833964] softirqs last disabled at (6095): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
       [96049.833965] ---[ end trace db7b05f01b7fa10c ]---
       [96049.935816] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
       [96049.935822] irq event stamp: 6584
       [96049.935823] hardirqs last  enabled at (6583): [<ffffffff91b00663>] console_unlock+0x503/0x640
       [96049.935825] hardirqs last disabled at (6584): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
       [96049.935827] softirqs last  enabled at (6328): [<ffffffff92600370>] __do_softirq+0x370/0x421
       [96049.935828] softirqs last disabled at (6313): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
       [96049.935829] ---[ end trace db7b05f01b7fa123 ]---
       [96049.935840] ------------[ cut here ]------------
       [96049.936065] WARNING: CPU: 1 PID: 24463 at fs/btrfs/extent_io.c:436 insert_state+0x101/0x120 [btrfs]
       [96049.936107] CPU: 1 PID: 24463 Comm: umount Tainted: G        W         4.19.0-rc7-btrfs-next-39 #1
       [96049.936108] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [96049.936117] RIP: 0010:insert_state+0x101/0x120 [btrfs]
       [96049.936119] RSP: 0018:ffffafd2c3637bc0 EFLAGS: 00010282
       [96049.936120] RAX: 0000000000000000 RBX: 000000000004dfff RCX: 0000000000000006
       [96049.936121] RDX: 0000000000000007 RSI: ffff990445cf88e0 RDI: ffff99047b2968a0
       [96049.936122] RBP: ffff990457851cd0 R08: 0000000000000001 R09: 0000000000000000
       [96049.936123] R10: ffffafd2c3637b88 R11: 0000000000000000 R12: ffff9904574301e8
       [96049.936124] R13: 000000000004e000 R14: ffffafd2c3637c50 R15: ffffafd2c3637c48
       [96049.936125] FS:  00007fe4b87e72c0(0000) GS:ffff99047b280000(0000) knlGS:0000000000000000
       [96049.936126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [96049.936128] CR2: 00005562e52618d8 CR3: 00000001151c8005 CR4: 00000000003606e0
       [96049.936129] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [96049.936131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [96049.936131] Call Trace:
       [96049.936141]  __set_extent_bit+0x46c/0x6a0 [btrfs]
       [96049.936154]  lock_extent_bits+0x6b/0x210 [btrfs]
       [96049.936167]  btrfs_evict_inode+0x1e1/0x5a0 [btrfs]
       [96049.936172]  evict+0xbf/0x1c0
       [96049.936174]  dispose_list+0x51/0x80
       [96049.936176]  evict_inodes+0x193/0x1c0
       [96049.936180]  generic_shutdown_super+0x3f/0x110
       [96049.936182]  kill_anon_super+0xe/0x30
       [96049.936189]  btrfs_kill_super+0x13/0x100 [btrfs]
       [96049.936191]  deactivate_locked_super+0x3a/0x70
       [96049.936193]  cleanup_mnt+0x3b/0x80
       [96049.936195]  task_work_run+0x93/0xc0
       [96049.936198]  exit_to_usermode_loop+0xfa/0x100
       [96049.936201]  do_syscall_64+0x17f/0x1b0
       [96049.936202]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [96049.936204] RIP: 0033:0x7fe4b80cfb37
       [96049.936206] RSP: 002b:00007ffff092b688 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
       [96049.936207] RAX: 0000000000000000 RBX: 00005562e5259060 RCX: 00007fe4b80cfb37
       [96049.936208] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005562e525faa0
       [96049.936209] RBP: 00005562e525faa0 R08: 00005562e525f770 R09: 0000000000000015
       [96049.936210] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fe4b85d1e64
       [96049.936211] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
       [96049.936211] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
       [96049.936216] irq event stamp: 6616
       [96049.936219] hardirqs last  enabled at (6615): [<ffffffff91b00663>] console_unlock+0x503/0x640
       [96049.936219] hardirqs last disabled at (6616): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
       [96049.936222] softirqs last  enabled at (6328): [<ffffffff92600370>] __do_softirq+0x370/0x421
       [96049.936222] softirqs last disabled at (6313): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
       [96049.936223] ---[ end trace db7b05f01b7fa124 ]---
      
      The second stack trace, from inode eviction, is repeated forever due to
      the infinite loop during eviction.
      
      This is the same type of problem fixed way back in 2015 by commit
      113e8283 ("Btrfs: fix inode eviction infinite loop after extent_same
      ioctl") and commit ccccf3d6 ("Btrfs: fix inode eviction infinite loop
      after cloning into it").
      
      So fix this by returning immediately if the deduplication range length
      gets rounded down to 0 bytes, as there is nothing that needs to be done in
      such case.
      
      Example reproducer:
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
      
       $ xfs_io -f -c "pwrite -S 0xe6 0 100" /mnt/foo
       $ xfs_io -f -c "pwrite -S 0xe6 0 1M" /mnt/bar
      
       # Unmount the filesystem and mount it again so that we start without any
       # extent state records when we ask for the deduplication.
       $ umount /mnt
       $ mount /dev/sdb /mnt
      
       $ xfs_io -c "dedupe /mnt/foo 0 500K 100" /mnt/bar
      
       # This unmount triggers the infinite loop.
       $ umount /mnt
      
      A test case for fstests will follow soon.
      
      Fixes: de02b9f6 ("Btrfs: fix data corruption when deduplicating between different files")
      CC: <stable@vger.kernel.org> # 4.19+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      11023d3f
    • F
      Btrfs: fix deadlock on tree root leaf when finding free extent · 4222ea71
      Filipe Manana 提交于
      When we are writing out a free space cache, during the transaction commit
      phase, we can end up in a deadlock which results in a stack trace like the
      following:
      
       schedule+0x28/0x80
       btrfs_tree_read_lock+0x8e/0x120 [btrfs]
       ? finish_wait+0x80/0x80
       btrfs_read_lock_root_node+0x2f/0x40 [btrfs]
       btrfs_search_slot+0xf6/0x9f0 [btrfs]
       ? evict_refill_and_join+0xd0/0xd0 [btrfs]
       ? inode_insert5+0x119/0x190
       btrfs_lookup_inode+0x3a/0xc0 [btrfs]
       ? kmem_cache_alloc+0x166/0x1d0
       btrfs_iget+0x113/0x690 [btrfs]
       __lookup_free_space_inode+0xd8/0x150 [btrfs]
       lookup_free_space_inode+0x5b/0xb0 [btrfs]
       load_free_space_cache+0x7c/0x170 [btrfs]
       ? cache_block_group+0x72/0x3b0 [btrfs]
       cache_block_group+0x1b3/0x3b0 [btrfs]
       ? finish_wait+0x80/0x80
       find_free_extent+0x799/0x1010 [btrfs]
       btrfs_reserve_extent+0x9b/0x180 [btrfs]
       btrfs_alloc_tree_block+0x1b3/0x4f0 [btrfs]
       __btrfs_cow_block+0x11d/0x500 [btrfs]
       btrfs_cow_block+0xdc/0x180 [btrfs]
       btrfs_search_slot+0x3bd/0x9f0 [btrfs]
       btrfs_lookup_inode+0x3a/0xc0 [btrfs]
       ? kmem_cache_alloc+0x166/0x1d0
       btrfs_update_inode_item+0x46/0x100 [btrfs]
       cache_save_setup+0xe4/0x3a0 [btrfs]
       btrfs_start_dirty_block_groups+0x1be/0x480 [btrfs]
       btrfs_commit_transaction+0xcb/0x8b0 [btrfs]
      
      At cache_save_setup() we need to update the inode item of a block group's
      cache which is located in the tree root (fs_info->tree_root), which means
      that it may result in COWing a leaf from that tree. If that happens we
      need to find a free metadata extent and while looking for one, if we find
      a block group which was not cached yet we attempt to load its cache by
      calling cache_block_group(). However this function will try to load the
      inode of the free space cache, which requires finding the matching inode
      item in the tree root - if that inode item is located in the same leaf as
      the inode item of the space cache we are updating at cache_save_setup(),
      we end up in a deadlock, since we try to obtain a read lock on the same
      extent buffer that we previously write locked.
      
      So fix this by using the tree root's commit root when searching for a
      block group's free space cache inode item when we are attempting to load
      a free space cache. This is safe since block groups once loaded stay in
      memory forever, as well as their caches, so after they are first loaded
      we will never need to read their inode items again. For new block groups,
      once they are created they get their ->cached field set to
      BTRFS_CACHE_FINISHED meaning we will not need to read their inode item.
      Reported-by: NAndrew Nelson <andrew.s.nelson@gmail.com>
      Link: https://lore.kernel.org/linux-btrfs/CAPTELenq9x5KOWuQ+fa7h1r3nsJG8vyiTH8+ifjURc_duHh2Wg@mail.gmail.com/
      Fixes: 9d66e233 ("Btrfs: load free space cache if it exists")
      Tested-by: NAndrew Nelson <andrew.s.nelson@gmail.com>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      4222ea71
    • A
      btrfs: avoid link error with CONFIG_NO_AUTO_INLINE · 7e17916b
      Arnd Bergmann 提交于
      Note: this patch fixes a problem in a feature outside of btrfs ("kernel
      hacking: add a config option to disable compiler auto-inlining") and is
      applied ahead of time due to cross-subsystem dependencies.
      
      On 32-bit ARM with gcc-8, I see a link error with the addition of the
      CONFIG_NO_AUTO_INLINE option:
      
      fs/btrfs/super.o: In function `btrfs_statfs':
      super.c:(.text+0x67b8): undefined reference to `__aeabi_uldivmod'
      super.c:(.text+0x67fc): undefined reference to `__aeabi_uldivmod'
      super.c:(.text+0x6858): undefined reference to `__aeabi_uldivmod'
      super.c:(.text+0x6920): undefined reference to `__aeabi_uldivmod'
      super.c:(.text+0x693c): undefined reference to `__aeabi_uldivmod'
      fs/btrfs/super.o:super.c:(.text+0x6958): more undefined references to `__aeabi_uldivmod' follow
      
      So far this is the only file that shows the behavior, so I'd propose
      to just work around it by marking the functions as 'static inline'
      that normally get inlined here.
      
      The reference to __aeabi_uldivmod comes from a div_u64() which has an
      optimization for a constant division that uses a straight '/' operator
      when the result should be known to the compiler. My interpretation is
      that as we turn off inlining, gcc still expects the result to be constant
      but fails to use that constant value.
      
      Link: https://lkml.kernel.org/r/20181103153941.1881966-1-arnd@arndb.deReviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NChangbin Du <changbin.du@gmail.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      [ add the note ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      7e17916b
    • S
      btrfs: tree-checker: Fix misleading group system information · 761333f2
      Shaokun Zhang 提交于
      block_group_err shows the group system as a decimal value with a '0x'
      prefix, which is somewhat misleading.
      
      Fix it to print hexadecimal, as was intended.
      
      Fixes: fce466ea ("btrfs: tree-checker: Verify block_group_item")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NShaokun Zhang <zhangshaokun@hisilicon.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      761333f2
    • F
      Btrfs: fix missing data checksums after a ranged fsync (msync) · 008c6753
      Filipe Manana 提交于
      Recently we got a massive simplification for fsync, where for the fast
      path we no longer log new extents while their respective ordered extents
      are still running.
      
      However that simplification introduced a subtle regression for the case
      where we use a ranged fsync (msync). Consider the following example:
      
                     CPU 0                                    CPU 1
      
                                                  mmap write to range [2Mb, 4Mb[
        mmap write to range [512Kb, 1Mb[
        msync range [512K, 1Mb[
          --> triggers fast fsync
              (BTRFS_INODE_NEEDS_FULL_SYNC
               not set)
          --> creates extent map A for this
              range and adds it to list of
              modified extents
          --> starts ordered extent A for
              this range
          --> waits for it to complete
      
                                                  writeback triggered for range
                                                  [2Mb, 4Mb[
                                                    --> create extent map B and
                                                        adds it to the list of
                                                        modified extents
                                                    --> creates ordered extent B
      
          --> start looking for and logging
              modified extents
          --> logs extent maps A and B
          --> finds checksums for extent A
              in the csum tree, but not for
              extent B
        fsync (msync) finishes
      
                                                    --> ordered extent B
                                                        finishes and its
                                                        checksums are added
                                                        to the csum tree
      
                                      <power cut>
      
      After replaying the log, we have the extent covering the range [2Mb, 4Mb[
      but do not have the data checksum items covering that file range.
      
      This happens because at the very beginning of an fsync (btrfs_sync_file())
      we start and wait for IO in the given range [512Kb, 1Mb[ and therefore
      wait for any ordered extents in that range to complete before we start
      logging the extents. However if right before we start logging the extent
      in our range [512Kb, 1Mb[, writeback is started for any other dirty range,
      such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync
      or msync (btrfs_sync_file() starts writeback before acquiring the inode's
      lock), an ordered extent is created for that other range and a new extent
      map is created to represent that range and added to the inode's list of
      modified extents.
      
      That means that we will see that other extent in that list when collecting
      extents for logging (done at btrfs_log_changed_extents()) and log the
      extent before the respective ordered extent finishes - namely before the
      checksum items are added to the checksums tree, which is where
      log_extent_csums() looks for the checksums, therefore making us log an
      extent without logging its checksums. Before that massive simplification
      of fsync, this wasn't a problem because besides looking for checkums in
      the checksums tree, we also looked for them in any ordered extent still
      running.
      
      The consequence of data checksums missing for a file range is that users
      attempting to read the affected file range will get -EIO errors and dmesg
      reports the following:
      
       [10188.358136] BTRFS info (device sdc): no csum found for inode 297 start 57344
       [10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off 57344 csum 0x98f94189 expected csum 0x00000000 mirror 1
      
      So fix this by skipping extents outside of our logging range at
      btrfs_log_changed_extents() and leaving them on the list of modified
      extents so that any subsequent ranged fsync may collect them if needed.
      Also, if we find a hole extent outside of the range still log it, just
      to prevent having gaps between extent items after replaying the log,
      otherwise fsck will complain when we are not using the NO_HOLES feature
      (fstest btrfs/056 triggers such case).
      
      Fixes: e7175a69 ("btrfs: remove the wait ordered logic in the log_one_extent path")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      008c6753
    • L
      btrfs: fix pinned underflow after transaction aborted · fcd5e742
      Lu Fengqi 提交于
      When running generic/475, we may get the following warning in dmesg:
      
      [ 6902.102154] WARNING: CPU: 3 PID: 18013 at fs/btrfs/extent-tree.c:9776 btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
      [ 6902.109160] CPU: 3 PID: 18013 Comm: umount Tainted: G        W  O      4.19.0-rc8+ #8
      [ 6902.110971] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      [ 6902.112857] RIP: 0010:btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
      [ 6902.118921] RSP: 0018:ffffc9000459bdb0 EFLAGS: 00010286
      [ 6902.120315] RAX: ffff880175050bb0 RBX: ffff8801124a8000 RCX: 0000000000170007
      [ 6902.121969] RDX: 0000000000000002 RSI: 0000000000170007 RDI: ffffffff8125fb74
      [ 6902.123716] RBP: ffff880175055d10 R08: 0000000000000000 R09: 0000000000000000
      [ 6902.125417] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880175055d88
      [ 6902.127129] R13: ffff880175050bb0 R14: 0000000000000000 R15: dead000000000100
      [ 6902.129060] FS:  00007f4507223780(0000) GS:ffff88017ba00000(0000) knlGS:0000000000000000
      [ 6902.130996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6902.132558] CR2: 00005623599cac78 CR3: 000000014b700001 CR4: 00000000003606e0
      [ 6902.134270] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 6902.135981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 6902.137836] Call Trace:
      [ 6902.138939]  close_ctree+0x171/0x330 [btrfs]
      [ 6902.140181]  ? kthread_stop+0x146/0x1f0
      [ 6902.141277]  generic_shutdown_super+0x6c/0x100
      [ 6902.142517]  kill_anon_super+0x14/0x30
      [ 6902.143554]  btrfs_kill_super+0x13/0x100 [btrfs]
      [ 6902.144790]  deactivate_locked_super+0x2f/0x70
      [ 6902.146014]  cleanup_mnt+0x3b/0x70
      [ 6902.147020]  task_work_run+0x9e/0xd0
      [ 6902.148036]  do_syscall_64+0x470/0x600
      [ 6902.149142]  ? trace_hardirqs_off_thunk+0x1a/0x1c
      [ 6902.150375]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 6902.151640] RIP: 0033:0x7f45077a6a7b
      [ 6902.157324] RSP: 002b:00007ffd589f3e68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 6902.159187] RAX: 0000000000000000 RBX: 000055e8eec732b0 RCX: 00007f45077a6a7b
      [ 6902.160834] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055e8eec73490
      [ 6902.162526] RBP: 0000000000000000 R08: 000055e8eec734b0 R09: 00007ffd589f26c0
      [ 6902.164141] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e8eec73490
      [ 6902.165815] R13: 00007f4507ac61a4 R14: 0000000000000000 R15: 00007ffd589f40d8
      [ 6902.167553] irq event stamp: 0
      [ 6902.168998] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
      [ 6902.170731] hardirqs last disabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
      [ 6902.172773] softirqs last  enabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
      [ 6902.174671] softirqs last disabled at (0): [<0000000000000000>]           (null)
      [ 6902.176407] ---[ end trace 463138c2986b275c ]---
      [ 6902.177636] BTRFS info (device dm-3): space_info 4 has 273465344 free, is not full
      [ 6902.179453] BTRFS info (device dm-3): space_info total=276824064, used=4685824, pinned=18446744073708158976, reserved=0, may_use=0, readonly=65536
      
      In the above line there's "pinned=18446744073708158976" which is an
      unsigned u64 value of -1392640, an obvious underflow.
      
      When transaction_kthread is running cleanup_transaction(), another
      fsstress is running btrfs_commit_transaction(). The
      btrfs_finish_extent_commit() may get the same range as
      btrfs_destroy_pinned_extent() got, which causes the pinned underflow.
      
      Fixes: d4b450cd ("Btrfs: fix race between transaction commit and empty block group removal")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      fcd5e742
    • R
      Btrfs: fix cur_offset in the error case for nocow · 506481b2
      Robbie Ko 提交于
      When the cow_file_range fails, the related resources are unlocked
      according to the range [start..end), so the unlock cannot be repeated in
      run_delalloc_nocow.
      
      In some cases (e.g. cur_offset <= end && cow_start != -1), cur_offset is
      not updated correctly, so move the cur_offset update before
      cow_file_range.
      
        kernel BUG at mm/page-writeback.c:2663!
        Internal error: Oops - BUG: 0 [#1] SMP
        CPU: 3 PID: 31525 Comm: kworker/u8:7 Tainted: P O
        Hardware name: Realtek_RTD1296 (DT)
        Workqueue: writeback wb_workfn (flush-btrfs-1)
        task: ffffffc076db3380 ti: ffffffc02e9ac000 task.ti: ffffffc02e9ac000
        PC is at clear_page_dirty_for_io+0x1bc/0x1e8
        LR is at clear_page_dirty_for_io+0x14/0x1e8
        pc : [<ffffffc00033c91c>] lr : [<ffffffc00033c774>] pstate: 40000145
        sp : ffffffc02e9af4f0
        Process kworker/u8:7 (pid: 31525, stack limit = 0xffffffc02e9ac020)
        Call trace:
        [<ffffffc00033c91c>] clear_page_dirty_for_io+0x1bc/0x1e8
        [<ffffffbffc514674>] extent_clear_unlock_delalloc+0x1e4/0x210 [btrfs]
        [<ffffffbffc4fb168>] run_delalloc_nocow+0x3b8/0x948 [btrfs]
        [<ffffffbffc4fb948>] run_delalloc_range+0x250/0x3a8 [btrfs]
        [<ffffffbffc514c0c>] writepage_delalloc.isra.21+0xbc/0x1d8 [btrfs]
        [<ffffffbffc516048>] __extent_writepage+0xe8/0x248 [btrfs]
        [<ffffffbffc51630c>] extent_write_cache_pages.isra.17+0x164/0x378 [btrfs]
        [<ffffffbffc5185a8>] extent_writepages+0x48/0x68 [btrfs]
        [<ffffffbffc4f5828>] btrfs_writepages+0x20/0x30 [btrfs]
        [<ffffffc00033d758>] do_writepages+0x30/0x88
        [<ffffffc0003ba0f4>] __writeback_single_inode+0x34/0x198
        [<ffffffc0003ba6c4>] writeback_sb_inodes+0x184/0x3c0
        [<ffffffc0003ba96c>] __writeback_inodes_wb+0x6c/0xc0
        [<ffffffc0003bac20>] wb_writeback+0x1b8/0x1c0
        [<ffffffc0003bb0f0>] wb_workfn+0x150/0x250
        [<ffffffc0002b0014>] process_one_work+0x1dc/0x388
        [<ffffffc0002b02f0>] worker_thread+0x130/0x500
        [<ffffffc0002b6344>] kthread+0x10c/0x110
        [<ffffffc000284590>] ret_from_fork+0x10/0x40
        Code: d503201f a9025bb5 a90363b7 f90023b9 (d4210000)
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      506481b2
  6. 04 11月, 2018 8 次提交