1. 21 11月, 2018 40 次提交
    • C
      kdb: print real address of pointers instead of hashed addresses · 401182ae
      Christophe Leroy 提交于
      commit 568fb6f42ac6851320adaea25f8f1b94de14e40a upstream.
      
      Since commit ad67b74d ("printk: hash addresses printed with %p"),
      all pointers printed with %p are printed with hashed addresses
      instead of real addresses in order to avoid leaking addresses in
      dmesg and syslog. But this applies to kdb too, with is unfortunate:
      
          Entering kdb (current=0x(ptrval), pid 329) due to Keyboard Entry
          kdb> ps
          15 sleeping system daemon (state M) processes suppressed,
          use 'ps A' to see all.
          Task Addr       Pid   Parent [*] cpu State Thread     Command
          0x(ptrval)      329      328  1    0   R  0x(ptrval) *sh
      
          0x(ptrval)        1        0  0    0   S  0x(ptrval)  init
          0x(ptrval)        3        2  0    0   D  0x(ptrval)  rcu_gp
          0x(ptrval)        4        2  0    0   D  0x(ptrval)  rcu_par_gp
          0x(ptrval)        5        2  0    0   D  0x(ptrval)  kworker/0:0
          0x(ptrval)        6        2  0    0   D  0x(ptrval)  kworker/0:0H
          0x(ptrval)        7        2  0    0   D  0x(ptrval)  kworker/u2:0
          0x(ptrval)        8        2  0    0   D  0x(ptrval)  mm_percpu_wq
          0x(ptrval)       10        2  0    0   D  0x(ptrval)  rcu_preempt
      
      The whole purpose of kdb is to debug, and for debugging real addresses
      need to be known. In addition, data displayed by kdb doesn't go into
      dmesg.
      
      This patch replaces all %p by %px in kdb in order to display real
      addresses.
      
      Fixes: ad67b74d ("printk: hash addresses printed with %p")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      401182ae
    • C
      kdb: use correct pointer when 'btc' calls 'btt' · 47052af2
      Christophe Leroy 提交于
      commit dded2e159208a9edc21dd5c5f583afa28d378d39 upstream.
      
      On a powerpc 8xx, 'btc' fails as follows:
      
      Entering kdb (current=0x(ptrval), pid 282) due to Keyboard Entry
      kdb> btc
      btc: cpu status: Currently on cpu 0
      Available cpus: 0
      kdb_getarea: Bad address 0x0
      
      when booting the kernel with 'debug_boot_weak_hash', it fails as well
      
      Entering kdb (current=0xba99ad80, pid 284) due to Keyboard Entry
      kdb> btc
      btc: cpu status: Currently on cpu 0
      Available cpus: 0
      kdb_getarea: Bad address 0xba99ad80
      
      On other platforms, Oopses have been observed too, see
      https://github.com/linuxppc/linux/issues/139
      
      This is due to btc calling 'btt' with %p pointer as an argument.
      
      This patch replaces %p by %px to get the real pointer value as
      expected by 'btt'
      
      Fixes: ad67b74d ("printk: hash addresses printed with %p")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NChristophe Leroy <christophe.leroy@c-s.fr>
      Reviewed-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      47052af2
    • U
      ARM: cpuidle: Don't register the driver when back-end init returns -ENXIO · 110e9082
      Ulf Hansson 提交于
      commit 763f191af51f127cf8e69cd361f50bf6180768a5 upstream.
      
      There's no point to register the cpuidle driver for the current CPU, when
      the initialization of the arch specific back-end data fails by returning
      -ENXIO.
      
      Instead, let's re-order the sequence to its original flow, by first trying
      to initialize the back-end part and then act accordingly on the returned
      error code. Additionally, let's print the error message, no matter of what
      error code that was returned.
      
      Fixes: a0d46a3d (ARM: cpuidle: Register per cpuidle device)
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      Reviewed-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
      Cc: 4.19+ <stable@vger.kernel.org> # v4.19+
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      110e9082
    • D
      uapi: fix linux/kfd_ioctl.h userspace compilation errors · 0d406e79
      Dmitry V. Levin 提交于
      commit aba118389a6fb2ad7958de0f37b5869852bd38cf upstream.
      
      Consistently use types provided by <linux/types.h> via <drm/drm.h>
      to fix the following linux/kfd_ioctl.h userspace compilation errors:
      
      /usr/include/linux/kfd_ioctl.h:250:2: error: unknown type name 'uint32_t'
        uint32_t reset_type;
      /usr/include/linux/kfd_ioctl.h:251:2: error: unknown type name 'uint32_t'
        uint32_t reset_cause;
      /usr/include/linux/kfd_ioctl.h:252:2: error: unknown type name 'uint32_t'
        uint32_t memory_lost;
      /usr/include/linux/kfd_ioctl.h:253:2: error: unknown type name 'uint32_t'
        uint32_t gpu_id;
      
      Fixes: 0c119aba ("drm/amd: Add kfd ioctl defines for hw_exception event")
      Cc: <stable@vger.kernel.org> # v4.19
      Signed-off-by: NDmitry V. Levin <ldv@altlinux.org>
      Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
      Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0d406e79
    • B
      mnt: fix __detach_mounts infinite loop · 83eec9ad
      Benjamin Coddington 提交于
      commit 1e9c75fb9c47a75a9aec0cd17db5f6dc36b58e00 upstream.
      
      Since commit ff17fa56 ("d_invalidate(): unhash immediately")
      immediately unhashes the dentry, we'll never return the mountpoint in
      lookup_mountpoint(), which can lead to an unbreakable loop in
      d_invalidate().
      
      I have reports of NFS clients getting into this condition after the server
      removes an export of an existing mount created through follow_automount(),
      but I suspect there are various other ways to produce this problem if we
      hunt down users of d_invalidate().  For example, it is possible to get into
      this state by using XFS' d_invalidate() call in xfs_vn_unlink():
      
      truncate -s 100m img{1,2}
      
      mkfs.xfs -q -n version=ci img1
      mkfs.xfs -q -n version=ci img2
      
      mkdir -p /mnt/xfs
      mount img1 /mnt/xfs
      
      mkdir /mnt/xfs/sub1
      mount img2 /mnt/xfs/sub1
      
      cat > /mnt/xfs/sub1/foo &
      umount -l /mnt/xfs/sub1
      mount img2 /mnt/xfs/sub1
      
      mount --make-private /mnt/xfs
      
      mkdir /mnt/xfs/sub2
      mount --move /mnt/xfs/sub1 /mnt/xfs/sub2
      rmdir /mnt/xfs/sub1
      
      Fix this by moving the check for an unlinked dentry out of the
      detach_mounts() path.
      
      Fixes: ff17fa56 ("d_invalidate(): unhash immediately")
      Cc: stable@vger.kernel.org
      Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NBenjamin Coddington <bcodding@redhat.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      83eec9ad
    • E
      mount: Prevent MNT_DETACH from disconnecting locked mounts · a7861ef8
      Eric W. Biederman 提交于
      commit 9c8e0a1b683525464a2abe9fb4b54404a50ed2b4 upstream.
      
      Timothy Baldwin <timbaldwin@fastmail.co.uk> wrote:
      > As per mount_namespaces(7) unprivileged users should not be able to look under mount points:
      >
      >   Mounts that come as a single unit from more privileged mount are locked
      >   together and may not be separated in a less privileged mount namespace.
      >
      > However they can:
      >
      > 1. Create a mount namespace.
      > 2. In the mount namespace open a file descriptor to the parent of a mount point.
      > 3. Destroy the mount namespace.
      > 4. Use the file descriptor to look under the mount point.
      >
      > I have reproduced this with Linux 4.16.18 and Linux 4.18-rc8.
      >
      > The setup:
      >
      > $ sudo sysctl kernel.unprivileged_userns_clone=1
      > kernel.unprivileged_userns_clone = 1
      > $ mkdir -p A/B/Secret
      > $ sudo mount -t tmpfs hide A/B
      >
      >
      > "Secret" is indeed hidden as expected:
      >
      > $ ls -lR A
      > A:
      > total 0
      > drwxrwxrwt 2 root root 40 Feb 12 21:08 B
      >
      > A/B:
      > total 0
      >
      >
      > The attack revealing "Secret":
      >
      > $ unshare -Umr sh -c "exec unshare -m ls -lR /proc/self/fd/4/ 4<A"
      > /proc/self/fd/4/:
      > total 0
      > drwxr-xr-x 3 root root 60 Feb 12 21:08 B
      >
      > /proc/self/fd/4/B:
      > total 0
      > drwxr-xr-x 2 root root 40 Feb 12 21:08 Secret
      >
      > /proc/self/fd/4/B/Secret:
      > total 0
      
      I tracked this down to put_mnt_ns running passing UMOUNT_SYNC and
      disconnecting all of the mounts in a mount namespace.  Fix this by
      factoring drop_mounts out of drop_collected_mounts and passing
      0 instead of UMOUNT_SYNC.
      
      There are two possible behavior differences that result from this.
      - No longer setting UMOUNT_SYNC will no longer set MNT_SYNC_UMOUNT on
        the vfsmounts being unmounted.  This effects the lazy rcu walk by
        kicking the walk out of rcu mode and forcing it to be a non-lazy
        walk.
      - No longer disconnecting locked mounts will keep some mounts around
        longer as they stay because the are locked to other mounts.
      
      There are only two users of drop_collected mounts: audit_tree.c and
      put_mnt_ns.
      
      In audit_tree.c the mounts are private and there are no rcu lazy walks
      only calls to iterate_mounts. So the changes should have no effect
      except for a small timing effect as the connected mounts are disconnected.
      
      In put_mnt_ns there may be references from process outside the mount
      namespace to the mounts.  So the mounts remaining connected will
      be the bug fix that is needed.  That rcu walks are allowed to continue
      appears not to be a problem especially as the rcu walk change was about
      an implementation detail not about semantics.
      
      Cc: stable@vger.kernel.org
      Fixes: 5ff9d8a6 ("vfs: Lock in place mounts from more privileged users")
      Reported-by: NTimothy Baldwin <timbaldwin@fastmail.co.uk>
      Tested-by: NTimothy Baldwin <timbaldwin@fastmail.co.uk>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a7861ef8
    • E
      mount: Don't allow copying MNT_UNBINDABLE|MNT_LOCKED mounts · 14e4bec1
      Eric W. Biederman 提交于
      commit df7342b240185d58d3d9665c0bbf0a0f5570ec29 upstream.
      
      Jonathan Calmels from NVIDIA reported that he's able to bypass the
      mount visibility security check in place in the Linux kernel by using
      a combination of the unbindable property along with the private mount
      propagation option to allow a unprivileged user to see a path which
      was purposefully hidden by the root user.
      
      Reproducer:
        # Hide a path to all users using a tmpfs
        root@castiana:~# mount -t tmpfs tmpfs /sys/devices/
        root@castiana:~#
      
        # As an unprivileged user, unshare user namespace and mount namespace
        stgraber@castiana:~$ unshare -U -m -r
      
        # Confirm the path is still not accessible
        root@castiana:~# ls /sys/devices/
      
        # Make /sys recursively unbindable and private
        root@castiana:~# mount --make-runbindable /sys
        root@castiana:~# mount --make-private /sys
      
        # Recursively bind-mount the rest of /sys over to /mnnt
        root@castiana:~# mount --rbind /sys/ /mnt
      
        # Access our hidden /sys/device as an unprivileged user
        root@castiana:~# ls /mnt/devices/
        breakpoint cpu cstate_core cstate_pkg i915 intel_pt isa kprobe
        LNXSYSTM:00 msr pci0000:00 platform pnp0 power software system
        tracepoint uncore_arb uncore_cbox_0 uncore_cbox_1 uprobe virtual
      
      Solve this by teaching copy_tree to fail if a mount turns out to be
      both unbindable and locked.
      
      Cc: stable@vger.kernel.org
      Fixes: 5ff9d8a6 ("vfs: Lock in place mounts from more privileged users")
      Reported-by: NJonathan Calmels <jcalmels@nvidia.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      14e4bec1
    • E
      mount: Retest MNT_LOCKED in do_umount · 32224b87
      Eric W. Biederman 提交于
      commit 25d202ed820ee347edec0bf3bf553544556bf64b upstream.
      
      It was recently pointed out that the one instance of testing MNT_LOCKED
      outside of the namespace_sem is in ksys_umount.
      
      Fix that by adding a test inside of do_umount with namespace_sem and
      the mount_lock held.  As it helps to fail fails the existing test is
      maintained with an additional comment pointing out that it may be racy
      because the locks are not held.
      
      Cc: stable@vger.kernel.org
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Fixes: 5ff9d8a6 ("vfs: Lock in place mounts from more privileged users")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      32224b87
    • V
      ext4: fix buffer leak in __ext4_read_dirblock() on error path · 4d01f031
      Vasily Averin 提交于
      commit de59fae0043f07de5d25e02ca360f7d57bfa5866 upstream.
      
      Fixes: dc6982ff ("ext4: refactor code to read directory blocks ...")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 3.9
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4d01f031
    • V
      ext4: fix buffer leak in ext4_expand_extra_isize_ea() on error path · b0f2b1fe
      Vasily Averin 提交于
      commit 53692ec074d00589c2cf1d6d17ca76ad0adce6ec upstream.
      
      Fixes: de05ca85 ("ext4: move call to ext4_error() into ...")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 4.17
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b0f2b1fe
    • V
      ext4: fix buffer leak in ext4_xattr_move_to_block() on error path · 29ee4d62
      Vasily Averin 提交于
      commit 6bdc9977fcdedf47118d2caf7270a19f4b6d8a8f upstream.
      
      Fixes: 3f2571c1 ("ext4: factor out xattr moving")
      Fixes: 6dd4ee7c ("ext4: Expand extra_inodes space per ...")
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 2.6.23
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      29ee4d62
    • V
      ext4: release bs.bh before re-using in ext4_xattr_block_find() · 4648dcb2
      Vasily Averin 提交于
      commit 45ae932d246f721e6584430017176cbcadfde610 upstream.
      
      bs.bh was taken in previous ext4_xattr_block_find() call,
      it should be released before re-using
      
      Fixes: 7e01c8e5 ("ext3/4: fix uninitialized bs in ...")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 2.6.26
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4648dcb2
    • V
      ext4: fix buffer leak in ext4_xattr_get_block() on error path · 0f0d1c16
      Vasily Averin 提交于
      commit ecaaf408478b6fb4d9986f9b6652f3824e374f4c upstream.
      
      Fixes: dec214d0 ("ext4: xattr inode deduplication")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 4.13
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0f0d1c16
    • V
      ext4: fix possible leak of s_journal_flag_rwsem in error path · 0a992da5
      Vasily Averin 提交于
      commit af18e35bfd01e6d65a5e3ef84ffe8b252d1628c5 upstream.
      
      Fixes: c8585c6f ("ext4: fix races between changing inode journal ...")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 4.7
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0a992da5
    • T
      ext4: fix possible leak of sbi->s_group_desc_leak in error path · 0d339ced
      Theodore Ts'o 提交于
      commit 9e463084cdb22e0b56b2dfbc50461020409a5fd3 upstream.
      
      Fixes: bfe0a5f4 ("ext4: add more mount time checks of the superblock")
      Reported-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 4.18
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0d339ced
    • T
      ext4: avoid possible double brelse() in add_new_gdb() on error path · 64a3d537
      Theodore Ts'o 提交于
      commit 4f32c38b4662312dd3c5f113d8bdd459887fb773 upstream.
      
      Fixes: b4097142 ("ext4: add error checking to calls to ...")
      Reported-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 2.6.38
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      64a3d537
    • V
      ext4: fix missing cleanup if ext4_alloc_flex_bg_array() fails while resizing · 110a1994
      Vasily Averin 提交于
      commit f348e2241fb73515d65b5d77dd9c174128a7fbf2 upstream.
      
      Fixes: 117fff10 ("ext4: grow the s_flex_groups array as needed ...")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 3.7
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      110a1994
    • V
      ext4: avoid buffer leak in ext4_orphan_add() after prior errors · 656b121b
      Vasily Averin 提交于
      commit feaf264ce7f8d54582e2f66eb82dd9dd124c94f3 upstream.
      
      Fixes: d745a8c2 ("ext4: reduce contention on s_orphan_lock")
      Fixes: 6e3617e5 ("ext4: Handle non empty on-disk orphan link")
      Cc: Dmitry Monakhov <dmonakhov@gmail.com>
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 2.6.34
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      656b121b
    • V
      ext4: avoid buffer leak on shutdown in ext4_mark_iloc_dirty() · d65b7d33
      Vasily Averin 提交于
      commit a6758309a005060b8297a538a457c88699cb2520 upstream.
      
      ext4_mark_iloc_dirty() callers expect that it releases iloc->bh
      even if it returns an error.
      
      Fixes: 0db1ff22 ("ext4: add shutdown bit and check for it")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 4.11
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d65b7d33
    • V
      ext4: fix possible inode leak in the retry loop of ext4_resize_fs() · 36b1ba6a
      Vasily Averin 提交于
      commit db6aee62406d9fbb53315fcddd81f1dc271d49fa upstream.
      
      Fixes: 1c6bd717 ("ext4: convert file system to meta_bg if needed ...")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 3.7
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      36b1ba6a
    • V
      ext4: missing !bh check in ext4_xattr_inode_write() · 4903c091
      Vasily Averin 提交于
      commit eb6984fa4ce2837dcb1f66720a600f31b0bb3739 upstream.
      
      According to Ted Ts'o ext4_getblk() called in ext4_xattr_inode_write()
      should not return bh = NULL
      
      The only time that bh could be NULL, then, would be in the case of
      something really going wrong; a programming error elsewhere (perhaps a
      wild pointer dereference) or I/O error causing on-disk file system
      corruption (although that would be highly unlikely given that we had
      *just* allocated the blocks and so the metadata blocks in question
      probably would still be in the cache).
      
      Fixes: e50e5129 ("ext4: xattr-in-inode support")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 4.13
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4903c091
    • V
      ext4: avoid potential extra brelse in setup_new_flex_group_blocks() · 20dd2c4e
      Vasily Averin 提交于
      commit 9e4028935cca3f9ef9b6a90df9da6f1f94853536 upstream.
      
      Currently bh is set to NULL only during first iteration of for cycle,
      then this pointer is not cleared after end of using.
      Therefore rollback after errors can lead to extra brelse(bh) call,
      decrements bh counter and later trigger an unexpected warning in __brelse()
      
      Patch moves brelse() calls in body of cycle to exclude requirement of
      brelse() call in rollback.
      
      Fixes: 33afdcc5 ("ext4: add a function which sets up group blocks ...")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 3.3+
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      20dd2c4e
    • V
      ext4: add missing brelse() add_new_gdb_meta_bg()'s error path · 2aa79d31
      Vasily Averin 提交于
      commit 61a9c11e5e7a0dab5381afa5d9d4dd5ebf18f7a0 upstream.
      
      Fixes: 01f795f9 ("ext4: add online resizing support for meta_bg ...")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 3.7
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2aa79d31
    • V
      ext4: add missing brelse() in set_flexbg_block_bitmap()'s error path · cd18d6e0
      Vasily Averin 提交于
      commit cea5794122125bf67559906a0762186cf417099c upstream.
      
      Fixes: 33afdcc5 ("ext4: add a function which sets up group blocks ...")
      Cc: stable@kernel.org # 3.3
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cd18d6e0
    • V
      ext4: add missing brelse() update_backups()'s error path · f7b6459e
      Vasily Averin 提交于
      commit ea0abbb648452cdb6e1734b702b6330a7448fcf8 upstream.
      
      Fixes: ac27a0ec ("ext4: initial copy of files from ext3")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org # 2.6.19
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f7b6459e
    • M
      clockevents/drivers/i8253: Add support for PIT shutdown quirk · ebbc6fce
      Michael Kelley 提交于
      commit 35b69a420bfb56b7b74cb635ea903db05e357bec upstream.
      
      Add support for platforms where pit_shutdown() doesn't work because of a
      quirk in the PIT emulation. On these platforms setting the counter register
      to zero causes the PIT to start running again, negating the shutdown.
      
      Provide a global variable that controls whether the counter register is
      zero'ed, which platform specific code can override.
      Signed-off-by: NMichael Kelley <mikelley@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
      Cc: "devel@linuxdriverproject.org" <devel@linuxdriverproject.org>
      Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
      Cc: "virtualization@lists.linux-foundation.org" <virtualization@lists.linux-foundation.org>
      Cc: "jgross@suse.com" <jgross@suse.com>
      Cc: "akataria@vmware.com" <akataria@vmware.com>
      Cc: "olaf@aepfle.de" <olaf@aepfle.de>
      Cc: "apw@canonical.com" <apw@canonical.com>
      Cc: vkuznets <vkuznets@redhat.com>
      Cc: "jasowang@redhat.com" <jasowang@redhat.com>
      Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1541303219-11142-2-git-send-email-mikelley@microsoft.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ebbc6fce
    • S
      btrfs: tree-checker: Fix misleading group system information · f2589f9a
      Shaokun Zhang 提交于
      commit 761333f2f50ccc887aa9957ae829300262c0d15b upstream.
      
      block_group_err shows the group system as a decimal value with a '0x'
      prefix, which is somewhat misleading.
      
      Fix it to print hexadecimal, as was intended.
      
      Fixes: fce466ea ("btrfs: tree-checker: Verify block_group_item")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NShaokun Zhang <zhangshaokun@hisilicon.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f2589f9a
    • F
      Btrfs: fix data corruption due to cloning of eof block · ec6d90a4
      Filipe Manana 提交于
      commit ac765f83f1397646c11092a032d4f62c3d478b81 upstream.
      
      We currently allow cloning a range from a file which includes the last
      block of the file even if the file's size is not aligned to the block
      size. This is fine and useful when the destination file has the same size,
      but when it does not and the range ends somewhere in the middle of the
      destination file, it leads to corruption because the bytes between the EOF
      and the end of the block have undefined data (when there is support for
      discard/trimming they have a value of 0x00).
      
      Example:
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
      
       $ export foo_size=$((256 * 1024 + 100))
       $ xfs_io -f -c "pwrite -S 0x3c 0 $foo_size" /mnt/foo
       $ xfs_io -f -c "pwrite -S 0xb5 0 1M" /mnt/bar
      
       $ xfs_io -c "reflink /mnt/foo 0 512K $foo_size" /mnt/bar
      
       $ od -A d -t x1 /mnt/bar
       0000000 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
       *
       0524288 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c 3c
       *
       0786528 3c 3c 3c 3c 00 00 00 00 00 00 00 00 00 00 00 00
       0786544 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
       *
       0790528 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5 b5
       *
       1048576
      
      The bytes in the range from 786532 (512Kb + 256Kb + 100 bytes) to 790527
      (512Kb + 256Kb + 4Kb - 1) got corrupted, having now a value of 0x00 instead
      of 0xb5.
      
      This is similar to the problem we had for deduplication that got recently
      fixed by commit de02b9f6 ("Btrfs: fix data corruption when
      deduplicating between different files").
      
      Fix this by not allowing such operations to be performed and return the
      errno -EINVAL to user space. This is what XFS is doing as well at the VFS
      level. This change however now makes us return -EINVAL instead of
      -EOPNOTSUPP for cases where the source range maps to an inline extent and
      the destination range's end is smaller then the destination file's size,
      since the detection of inline extents is done during the actual process of
      dropping file extent items (at __btrfs_drop_extents()). Returning the
      -EINVAL error is done early on and solely based on the input parameters
      (offsets and length) and destination file's size. This makes us consistent
      with XFS and anyone else supporting cloning since this case is now checked
      at a higher level in the VFS and is where the -EINVAL will be returned
      from starting with kernel 4.20 (the VFS changed was introduced in 4.20-rc1
      by commit 07d19dc9fbe9 ("vfs: avoid problematic remapping requests into
      partial EOF block"). So this change is more geared towards stable kernels,
      as it's unlikely the new VFS checks get removed intentionally.
      
      A test case for fstests follows soon, as well as an update to filter
      existing tests that expect -EOPNOTSUPP to accept -EINVAL as well.
      
      CC: <stable@vger.kernel.org> # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ec6d90a4
    • F
      Btrfs: fix infinite loop on inode eviction after deduplication of eof block · bafd5b78
      Filipe Manana 提交于
      commit 11023d3f5fdf89bba5e1142127701ca6e6014587 upstream.
      
      If we attempt to deduplicate the last block of a file A into the middle of
      a file B, and file A's size is not a multiple of the block size, we end
      rounding the deduplication length to 0 bytes, to avoid the data corruption
      issue fixed by commit de02b9f6 ("Btrfs: fix data corruption when
      deduplicating between different files"). However a length of zero will
      cause the insertion of an extent state with a start value greater (by 1)
      then the end value, leading to a corrupt extent state that will trigger a
      warning and cause chaos such as an infinite loop during inode eviction.
      Example trace:
      
       [96049.833585] ------------[ cut here ]------------
       [96049.833714] WARNING: CPU: 0 PID: 24448 at fs/btrfs/extent_io.c:436 insert_state+0x101/0x120 [btrfs]
       [96049.833767] CPU: 0 PID: 24448 Comm: xfs_io Not tainted 4.19.0-rc7-btrfs-next-39 #1
       [96049.833768] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [96049.833780] RIP: 0010:insert_state+0x101/0x120 [btrfs]
       [96049.833783] RSP: 0018:ffffafd2c3707af0 EFLAGS: 00010282
       [96049.833785] RAX: 0000000000000000 RBX: 000000000004dfff RCX: 0000000000000006
       [96049.833786] RDX: 0000000000000007 RSI: ffff99045c143230 RDI: ffff99047b2168a0
       [96049.833787] RBP: ffff990457851cd0 R08: 0000000000000001 R09: 0000000000000000
       [96049.833787] R10: ffffafd2c3707ab8 R11: 0000000000000000 R12: ffff9903b93b12c8
       [96049.833788] R13: 000000000004e000 R14: ffffafd2c3707b80 R15: ffffafd2c3707b78
       [96049.833790] FS:  00007f5c14e7d700(0000) GS:ffff99047b200000(0000) knlGS:0000000000000000
       [96049.833791] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [96049.833792] CR2: 00007f5c146abff8 CR3: 0000000115f4c004 CR4: 00000000003606f0
       [96049.833795] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [96049.833796] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [96049.833796] Call Trace:
       [96049.833809]  __set_extent_bit+0x46c/0x6a0 [btrfs]
       [96049.833823]  lock_extent_bits+0x6b/0x210 [btrfs]
       [96049.833831]  ? _raw_spin_unlock+0x24/0x30
       [96049.833841]  ? test_range_bit+0xdf/0x130 [btrfs]
       [96049.833853]  lock_extent_range+0x8e/0x150 [btrfs]
       [96049.833864]  btrfs_double_extent_lock+0x78/0xb0 [btrfs]
       [96049.833875]  btrfs_extent_same_range+0x14e/0x550 [btrfs]
       [96049.833885]  ? rcu_read_lock_sched_held+0x3f/0x70
       [96049.833890]  ? __kmalloc_node+0x2b0/0x2f0
       [96049.833899]  ? btrfs_dedupe_file_range+0x19a/0x280 [btrfs]
       [96049.833909]  btrfs_dedupe_file_range+0x270/0x280 [btrfs]
       [96049.833916]  vfs_dedupe_file_range_one+0xd9/0xe0
       [96049.833919]  vfs_dedupe_file_range+0x131/0x1b0
       [96049.833924]  do_vfs_ioctl+0x272/0x6e0
       [96049.833927]  ? __fget+0x113/0x200
       [96049.833931]  ksys_ioctl+0x70/0x80
       [96049.833933]  __x64_sys_ioctl+0x16/0x20
       [96049.833937]  do_syscall_64+0x60/0x1b0
       [96049.833939]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [96049.833941] RIP: 0033:0x7f5c1478ddd7
       [96049.833943] RSP: 002b:00007ffe15b196a8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010
       [96049.833945] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f5c1478ddd7
       [96049.833946] RDX: 00005625ece322d0 RSI: 00000000c0189436 RDI: 0000000000000004
       [96049.833947] RBP: 0000000000000000 R08: 00007f5c14a46f48 R09: 0000000000000040
       [96049.833948] R10: 0000000000000541 R11: 0000000000000202 R12: 0000000000000000
       [96049.833949] R13: 0000000000000000 R14: 0000000000000004 R15: 00005625ece322d0
       [96049.833954] irq event stamp: 6196
       [96049.833956] hardirqs last  enabled at (6195): [<ffffffff91b00663>] console_unlock+0x503/0x640
       [96049.833958] hardirqs last disabled at (6196): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
       [96049.833959] softirqs last  enabled at (6114): [<ffffffff92600370>] __do_softirq+0x370/0x421
       [96049.833964] softirqs last disabled at (6095): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
       [96049.833965] ---[ end trace db7b05f01b7fa10c ]---
       [96049.935816] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
       [96049.935822] irq event stamp: 6584
       [96049.935823] hardirqs last  enabled at (6583): [<ffffffff91b00663>] console_unlock+0x503/0x640
       [96049.935825] hardirqs last disabled at (6584): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
       [96049.935827] softirqs last  enabled at (6328): [<ffffffff92600370>] __do_softirq+0x370/0x421
       [96049.935828] softirqs last disabled at (6313): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
       [96049.935829] ---[ end trace db7b05f01b7fa123 ]---
       [96049.935840] ------------[ cut here ]------------
       [96049.936065] WARNING: CPU: 1 PID: 24463 at fs/btrfs/extent_io.c:436 insert_state+0x101/0x120 [btrfs]
       [96049.936107] CPU: 1 PID: 24463 Comm: umount Tainted: G        W         4.19.0-rc7-btrfs-next-39 #1
       [96049.936108] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014
       [96049.936117] RIP: 0010:insert_state+0x101/0x120 [btrfs]
       [96049.936119] RSP: 0018:ffffafd2c3637bc0 EFLAGS: 00010282
       [96049.936120] RAX: 0000000000000000 RBX: 000000000004dfff RCX: 0000000000000006
       [96049.936121] RDX: 0000000000000007 RSI: ffff990445cf88e0 RDI: ffff99047b2968a0
       [96049.936122] RBP: ffff990457851cd0 R08: 0000000000000001 R09: 0000000000000000
       [96049.936123] R10: ffffafd2c3637b88 R11: 0000000000000000 R12: ffff9904574301e8
       [96049.936124] R13: 000000000004e000 R14: ffffafd2c3637c50 R15: ffffafd2c3637c48
       [96049.936125] FS:  00007fe4b87e72c0(0000) GS:ffff99047b280000(0000) knlGS:0000000000000000
       [96049.936126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       [96049.936128] CR2: 00005562e52618d8 CR3: 00000001151c8005 CR4: 00000000003606e0
       [96049.936129] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       [96049.936131] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       [96049.936131] Call Trace:
       [96049.936141]  __set_extent_bit+0x46c/0x6a0 [btrfs]
       [96049.936154]  lock_extent_bits+0x6b/0x210 [btrfs]
       [96049.936167]  btrfs_evict_inode+0x1e1/0x5a0 [btrfs]
       [96049.936172]  evict+0xbf/0x1c0
       [96049.936174]  dispose_list+0x51/0x80
       [96049.936176]  evict_inodes+0x193/0x1c0
       [96049.936180]  generic_shutdown_super+0x3f/0x110
       [96049.936182]  kill_anon_super+0xe/0x30
       [96049.936189]  btrfs_kill_super+0x13/0x100 [btrfs]
       [96049.936191]  deactivate_locked_super+0x3a/0x70
       [96049.936193]  cleanup_mnt+0x3b/0x80
       [96049.936195]  task_work_run+0x93/0xc0
       [96049.936198]  exit_to_usermode_loop+0xfa/0x100
       [96049.936201]  do_syscall_64+0x17f/0x1b0
       [96049.936202]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
       [96049.936204] RIP: 0033:0x7fe4b80cfb37
       [96049.936206] RSP: 002b:00007ffff092b688 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
       [96049.936207] RAX: 0000000000000000 RBX: 00005562e5259060 RCX: 00007fe4b80cfb37
       [96049.936208] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 00005562e525faa0
       [96049.936209] RBP: 00005562e525faa0 R08: 00005562e525f770 R09: 0000000000000015
       [96049.936210] R10: 00000000000006b4 R11: 0000000000000246 R12: 00007fe4b85d1e64
       [96049.936211] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
       [96049.936211] R13: 0000000000000000 R14: 00005562e5259240 R15: 00007ffff092b910
       [96049.936216] irq event stamp: 6616
       [96049.936219] hardirqs last  enabled at (6615): [<ffffffff91b00663>] console_unlock+0x503/0x640
       [96049.936219] hardirqs last disabled at (6616): [<ffffffff91a037dd>] trace_hardirqs_off_thunk+0x1a/0x1c
       [96049.936222] softirqs last  enabled at (6328): [<ffffffff92600370>] __do_softirq+0x370/0x421
       [96049.936222] softirqs last disabled at (6313): [<ffffffff91a8dd4d>] irq_exit+0xcd/0xe0
       [96049.936223] ---[ end trace db7b05f01b7fa124 ]---
      
      The second stack trace, from inode eviction, is repeated forever due to
      the infinite loop during eviction.
      
      This is the same type of problem fixed way back in 2015 by commit
      113e8283 ("Btrfs: fix inode eviction infinite loop after extent_same
      ioctl") and commit ccccf3d6 ("Btrfs: fix inode eviction infinite loop
      after cloning into it").
      
      So fix this by returning immediately if the deduplication range length
      gets rounded down to 0 bytes, as there is nothing that needs to be done in
      such case.
      
      Example reproducer:
      
       $ mkfs.btrfs -f /dev/sdb
       $ mount /dev/sdb /mnt
      
       $ xfs_io -f -c "pwrite -S 0xe6 0 100" /mnt/foo
       $ xfs_io -f -c "pwrite -S 0xe6 0 1M" /mnt/bar
      
       # Unmount the filesystem and mount it again so that we start without any
       # extent state records when we ask for the deduplication.
       $ umount /mnt
       $ mount /dev/sdb /mnt
      
       $ xfs_io -c "dedupe /mnt/foo 0 500K 100" /mnt/bar
      
       # This unmount triggers the infinite loop.
       $ umount /mnt
      
      A test case for fstests will follow soon.
      
      Fixes: de02b9f6 ("Btrfs: fix data corruption when deduplicating between different files")
      CC: <stable@vger.kernel.org> # 4.19+
      Reviewed-by: NNikolay Borisov <nborisov@suse.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bafd5b78
    • R
      Btrfs: fix cur_offset in the error case for nocow · db39065c
      Robbie Ko 提交于
      commit 506481b20e818db40b6198815904ecd2d6daee64 upstream.
      
      When the cow_file_range fails, the related resources are unlocked
      according to the range [start..end), so the unlock cannot be repeated in
      run_delalloc_nocow.
      
      In some cases (e.g. cur_offset <= end && cow_start != -1), cur_offset is
      not updated correctly, so move the cur_offset update before
      cow_file_range.
      
        kernel BUG at mm/page-writeback.c:2663!
        Internal error: Oops - BUG: 0 [#1] SMP
        CPU: 3 PID: 31525 Comm: kworker/u8:7 Tainted: P O
        Hardware name: Realtek_RTD1296 (DT)
        Workqueue: writeback wb_workfn (flush-btrfs-1)
        task: ffffffc076db3380 ti: ffffffc02e9ac000 task.ti: ffffffc02e9ac000
        PC is at clear_page_dirty_for_io+0x1bc/0x1e8
        LR is at clear_page_dirty_for_io+0x14/0x1e8
        pc : [<ffffffc00033c91c>] lr : [<ffffffc00033c774>] pstate: 40000145
        sp : ffffffc02e9af4f0
        Process kworker/u8:7 (pid: 31525, stack limit = 0xffffffc02e9ac020)
        Call trace:
        [<ffffffc00033c91c>] clear_page_dirty_for_io+0x1bc/0x1e8
        [<ffffffbffc514674>] extent_clear_unlock_delalloc+0x1e4/0x210 [btrfs]
        [<ffffffbffc4fb168>] run_delalloc_nocow+0x3b8/0x948 [btrfs]
        [<ffffffbffc4fb948>] run_delalloc_range+0x250/0x3a8 [btrfs]
        [<ffffffbffc514c0c>] writepage_delalloc.isra.21+0xbc/0x1d8 [btrfs]
        [<ffffffbffc516048>] __extent_writepage+0xe8/0x248 [btrfs]
        [<ffffffbffc51630c>] extent_write_cache_pages.isra.17+0x164/0x378 [btrfs]
        [<ffffffbffc5185a8>] extent_writepages+0x48/0x68 [btrfs]
        [<ffffffbffc4f5828>] btrfs_writepages+0x20/0x30 [btrfs]
        [<ffffffc00033d758>] do_writepages+0x30/0x88
        [<ffffffc0003ba0f4>] __writeback_single_inode+0x34/0x198
        [<ffffffc0003ba6c4>] writeback_sb_inodes+0x184/0x3c0
        [<ffffffc0003ba96c>] __writeback_inodes_wb+0x6c/0xc0
        [<ffffffc0003bac20>] wb_writeback+0x1b8/0x1c0
        [<ffffffc0003bb0f0>] wb_workfn+0x150/0x250
        [<ffffffc0002b0014>] process_one_work+0x1dc/0x388
        [<ffffffc0002b02f0>] worker_thread+0x130/0x500
        [<ffffffc0002b6344>] kthread+0x10c/0x110
        [<ffffffc000284590>] ret_from_fork+0x10/0x40
        Code: d503201f a9025bb5 a90363b7 f90023b9 (d4210000)
      
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NRobbie Ko <robbieko@synology.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      db39065c
    • F
      Btrfs: fix missing data checksums after a ranged fsync (msync) · fa625a48
      Filipe Manana 提交于
      commit 008c6753f7e070c77c70d708a6bf0255b4381763 upstream.
      
      Recently we got a massive simplification for fsync, where for the fast
      path we no longer log new extents while their respective ordered extents
      are still running.
      
      However that simplification introduced a subtle regression for the case
      where we use a ranged fsync (msync). Consider the following example:
      
                     CPU 0                                    CPU 1
      
                                                  mmap write to range [2Mb, 4Mb[
        mmap write to range [512Kb, 1Mb[
        msync range [512K, 1Mb[
          --> triggers fast fsync
              (BTRFS_INODE_NEEDS_FULL_SYNC
               not set)
          --> creates extent map A for this
              range and adds it to list of
              modified extents
          --> starts ordered extent A for
              this range
          --> waits for it to complete
      
                                                  writeback triggered for range
                                                  [2Mb, 4Mb[
                                                    --> create extent map B and
                                                        adds it to the list of
                                                        modified extents
                                                    --> creates ordered extent B
      
          --> start looking for and logging
              modified extents
          --> logs extent maps A and B
          --> finds checksums for extent A
              in the csum tree, but not for
              extent B
        fsync (msync) finishes
      
                                                    --> ordered extent B
                                                        finishes and its
                                                        checksums are added
                                                        to the csum tree
      
                                      <power cut>
      
      After replaying the log, we have the extent covering the range [2Mb, 4Mb[
      but do not have the data checksum items covering that file range.
      
      This happens because at the very beginning of an fsync (btrfs_sync_file())
      we start and wait for IO in the given range [512Kb, 1Mb[ and therefore
      wait for any ordered extents in that range to complete before we start
      logging the extents. However if right before we start logging the extent
      in our range [512Kb, 1Mb[, writeback is started for any other dirty range,
      such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync
      or msync (btrfs_sync_file() starts writeback before acquiring the inode's
      lock), an ordered extent is created for that other range and a new extent
      map is created to represent that range and added to the inode's list of
      modified extents.
      
      That means that we will see that other extent in that list when collecting
      extents for logging (done at btrfs_log_changed_extents()) and log the
      extent before the respective ordered extent finishes - namely before the
      checksum items are added to the checksums tree, which is where
      log_extent_csums() looks for the checksums, therefore making us log an
      extent without logging its checksums. Before that massive simplification
      of fsync, this wasn't a problem because besides looking for checkums in
      the checksums tree, we also looked for them in any ordered extent still
      running.
      
      The consequence of data checksums missing for a file range is that users
      attempting to read the affected file range will get -EIO errors and dmesg
      reports the following:
      
       [10188.358136] BTRFS info (device sdc): no csum found for inode 297 start 57344
       [10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off 57344 csum 0x98f94189 expected csum 0x00000000 mirror 1
      
      So fix this by skipping extents outside of our logging range at
      btrfs_log_changed_extents() and leaving them on the list of modified
      extents so that any subsequent ranged fsync may collect them if needed.
      Also, if we find a hole extent outside of the range still log it, just
      to prevent having gaps between extent items after replaying the log,
      otherwise fsck will complain when we are not using the NO_HOLES feature
      (fstest btrfs/056 triggers such case).
      
      Fixes: e7175a69 ("btrfs: remove the wait ordered logic in the log_one_extent path")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fa625a48
    • L
      btrfs: fix pinned underflow after transaction aborted · ec26ad25
      Lu Fengqi 提交于
      commit fcd5e74288f7d36991b1f0fb96b8c57079645e38 upstream.
      
      When running generic/475, we may get the following warning in dmesg:
      
      [ 6902.102154] WARNING: CPU: 3 PID: 18013 at fs/btrfs/extent-tree.c:9776 btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
      [ 6902.109160] CPU: 3 PID: 18013 Comm: umount Tainted: G        W  O      4.19.0-rc8+ #8
      [ 6902.110971] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
      [ 6902.112857] RIP: 0010:btrfs_free_block_groups+0x2af/0x3b0 [btrfs]
      [ 6902.118921] RSP: 0018:ffffc9000459bdb0 EFLAGS: 00010286
      [ 6902.120315] RAX: ffff880175050bb0 RBX: ffff8801124a8000 RCX: 0000000000170007
      [ 6902.121969] RDX: 0000000000000002 RSI: 0000000000170007 RDI: ffffffff8125fb74
      [ 6902.123716] RBP: ffff880175055d10 R08: 0000000000000000 R09: 0000000000000000
      [ 6902.125417] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880175055d88
      [ 6902.127129] R13: ffff880175050bb0 R14: 0000000000000000 R15: dead000000000100
      [ 6902.129060] FS:  00007f4507223780(0000) GS:ffff88017ba00000(0000) knlGS:0000000000000000
      [ 6902.130996] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6902.132558] CR2: 00005623599cac78 CR3: 000000014b700001 CR4: 00000000003606e0
      [ 6902.134270] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [ 6902.135981] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [ 6902.137836] Call Trace:
      [ 6902.138939]  close_ctree+0x171/0x330 [btrfs]
      [ 6902.140181]  ? kthread_stop+0x146/0x1f0
      [ 6902.141277]  generic_shutdown_super+0x6c/0x100
      [ 6902.142517]  kill_anon_super+0x14/0x30
      [ 6902.143554]  btrfs_kill_super+0x13/0x100 [btrfs]
      [ 6902.144790]  deactivate_locked_super+0x2f/0x70
      [ 6902.146014]  cleanup_mnt+0x3b/0x70
      [ 6902.147020]  task_work_run+0x9e/0xd0
      [ 6902.148036]  do_syscall_64+0x470/0x600
      [ 6902.149142]  ? trace_hardirqs_off_thunk+0x1a/0x1c
      [ 6902.150375]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      [ 6902.151640] RIP: 0033:0x7f45077a6a7b
      [ 6902.157324] RSP: 002b:00007ffd589f3e68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
      [ 6902.159187] RAX: 0000000000000000 RBX: 000055e8eec732b0 RCX: 00007f45077a6a7b
      [ 6902.160834] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 000055e8eec73490
      [ 6902.162526] RBP: 0000000000000000 R08: 000055e8eec734b0 R09: 00007ffd589f26c0
      [ 6902.164141] R10: 0000000000000000 R11: 0000000000000246 R12: 000055e8eec73490
      [ 6902.165815] R13: 00007f4507ac61a4 R14: 0000000000000000 R15: 00007ffd589f40d8
      [ 6902.167553] irq event stamp: 0
      [ 6902.168998] hardirqs last  enabled at (0): [<0000000000000000>]           (null)
      [ 6902.170731] hardirqs last disabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
      [ 6902.172773] softirqs last  enabled at (0): [<ffffffff810cd810>] copy_process.part.55+0x3b0/0x1f00
      [ 6902.174671] softirqs last disabled at (0): [<0000000000000000>]           (null)
      [ 6902.176407] ---[ end trace 463138c2986b275c ]---
      [ 6902.177636] BTRFS info (device dm-3): space_info 4 has 273465344 free, is not full
      [ 6902.179453] BTRFS info (device dm-3): space_info total=276824064, used=4685824, pinned=18446744073708158976, reserved=0, may_use=0, readonly=65536
      
      In the above line there's "pinned=18446744073708158976" which is an
      unsigned u64 value of -1392640, an obvious underflow.
      
      When transaction_kthread is running cleanup_transaction(), another
      fsstress is running btrfs_commit_transaction(). The
      btrfs_finish_extent_commit() may get the same range as
      btrfs_destroy_pinned_extent() got, which causes the pinned underflow.
      
      Fixes: d4b450cd ("Btrfs: fix race between transaction commit and empty block group removal")
      CC: stable@vger.kernel.org # 4.4+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NLu Fengqi <lufq.fnst@cn.fujitsu.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ec26ad25
    • M
      watchdog/core: Add missing prototypes for weak functions · cb7c993f
      Mathieu Malaterre 提交于
      commit 81bd415c91eb966118d773dddf254aebf3022411 upstream.
      
      The split out of the hard lockup detector exposed two new weak functions,
      but no prototypes for them, which triggers the build warning:
      
        kernel/watchdog.c:109:12: warning: no previous prototype for ‘watchdog_nmi_enable’ [-Wmissing-prototypes]
        kernel/watchdog.c:115:13: warning: no previous prototype for ‘watchdog_nmi_disable’ [-Wmissing-prototypes]
      
      Add the prototypes.
      
      Fixes: 73ce0511 ("kernel/watchdog.c: move hardlockup detector to separate file")
      Signed-off-by: NMathieu Malaterre <malat@debian.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Babu Moger <babu.moger@oracle.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20180606194232.17653-1-malat@debian.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cb7c993f
    • H
      arch/alpha, termios: implement BOTHER, IBSHIFT and termios2 · 139ca3da
      H. Peter Anvin (Intel) 提交于
      commit d0ffb805b729322626639336986bc83fc2e60871 upstream.
      
      Alpha has had c_ispeed and c_ospeed, but still set speeds in c_cflags
      using arbitrary flags. Because BOTHER is not defined, the general
      Linux code doesn't allow setting arbitrary baud rates, and because
      CBAUDEX == 0, we can have an array overrun of the baud_rate[] table in
      drivers/tty/tty_baudrate.c if (c_cflags & CBAUD) == 037.
      
      Resolve both problems by #defining BOTHER to 037 on Alpha.
      
      However, userspace still needs to know if setting BOTHER is actually
      safe given legacy kernels (does anyone actually care about that on
      Alpha anymore?), so enable the TCGETS2/TCSETS*2 ioctls on Alpha, even
      though they use the same structure. Define struct termios2 just for
      compatibility; it is the exact same structure as struct termios. In a
      future patchset, this will be cleaned up so the uapi headers are
      usable from libc.
      Signed-off-by: NH. Peter Anvin (Intel) <hpa@zytor.com>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Eugene Syromiatnikov <esyr@redhat.com>
      Cc: <linux-alpha@vger.kernel.org>
      Cc: <linux-serial@vger.kernel.org>
      Cc: Johan Hovold <johan@kernel.org>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      139ca3da
    • H
      termios, tty/tty_baudrate.c: fix buffer overrun · 8851e11f
      H. Peter Anvin 提交于
      commit 991a25194097006ec1e0d2e0814ff920e59e3465 upstream.
      
      On architectures with CBAUDEX == 0 (Alpha and PowerPC), the code in tty_baudrate.c does
      not do any limit checking on the tty_baudrate[] array, and in fact a
      buffer overrun is possible on both architectures. Add a limit check to
      prevent that situation.
      
      This will be followed by a much bigger cleanup/simplification patch.
      Signed-off-by: NH. Peter Anvin (Intel) <hpa@zytor.com>
      Requested-by: NCc: Johan Hovold <johan@kernel.org>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Eugene Syromiatnikov <esyr@redhat.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: stable <stable@vger.kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8851e11f
    • M
      x86/hyper-v: Enable PIT shutdown quirk · 2deb55aa
      Michael Kelley 提交于
      commit 1de72c706488b7be664a601cf3843bd01e327e58 upstream.
      
      Hyper-V emulation of the PIT has a quirk such that the normal PIT shutdown
      path doesn't work, because clearing the counter register restarts the
      timer.
      
      Disable the counter clearing on PIT shutdown.
      Signed-off-by: NMichael Kelley <mikelley@microsoft.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: "gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>
      Cc: "devel@linuxdriverproject.org" <devel@linuxdriverproject.org>
      Cc: "daniel.lezcano@linaro.org" <daniel.lezcano@linaro.org>
      Cc: "virtualization@lists.linux-foundation.org" <virtualization@lists.linux-foundation.org>
      Cc: "jgross@suse.com" <jgross@suse.com>
      Cc: "akataria@vmware.com" <akataria@vmware.com>
      Cc: "olaf@aepfle.de" <olaf@aepfle.de>
      Cc: "apw@canonical.com" <apw@canonical.com>
      Cc: vkuznets <vkuznets@redhat.com>
      Cc: "jasowang@redhat.com" <jasowang@redhat.com>
      Cc: "marcelo.cerri@canonical.com" <marcelo.cerri@canonical.com>
      Cc: KY Srinivasan <kys@microsoft.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/1541303219-11142-3-git-send-email-mikelley@microsoft.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      2deb55aa
    • S
      x86/cpu/vmware: Do not trace vmware_sched_clock() · e73cb6a6
      Steven Rostedt (VMware) 提交于
      commit 15035388439f892017d38b05214d3cda6578af64 upstream.
      
      When running function tracing on a Linux guest running on VMware
      Workstation, the guest would crash. This is due to tracing of the
      sched_clock internal call of the VMware vmware_sched_clock(), which
      causes an infinite recursion within the tracing code (clock calls must
      not be traced).
      
      Make vmware_sched_clock() not traced by ftrace.
      
      Fixes: 80e9a4f2 ("x86/vmware: Add paravirt sched clock")
      Reported-by: NGwanYeong Kim <gy741.kim@gmail.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      CC: Alok Kataria <akataria@vmware.com>
      CC: GwanYeong Kim <gy741.kim@gmail.com>
      CC: "H. Peter Anvin" <hpa@zytor.com>
      CC: Ingo Molnar <mingo@kernel.org>
      Cc: stable@vger.kernel.org
      CC: Thomas Gleixner <tglx@linutronix.de>
      CC: virtualization@lists.linux-foundation.org
      CC: x86-ml <x86@kernel.org>
      Link: http://lkml.kernel.org/r/20181109152207.4d3e7d70@gandalf.local.homeSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e73cb6a6
    • J
      of, numa: Validate some distance map rules · 3cbdaf13
      John Garry 提交于
      commit 89c38422e072bb453e3045b8f1b962a344c3edea upstream.
      
      Currently the NUMA distance map parsing does not validate the distance
      table for the distance-matrix rules 1-2 in [1].
      
      However the arch NUMA code may enforce some of these rules, but not all.
      Such is the case for the arm64 port, which does not enforce the rule that
      the distance between separates nodes cannot equal LOCAL_DISTANCE.
      
      The patch adds the following rules validation:
      - distance of node to self equals LOCAL_DISTANCE
      - distance of separate nodes > LOCAL_DISTANCE
      
      This change avoids a yet-unresolved crash reported in [2].
      
      A note on dealing with symmetrical distances between nodes:
      
      Validating symmetrical distances between nodes is difficult. If it were
      mandated in the bindings that every distance must be recorded in the
      table, then it would be easy. However, it isn't.
      
      In addition to this, it is also possible to record [b, a] distance only
      (and not [a, b]). So, when processing the table for [b, a], we cannot
      assert that current distance of [a, b] != [b, a] as invalid, as [a, b]
      distance may not be present in the table and current distance would be
      default at REMOTE_DISTANCE.
      
      As such, we maintain the policy that we overwrite distance [a, b] = [b, a]
      for b > a. This policy is different to kernel ACPI SLIT validation, which
      allows non-symmetrical distances (ACPI spec SLIT rules allow it). However,
      the distance debug message is dropped as it may be misleading (for a distance
      which is later overwritten).
      
      Some final notes on semantics:
      
      - It is implied that it is the responsibility of the arch NUMA code to
        reset the NUMA distance map for an error in distance map parsing.
      
      - It is the responsibility of the FW NUMA topology parsing (whether OF or
        ACPI) to enforce NUMA distance rules, and not arch NUMA code.
      
      [1] Documents/devicetree/bindings/numa.txt
      [2] https://www.spinics.net/lists/arm-kernel/msg683304.html
      
      Cc: stable@vger.kernel.org # 4.7
      Signed-off-by: NJohn Garry <john.garry@huawei.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NRob Herring <robh@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3cbdaf13
    • A
      perf intel-pt: Insert callchain context into synthesized callchains · 73c660f3
      Adrian Hunter 提交于
      commit 242483068b4b9ad02f1653819b6e683577681e0e upstream.
      
      In the absence of a fallback, callchains must encode also the callchain
      context. Do that now there is no fallback.
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Reviewed-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Leo Yan <leo.yan@linaro.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: stable@vger.kernel.org # 4.19
      Link: http://lkml.kernel.org/r/100ea2ec-ed14-b56d-d810-e0a6d2f4b069@intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      73c660f3
    • A
      perf intel-pt/bts: Calculate cpumode for synthesized samples · f3de8640
      Adrian Hunter 提交于
      commit 5d4f0edaa3ac4f1844ed7c64cd2bae6f1912bac5 upstream.
      
      In the absence of a fallback, samples must provide a correct cpumode for
      the 'ip'. Do that now there is no fallback.
      Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com>
      Reviewed-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Leo Yan <leo.yan@linaro.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: stable@vger.kernel.org # 4.19
      Link: http://lkml.kernel.org/r/20181031091043.23465-6-adrian.hunter@intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f3de8640