1. 31 5月, 2019 31 次提交
    • J
      btrfs: honor path->skip_locking in backref code · 9c0339dd
      Josef Bacik 提交于
      commit 38e3eebff643db725633657d1d87a3be019d1018 upstream.
      
      Qgroups will do the old roots lookup at delayed ref time, which could be
      while walking down the extent root while running a delayed ref.  This
      should be fine, except we specifically lock eb's in the backref walking
      code irrespective of path->skip_locking, which deadlocks the system.
      Fix up the backref code to honor path->skip_locking, nobody will be
      modifying the commit_root when we're searching so it's completely safe
      to do.
      
      This happens since fb235dc0 ("btrfs: qgroup: Move half of the qgroup
      accounting time out of commit trans"), kernel may lockup with quota
      enabled.
      
      There is one backref trace triggered by snapshot dropping along with
      write operation in the source subvolume.  The example can be reliably
      reproduced:
      
        btrfs-cleaner   D    0  4062      2 0x80000000
        Call Trace:
         schedule+0x32/0x90
         btrfs_tree_read_lock+0x93/0x130 [btrfs]
         find_parent_nodes+0x29b/0x1170 [btrfs]
         btrfs_find_all_roots_safe+0xa8/0x120 [btrfs]
         btrfs_find_all_roots+0x57/0x70 [btrfs]
         btrfs_qgroup_trace_extent_post+0x37/0x70 [btrfs]
         btrfs_qgroup_trace_leaf_items+0x10b/0x140 [btrfs]
         btrfs_qgroup_trace_subtree+0xc8/0xe0 [btrfs]
         do_walk_down+0x541/0x5e3 [btrfs]
         walk_down_tree+0xab/0xe7 [btrfs]
         btrfs_drop_snapshot+0x356/0x71a [btrfs]
         btrfs_clean_one_deleted_snapshot+0xb8/0xf0 [btrfs]
         cleaner_kthread+0x12b/0x160 [btrfs]
         kthread+0x112/0x130
         ret_from_fork+0x27/0x50
      
      When dropping snapshots with qgroup enabled, we will trigger backref
      walk.
      
      However such backref walk at that timing is pretty dangerous, as if one
      of the parent nodes get WRITE locked by other thread, we could cause a
      dead lock.
      
      For example:
      
                 FS 260     FS 261 (Dropped)
                  node A        node B
                 /      \      /      \
             node C      node D      node E
            /   \         /  \        /     \
        leaf F|leaf G|leaf H|leaf I|leaf J|leaf K
      
      The lock sequence would be:
      
            Thread A (cleaner)             |       Thread B (other writer)
      -----------------------------------------------------------------------
      write_lock(B)                        |
      write_lock(D)                        |
      ^^^ called by walk_down_tree()       |
                                           |       write_lock(A)
                                           |       write_lock(D) << Stall
      read_lock(H) << for backref walk     |
      read_lock(D) << lock owner is        |
                      the same thread A    |
                      so read lock is OK   |
      read_lock(A) << Stall                |
      
      So thread A hold write lock D, and needs read lock A to unlock.
      While thread B holds write lock A, while needs lock D to unlock.
      
      This will cause a deadlock.
      
      This is not only limited to snapshot dropping case.  As the backref
      walk, even only happens on commit trees, is breaking the normal top-down
      locking order, makes it deadlock prone.
      
      Fixes: fb235dc0 ("btrfs: qgroup: Move half of the qgroup accounting time out of commit trans")
      CC: stable@vger.kernel.org # 4.14+
      Reported-and-tested-by: NDavid Sterba <dsterba@suse.com>
      Reported-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NQu Wenruo <wqu@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      [ rebase to latest branch and fix lock assert bug in btrfs/007 ]
      [ backport to linux-4.19.y branch, solve minor conflicts ]
      Signed-off-by: NQu Wenruo <wqu@suse.com>
      [ copy logs and deadlock analysis from Qu's patch ]
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c0339dd
    • W
      arm64: errata: Add workaround for Cortex-A76 erratum #1463225 · 2eefb4a3
      Will Deacon 提交于
      commit 969f5ea627570e91c9d54403287ee3ed657f58fe upstream.
      
      Revisions of the Cortex-A76 CPU prior to r4p0 are affected by an erratum
      that can prevent interrupts from being taken when single-stepping.
      
      This patch implements a software workaround to prevent userspace from
      effectively being able to disable interrupts.
      
      Cc: <stable@vger.kernel.org>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      2eefb4a3
    • A
      brcmfmac: add subtype check for event handling in data path · 8783c412
      Arend van Spriel 提交于
      commit a4176ec356c73a46c07c181c6d04039fafa34a9f upstream.
      
      For USB there is no separate channel being used to pass events
      from firmware to the host driver and as such are passed over the
      data path. In order to detect mock event messages an additional
      check is needed on event subtype. This check is added conditionally
      using unlikely() keyword.
      Reviewed-by: NHante Meuleman <hante.meuleman@broadcom.com>
      Reviewed-by: NPieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
      Reviewed-by: NFranky Lin <franky.lin@broadcom.com>
      Signed-off-by: NArend van Spriel <arend.vanspriel@broadcom.com>
      Signed-off-by: NKalle Valo <kvalo@codeaurora.org>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8783c412
    • A
      brcmfmac: assure SSID length from firmware is limited · cc240e05
      Arend van Spriel 提交于
      commit 1b5e2423164b3670e8bc9174e4762d297990deff upstream.
      
      The SSID length as received from firmware should not exceed
      IEEE80211_MAX_SSID_LEN as that would result in heap overflow.
      Reviewed-by: NHante Meuleman <hante.meuleman@broadcom.com>
      Reviewed-by: NPieter-Paul Giesberts <pieter-paul.giesberts@broadcom.com>
      Reviewed-by: NFranky Lin <franky.lin@broadcom.com>
      Signed-off-by: NArend van Spriel <arend.vanspriel@broadcom.com>
      Signed-off-by: NKalle Valo <kvalo@codeaurora.org>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cc240e05
    • D
      bpf: add bpf_jit_limit knob to restrict unpriv allocations · 43caa29c
      Daniel Borkmann 提交于
      commit ede95a63b5e84ddeea6b0c473b36ab8bfd8c6ce3 upstream.
      
      Rick reported that the BPF JIT could potentially fill the entire module
      space with BPF programs from unprivileged users which would prevent later
      attempts to load normal kernel modules or privileged BPF programs, for
      example. If JIT was enabled but unsuccessful to generate the image, then
      before commit 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      we would always fall back to the BPF interpreter. Nowadays in the case
      where the CONFIG_BPF_JIT_ALWAYS_ON could be set, then the load will abort
      with a failure since the BPF interpreter was compiled out.
      
      Add a global limit and enforce it for unprivileged users such that in case
      of BPF interpreter compiled out we fail once the limit has been reached
      or we fall back to BPF interpreter earlier w/o using module mem if latter
      was compiled in. In a next step, fair share among unprivileged users can
      be resolved in particular for the case where we would fail hard once limit
      is reached.
      
      Fixes: 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
      Fixes: 0a14842f ("net: filter: Just In Time compiler for x86-64")
      Co-Developed-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: LKML <linux-kernel@vger.kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Ben Hutchings <ben.hutchings@codethink.co.uk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43caa29c
    • O
      NFSv4.1 fix incorrect return value in copy_file_range · cc1afc10
      Olga Kornievskaia 提交于
      commit 0769663b4f580566ef6cdf366f3073dbe8022c39 upstream.
      
      According to the NFSv4.2 spec if the input and output file is the
      same file, operation should fail with EINVAL. However, linux
      copy_file_range() system call has no such restrictions. Therefore,
      in such case let's return EOPNOTSUPP and allow VFS to fallback
      to doing do_splice_direct(). Also when copy_file_range is called
      on an NFSv4.0 or 4.1 mount (ie., a server that doesn't support
      COPY functionality), we also need to return EOPNOTSUPP and
      fallback to a regular copy.
      
      Fixes xfstest generic/075, generic/091, generic/112, generic/263
      for all NFSv4.x versions.
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Cc: Yu Xu <xuyu@linux.alibaba.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cc1afc10
    • O
      NFSv4.2 fix unnecessary retry in nfs4_copy_file_range · e1eed692
      Olga Kornievskaia 提交于
      commit 45ac486ecf2dc998e25cf32f0cabf2deaad875be upstream.
      
      Currently nfs42_proc_copy_file_range() can not return EAGAIN.
      
      Fixes: e4648aa4 ("NFS recover from destination server reboot for copies")
      Signed-off-by: NOlga Kornievskaia <kolga@netapp.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Cc: Yu Xu <xuyu@linux.alibaba.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e1eed692
    • S
      fbdev: fix divide error in fb_var_to_videomode · 0bad28e9
      Shile Zhang 提交于
      commit cf84807f6dd0be5214378e66460cfc9187f532f9 upstream.
      
      To fix following divide-by-zero error found by Syzkaller:
      
        divide error: 0000 [#1] SMP PTI
        CPU: 7 PID: 8447 Comm: test Kdump: loaded Not tainted 4.19.24-8.al7.x86_64 #1
        Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
        RIP: 0010:fb_var_to_videomode+0xae/0xc0
        Code: 04 44 03 46 78 03 4e 7c 44 03 46 68 03 4e 70 89 ce d1 ee 69 c0 e8 03 00 00 f6 c2 01 0f 45 ce 83 e2 02 8d 34 09 0f 45 ce 31 d2 <41> f7 f0 31 d2 f7 f1 89 47 08 f3 c3 66 0f 1f 44 00 00 0f 1f 44 00
        RSP: 0018:ffffb7e189347bf0 EFLAGS: 00010246
        RAX: 00000000e1692410 RBX: ffffb7e189347d60 RCX: 0000000000000000
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffb7e189347c10
        RBP: ffff99972a091c00 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000100
        R13: 0000000000010000 R14: 00007ffd66baf6d0 R15: 0000000000000000
        FS:  00007f2054d11740(0000) GS:ffff99972fbc0000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 00007f205481fd20 CR3: 00000004288a0001 CR4: 00000000001606a0
        Call Trace:
         fb_set_var+0x257/0x390
         ? lookup_fast+0xbb/0x2b0
         ? fb_open+0xc0/0x140
         ? chrdev_open+0xa6/0x1a0
         do_fb_ioctl+0x445/0x5a0
         do_vfs_ioctl+0x92/0x5f0
         ? __alloc_fd+0x3d/0x160
         ksys_ioctl+0x60/0x90
         __x64_sys_ioctl+0x16/0x20
         do_syscall_64+0x5b/0x190
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7f20548258d7
        Code: 44 00 00 48 8b 05 b9 15 2d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 15 2d 00 f7 d8 64 89 01 48
      
      It can be triggered easily with following test code:
      
        #include <linux/fb.h>
        #include <fcntl.h>
        #include <sys/ioctl.h>
        int main(void)
        {
                struct fb_var_screeninfo var = {.activate = 0x100, .pixclock = 60};
                int fd = open("/dev/fb0", O_RDWR);
                if (fd < 0)
                        return 1;
      
                if (ioctl(fd, FBIOPUT_VSCREENINFO, &var))
                        return 1;
      
                return 0;
        }
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Fredrik Noring <noring@nocrew.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Reviewed-by: NMukesh Ojha <mojha@codeaurora.org>
      Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      0bad28e9
    • D
      udlfb: fix some inconsistent NULL checking · b8304d91
      Dan Carpenter 提交于
      commit c143a559b073aeea688b9bb7c5b46f3cf322d569 upstream.
      
      In the current kernel, then kzalloc() can't fail for small allocations,
      but if it did fail then we would have a NULL dereference in the error
      handling.  Also in dlfb_usb_disconnect() if "info" were NULL then it
      would cause an Oops inside the unregister_framebuffer() function but it
      can't be NULL so let's remove that check.
      
      Fixes: 68a958a915ca ("udlfb: handle unplug properly")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Bernie Thompson <bernie@plugable.com>
      Cc: Mikulas Patocka <mpatocka@redhat.com>
      Cc: Alexey Khoroshilov <khoroshilov@ispras.ru>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Wen Yang <wen.yang99@zte.com.cn>
      [b.zolnierkie: added "Fixes:" tag]
      Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b8304d91
    • T
      btrfs: sysfs: don't leak memory when failing add fsid · 94e1f966
      Tobin C. Harding 提交于
      commit e32773357d5cc271b1d23550b3ed026eb5c2a468 upstream.
      
      A failed call to kobject_init_and_add() must be followed by a call to
      kobject_put().  Currently in the error path when adding fs_devices we
      are missing this call.  This could be fixed by calling
      btrfs_sysfs_remove_fsid() if btrfs_sysfs_add_fsid() returns an error or
      by adding a call to kobject_put() directly in btrfs_sysfs_add_fsid().
      Here we choose the second option because it prevents the slightly
      unusual error path handling requirements of kobject from leaking out
      into btrfs functions.
      
      Add a call to kobject_put() in the error path of kobject_add_and_init().
      This causes the release method to be called if kobject_init_and_add()
      fails.  open_tree() is the function that calls btrfs_sysfs_add_fsid()
      and the error code in this function is already written with the
      assumption that the release method is called during the error path of
      open_tree() (as seen by the call to btrfs_sysfs_remove_fsid() under the
      fail_fsdev_sysfs label).
      
      Cc: stable@vger.kernel.org # v4.4+
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NTobin C. Harding <tobin@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      94e1f966
    • T
      btrfs: sysfs: Fix error path kobject memory leak · 946ad2ec
      Tobin C. Harding 提交于
      commit 450ff8348808a89cc27436771aa05c2b90c0eef1 upstream.
      
      If a call to kobject_init_and_add() fails we must call kobject_put()
      otherwise we leak memory.
      
      Calling kobject_put() when kobject_init_and_add() fails drops the
      refcount back to 0 and calls the ktype release method (which in turn
      calls the percpu destroy and kfree).
      
      Add call to kobject_put() in the error path of call to
      kobject_init_and_add().
      
      Cc: stable@vger.kernel.org # v4.4+
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NTobin C. Harding <tobin@kernel.org>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      946ad2ec
    • F
      Btrfs: fix race between ranged fsync and writeback of adjacent ranges · 92f907d7
      Filipe Manana 提交于
      commit 0c713cbab6200b0ab6473b50435e450a6e1de85d upstream.
      
      When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
      inode) that happens to be ranged, which happens during a msync() or writes
      for files opened with O_SYNC for example, we can end up with a corrupt log,
      due to different file extent items representing ranges that overlap with
      each other, or hit some assertion failures.
      
      When doing a ranged fsync we only flush delalloc and wait for ordered
      exents within that range. If while we are logging items from our inode
      ordered extents for adjacent ranges complete, we end up in a race that can
      make us insert the file extent items that overlap with others we logged
      previously and the assertion failures.
      
      For example, if tree-log.c:copy_items() receives a leaf that has the
      following file extents items, all with a length of 4K and therefore there
      is an implicit hole in the range 68K to 72K - 1:
      
        (257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...
      
      It copies them to the log tree. However due to the need to detect implicit
      holes, it may release the path, in order to look at the previous leaf to
      detect an implicit hole, and then later it will search again in the tree
      for the first file extent item key, with the goal of locking again the
      leaf (which might have changed due to concurrent changes to other inodes).
      
      However when it locks again the leaf containing the first key, the key
      corresponding to the extent at offset 72K may not be there anymore since
      there is an ordered extent for that range that is finishing (that is,
      somewhere in the middle of btrfs_finish_ordered_io()), and it just
      removed the file extent item but has not yet replaced it with a new file
      extent item, so the part of copy_items() that does hole detection will
      decide that there is a hole in the range starting from 68K to 76K - 1,
      and therefore insert a file extent item to represent that hole, having
      a key offset of 68K. After that we now have a log tree with 2 different
      extent items that have overlapping ranges:
      
       1) The file extent item copied before copy_items() released the path,
          which has a key offset of 72K and a length of 4K, representing the
          file range 72K to 76K - 1.
      
       2) And a file extent item representing a hole that has a key offset of
          68K and a length of 8K, representing the range 68K to 76K - 1. This
          item was inserted after releasing the path, and overlaps with the
          extent item inserted before.
      
      The overlapping extent items can cause all sorts of unpredictable and
      incorrect behaviour, either when replayed or if a fast (non full) fsync
      happens later, which can trigger a BUG_ON() when calling
      btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a
      trace like the following:
      
        [61666.783269] ------------[ cut here ]------------
        [61666.783943] kernel BUG at fs/btrfs/ctree.c:3182!
        [61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP
        (...)
        [61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000
        [61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs]
        [61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246
        [61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000
        [61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937
        [61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000
        [61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418
        [61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000
        [61666.786253] FS:  00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000
        [61666.786253] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0
        [61666.786253] Call Trace:
        [61666.786253]  __btrfs_drop_extents+0x5e3/0xaad [btrfs]
        [61666.786253]  ? time_hardirqs_on+0x9/0x14
        [61666.786253]  btrfs_log_changed_extents+0x294/0x4e0 [btrfs]
        [61666.786253]  ? release_extent_buffer+0x38/0xb4 [btrfs]
        [61666.786253]  btrfs_log_inode+0xb6e/0xcdc [btrfs]
        [61666.786253]  ? lock_acquire+0x131/0x1c5
        [61666.786253]  ? btrfs_log_inode_parent+0xee/0x659 [btrfs]
        [61666.786253]  ? arch_local_irq_save+0x9/0xc
        [61666.786253]  ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs]
        [61666.786253]  btrfs_log_inode_parent+0x223/0x659 [btrfs]
        [61666.786253]  ? arch_local_irq_save+0x9/0xc
        [61666.786253]  ? lockref_get_not_zero+0x2c/0x34
        [61666.786253]  ? rcu_read_unlock+0x3e/0x5d
        [61666.786253]  btrfs_log_dentry_safe+0x60/0x7b [btrfs]
        [61666.786253]  btrfs_sync_file+0x317/0x42c [btrfs]
        [61666.786253]  vfs_fsync_range+0x8c/0x9e
        [61666.786253]  SyS_msync+0x13c/0x1c9
        [61666.786253]  entry_SYSCALL_64_fastpath+0x18/0xad
      
      A sample of a corrupt log tree leaf with overlapping extents I got from
      running btrfs/072:
      
            item 14 key (295 108 200704) itemoff 2599 itemsize 53
                    extent data disk bytenr 0 nr 0
                    extent data offset 0 nr 458752 ram 458752
            item 15 key (295 108 659456) itemoff 2546 itemsize 53
                    extent data disk bytenr 4343541760 nr 770048
                    extent data offset 606208 nr 163840 ram 770048
            item 16 key (295 108 663552) itemoff 2493 itemsize 53
                    extent data disk bytenr 4343541760 nr 770048
                    extent data offset 610304 nr 155648 ram 770048
            item 17 key (295 108 819200) itemoff 2440 itemsize 53
                    extent data disk bytenr 4334788608 nr 4096
                    extent data offset 0 nr 4096 ram 4096
      
      The file extent item at offset 659456 (item 15) ends at offset 823296
      (659456 + 163840) while the next file extent item (item 16) starts at
      offset 663552.
      
      Another different problem that the race can trigger is a failure in the
      assertions at tree-log.c:copy_items(), which expect that the first file
      extent item key we found before releasing the path exists after we have
      released path and that the last key we found before releasing the path
      also exists after releasing the path:
      
        $ cat -n fs/btrfs/tree-log.c
        4080          if (need_find_last_extent) {
        4081                  /* btrfs_prev_leaf could return 1 without releasing the path */
        4082                  btrfs_release_path(src_path);
        4083                  ret = btrfs_search_slot(NULL, inode->root, &first_key,
        4084                                  src_path, 0, 0);
        4085                  if (ret < 0)
        4086                          return ret;
        4087                  ASSERT(ret == 0);
        (...)
        4103                  if (i >= btrfs_header_nritems(src_path->nodes[0])) {
        4104                          ret = btrfs_next_leaf(inode->root, src_path);
        4105                          if (ret < 0)
        4106                                  return ret;
        4107                          ASSERT(ret == 0);
        4108                          src = src_path->nodes[0];
        4109                          i = 0;
        4110                          need_find_last_extent = true;
        4111                  }
        (...)
      
      The second assertion implicitly expects that the last key before the path
      release still exists, because the surrounding while loop only stops after
      we have found that key. When this assertion fails it produces a stack like
      this:
      
        [139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107
        [139590.037406] ------------[ cut here ]------------
        [139590.037707] kernel BUG at fs/btrfs/ctree.h:3546!
        [139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
        [139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G        W         5.0.0-btrfs-next-46 #1
        (...)
        [139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs]
        (...)
        [139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282
        [139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000
        [139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868
        [139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000
        [139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013
        [139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001
        [139590.042501] FS:  00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000
        [139590.042847] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        [139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0
        [139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        [139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        [139590.044250] Call Trace:
        [139590.044631]  copy_items+0xa3f/0x1000 [btrfs]
        [139590.045009]  ? generic_bin_search.constprop.32+0x61/0x200 [btrfs]
        [139590.045396]  btrfs_log_inode+0x7b3/0xd70 [btrfs]
        [139590.045773]  btrfs_log_inode_parent+0x2b3/0xce0 [btrfs]
        [139590.046143]  ? do_raw_spin_unlock+0x49/0xc0
        [139590.046510]  btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
        [139590.046872]  btrfs_sync_file+0x3b6/0x440 [btrfs]
        [139590.047243]  btrfs_file_write_iter+0x45b/0x5c0 [btrfs]
        [139590.047592]  __vfs_write+0x129/0x1c0
        [139590.047932]  vfs_write+0xc2/0x1b0
        [139590.048270]  ksys_write+0x55/0xc0
        [139590.048608]  do_syscall_64+0x60/0x1b0
        [139590.048946]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
        [139590.049287] RIP: 0033:0x7f2efc4be190
        (...)
        [139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
        [139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190
        [139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003
        [139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60
        [139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003
        [139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000
        (...)
        [139590.055128] ---[ end trace 193f35d0215cdeeb ]---
      
      So fix this race between a full ranged fsync and writeback of adjacent
      ranges by flushing all delalloc and waiting for all ordered extents to
      complete before logging the inode. This is the simplest way to solve the
      problem because currently the full fsync path does not deal with ranges
      at all (it assumes a full range from 0 to LLONG_MAX) and it always needs
      to look at adjacent ranges for hole detection. For use cases of ranged
      fsyncs this can make a few fsyncs slower but on the other hand it can
      make some following fsyncs to other ranges do less work or no need to do
      anything at all. A full fsync is rare anyway and happens only once after
      loading/creating an inode and once after less common operations such as a
      shrinking truncate.
      
      This is an issue that exists for a long time, and was often triggered by
      generic/127, because it does mmap'ed writes and msync (which triggers a
      ranged fsync). Adding support for the tree checker to detect overlapping
      extents (next patch in the series) and trigger a WARN() when such cases
      are found, and then calling btrfs_check_leaf_full() at the end of
      btrfs_insert_file_extent() made the issue much easier to detect. Running
      btrfs/072 with that change to the tree checker and making fsstress open
      files always with O_SYNC made it much easier to trigger the issue (as
      triggering it with generic/127 is very rare).
      
      CC: stable@vger.kernel.org # 3.16+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      92f907d7
    • F
      Btrfs: avoid fallback to transaction commit during fsync of files with holes · 4f9a774d
      Filipe Manana 提交于
      commit ebb929060aeb162417b4c1307e63daee47b208d9 upstream.
      
      When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a
      file that has holes and has file extent items spanning two or more leafs,
      we can end up falling to back to a full transaction commit due to a logic
      bug that leads to failure to insert a duplicate file extent item that is
      meant to represent a hole between the last file extent item of a leaf and
      the first file extent item in the next leaf. The failure (EEXIST error)
      leads to a transaction commit (as most errors when logging an inode do).
      
      For example, we have the two following leafs:
      
      Leaf N:
      
        -----------------------------------------------
        | ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) |
        -----------------------------------------------
        The file extent item at the end of leaf N has a length of 4Kb,
        representing the file range from 64K to 68K - 1.
      
      Leaf N + 1:
      
        -----------------------------------------------
        | (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... |
        -----------------------------------------------
        The file extent item at the first slot of leaf N + 1 has a length of
        4Kb too, representing the file range from 72K to 76K - 1.
      
      During the full fsync path, when we are at tree-log.c:copy_items() with
      leaf N as a parameter, after processing the last file extent item, that
      represents the extent at offset 64K, we take a look at the first file
      extent item at the next leaf (leaf N + 1), and notice there's a 4K hole
      between the two extents, and therefore we insert a file extent item
      representing that hole, starting at file offset 68K and ending at offset
      72K - 1. However we don't update the value of *last_extent, which is used
      to represent the end offset (plus 1, non-inclusive end) of the last file
      extent item inserted in the log, so it stays with a value of 68K and not
      with a value of 72K.
      
      Then, when copy_items() is called for leaf N + 1, because the value of
      *last_extent is smaller then the offset of the first extent item in the
      leaf (68K < 72K), we look at the last file extent item in the previous
      leaf (leaf N) and see it there's a 4K gap between it and our first file
      extent item (again, 68K < 72K), so we decide to insert a file extent item
      representing the hole, starting at file offset 68K and ending at offset
      72K - 1, this insertion will fail with -EEXIST being returned from
      btrfs_insert_file_extent() because we already inserted a file extent item
      representing a hole for this offset (68K) in the previous call to
      copy_items(), when processing leaf N.
      
      The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(),
      which falls back to a full transaction commit.
      
      Fix this by adjusting *last_extent after inserting a hole when we had to
      look at the next leaf.
      
      Fixes: 4ee3fad3 ("Btrfs: fix fsync after hole punching when using no-holes feature")
      Cc: stable@vger.kernel.org # 4.14+
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4f9a774d
    • F
      Btrfs: do not abort transaction at btrfs_update_root() after failure to COW path · 7ec747c8
      Filipe Manana 提交于
      commit 72bd2323ec87722c115a5906bc6a1b31d11e8f54 upstream.
      
      Currently when we fail to COW a path at btrfs_update_root() we end up
      always aborting the transaction. However all the current callers of
      btrfs_update_root() are able to deal with errors returned from it, many do
      end up aborting the transaction themselves (directly or not, such as the
      transaction commit path), other BUG_ON() or just gracefully cancel whatever
      they were doing.
      
      When syncing the fsync log, we call btrfs_update_root() through
      tree-log.c:update_log_root(), and if it returns an -ENOSPC error, the log
      sync code does not abort the transaction, instead it gracefully handles
      the error and returns -EAGAIN to the fsync handler, so that it falls back
      to a transaction commit. Any other error different from -ENOSPC, makes the
      log sync code abort the transaction.
      
      So remove the transaction abort from btrfs_update_log() when we fail to
      COW a path to update the root item, so that if an -ENOSPC failure happens
      we avoid aborting the current transaction and have a chance of the fsync
      succeeding after falling back to a transaction commit.
      
      Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203413
      Fixes: 79787eaa ("btrfs: replace many BUG_ONs with proper error handling")
      Cc: stable@vger.kernel.org # 4.4+
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7ec747c8
    • J
      btrfs: don't double unlock on error in btrfs_punch_hole · ce21e658
      Josef Bacik 提交于
      commit 8fca955057b9c58467d1b231e43f19c4cf26ae8c upstream.
      
      If we have an error writing out a delalloc range in
      btrfs_punch_hole_lock_range we'll unlock the inode and then goto
      out_only_mutex, where we will again unlock the inode.  This is bad,
      don't do this.
      
      Fixes: f27451f2 ("Btrfs: add support for fallocate's zero range operation")
      CC: stable@vger.kernel.org # 4.19+
      Reviewed-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ce21e658
    • A
      gfs2: Fix sign extension bug in gfs2_update_stats · fdc78eed
      Andreas Gruenbacher 提交于
      commit 5a5ec83d6ac974b12085cd99b196795f14079037 upstream.
      
      Commit 4d207133 changed the types of the statistic values in struct
      gfs2_lkstats from s64 to u64.  Because of that, what should be a signed
      value in gfs2_update_stats turned into an unsigned value.  When shifted
      right, we end up with a large positive value instead of a small negative
      value, which results in an incorrect variance estimate.
      
      Fixes: 4d207133 ("gfs2: Make statistics unsigned, suitable for use with do_div()")
      Signed-off-by: NAndreas Gruenbacher <agruenba@redhat.com>
      Cc: stable@vger.kernel.org # v4.4+
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fdc78eed
    • C
      arm64/iommu: handle non-remapped addresses in ->mmap and ->get_sgtable · 53cd8ae3
      Christoph Hellwig 提交于
      commit a98d9ae937d256ed679a935fc82d9deaa710d98e upstream.
      
      DMA allocations that can't sleep may return non-remapped addresses, but
      we do not properly handle them in the mmap and get_sgtable methods.
      Resolve non-vmalloc addresses using virt_to_page to handle this corner
      case.
      
      Cc: <stable@vger.kernel.org>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      53cd8ae3
    • A
      arm64/kernel: kaslr: reduce module randomization range to 2 GB · 9c15fff2
      Ard Biesheuvel 提交于
      commit b2eed9b58811283d00fa861944cb75797d4e52a7 upstream.
      
      The following commit
      
        7290d580 ("module: use relative references for __ksymtab entries")
      
      updated the ksymtab handling of some KASLR capable architectures
      so that ksymtab entries are emitted as pairs of 32-bit relative
      references. This reduces the size of the entries, but more
      importantly, it gets rid of statically assigned absolute
      addresses, which require fixing up at boot time if the kernel
      is self relocating (which takes a 24 byte RELA entry for each
      member of the ksymtab struct).
      
      Since ksymtab entries are always part of the same module as the
      symbol they export, it was assumed at the time that a 32-bit
      relative reference is always sufficient to capture the offset
      between a ksymtab entry and its target symbol.
      
      Unfortunately, this is not always true: in the case of per-CPU
      variables, a per-CPU variable's base address (which usually differs
      from the actual address of any of its per-CPU copies) is allocated
      in the vicinity of the ..data.percpu section in the core kernel
      (i.e., in the per-CPU reserved region which follows the section
      containing the core kernel's statically allocated per-CPU variables).
      
      Since we randomize the module space over a 4 GB window covering
      the core kernel (based on the -/+ 4 GB range of an ADRP/ADD pair),
      we may end up putting the core kernel out of the -/+ 2 GB range of
      32-bit relative references of module ksymtab entries that refer to
      per-CPU variables.
      
      So reduce the module randomization range a bit further. We lose
      1 bit of randomization this way, but this is something we can
      tolerate.
      
      Cc: <stable@vger.kernel.org> # v4.19+
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9c15fff2
    • D
      libnvdimm/pmem: Bypass CONFIG_HARDENED_USERCOPY overhead · ee6d3eb3
      Dan Williams 提交于
      commit 52f476a323f9efc959be1c890d0cdcf12e1582e0 upstream.
      
      Jeff discovered that performance improves from ~375K iops to ~519K iops
      on a simple psync-write fio workload when moving the location of 'struct
      page' from the default PMEM location to DRAM. This result is surprising
      because the expectation is that 'struct page' for dax is only needed for
      third party references to dax mappings. For example, a dax-mapped buffer
      passed to another system call for direct-I/O requires 'struct page' for
      sending the request down the driver stack and pinning the page. There is
      no usage of 'struct page' for first party access to a file via
      read(2)/write(2) and friends.
      
      However, this "no page needed" expectation is violated by
      CONFIG_HARDENED_USERCOPY and the check_copy_size() performed in
      copy_from_iter_full_nocache() and copy_to_iter_mcsafe(). The
      check_heap_object() helper routine assumes the buffer is backed by a
      slab allocator (DRAM) page and applies some checks.  Those checks are
      invalid, dax pages do not originate from the slab, and redundant,
      dax_iomap_actor() has already validated that the I/O is within bounds.
      Specifically that routine validates that the logical file offset is
      within bounds of the file, then it does a sector-to-pfn translation
      which validates that the physical mapping is within bounds of the block
      device.
      
      Bypass additional hardened usercopy overhead and call the 'no check'
      versions of the copy_{to,from}_iter operations directly.
      
      Fixes: 0aed55af ("x86, uaccess: introduce copy_from_iter_flushcache...")
      Cc: <stable@vger.kernel.org>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Matthew Wilcox <willy@infradead.org>
      Reported-and-tested-by: NJeff Smits <jeff.smits@intel.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ee6d3eb3
    • S
      kvm: svm/avic: fix off-by-one in checking host APIC ID · 709a9305
      Suthikulpanit, Suravee 提交于
      commit c9bcd3e3335d0a29d89fabd2c385e1b989e6f1b0 upstream.
      
      Current logic does not allow VCPU to be loaded onto CPU with
      APIC ID 255. This should be allowed since the host physical APIC ID
      field in the AVIC Physical APIC table entry is an 8-bit value,
      and APIC ID 255 is valid in system with x2APIC enabled.
      Instead, do not allow VCPU load if the host APIC ID cannot be
      represented by an 8-bit value.
      
      Also, use the more appropriate AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK
      instead of AVIC_MAX_PHYSICAL_ID_COUNT.
      Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      709a9305
    • T
      mmc: sdhci-iproc: Set NO_HISPD bit to fix HS50 data hold time problem · 5b69ceee
      Trac Hoang 提交于
      commit ec0970e0a1b2c807c908d459641a9f9a1be3e130 upstream.
      
      The iproc host eMMC/SD controller hold time does not meet the
      specification in the HS50 mode.  This problem can be mitigated
      by disabling the HISPD bit; thus forcing the controller output
      data to be driven on the falling clock edges rather than the
      rising clock edges.
      
      Stable tag (v4.12+) chosen to assist stable kernel maintainers so that
      the change does not produce merge conflicts backporting to older kernel
      versions. In reality, the timing bug existed since the driver was first
      introduced but there is no need for this driver to be supported in kernel
      versions that old.
      
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NTrac Hoang <trac.hoang@broadcom.com>
      Signed-off-by: NScott Branden <scott.branden@broadcom.com>
      Acked-by: NAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5b69ceee
    • T
      mmc: sdhci-iproc: cygnus: Set NO_HISPD bit to fix HS50 data hold time problem · 227e0153
      Trac Hoang 提交于
      commit b7dfa695afc40d5396ed84b9f25aa3754de23e39 upstream.
      
      The iproc host eMMC/SD controller hold time does not meet the
      specification in the HS50 mode. This problem can be mitigated
      by disabling the HISPD bit; thus forcing the controller output
      data to be driven on the falling clock edges rather than the
      rising clock edges.
      
      This change applies only to the Cygnus platform.
      
      Stable tag (v4.12+) chosen to assist stable kernel maintainers so that
      the change does not produce merge conflicts backporting to older kernel
      versions. In reality, the timing bug existed since the driver was first
      introduced but there is no need for this driver to be supported in kernel
      versions that old.
      
      Cc: stable@vger.kernel.org # v4.12+
      Signed-off-by: NTrac Hoang <trac.hoang@broadcom.com>
      Signed-off-by: NScott Branden <scott.branden@broadcom.com>
      Acked-by: NAdrian Hunter <adrian.hunter@intel.com>
      Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      227e0153
    • D
      crypto: vmx - CTR: always increment IV as quadword · 792d65fc
      Daniel Axtens 提交于
      commit 009b30ac7444c17fae34c4f435ebce8e8e2b3250 upstream.
      
      The kernel self-tests picked up an issue with CTR mode:
      alg: skcipher: p8_aes_ctr encryption test failed (wrong result) on test vector 3, cfg="uneven misaligned splits, may sleep"
      
      Test vector 3 has an IV of FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFD, so
      after 3 increments it should wrap around to 0.
      
      In the aesp8-ppc code from OpenSSL, there are two paths that
      increment IVs: the bulk (8 at a time) path, and the individual
      path which is used when there are fewer than 8 AES blocks to
      process.
      
      In the bulk path, the IV is incremented with vadduqm: "Vector
      Add Unsigned Quadword Modulo", which does 128-bit addition.
      
      In the individual path, however, the IV is incremented with
      vadduwm: "Vector Add Unsigned Word Modulo", which instead
      does 4 32-bit additions. Thus the IV would instead become
      FFFFFFFFFFFFFFFFFFFFFFFF00000000, throwing off the result.
      
      Use vadduqm.
      
      This was probably a typo originally, what with q and w being
      adjacent. It is a pretty narrow edge case: I am really
      impressed by the quality of the kernel self-tests!
      
      Fixes: 5c380d62 ("crypto: vmx - Add support for VMS instructions by ASM")
      Cc: stable@vger.kernel.org
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Acked-by: NNayna Jain <nayna@linux.ibm.com>
      Tested-by: NNayna Jain <nayna@linux.ibm.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      792d65fc
    • M
      Revert "scsi: sd: Keep disk read-only when re-reading partition" · 136b8cef
      Martin K. Petersen 提交于
      commit 8acf608e602f6ec38b7cc37b04c80f1ce9a1a6cc upstream.
      
      This reverts commit 20bd1d02.
      
      This patch introduced regressions for devices that come online in
      read-only state and subsequently switch to read-write.
      
      Given how the partition code is currently implemented it is not
      possible to persist the read-only flag across a device revalidate
      call. This may need to get addressed in the future since it is common
      for user applications to proactively call BLKRRPART.
      
      Reverting this commit will re-introduce a regression where a
      device-initiated revalidate event will cause the admin state to be
      forgotten. A separate patch will address this issue.
      
      Fixes: 20bd1d02 ("scsi: sd: Keep disk read-only when re-reading partition")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      136b8cef
    • A
      sbitmap: fix improper use of smp_mb__before_atomic() · ac7480a5
      Andrea Parri 提交于
      commit a0934fd2b1208458e55fc4b48f55889809fce666 upstream.
      
      This barrier only applies to the read-modify-write operations; in
      particular, it does not apply to the atomic_set() primitive.
      
      Replace the barrier with an smp_mb().
      
      Fixes: 6c0ca7ae ("sbitmap: fix wakeup hang after sbq resize")
      Cc: stable@vger.kernel.org
      Reported-by: N"Paul E. McKenney" <paulmck@linux.ibm.com>
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: linux-block@vger.kernel.org
      Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      ac7480a5
    • A
      bio: fix improper use of smp_mb__before_atomic() · b78255d6
      Andrea Parri 提交于
      commit f381c6a4bd0ae0fde2d6340f1b9bb0f58d915de6 upstream.
      
      This barrier only applies to the read-modify-write operations; in
      particular, it does not apply to the atomic_set() primitive.
      
      Replace the barrier with an smp_mb().
      
      Fixes: dac56212 ("bio: skip atomic inc/dec of ->bi_cnt for most use cases")
      Cc: stable@vger.kernel.org
      Reported-by: N"Paul E. McKenney" <paulmck@linux.ibm.com>
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: linux-block@vger.kernel.org
      Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b78255d6
    • P
      KVM: x86: fix return value for reserved EFER · 432ec4fa
      Paolo Bonzini 提交于
      commit 66f61c92889ff3ca365161fb29dd36d6354682ba upstream.
      
      Commit 11988499e62b ("KVM: x86: Skip EFER vs. guest CPUID checks for
      host-initiated writes", 2019-04-02) introduced a "return false" in a
      function returning int, and anyway set_efer has a "nonzero on error"
      conventon so it should be returning 1.
      Reported-by: NPavel Machek <pavel@denx.de>
      Fixes: 11988499e62b ("KVM: x86: Skip EFER vs. guest CPUID checks for host-initiated writes")
      Cc: Sean Christopherson <sean.j.christopherson@intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      432ec4fa
    • D
      f2fs: Fix use of number of devices · 70d33cce
      Damien Le Moal 提交于
      commit 0916878da355650d7e77104a7ac0fa1784eca852 upstream.
      
      For a single device mount using a zoned block device, the zone
      information for the device is stored in the sbi->devs single entry
      array and sbi->s_ndevs is set to 1. This differs from a single device
      mount using a regular block device which does not allocate sbi->devs
      and sets sbi->s_ndevs to 0.
      
      However, sbi->s_devs == 0 condition is used throughout the code to
      differentiate a single device mount from a multi-device mount where
      sbi->s_ndevs is always larger than 1. This results in problems with
      single zoned block device volumes as these are treated as multi-device
      mounts but do not have the start_blk and end_blk information set. One
      of the problem observed is skipping of zone discard issuing resulting in
      write commands being issued to full zones or unaligned to a zone write
      pointer.
      
      Fix this problem by simply treating the cases sbi->s_ndevs == 0 (single
      regular block device mount) and sbi->s_ndevs == 1 (single zoned block
      device mount) in the same manner. This is done by introducing the
      helper function f2fs_is_multi_device() and using this helper in place
      of direct tests of sbi->s_ndevs value, improving code readability.
      
      Fixes: 7bb3a371 ("f2fs: Fix zoned block device support")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Signed-off-by: NJaegeuk Kim <jaegeuk@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      70d33cce
    • J
      ext4: wait for outstanding dio during truncate in nojournal mode · 5220582c
      Jan Kara 提交于
      commit 82a25b027ca48d7ef197295846b352345853dfa8 upstream.
      
      We didn't wait for outstanding direct IO during truncate in nojournal
      mode (as we skip orphan handling in that case). This can lead to fs
      corruption or stale data exposure if truncate ends up freeing blocks
      and these get reallocated before direct IO finishes. Fix the condition
      determining whether the wait is necessary.
      
      CC: stable@vger.kernel.org
      Fixes: 1c9114f9 ("ext4: serialize unlocked dio reads with truncate")
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5220582c
    • J
      ext4: do not delete unlinked inode from orphan list on failed truncate · 71e430fd
      Jan Kara 提交于
      commit ee0ed02ca93ef1ecf8963ad96638795d55af2c14 upstream.
      
      It is possible that unlinked inode enters ext4_setattr() (e.g. if
      somebody calls ftruncate(2) on unlinked but still open file). In such
      case we should not delete the inode from the orphan list if truncate
      fails. Note that this is mostly a theoretical concern as filesystem is
      corrupted if we reach this path anyway but let's be consistent in our
      orphan handling.
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: stable@kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      71e430fd
    • S
      x86: Hide the int3_emulate_call/jmp functions from UML · 1d84eb87
      Steven Rostedt (VMware) 提交于
      commit 693713cbdb3a4bda5a8a678c31f06560bbb14657 upstream.
      
      User Mode Linux does not have access to the ip or sp fields of the pt_regs,
      and accessing them causes UML to fail to build. Hide the int3_emulate_jmp()
      and int3_emulate_call() instructions from UML, as it doesn't need them
      anyway.
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      1d84eb87
  2. 26 5月, 2019 9 次提交
    • G
      Linux 4.19.46 · 8b2fc005
      Greg Kroah-Hartman 提交于
      8b2fc005
    • Y
      fbdev: sm712fb: fix memory frequency by avoiding a switch/case fallthrough · fcac7169
      Yifeng Li 提交于
      commit 9dc20113988b9a75ea6b3abd68dc45e2d73ccdab upstream.
      
      A fallthrough in switch/case was introduced in f627caf55b8e ("fbdev:
      sm712fb: fix crashes and garbled display during DPMS modesetting"),
      due to my copy-paste error, which would cause the memory clock frequency
      for SM720 to be programmed to SM712.
      
      Since it only reprograms the clock to a different frequency, it's only
      a benign issue without visible side-effect, so it also evaded Sudip
      Mukherjee's code review and regression tests. scripts/checkpatch.pl
      also failed to discover the issue, possibly due to nested switch
      statements.
      
      This issue was found by Stephen Rothwell by building linux-next with
      -Wimplicit-fallthrough.
      Reported-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Fixes: f627caf55b8e ("fbdev: sm712fb: fix crashes and garbled display during DPMS modesetting")
      Signed-off-by: NYifeng Li <tomli@tomli.me>
      Cc: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
      Cc: "Gustavo A. R. Silva" <gustavo@embeddedor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Signed-off-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      fcac7169
    • D
      bpf, lru: avoid messing with eviction heuristics upon syscall lookup · 107e215c
      Daniel Borkmann 提交于
      commit 50b045a8c0ccf44f76640ac3eea8d80ca53979a3 upstream.
      
      One of the biggest issues we face right now with picking LRU map over
      regular hash table is that a map walk out of user space, for example,
      to just dump the existing entries or to remove certain ones, will
      completely mess up LRU eviction heuristics and wrong entries such
      as just created ones will get evicted instead. The reason for this
      is that we mark an entry as "in use" via bpf_lru_node_set_ref() from
      system call lookup side as well. Thus upon walk, all entries are
      being marked, so information of actual least recently used ones
      are "lost".
      
      In case of Cilium where it can be used (besides others) as a BPF
      based connection tracker, this current behavior causes disruption
      upon control plane changes that need to walk the map from user space
      to evict certain entries. Discussion result from bpfconf [0] was that
      we should simply just remove marking from system call side as no
      good use case could be found where it's actually needed there.
      Therefore this patch removes marking for regular LRU and per-CPU
      flavor. If there ever should be a need in future, the behavior could
      be selected via map creation flag, but due to mentioned reason we
      avoid this here.
      
        [0] http://vger.kernel.org/bpfconf.html
      
      Fixes: 29ba732a ("bpf: Add BPF_MAP_TYPE_LRU_HASH")
      Fixes: 8f844938 ("bpf: Add BPF_MAP_TYPE_LRU_PERCPU_HASH")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      107e215c
    • D
      bpf: add map_lookup_elem_sys_only for lookups from syscall side · 2bb3c547
      Daniel Borkmann 提交于
      commit c6110222c6f49ea68169f353565eb865488a8619 upstream.
      
      Add a callback map_lookup_elem_sys_only() that map implementations
      could use over map_lookup_elem() from system call side in case the
      map implementation needs to handle the latter differently than from
      the BPF data path. If map_lookup_elem_sys_only() is set, this will
      be preferred pick for map lookups out of user space. This hook is
      used in a follow-up fix for LRU map, but once development window
      opens, we can convert other map types from map_lookup_elem() (here,
      the one called upon BPF_MAP_LOOKUP_ELEM cmd is meant) over to use
      the callback to simplify and clean up the latter.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      2bb3c547
    • C
      bpf: relax inode permission check for retrieving bpf program · 3ded3aaa
      Chenbo Feng 提交于
      commit e547ff3f803e779a3898f1f48447b29f43c54085 upstream.
      
      For iptable module to load a bpf program from a pinned location, it
      only retrieve a loaded program and cannot change the program content so
      requiring a write permission for it might not be necessary.
      Also when adding or removing an unrelated iptable rule, it might need to
      flush and reload the xt_bpf related rules as well and triggers the inode
      permission check. It might be better to remove the write premission
      check for the inode so we won't need to grant write access to all the
      processes that flush and restore iptables rules.
      Signed-off-by: NChenbo Feng <fengc@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3ded3aaa
    • G
      Revert "selftests/bpf: skip verifier tests for unsupported program types" · c33563e9
      Greg Kroah-Hartman 提交于
      This reverts commit 118d38a3 which is
      commit 8184d44c9a577a2f1842ed6cc844bfd4a9981d8e upstream.
      
      Tommi reports that this patch breaks the build, it's not really needed
      so let's revert it.
      Reported-by: NTommi Rantala <tommi.t.rantala@nokia.com>
      Cc: Stanislav Fomichev <sdf@google.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c33563e9
    • J
      driver core: Postpone DMA tear-down until after devres release for probe failure · 90110ffd
      John Garry 提交于
      commit 0b777eee88d712256ba8232a9429edb17c4f9ceb upstream.
      
      In commit 376991db4b64 ("driver core: Postpone DMA tear-down until after
      devres release"), we changed the ordering of tearing down the device DMA
      ops and releasing all the device's resources; this was because the DMA ops
      should be maintained until we release the device's managed DMA memories.
      
      However, we have seen another crash on an arm64 system when a
      device driver probe fails:
      
        hisi_sas_v3_hw 0000:74:02.0: Adding to iommu group 2
        scsi host1: hisi_sas_v3_hw
        BUG: Bad page state in process swapper/0  pfn:313f5
        page:ffff7e0000c4fd40 count:1 mapcount:0
        mapping:0000000000000000 index:0x0
        flags: 0xfffe00000001000(reserved)
        raw: 0fffe00000001000 ffff7e0000c4fd48 ffff7e0000c4fd48
      0000000000000000
        raw: 0000000000000000 0000000000000000 00000001ffffffff
      0000000000000000
        page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
        bad because of flags: 0x1000(reserved)
        Modules linked in:
        CPU: 49 PID: 1 Comm: swapper/0 Not tainted
      5.1.0-rc1-43081-g22d97fd-dirty #1433
        Hardware name: Huawei D06/D06, BIOS Hisilicon D06 UEFI
      RC0 - V1.12.01 01/29/2019
        Call trace:
        dump_backtrace+0x0/0x118
        show_stack+0x14/0x1c
        dump_stack+0xa4/0xc8
        bad_page+0xe4/0x13c
        free_pages_check_bad+0x4c/0xc0
        __free_pages_ok+0x30c/0x340
        __free_pages+0x30/0x44
        __dma_direct_free_pages+0x30/0x38
        dma_direct_free+0x24/0x38
        dma_free_attrs+0x9c/0xd8
        dmam_release+0x20/0x28
        release_nodes+0x17c/0x220
        devres_release_all+0x34/0x54
        really_probe+0xc4/0x2c8
        driver_probe_device+0x58/0xfc
        device_driver_attach+0x68/0x70
        __driver_attach+0x94/0xdc
        bus_for_each_dev+0x5c/0xb4
        driver_attach+0x20/0x28
        bus_add_driver+0x14c/0x200
        driver_register+0x6c/0x124
        __pci_register_driver+0x48/0x50
        sas_v3_pci_driver_init+0x20/0x28
        do_one_initcall+0x40/0x25c
        kernel_init_freeable+0x2b8/0x3c0
        kernel_init+0x10/0x100
        ret_from_fork+0x10/0x18
        Disabling lock debugging due to kernel taint
        BUG: Bad page state in process swapper/0  pfn:313f6
        page:ffff7e0000c4fd80 count:1 mapcount:0
      mapping:0000000000000000 index:0x0
      [   89.322983] flags: 0xfffe00000001000(reserved)
        raw: 0fffe00000001000 ffff7e0000c4fd88 ffff7e0000c4fd88
      0000000000000000
        raw: 0000000000000000 0000000000000000 00000001ffffffff
      0000000000000000
      
      The crash occurs for the same reason.
      
      In this case, on the really_probe() failure path, we are still clearing
      the DMA ops prior to releasing the device's managed memories.
      
      This patch fixes this issue by reordering the DMA ops teardown and the
      call to devres_release_all() on the failure path.
      Reported-by: NXiang Chen <chenxiang66@hisilicon.com>
      Tested-by: NXiang Chen <chenxiang66@hisilicon.com>
      Signed-off-by: NJohn Garry <john.garry@huawei.com>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      [jpg: backport to 4.19.x and earlier]
      Signed-off-by: NJohn Garry <john.garry@huawei.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      90110ffd
    • N
      md/raid: raid5 preserve the writeback action after the parity check · 43090805
      Nigel Croxon 提交于
      commit b2176a1dfb518d870ee073445d27055fea64dfb8 upstream.
      
      The problem is that any 'uptodate' vs 'disks' check is not precise
      in this path. Put a "WARN_ON(!test_bit(R5_UPTODATE, &dev->flags)" on the
      device that might try to kick off writes and then skip the action.
      Better to prevent the raid driver from taking unexpected action *and* keep
      the system alive vs killing the machine with BUG_ON.
      
      Note: fixed warning reported by kbuild test robot <lkp@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NNigel Croxon <ncroxon@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43090805
    • S
      Revert "Don't jump to compute_result state from check_result state" · 3d25b7f5
      Song Liu 提交于
      commit a25d8c327bb41742dbd59f8c545f59f3b9c39983 upstream.
      
      This reverts commit 4f4fd7c5798bbdd5a03a60f6269cf1177fbd11ef.
      
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Nigel Croxon <ncroxon@redhat.com>
      Cc: Xiao Ni <xni@redhat.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3d25b7f5