1. 14 2月, 2023 2 次提交
  2. 13 2月, 2023 1 次提交
  3. 09 2月, 2023 37 次提交
    • E
      cifs: do not include page data when checking signature · 38116603
      Enzo Matsumiya 提交于
      stable inclusion
      from stable-v4.19.271
      commit 19f0577dd34b250e1595f8dd577d9c2b6c1dc85d
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v4.19.271&id=19f0577dd34b250e1595f8dd577d9c2b6c1dc85d
      
      --------------------------------
      
      commit 30b2b219 upstream.
      
      On async reads, page data is allocated before sending.  When the
      response is received but it has no data to fill (e.g.
      STATUS_END_OF_FILE), __calc_signature() will still include the pages in
      its computation, leading to an invalid signature check.
      
      This patch fixes this by not setting the async read smb_rqst page data
      (zeroed by default) if its got_bytes is 0.
      
      This can be reproduced/verified with xfstests generic/465.
      
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NEnzo Matsumiya <ematsumiya@suse.de>
      Reviewed-by: NPaulo Alcantara (SUSE) <pc@cjr.nz>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      Conflict:
        fs/cifs/smb2pdu.c
      Signed-off-by: NLi Lingfeng <lilingfeng3@huawei.com>
      Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      38116603
    • C
      SUNRPC: Don't leak netobj memory when gss_read_proxy_verf() fails · 40124251
      Chuck Lever 提交于
      stable inclusion
      from stable-v4.19.270
      commit 76f2497a2faa6a4e91efb94a7f55705b403273fd
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      commit da522b5f upstream.
      
      Fixes: 030d794b ("SUNRPC: Use gssproxy upcall for server RPCGSS authentication.")
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      Conflicts:
      	net/sunrpc/auth_gss/svcauth_gss.c
      Signed-off-by: NBaisong Zhong <zhongbaisong@huawei.com>
      Reviewed-by: NLiu Jian <liujian56@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      40124251
    • E
      net: stream: purge sk_error_queue in sk_stream_kill_queues() · 8512d066
      Eric Dumazet 提交于
      stable inclusion
      from stable-v4.19.270
      commit 6f00bd0402a1e3d2d556afba57c045bd7931e4d3
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=6f00bd0402a1e3d2d556afba57c045bd7931e4d3
      
      --------------------------------
      
      [ Upstream commit e0c8bccd ]
      
      Changheon Lee reported TCP socket leaks, with a nice repro.
      
      It seems we leak TCP sockets with the following sequence:
      
      1) SOF_TIMESTAMPING_TX_ACK is enabled on the socket.
      
         Each ACK will cook an skb put in error queue, from __skb_tstamp_tx().
         __skb_tstamp_tx() is using skb_clone(), unless
         SOF_TIMESTAMPING_OPT_TSONLY was also requested.
      
      2) If the application is also using MSG_ZEROCOPY, then we put in the
         error queue cloned skbs that had a struct ubuf_info attached to them.
      
         Whenever an struct ubuf_info is allocated, sock_zerocopy_alloc()
         does a sock_hold().
      
         As long as the cloned skbs are still in sk_error_queue,
         socket refcount is kept elevated.
      
      3) Application closes the socket, while error queue is not empty.
      
      Since tcp_close() no longer purges the socket error queue,
      we might end up with a TCP socket with at least one skb in
      error queue keeping the socket alive forever.
      
      This bug can be (ab)used to consume all kernel memory
      and freeze the host.
      
      We need to purge the error queue, with proper synchronization
      against concurrent writers.
      
      Fixes: 24bcbe1c ("net: stream: don't purge sk_error_queue in sk_stream_kill_queues()")
      Reported-by: NChangheon Lee <darklight2357@icloud.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NLu Wei <luwei32@huawei.com>
      Reviewed-by: NLiu Jian <liujian56@huawei.com>
      Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      8512d066
    • J
      net: stream: don't purge sk_error_queue in sk_stream_kill_queues() · d47b2f0c
      Jakub Kicinski 提交于
      stable inclusion
      from stable-v4.19.218
      commit 8b8b3d738e450d2c2ccdc75f0ab5a951746c2a96
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8b8b3d738e450d2c2ccdc75f0ab5a951746c2a96
      
      --------------------------------
      
      [ Upstream commit 24bcbe1c ]
      
      sk_stream_kill_queues() can be called on close when there are
      still outstanding skbs to transmit. Those skbs may try to queue
      notifications to the error queue (e.g. timestamps).
      If sk_stream_kill_queues() purges the queue without taking
      its lock the queue may get corrupted, and skbs leaked.
      
      This shows up as a warning about an rmem leak:
      
      WARNING: CPU: 24 PID: 0 at net/ipv4/af_inet.c:154 inet_sock_destruct+0x...
      
      The leak is always a multiple of 0x300 bytes (the value is in
      %rax on my builds, so RAX: 0000000000000300). 0x300 is truesize of
      an empty sk_buff. Indeed if we dump the socket state at the time
      of the warning the sk_error_queue is often (but not always)
      corrupted. The ->next pointer points back at the list head,
      but not the ->prev pointer. Indeed we can find the leaked skb
      by scanning the kernel memory for something that looks like
      an skb with ->sk = socket in question, and ->truesize = 0x300.
      The contents of ->cb[] of the skb confirms the suspicion that
      it is indeed a timestamp notification (as generated in
      __skb_complete_tx_timestamp()).
      
      Removing purging of sk_error_queue should be okay, since
      inet_sock_destruct() does it again once all socket refs
      are gone. Eric suggests this may cause sockets that go
      thru disconnect() to maintain notifications from the
      previous incarnations of the socket, but that should be
      okay since the race was there anyway, and disconnect()
      is not exactly dependable.
      
      Thanks to Jonathan Lemon and Omar Sandoval for help at various
      stages of tracing the issue.
      
      Fixes: cb9eff09 ("net: new user space API for time stamping of incoming and outgoing packets")
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NLu Wei <luwei32@huawei.com>
      Reviewed-by: NLiu Jian <liujian56@huawei.com>
      Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      d47b2f0c
    • J
      ext4: fix deadlock due to mbcache entry corruption · 34b18eb0
      Jan Kara 提交于
      stable inclusion
      from stable-v4.19.270
      commit efaa0ca678f56d47316a08030b2515678cebbc50
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit a44e84a9 ]
      
      When manipulating xattr blocks, we can deadlock infinitely looping
      inside ext4_xattr_block_set() where we constantly keep finding xattr
      block for reuse in mbcache but we are unable to reuse it because its
      reference count is too big. This happens because cache entry for the
      xattr block is marked as reusable (e_reusable set) although its
      reference count is too big. When this inconsistency happens, this
      inconsistent state is kept indefinitely and so ext4_xattr_block_set()
      keeps retrying indefinitely.
      
      The inconsistent state is caused by non-atomic update of e_reusable bit.
      e_reusable is part of a bitfield and e_reusable update can race with
      update of e_referenced bit in the same bitfield resulting in loss of one
      of the updates. Fix the problem by using atomic bitops instead.
      
      This bug has been around for many years, but it became *much* easier
      to hit after commit 65f8b800 ("ext4: fix race when reusing xattr
      blocks").
      
      Cc: stable@vger.kernel.org
      Fixes: 6048c64b ("mbcache: add reusable flag to cache entries")
      Fixes: 65f8b800 ("ext4: fix race when reusing xattr blocks")
      Reported-and-tested-by: NJeremi Piotrowski <jpiotrowski@linux.microsoft.com>
      Reported-by: NThilo Fromm <t-lo@linux.microsoft.com>
      Link: https://lore.kernel.org/r/c77bf00f-4618-7149-56f1-b8d1664b9d07@linux.microsoft.com/Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Link: https://lore.kernel.org/r/20221123193950.16758-1-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      34b18eb0
    • J
      mbcache: automatically delete entries from cache on freeing · 5ae65557
      Jan Kara 提交于
      stable inclusion
      from stable-v4.19.270
      commit 61dc6cdfc85000e305a58553d41036716b427a0d
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 307af6c8 ]
      
      Use the fact that entries with elevated refcount are not removed from
      the hash and just move removal of the entry from the hash to the entry
      freeing time. When doing this we also change the generic code to hold
      one reference to the cache entry, not two of them, which makes code
      somewhat more obvious.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20220712105436.32204-10-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Stable-dep-of: a44e84a9 ("ext4: fix deadlock due to mbcache entry corruption")
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      5ae65557
    • J
      mm/khugepaged: invoke MMU notifiers in shmem/file collapse paths · 4cd9f02e
      Jann Horn 提交于
      stable inclusion
      from stable-v4.19.270
      commit ff2a1a6f869650aec99e9d070b5ab625bfbc5bc3
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      commit f268f6cf upstream.
      
      Any codepath that zaps page table entries must invoke MMU notifiers to
      ensure that secondary MMUs (like KVM) don't keep accessing pages which
      aren't mapped anymore.  Secondary MMUs don't hold their own references to
      pages that are mirrored over, so failing to notify them can lead to page
      use-after-free.
      
      I'm marking this as addressing an issue introduced in commit f3f0e1d2
      ("khugepaged: add support of collapse for tmpfs/shmem pages"), but most of
      the security impact of this only came in commit 27e1f827 ("khugepaged:
      enable collapse pmd for pte-mapped THP"), which actually omitted flushes
      for the removal of present PTEs, not just for the removal of empty page
      tables.
      
      Link: https://lkml.kernel.org/r/20221129154730.2274278-3-jannh@google.com
      Link: https://lkml.kernel.org/r/20221128180252.1684965-3-jannh@google.com
      Link: https://lkml.kernel.org/r/20221125213714.4115729-3-jannh@google.com
      Fixes: f3f0e1d2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
      Signed-off-by: NJann Horn <jannh@google.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [manual backport: this code was refactored from two copies into a common
      helper between 5.15 and 6.0;
      pmd collapse for PTE-mapped THP was only added in 5.4;
      MMU notifier API changed between 4.19 and 5.4]
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: Ntong tiangen <tongtiangen@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      4cd9f02e
    • J
      mm/khugepaged: fix GUP-fast interaction by sending IPI · 5fc2a808
      Jann Horn 提交于
      stable inclusion
      from stable-v4.19.270
      commit f0700ae26832550ce497a568789c3fceeb44d753
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      commit 2ba99c5e upstream.
      
      Since commit 70cbc3cc ("mm: gup: fix the fast GUP race against THP
      collapse"), the lockless_pages_from_mm() fastpath rechecks the pmd_t to
      ensure that the page table was not removed by khugepaged in between.
      
      However, lockless_pages_from_mm() still requires that the page table is
      not concurrently freed.  Fix it by sending IPIs (if the architecture uses
      semi-RCU-style page table freeing) before freeing/reusing page tables.
      
      Link: https://lkml.kernel.org/r/20221129154730.2274278-2-jannh@google.com
      Link: https://lkml.kernel.org/r/20221128180252.1684965-2-jannh@google.com
      Link: https://lkml.kernel.org/r/20221125213714.4115729-2-jannh@google.com
      Fixes: ba76149f ("thp: khugepaged")
      Signed-off-by: NJann Horn <jannh@google.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      [manual backport: two of the three places in khugepaged that can free
      ptes were refactored into a common helper between 5.15 and 6.0;
      TLB flushing was refactored between 5.4 and 5.10;
      TLB flushing was refactored between 4.19 and 5.4;
      pmd collapse for PTE-mapped THP was only added in 5.4;
      ugly hack needed in <=4.19 for s390 and arm]
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      Conflicts:
      	mm/memory.c
      	mm/mmu_gather.c
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: Ntong tiangen <tongtiangen@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      5fc2a808
    • Y
      mm: gup: fix the fast GUP race against THP collapse · ca1eb2d4
      Yang Shi 提交于
      stable inclusion
      from stable-v5.10.148
      commit 377c60dd32d3289788bdb3d8840382f79d42139b
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      commit 70cbc3cc upstream.
      
      Since general RCU GUP fast was introduced in commit 2667f50e ("mm:
      introduce a general RCU get_user_pages_fast()"), a TLB flush is no longer
      sufficient to handle concurrent GUP-fast in all cases, it only handles
      traditional IPI-based GUP-fast correctly.  On architectures that send an
      IPI broadcast on TLB flush, it works as expected.  But on the
      architectures that do not use IPI to broadcast TLB flush, it may have the
      below race:
      
         CPU A                                          CPU B
      THP collapse                                     fast GUP
                                                    gup_pmd_range() <-- see valid pmd
                                                        gup_pte_range() <-- work on pte
      pmdp_collapse_flush() <-- clear pmd and flush
      __collapse_huge_page_isolate()
          check page pinned <-- before GUP bump refcount
                                                            pin the page
                                                            check PTE <-- no change
      __collapse_huge_page_copy()
          copy data to huge page
          ptep_clear()
      install huge pmd for the huge page
                                                            return the stale page
      discard the stale page
      
      The race can be fixed by checking whether PMD is changed or not after
      taking the page pin in fast GUP, just like what it does for PTE.  If the
      PMD is changed it means there may be parallel THP collapse, so GUP should
      back off.
      
      Also update the stale comment about serializing against fast GUP in
      khugepaged.
      
      Link: https://lkml.kernel.org/r/20220907180144.555485-1-shy828301@gmail.com
      Fixes: 2667f50e ("mm: introduce a general RCU get_user_pages_fast()")
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NYang Shi <shy828301@gmail.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      Conflicts:
      	mm/gup.c
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: Ntong tiangen <tongtiangen@huawei.com>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      ca1eb2d4
    • G
      prlimit: do_prlimit needs to have a speculation check · b3701e78
      Greg Kroah-Hartman 提交于
      stable inclusion
      from stable-v4.19.271
      commit d3ee91e50a6b3c5a45398e3dcb912a8a264f575c
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      commit 73979060 upstream.
      
      do_prlimit() adds the user-controlled resource value to a pointer that
      will subsequently be dereferenced.  In order to help prevent this
      codepath from being used as a spectre "gadget" a barrier needs to be
      added after checking the range.
      Reported-by: NJordy Zomer <jordyzomer@google.com>
      Tested-by: NJordy Zomer <jordyzomer@google.com>
      Suggested-by: NLinus Torvalds <torvalds@linuxfoundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      b3701e78
    • M
      arm64: cmpxchg_double*: hazard against entire exchange variable · 89d288a0
      Mark Rutland 提交于
      stable inclusion
      from stable-v4.19.270
      commit 6ad3636bd8419b29dc85fadb3e50caa8f91cbc79
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 031af500 ]
      
      The inline assembly for arm64's cmpxchg_double*() implementations use a
      +Q constraint to hazard against other accesses to the memory location
      being exchanged. However, the pointer passed to the constraint is a
      pointer to unsigned long, and thus the hazard only applies to the first
      8 bytes of the location.
      
      GCC can take advantage of this, assuming that other portions of the
      location are unchanged, leading to a number of potential problems.
      
      This is similar to what we fixed back in commit:
      
        fee960be ("arm64: xchg: hazard against entire exchange variable")
      
      ... but we forgot to adjust cmpxchg_double*() similarly at the same
      time.
      
      The same problem applies, as demonstrated with the following test:
      
      | struct big {
      |         u64 lo, hi;
      | } __aligned(128);
      |
      | unsigned long foo(struct big *b)
      | {
      |         u64 hi_old, hi_new;
      |
      |         hi_old = b->hi;
      |         cmpxchg_double_local(&b->lo, &b->hi, 0x12, 0x34, 0x56, 0x78);
      |         hi_new = b->hi;
      |
      |         return hi_old ^ hi_new;
      | }
      
      ... which GCC 12.1.0 compiles as:
      
      | 0000000000000000 <foo>:
      |    0:   d503233f        paciasp
      |    4:   aa0003e4        mov     x4, x0
      |    8:   1400000e        b       40 <foo+0x40>
      |    c:   d2800240        mov     x0, #0x12                       // #18
      |   10:   d2800681        mov     x1, #0x34                       // #52
      |   14:   aa0003e5        mov     x5, x0
      |   18:   aa0103e6        mov     x6, x1
      |   1c:   d2800ac2        mov     x2, #0x56                       // #86
      |   20:   d2800f03        mov     x3, #0x78                       // #120
      |   24:   48207c82        casp    x0, x1, x2, x3, [x4]
      |   28:   ca050000        eor     x0, x0, x5
      |   2c:   ca060021        eor     x1, x1, x6
      |   30:   aa010000        orr     x0, x0, x1
      |   34:   d2800000        mov     x0, #0x0                        // #0    <--- BANG
      |   38:   d50323bf        autiasp
      |   3c:   d65f03c0        ret
      |   40:   d2800240        mov     x0, #0x12                       // #18
      |   44:   d2800681        mov     x1, #0x34                       // #52
      |   48:   d2800ac2        mov     x2, #0x56                       // #86
      |   4c:   d2800f03        mov     x3, #0x78                       // #120
      |   50:   f9800091        prfm    pstl1strm, [x4]
      |   54:   c87f1885        ldxp    x5, x6, [x4]
      |   58:   ca0000a5        eor     x5, x5, x0
      |   5c:   ca0100c6        eor     x6, x6, x1
      |   60:   aa0600a6        orr     x6, x5, x6
      |   64:   b5000066        cbnz    x6, 70 <foo+0x70>
      |   68:   c8250c82        stxp    w5, x2, x3, [x4]
      |   6c:   35ffff45        cbnz    w5, 54 <foo+0x54>
      |   70:   d2800000        mov     x0, #0x0                        // #0     <--- BANG
      |   74:   d50323bf        autiasp
      |   78:   d65f03c0        ret
      
      Notice that at the lines with "BANG" comments, GCC has assumed that the
      higher 8 bytes are unchanged by the cmpxchg_double() call, and that
      `hi_old ^ hi_new` can be reduced to a constant zero, for both LSE and
      LL/SC versions of cmpxchg_double().
      
      This patch fixes the issue by passing a pointer to __uint128_t into the
      +Q constraint, ensuring that the compiler hazards against the entire 16
      bytes being modified.
      
      With this change, GCC 12.1.0 compiles the above test as:
      
      | 0000000000000000 <foo>:
      |    0:   f9400407        ldr     x7, [x0, #8]
      |    4:   d503233f        paciasp
      |    8:   aa0003e4        mov     x4, x0
      |    c:   1400000f        b       48 <foo+0x48>
      |   10:   d2800240        mov     x0, #0x12                       // #18
      |   14:   d2800681        mov     x1, #0x34                       // #52
      |   18:   aa0003e5        mov     x5, x0
      |   1c:   aa0103e6        mov     x6, x1
      |   20:   d2800ac2        mov     x2, #0x56                       // #86
      |   24:   d2800f03        mov     x3, #0x78                       // #120
      |   28:   48207c82        casp    x0, x1, x2, x3, [x4]
      |   2c:   ca050000        eor     x0, x0, x5
      |   30:   ca060021        eor     x1, x1, x6
      |   34:   aa010000        orr     x0, x0, x1
      |   38:   f9400480        ldr     x0, [x4, #8]
      |   3c:   d50323bf        autiasp
      |   40:   ca0000e0        eor     x0, x7, x0
      |   44:   d65f03c0        ret
      |   48:   d2800240        mov     x0, #0x12                       // #18
      |   4c:   d2800681        mov     x1, #0x34                       // #52
      |   50:   d2800ac2        mov     x2, #0x56                       // #86
      |   54:   d2800f03        mov     x3, #0x78                       // #120
      |   58:   f9800091        prfm    pstl1strm, [x4]
      |   5c:   c87f1885        ldxp    x5, x6, [x4]
      |   60:   ca0000a5        eor     x5, x5, x0
      |   64:   ca0100c6        eor     x6, x6, x1
      |   68:   aa0600a6        orr     x6, x5, x6
      |   6c:   b5000066        cbnz    x6, 78 <foo+0x78>
      |   70:   c8250c82        stxp    w5, x2, x3, [x4]
      |   74:   35ffff45        cbnz    w5, 5c <foo+0x5c>
      |   78:   f9400480        ldr     x0, [x4, #8]
      |   7c:   d50323bf        autiasp
      |   80:   ca0000e0        eor     x0, x7, x0
      |   84:   d65f03c0        ret
      
      ... sampling the high 8 bytes before and after the cmpxchg, and
      performing an EOR, as we'd expect.
      
      For backporting, I've tested this atop linux-4.9.y with GCC 5.5.0. Note
      that linux-4.9.y is oldest currently supported stable release, and
      mandates GCC 5.1+. Unfortunately I couldn't get a GCC 5.1 binary to run
      on my machines due to library incompatibilities.
      
      I've also used a standalone test to check that we can use a __uint128_t
      pointer in a +Q constraint at least as far back as GCC 4.8.5 and LLVM
      3.9.1.
      
      Fixes: 5284e1b4 ("arm64: xchg: Implement cmpxchg_double")
      Fixes: e9a4b795 ("arm64: cmpxchg_dbl: patch in lse instructions when supported by the CPU")
      Reported-by: NBoqun Feng <boqun.feng@gmail.com>
      Link: https://lore.kernel.org/lkml/Y6DEfQXymYVgL3oJ@boqun-archlinux/Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/lkml/Y6GXoO4qmH9OIZ5Q@hirez.programming.kicks-ass.net/Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: stable@vger.kernel.org
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Will Deacon <will@kernel.org>
      Link: https://lore.kernel.org/r/20230104151626.3262137-1-mark.rutland@arm.comSigned-off-by: NWill Deacon <will@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      89d288a0
    • P
      net/ulp: prevent ULP without clone op from entering the LISTEN status · aeb315d8
      Paolo Abeni 提交于
      stable inclusion
      from stable-v4.19.270
      commit 755193f2523ce5157c2f844a4b6d16b95593f830
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      commit 2c02d41d upstream.
      
      When an ULP-enabled socket enters the LISTEN status, the listener ULP data
      pointer is copied inside the child/accepted sockets by sk_clone_lock().
      
      The relevant ULP can take care of de-duplicating the context pointer via
      the clone() operation, but only MPTCP and SMC implement such op.
      
      Other ULPs may end-up with a double-free at socket disposal time.
      
      We can't simply clear the ULP data at clone time, as TLS replaces the
      socket ops with custom ones assuming a valid TLS ULP context is
      available.
      
      Instead completely prevent clone-less ULP sockets from entering the
      LISTEN status.
      
      Fixes: 734942cc ("tcp: ULP infrastructure")
      Reported-by: Nslipper <slipper.alive@gmail.com>
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Link: https://lore.kernel.org/r/4b80c3d1dbe3d0ab072f80450c202d9bc88b4b03.1672740602.git.pabeni@redhat.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      aeb315d8
    • I
      driver core: Fix bus_type.match() error handling in __driver_attach() · 7798e957
      Isaac J. Manjarres 提交于
      stable inclusion
      from stable-v4.19.270
      commit 728c23ee14f01858632625556e51c2d1db4a414e
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      commit 27c0d217 upstream.
      
      When a driver registers with a bus, it will attempt to match with every
      device on the bus through the __driver_attach() function. Currently, if
      the bus_type.match() function encounters an error that is not
      -EPROBE_DEFER, __driver_attach() will return a negative error code, which
      causes the driver registration logic to stop trying to match with the
      remaining devices on the bus.
      
      This behavior is not correct; a failure while matching a driver to a
      device does not mean that the driver won't be able to match and bind
      with other devices on the bus. Update the logic in __driver_attach()
      to reflect this.
      
      Fixes: 656b8035 ("ARM: 8524/1: driver cohandle -EPROBE_DEFER from bus_type.match()")
      Cc: stable@vger.kernel.org
      Cc: Saravana Kannan <saravanak@google.com>
      Signed-off-by: NIsaac J. Manjarres <isaacmanjarres@google.com>
      Link: https://lore.kernel.org/r/20220921001414.4046492-1-isaacmanjarres@google.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      7798e957
    • M
      md: fix a crash in mempool_free · 10fddc40
      Mikulas Patocka 提交于
      stable inclusion
      from stable-v4.19.270
      commit b5be563b4356b3089b3245d024cae3f248ba7090
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      commit 341097ee upstream.
      
      There's a crash in mempool_free when running the lvm test
      shell/lvchange-rebuild-raid.sh.
      
      The reason for the crash is this:
      * super_written calls atomic_dec_and_test(&mddev->pending_writes) and
        wake_up(&mddev->sb_wait). Then it calls rdev_dec_pending(rdev, mddev)
        and bio_put(bio).
      * so, the process that waited on sb_wait and that is woken up is racing
        with bio_put(bio).
      * if the process wins the race, it calls bioset_exit before bio_put(bio)
        is executed.
      * bio_put(bio) attempts to free a bio into a destroyed bio set - causing
        a crash in mempool_free.
      
      We fix this bug by moving bio_put before atomic_dec_and_test.
      
      We also move rdev_dec_pending before atomic_dec_and_test as suggested by
      Neil Brown.
      
      The function md_end_flush has a similar bug - we must call bio_put before
      we decrement the number of in-progress bios.
      
       BUG: kernel NULL pointer dereference, address: 0000000000000000
       #PF: supervisor write access in kernel mode
       #PF: error_code(0x0002) - not-present page
       PGD 11557f0067 P4D 11557f0067 PUD 0
       Oops: 0002 [#1] PREEMPT SMP
       CPU: 0 PID: 73 Comm: kworker/0:1 Not tainted 6.1.0-rc3 #5
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014
       Workqueue: kdelayd flush_expired_bios [dm_delay]
       RIP: 0010:mempool_free+0x47/0x80
       Code: 48 89 ef 5b 5d ff e0 f3 c3 48 89 f7 e8 32 45 3f 00 48 63 53 08 48 89 c6 3b 53 04 7d 2d 48 8b 43 10 8d 4a 01 48 89 df 89 4b 08 <48> 89 2c d0 e8 b0 45 3f 00 48 8d 7b 30 5b 5d 31 c9 ba 01 00 00 00
       RSP: 0018:ffff88910036bda8 EFLAGS: 00010093
       RAX: 0000000000000000 RBX: ffff8891037b65d8 RCX: 0000000000000001
       RDX: 0000000000000000 RSI: 0000000000000202 RDI: ffff8891037b65d8
       RBP: ffff8891447ba240 R08: 0000000000012908 R09: 00000000003d0900
       R10: 0000000000000000 R11: 0000000000173544 R12: ffff889101a14000
       R13: ffff8891562ac300 R14: ffff889102b41440 R15: ffffe8ffffa00d05
       FS:  0000000000000000(0000) GS:ffff88942fa00000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 0000000000000000 CR3: 0000001102e99000 CR4: 00000000000006b0
       Call Trace:
        <TASK>
        clone_endio+0xf4/0x1c0 [dm_mod]
        clone_endio+0xf4/0x1c0 [dm_mod]
        __submit_bio+0x76/0x120
        submit_bio_noacct_nocheck+0xb6/0x2a0
        flush_expired_bios+0x28/0x2f [dm_delay]
        process_one_work+0x1b4/0x300
        worker_thread+0x45/0x3e0
        ? rescuer_thread+0x380/0x380
        kthread+0xc2/0x100
        ? kthread_complete_and_exit+0x20/0x20
        ret_from_fork+0x1f/0x30
        </TASK>
       Modules linked in: brd dm_delay dm_raid dm_mod af_packet uvesafb cfbfillrect cfbimgblt cn cfbcopyarea fb font fbdev tun autofs4 binfmt_misc configfs ipv6 virtio_rng virtio_balloon rng_core virtio_net pcspkr net_failover failover qemu_fw_cfg button mousedev raid10 raid456 libcrc32c async_raid6_recov async_memcpy async_pq raid6_pq async_xor xor async_tx raid1 raid0 md_mod sd_mod t10_pi crc64_rocksoft crc64 virtio_scsi scsi_mod evdev psmouse bsg scsi_common [last unloaded: brd]
       CR2: 0000000000000000
       ---[ end trace 0000000000000000 ]---
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NSong Liu <song@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      10fddc40
    • J
      bpf: pull before calling skb_postpull_rcsum() · 1bd6a4c1
      Jakub Kicinski 提交于
      stable inclusion
      from stable-v4.19.270
      commit 31f7a52168c67e70a521d7acb8b0c8b6c95e7abd
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 54c3f1a8 ]
      
      Anand hit a BUG() when pulling off headers on egress to a SW tunnel.
      We get to skb_checksum_help() with an invalid checksum offset
      (commit d7ea0d9d ("net: remove two BUG() from skb_checksum_help()")
      converted those BUGs to WARN_ONs()).
      He points out oddness in how skb_postpull_rcsum() gets used.
      Indeed looks like we should pull before "postpull", otherwise
      the CHECKSUM_PARTIAL fixup from skb_postpull_rcsum() will not
      be able to do its job:
      
      	if (skb->ip_summed == CHECKSUM_PARTIAL &&
      	    skb_checksum_start_offset(skb) < 0)
      		skb->ip_summed = CHECKSUM_NONE;
      Reported-by: NAnand Parthasarathy <anpartha@meta.com>
      Fixes: 6578171a ("bpf: add bpf_skb_change_proto helper")
      Signed-off-by: NJakub Kicinski <kuba@kernel.org>
      Acked-by: NStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20221220004701.402165-1-kuba@kernel.orgSigned-off-by: NMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      1bd6a4c1
    • M
      SUNRPC: ensure the matching upcall is in-flight upon downcall · 3839e583
      minoura makoto 提交于
      stable inclusion
      from stable-v4.19.270
      commit 4916a52341b7c0ab016c213b11d0104d7f54a2c6
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit b18cba09 ]
      
      Commit 9130b8db ("SUNRPC: allow for upcalls for the same uid
      but different gss service") introduced `auth` argument to
      __gss_find_upcall(), but in gss_pipe_downcall() it was left as NULL
      since it (and auth->service) was not (yet) determined.
      
      When multiple upcalls with the same uid and different service are
      ongoing, it could happen that __gss_find_upcall(), which returns the
      first match found in the pipe->in_downcall list, could not find the
      correct gss_msg corresponding to the downcall we are looking for.
      Moreover, it might return a msg which is not sent to rpc.gssd yet.
      
      We could see mount.nfs process hung in D state with multiple mount.nfs
      are executed in parallel.  The call trace below is of CentOS 7.9
      kernel-3.10.0-1160.24.1.el7.x86_64 but we observed the same hang w/
      elrepo kernel-ml-6.0.7-1.el7.
      
      PID: 71258  TASK: ffff91ebd4be0000  CPU: 36  COMMAND: "mount.nfs"
       #0 [ffff9203ca3234f8] __schedule at ffffffffa3b8899f
       #1 [ffff9203ca323580] schedule at ffffffffa3b88eb9
       #2 [ffff9203ca323590] gss_cred_init at ffffffffc0355818 [auth_rpcgss]
       #3 [ffff9203ca323658] rpcauth_lookup_credcache at ffffffffc0421ebc
      [sunrpc]
       #4 [ffff9203ca3236d8] gss_lookup_cred at ffffffffc0353633 [auth_rpcgss]
       #5 [ffff9203ca3236e8] rpcauth_lookupcred at ffffffffc0421581 [sunrpc]
       #6 [ffff9203ca323740] rpcauth_refreshcred at ffffffffc04223d3 [sunrpc]
       #7 [ffff9203ca3237a0] call_refresh at ffffffffc04103dc [sunrpc]
       #8 [ffff9203ca3237b8] __rpc_execute at ffffffffc041e1c9 [sunrpc]
       #9 [ffff9203ca323820] rpc_execute at ffffffffc0420a48 [sunrpc]
      
      The scenario is like this. Let's say there are two upcalls for
      services A and B, A -> B in pipe->in_downcall, B -> A in pipe->pipe.
      
      When rpc.gssd reads pipe to get the upcall msg corresponding to
      service B from pipe->pipe and then writes the response, in
      gss_pipe_downcall the msg corresponding to service A will be picked
      because only uid is used to find the msg and it is before the one for
      B in pipe->in_downcall.  And the process waiting for the msg
      corresponding to service A will be woken up.
      
      Actual scheduing of that process might be after rpc.gssd processes the
      next msg.  In rpc_pipe_generic_upcall it clears msg->errno (for A).
      The process is scheduled to see gss_msg->ctx == NULL and
      gss_msg->msg.errno == 0, therefore it cannot break the loop in
      gss_create_upcall and is never woken up after that.
      
      This patch adds a simple check to ensure that a msg which is not
      sent to rpc.gssd yet is not chosen as the matching upcall upon
      receiving a downcall.
      Signed-off-by: Nminoura makoto <minoura@valinux.co.jp>
      Signed-off-by: NHiroshi Shimamoto <h-shimamoto@nec.com>
      Tested-by: NHiroshi Shimamoto <h-shimamoto@nec.com>
      Cc: Trond Myklebust <trondmy@hammerspace.com>
      Fixes: 9130b8db ("SUNRPC: allow for upcalls for same uid but different gss service")
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      3839e583
    • Z
      ovl: Use ovl mounter's fsuid and fsgid in ovl_link() · ef39593e
      Zhang Tianci 提交于
      stable inclusion
      from stable-v4.19.270
      commit 936a357a97c710b95fe9164d5e9aca9f156a0dc1
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      commit 5b0db512 upstream.
      
      There is a wrong case of link() on overlay:
        $ mkdir /lower /fuse /merge
        $ mount -t fuse /fuse
        $ mkdir /fuse/upper /fuse/work
        $ mount -t overlay /merge -o lowerdir=/lower,upperdir=/fuse/upper,\
          workdir=work
        $ touch /merge/file
        $ chown bin.bin /merge/file // the file's caller becomes "bin"
        $ ln /merge/file /merge/lnkfile
      
      Then we will get an error(EACCES) because fuse daemon checks the link()'s
      caller is "bin", it denied this request.
      
      In the changing history of ovl_link(), there are two key commits:
      
      The first is commit bb0d2b8a ("ovl: fix sgid on directory") which
      overrides the cred's fsuid/fsgid using the new inode. The new inode's
      owner is initialized by inode_init_owner(), and inode->fsuid is
      assigned to the current user. So the override fsuid becomes the
      current user. We know link() is actually modifying the directory, so
      the caller must have the MAY_WRITE permission on the directory. The
      current caller may should have this permission. This is acceptable
      to use the caller's fsuid.
      
      The second is commit 51f7e52d ("ovl: share inode for hard link")
      which removed the inode creation in ovl_link(). This commit move
      inode_init_owner() into ovl_create_object(), so the ovl_link() just
      give the old inode to ovl_create_or_link(). Then the override fsuid
      becomes the old inode's fsuid, neither the caller nor the overlay's
      mounter! So this is incorrect.
      
      Fix this bug by using ovl mounter's fsuid/fsgid to do underlying
      fs's link().
      
      Link: https://lore.kernel.org/all/20220817102952.xnvesg3a7rbv576x@wittgenstein/T
      Link: https://lore.kernel.org/lkml/20220825130552.29587-1-zhangtianci.1997@bytedance.com/tSigned-off-by: NZhang Tianci <zhangtianci.1997@bytedance.com>
      Signed-off-by: NJiachen Zhang <zhangjiachen.jaycee@bytedance.com>
      Reviewed-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      Fixes: 51f7e52d ("ovl: share inode for hard link")
      Cc: <stable@vger.kernel.org> # v4.8
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      ef39593e
    • C
      pnode: terminate at peers of source · 302deb4b
      Christian Brauner 提交于
      stable inclusion
      from stable-v4.19.270
      commit 7f57df69de7f05302fad584eb8e3f34de39e0311
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      commit 11933cf1 upstream.
      
      The propagate_mnt() function handles mount propagation when creating
      mounts and propagates the source mount tree @source_mnt to all
      applicable nodes of the destination propagation mount tree headed by
      @dest_mnt.
      
      Unfortunately it contains a bug where it fails to terminate at peers of
      @source_mnt when looking up copies of the source mount that become
      masters for copies of the source mount tree mounted on top of slaves in
      the destination propagation tree causing a NULL dereference.
      
      Once the mechanics of the bug are understood it's easy to trigger.
      Because of unprivileged user namespaces it is available to unprivileged
      users.
      
      While fixing this bug we've gotten confused multiple times due to
      unclear terminology or missing concepts. So let's start this with some
      clarifications:
      
      * The terms "master" or "peer" denote a shared mount. A shared mount
        belongs to a peer group.
      
      * A peer group is a set of shared mounts that propagate to each other.
        They are identified by a peer group id. The peer group id is available
        in @shared_mnt->mnt_group_id.
        Shared mounts within the same peer group have the same peer group id.
        The peers in a peer group can be reached via @shared_mnt->mnt_share.
      
      * The terms "slave mount" or "dependent mount" denote a mount that
        receives propagation from a peer in a peer group. IOW, shared mounts
        may have slave mounts and slave mounts have shared mounts as their
        master. Slave mounts of a given peer in a peer group are listed on
        that peers slave list available at @shared_mnt->mnt_slave_list.
      
      * The term "master mount" denotes a mount in a peer group. IOW, it
        denotes a shared mount or a peer mount in a peer group. The term
        "master mount" - or "master" for short - is mostly used when talking
        in the context of slave mounts that receive propagation from a master
        mount. A master mount of a slave identifies the closest peer group a
        slave mount receives propagation from. The master mount of a slave can
        be identified via @slave_mount->mnt_master. Different slaves may point
        to different masters in the same peer group.
      
      * Multiple peers in a peer group can have non-empty ->mnt_slave_lists.
        Non-empty ->mnt_slave_lists of peers don't intersect. Consequently, to
        ensure all slave mounts of a peer group are visited the
        ->mnt_slave_lists of all peers in a peer group have to be walked.
      
      * Slave mounts point to a peer in the closest peer group they receive
        propagation from via @slave_mnt->mnt_master (see above). Together with
        these peers they form a propagation group (see below). The closest
        peer group can thus be identified through the peer group id
        @slave_mnt->mnt_master->mnt_group_id of the peer/master that a slave
        mount receives propagation from.
      
      * A shared-slave mount is a slave mount to a peer group pg1 while also
        a peer in another peer group pg2. IOW, a peer group may receive
        propagation from another peer group.
      
        If a peer group pg1 is a slave to another peer group pg2 then all
        peers in peer group pg1 point to the same peer in peer group pg2 via
        ->mnt_master. IOW, all peers in peer group pg1 appear on the same
        ->mnt_slave_list. IOW, they cannot be slaves to different peer groups.
      
      * A pure slave mount is a slave mount that is a slave to a peer group
        but is not a peer in another peer group.
      
      * A propagation group denotes the set of mounts consisting of a single
        peer group pg1 and all slave mounts and shared-slave mounts that point
        to a peer in that peer group via ->mnt_master. IOW, all slave mounts
        such that @slave_mnt->mnt_master->mnt_group_id is equal to
        @shared_mnt->mnt_group_id.
      
        The concept of a propagation group makes it easier to talk about a
        single propagation level in a propagation tree.
      
        For example, in propagate_mnt() the immediate peers of @dest_mnt and
        all slaves of @dest_mnt's peer group form a propagation group propg1.
        So a shared-slave mount that is a slave in propg1 and that is a peer
        in another peer group pg2 forms another propagation group propg2
        together with all slaves that point to that shared-slave mount in
        their ->mnt_master.
      
      * A propagation tree refers to all mounts that receive propagation
        starting from a specific shared mount.
      
        For example, for propagate_mnt() @dest_mnt is the start of a
        propagation tree. The propagation tree ecompasses all mounts that
        receive propagation from @dest_mnt's peer group down to the leafs.
      
      With that out of the way let's get to the actual algorithm.
      
      We know that @dest_mnt is guaranteed to be a pure shared mount or a
      shared-slave mount. This is guaranteed by a check in
      attach_recursive_mnt(). So propagate_mnt() will first propagate the
      source mount tree to all peers in @dest_mnt's peer group:
      
      for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) {
              ret = propagate_one(n);
              if (ret)
                     goto out;
      }
      
      Notice, that the peer propagation loop of propagate_mnt() doesn't
      propagate @dest_mnt itself. @dest_mnt is mounted directly in
      attach_recursive_mnt() after we propagated to the destination
      propagation tree.
      
      The mount that will be mounted on top of @dest_mnt is @source_mnt. This
      copy was created earlier even before we entered attach_recursive_mnt()
      and doesn't concern us a lot here.
      
      It's just important to notice that when propagate_mnt() is called
      @source_mnt will not yet have been mounted on top of @dest_mnt. Thus,
      @source_mnt->mnt_parent will either still point to @source_mnt or - in
      the case @source_mnt is moved and thus already attached - still to its
      former parent.
      
      For each peer @m in @dest_mnt's peer group propagate_one() will create a
      new copy of the source mount tree and mount that copy @child on @m such
      that @child->mnt_parent points to @m after propagate_one() returns.
      
      propagate_one() will stash the last destination propagation node @m in
      @last_dest and the last copy it created for the source mount tree in
      @last_source.
      
      Hence, if we call into propagate_one() again for the next destination
      propagation node @m, @last_dest will point to the previous destination
      propagation node and @last_source will point to the previous copy of the
      source mount tree and mounted on @last_dest.
      
      Each new copy of the source mount tree is created from the previous copy
      of the source mount tree. This will become important later.
      
      The peer loop in propagate_mnt() is straightforward. We iterate through
      the peers copying and updating @last_source and @last_dest as we go
      through them and mount each copy of the source mount tree @child on a
      peer @m in @dest_mnt's peer group.
      
      After propagate_mnt() handled the peers in @dest_mnt's peer group
      propagate_mnt() will propagate the source mount tree down the
      propagation tree that @dest_mnt's peer group propagates to:
      
      for (m = next_group(dest_mnt, dest_mnt); m;
                      m = next_group(m, dest_mnt)) {
              /* everything in that slave group */
              n = m;
              do {
                      ret = propagate_one(n);
                      if (ret)
                              goto out;
                      n = next_peer(n);
              } while (n != m);
      }
      
      The next_group() helper will recursively walk the destination
      propagation tree, descending into each propagation group of the
      propagation tree.
      
      The important part is that it takes care to propagate the source mount
      tree to all peers in the peer group of a propagation group before it
      propagates to the slaves to those peers in the propagation group. IOW,
      it creates and mounts copies of the source mount tree that become
      masters before it creates and mounts copies of the source mount tree
      that become slaves to these masters.
      
      It is important to remember that propagating the source mount tree to
      each mount @m in the destination propagation tree simply means that we
      create and mount new copies @child of the source mount tree on @m such
      that @child->mnt_parent points to @m.
      
      Since we know that each node @m in the destination propagation tree
      headed by @dest_mnt's peer group will be overmounted with a copy of the
      source mount tree and since we know that the propagation properties of
      each copy of the source mount tree we create and mount at @m will mostly
      mirror the propagation properties of @m. We can use that information to
      create and mount the copies of the source mount tree that become masters
      before their slaves.
      
      The easy case is always when @m and @last_dest are peers in a peer group
      of a given propagation group. In that case we know that we can simply
      copy @last_source without having to figure out what the master for the
      new copy @child of the source mount tree needs to be as we've done that
      in a previous call to propagate_one().
      
      The hard case is when we're dealing with a slave mount or a shared-slave
      mount @m in a destination propagation group that we need to create and
      mount a copy of the source mount tree on.
      
      For each propagation group in the destination propagation tree we
      propagate the source mount tree to we want to make sure that the copies
      @child of the source mount tree we create and mount on slaves @m pick an
      ealier copy of the source mount tree that we mounted on a master @m of
      the destination propagation group as their master. This is a mouthful
      but as far as we can tell that's the core of it all.
      
      But, if we keep track of the masters in the destination propagation tree
      @m we can use the information to find the correct master for each copy
      of the source mount tree we create and mount at the slaves in the
      destination propagation tree @m.
      
      Let's walk through the base case as that's still fairly easy to grasp.
      
      If we're dealing with the first slave in the propagation group that
      @dest_mnt is in then we don't yet have marked any masters in the
      destination propagation tree.
      
      We know the master for the first slave to @dest_mnt's peer group is
      simple @dest_mnt. So we expect this algorithm to yield a copy of the
      source mount tree that was mounted on a peer in @dest_mnt's peer group
      as the master for the copy of the source mount tree we want to mount at
      the first slave @m:
      
      for (n = m; ; n = p) {
              p = n->mnt_master;
              if (p == dest_master || IS_MNT_MARKED(p))
                      break;
      }
      
      For the first slave we walk the destination propagation tree all the way
      up to a peer in @dest_mnt's peer group. IOW, the propagation hierarchy
      can be walked by walking up the @mnt->mnt_master hierarchy of the
      destination propagation tree @m. We will ultimately find a peer in
      @dest_mnt's peer group and thus ultimately @dest_mnt->mnt_master.
      
      Btw, here the assumption we listed at the beginning becomes important.
      Namely, that peers in a peer group pg1 that are slaves in another peer
      group pg2 appear on the same ->mnt_slave_list. IOW, all slaves who are
      peers in peer group pg1 point to the same peer in peer group pg2 via
      their ->mnt_master. Otherwise the termination condition in the code
      above would be wrong and next_group() would be broken too.
      
      So the first iteration sets:
      
      n = m;
      p = n->mnt_master;
      
      such that @p now points to a peer or @dest_mnt itself. We walk up one
      more level since we don't have any marked mounts. So we end up with:
      
      n = dest_mnt;
      p = dest_mnt->mnt_master;
      
      If @dest_mnt's peer group is not slave to another peer group then @p is
      now NULL. If @dest_mnt's peer group is a slave to another peer group
      then @p now points to @dest_mnt->mnt_master points which is a master
      outside the propagation tree we're dealing with.
      
      Now we need to figure out the master for the copy of the source mount
      tree we're about to create and mount on the first slave of @dest_mnt's
      peer group:
      
      do {
              struct mount *parent = last_source->mnt_parent;
              if (last_source == first_source)
                      break;
              done = parent->mnt_master == p;
              if (done && peers(n, parent))
                      break;
              last_source = last_source->mnt_master;
      } while (!done);
      
      We know that @last_source->mnt_parent points to @last_dest and
      @last_dest is the last peer in @dest_mnt's peer group we propagated to
      in the peer loop in propagate_mnt().
      
      Consequently, @last_source is the last copy we created and mount on that
      last peer in @dest_mnt's peer group. So @last_source is the master we
      want to pick.
      
      We know that @last_source->mnt_parent->mnt_master points to
      @last_dest->mnt_master. We also know that @last_dest->mnt_master is
      either NULL or points to a master outside of the destination propagation
      tree and so does @p. Hence:
      
      done = parent->mnt_master == p;
      
      is trivially true in the base condition.
      
      We also know that for the first slave mount of @dest_mnt's peer group
      that @last_dest either points @dest_mnt itself because it was
      initialized to:
      
      last_dest = dest_mnt;
      
      at the beginning of propagate_mnt() or it will point to a peer of
      @dest_mnt in its peer group. In both cases it is guaranteed that on the
      first iteration @n and @parent are peers (Please note the check for
      peers here as that's important.):
      
      if (done && peers(n, parent))
              break;
      
      So, as we expected, we select @last_source, which referes to the last
      copy of the source mount tree we mounted on the last peer in @dest_mnt's
      peer group, as the master of the first slave in @dest_mnt's peer group.
      The rest is taken care of by clone_mnt(last_source, ...). We'll skip
      over that part otherwise this becomes a blogpost.
      
      At the end of propagate_mnt() we now mark @m->mnt_master as the first
      master in the destination propagation tree that is distinct from
      @dest_mnt->mnt_master. IOW, we mark @dest_mnt itself as a master.
      
      By marking @dest_mnt or one of it's peers we are able to easily find it
      again when we later lookup masters for other copies of the source mount
      tree we mount copies of the source mount tree on slaves @m to
      @dest_mnt's peer group. This, in turn allows us to find the master we
      selected for the copies of the source mount tree we mounted on master in
      the destination propagation tree again.
      
      The important part is to realize that the code makes use of the fact
      that the last copy of the source mount tree stashed in @last_source was
      mounted on top of the previous destination propagation node @last_dest.
      What this means is that @last_source allows us to walk the destination
      propagation hierarchy the same way each destination propagation node @m
      does.
      
      If we take @last_source, which is the copy of @source_mnt we have
      mounted on @last_dest in the previous iteration of propagate_one(), then
      we know @last_source->mnt_parent points to @last_dest but we also know
      that as we walk through the destination propagation tree that
      @last_source->mnt_master will point to an earlier copy of the source
      mount tree we mounted one an earlier destination propagation node @m.
      
      IOW, @last_source->mnt_parent will be our hook into the destination
      propagation tree and each consecutive @last_source->mnt_master will lead
      us to an earlier propagation node @m via
      @last_source->mnt_master->mnt_parent.
      
      Hence, by walking up @last_source->mnt_master, each of which is mounted
      on a node that is a master @m in the destination propagation tree we can
      also walk up the destination propagation hierarchy.
      
      So, for each new destination propagation node @m we use the previous
      copy of @last_source and the fact it's mounted on the previous
      propagation node @last_dest via @last_source->mnt_master->mnt_parent to
      determine what the master of the new copy of @last_source needs to be.
      
      The goal is to find the _closest_ master that the new copy of the source
      mount tree we are about to create and mount on a slave @m in the
      destination propagation tree needs to pick. IOW, we want to find a
      suitable master in the propagation group.
      
      As the propagation structure of the source mount propagation tree we
      create mirrors the propagation structure of the destination propagation
      tree we can find @m's closest master - i.e., a marked master - which is
      a peer in the closest peer group that @m receives propagation from. We
      store that closest master of @m in @p as before and record the slave to
      that master in @n
      
      We then search for this master @p via @last_source by walking up the
      master hierarchy starting from the last copy of the source mount tree
      stored in @last_source that we created and mounted on the previous
      destination propagation node @m.
      
      We will try to find the master by walking @last_source->mnt_master and
      by comparing @last_source->mnt_master->mnt_parent->mnt_master to @p. If
      we find @p then we can figure out what earlier copy of the source mount
      tree needs to be the master for the new copy of the source mount tree
      we're about to create and mount at the current destination propagation
      node @m.
      
      If @last_source->mnt_master->mnt_parent and @n are peers then we know
      that the closest master they receive propagation from is
      @last_source->mnt_master->mnt_parent->mnt_master. If not then the
      closest immediate peer group that they receive propagation from must be
      one level higher up.
      
      This builds on the earlier clarification at the beginning that all peers
      in a peer group which are slaves of other peer groups all point to the
      same ->mnt_master, i.e., appear on the same ->mnt_slave_list, of the
      closest peer group that they receive propagation from.
      
      However, terminating the walk has corner cases.
      
      If the closest marked master for a given destination node @m cannot be
      found by walking up the master hierarchy via @last_source->mnt_master
      then we need to terminate the walk when we encounter @source_mnt again.
      
      This isn't an arbitrary termination. It simply means that the new copy
      of the source mount tree we're about to create has a copy of the source
      mount tree we created and mounted on a peer in @dest_mnt's peer group as
      its master. IOW, @source_mnt is the peer in the closest peer group that
      the new copy of the source mount tree receives propagation from.
      
      We absolutely have to stop @source_mnt because @last_source->mnt_master
      either points outside the propagation hierarchy we're dealing with or it
      is NULL because @source_mnt isn't a shared-slave.
      
      So continuing the walk past @source_mnt would cause a NULL dereference
      via @last_source->mnt_master->mnt_parent. And so we have to stop the
      walk when we encounter @source_mnt again.
      
      One scenario where this can happen is when we first handled a series of
      slaves of @dest_mnt's peer group and then encounter peers in a new peer
      group that is a slave to @dest_mnt's peer group. We handle them and then
      we encounter another slave mount to @dest_mnt that is a pure slave to
      @dest_mnt's peer group. That pure slave will have a peer in @dest_mnt's
      peer group as its master. Consequently, the new copy of the source mount
      tree will need to have @source_mnt as it's master. So we walk the
      propagation hierarchy all the way up to @source_mnt based on
      @last_source->mnt_master.
      
      So terminate on @source_mnt, easy peasy. Except, that the check misses
      something that the rest of the algorithm already handles.
      
      If @dest_mnt has peers in it's peer group the peer loop in
      propagate_mnt():
      
      for (n = next_peer(dest_mnt); n != dest_mnt; n = next_peer(n)) {
              ret = propagate_one(n);
              if (ret)
                      goto out;
      }
      
      will consecutively update @last_source with each previous copy of the
      source mount tree we created and mounted at the previous peer in
      @dest_mnt's peer group. So after that loop terminates @last_source will
      point to whatever copy of the source mount tree was created and mounted
      on the last peer in @dest_mnt's peer group.
      
      Furthermore, if there is even a single additional peer in @dest_mnt's
      peer group then @last_source will __not__ point to @source_mnt anymore.
      Because, as we mentioned above, @dest_mnt isn't even handled in this
      loop but directly in attach_recursive_mnt(). So it can't even accidently
      come last in that peer loop.
      
      So the first time we handle a slave mount @m of @dest_mnt's peer group
      the copy of the source mount tree we create will make the __last copy of
      the source mount tree we created and mounted on the last peer in
      @dest_mnt's peer group the master of the new copy of the source mount
      tree we create and mount on the first slave of @dest_mnt's peer group__.
      
      But this means that the termination condition that checks for
      @source_mnt is wrong. The @source_mnt cannot be found anymore by
      propagate_one(). Instead it will find the last copy of the source mount
      tree we created and mounted for the last peer of @dest_mnt's peer group
      again. And that is a peer of @source_mnt not @source_mnt itself.
      
      IOW, we fail to terminate the loop correctly and ultimately dereference
      @last_source->mnt_master->mnt_parent. When @source_mnt's peer group
      isn't slave to another peer group then @last_source->mnt_master is NULL
      causing the splat below.
      
      For example, assume @dest_mnt is a pure shared mount and has three peers
      in its peer group:
      
      ===================================================================================
                                               mount-id   mount-parent-id   peer-group-id
      ===================================================================================
      (@dest_mnt) mnt_master[216]              309        297               shared:216
          \
           (@source_mnt) mnt_master[218]:      609        609               shared:218
      
      (1) mnt_master[216]:                     607        605               shared:216
          \
           (P1) mnt_master[218]:               624        607               shared:218
      
      (2) mnt_master[216]:                     576        574               shared:216
          \
           (P2) mnt_master[218]:               625        576               shared:218
      
      (3) mnt_master[216]:                     545        543               shared:216
          \
           (P3) mnt_master[218]:               626        545               shared:218
      
      After this sequence has been processed @last_source will point to (P3),
      the copy generated for the third peer in @dest_mnt's peer group we
      handled. So the copy of the source mount tree (P4) we create and mount
      on the first slave of @dest_mnt's peer group:
      
      ===================================================================================
                                               mount-id   mount-parent-id   peer-group-id
      ===================================================================================
          mnt_master[216]                      309        297               shared:216
         /
        /
      (S0) mnt_slave                           483        481               master:216
        \
         \    (P3) mnt_master[218]             626        545               shared:218
          \  /
           \/
          (P4) mnt_slave                       627        483               master:218
      
      will pick the last copy of the source mount tree (P3) as master, not (S0).
      
      When walking the propagation hierarchy via @last_source's master
      hierarchy we encounter (P3) but not (S0), i.e., @source_mnt.
      
      We can fix this in multiple ways:
      
      (1) By setting @last_source to @source_mnt after we processed the peers
          in @dest_mnt's peer group right after the peer loop in
          propagate_mnt().
      
      (2) By changing the termination condition that relies on finding exactly
          @source_mnt to finding a peer of @source_mnt.
      
      (3) By only moving @last_source when we actually venture into a new peer
          group or some clever variant thereof.
      
      The first two options are minimally invasive and what we want as a fix.
      The third option is more intrusive but something we'd like to explore in
      the near future.
      
      This passes all LTP tests and specifically the mount propagation
      testsuite part of it. It also holds up against all known reproducers of
      this issues.
      
      Final words.
      First, this is a clever but __worringly__ underdocumented algorithm.
      There isn't a single detailed comment to be found in next_group(),
      propagate_one() or anywhere else in that file for that matter. This has
      been a giant pain to understand and work through and a bug like this is
      insanely difficult to fix without a detailed understanding of what's
      happening. Let's not talk about the amount of time that was sunk into
      fixing this.
      
      Second, all the cool kids with access to
      unshare --mount --user --map-root --propagation=unchanged
      are going to have a lot of fun. IOW, triggerable by unprivileged users
      while namespace_lock() lock is held.
      
      [  115.848393] BUG: kernel NULL pointer dereference, address: 0000000000000010
      [  115.848967] #PF: supervisor read access in kernel mode
      [  115.849386] #PF: error_code(0x0000) - not-present page
      [  115.849803] PGD 0 P4D 0
      [  115.850012] Oops: 0000 [#1] PREEMPT SMP PTI
      [  115.850354] CPU: 0 PID: 15591 Comm: mount Not tainted 6.1.0-rc7 #3
      [  115.850851] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS
      VirtualBox 12/01/2006
      [  115.851510] RIP: 0010:propagate_one.part.0+0x7f/0x1a0
      [  115.851924] Code: 75 eb 4c 8b 05 c2 25 37 02 4c 89 ca 48 8b 4a 10
      49 39 d0 74 1e 48 3b 81 e0 00 00 00 74 26 48 8b 92 e0 00 00 00 be 01
      00 00 00 <48> 8b 4a 10 49 39 d0 75 e2 40 84 f6 74 38 4c 89 05 84 25 37
      02 4d
      [  115.853441] RSP: 0018:ffffb8d5443d7d50 EFLAGS: 00010282
      [  115.853865] RAX: ffff8e4d87c41c80 RBX: ffff8e4d88ded780 RCX: ffff8e4da4333a00
      [  115.854458] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8e4d88ded780
      [  115.855044] RBP: ffff8e4d88ded780 R08: ffff8e4da4338000 R09: ffff8e4da43388c0
      [  115.855693] R10: 0000000000000002 R11: ffffb8d540158000 R12: ffffb8d5443d7da8
      [  115.856304] R13: ffff8e4d88ded780 R14: 0000000000000000 R15: 0000000000000000
      [  115.856859] FS:  00007f92c90c9800(0000) GS:ffff8e4dfdc00000(0000)
      knlGS:0000000000000000
      [  115.857531] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  115.858006] CR2: 0000000000000010 CR3: 0000000022f4c002 CR4: 00000000000706f0
      [  115.858598] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  115.859393] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  115.860099] Call Trace:
      [  115.860358]  <TASK>
      [  115.860535]  propagate_mnt+0x14d/0x190
      [  115.860848]  attach_recursive_mnt+0x274/0x3e0
      [  115.861212]  path_mount+0x8c8/0xa60
      [  115.861503]  __x64_sys_mount+0xf6/0x140
      [  115.861819]  do_syscall_64+0x5b/0x80
      [  115.862117]  ? do_faccessat+0x123/0x250
      [  115.862435]  ? syscall_exit_to_user_mode+0x17/0x40
      [  115.862826]  ? do_syscall_64+0x67/0x80
      [  115.863133]  ? syscall_exit_to_user_mode+0x17/0x40
      [  115.863527]  ? do_syscall_64+0x67/0x80
      [  115.863835]  ? do_syscall_64+0x67/0x80
      [  115.864144]  ? do_syscall_64+0x67/0x80
      [  115.864452]  ? exc_page_fault+0x70/0x170
      [  115.864775]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
      [  115.865187] RIP: 0033:0x7f92c92b0ebe
      [  115.865480] Code: 48 8b 0d 75 4f 0c 00 f7 d8 64 89 01 48 83 c8 ff
      c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 a5 00 00
      00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 42 4f 0c 00 f7 d8 64 89
      01 48
      [  115.866984] RSP: 002b:00007fff000aa728 EFLAGS: 00000246 ORIG_RAX:
      00000000000000a5
      [  115.867607] RAX: ffffffffffffffda RBX: 000055a77888d6b0 RCX: 00007f92c92b0ebe
      [  115.868240] RDX: 000055a77888d8e0 RSI: 000055a77888e6e0 RDI: 000055a77888e620
      [  115.868823] RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000001
      [  115.869403] R10: 0000000000001000 R11: 0000000000000246 R12: 000055a77888e620
      [  115.869994] R13: 000055a77888d8e0 R14: 00000000ffffffff R15: 00007f92c93e4076
      [  115.870581]  </TASK>
      [  115.870763] Modules linked in: nft_fib_inet nft_fib_ipv4
      nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6
      nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6
      nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink qrtr snd_intel8x0
      sunrpc snd_ac97_codec ac97_bus snd_pcm snd_timer intel_rapl_msr
      intel_rapl_common snd vboxguest intel_powerclamp video rapl joydev
      soundcore i2c_piix4 wmi fuse zram xfs vmwgfx crct10dif_pclmul
      crc32_pclmul crc32c_intel polyval_clmulni polyval_generic
      drm_ttm_helper ttm e1000 ghash_clmulni_intel serio_raw ata_generic
      pata_acpi scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_multipath
      [  115.875288] CR2: 0000000000000010
      [  115.875641] ---[ end trace 0000000000000000 ]---
      [  115.876135] RIP: 0010:propagate_one.part.0+0x7f/0x1a0
      [  115.876551] Code: 75 eb 4c 8b 05 c2 25 37 02 4c 89 ca 48 8b 4a 10
      49 39 d0 74 1e 48 3b 81 e0 00 00 00 74 26 48 8b 92 e0 00 00 00 be 01
      00 00 00 <48> 8b 4a 10 49 39 d0 75 e2 40 84 f6 74 38 4c 89 05 84 25 37
      02 4d
      [  115.878086] RSP: 0018:ffffb8d5443d7d50 EFLAGS: 00010282
      [  115.878511] RAX: ffff8e4d87c41c80 RBX: ffff8e4d88ded780 RCX: ffff8e4da4333a00
      [  115.879128] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8e4d88ded780
      [  115.879715] RBP: ffff8e4d88ded780 R08: ffff8e4da4338000 R09: ffff8e4da43388c0
      [  115.880359] R10: 0000000000000002 R11: ffffb8d540158000 R12: ffffb8d5443d7da8
      [  115.880962] R13: ffff8e4d88ded780 R14: 0000000000000000 R15: 0000000000000000
      [  115.881548] FS:  00007f92c90c9800(0000) GS:ffff8e4dfdc00000(0000)
      knlGS:0000000000000000
      [  115.882234] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  115.882713] CR2: 0000000000000010 CR3: 0000000022f4c002 CR4: 00000000000706f0
      [  115.883314] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  115.883966] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      
      Fixes: f2ebb3a9 ("smarter propagate_mnt()")
      Fixes: 5ec0811d ("propogate_mnt: Handle the first propogated copy being a slave")
      Cc: <stable@vger.kernel.org>
      Reported-by: NDitang Chen <ditang.c@gmail.com>
      Signed-off-by: NSeth Forshee (Digital Ocean) <sforshee@kernel.org>
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      302deb4b
    • V
      cifs: Fix uninitialized memory read for smb311 posix symlink create · c35e430f
      Volker Lendecke 提交于
      stable inclusion
      from stable-v4.19.270
      commit 707682dbab5b61d6b7a95b05491b476510aeeb64
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      commit a152d05a upstream.
      
      If smb311 posix is enabled, we send the intended mode for file
      creation in the posix create context. Instead of using what's there on
      the stack, create the mfsymlink file with 0644.
      
      Fixes: ce558b0e ("smb3: Add posix create context for smb3.11 posix mounts")
      Cc: stable@vger.kernel.org
      Signed-off-by: NVolker Lendecke <vl@samba.org>
      Reviewed-by: NTom Talpey <tom@talpey.com>
      Reviewed-by: NPaulo Alcantara (SUSE) <pc@cjr.nz>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      c35e430f
    • W
      device_cgroup: Roll back to original exceptions after copy failure · 29a747a1
      Wang Weiyang 提交于
      stable inclusion
      from stable-v4.19.270
      commit 697e55b94162721cfdfa7acd1be09427d2c47c80
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      commit e68bfbd3 upstream.
      
      When add the 'a *:* rwm' entry to devcgroup A's whitelist, at first A's
      exceptions will be cleaned and A's behavior is changed to
      DEVCG_DEFAULT_ALLOW. Then parent's exceptions will be copyed to A's
      whitelist. If copy failure occurs, just return leaving A to grant
      permissions to all devices. And A may grant more permissions than
      parent.
      
      Backup A's whitelist and recover original exceptions after copy
      failure.
      
      Cc: stable@vger.kernel.org
      Fixes: 4cef7299 ("device_cgroup: add proper checking when changing default behavior")
      Signed-off-by: NWang Weiyang <wangweiyang2@huawei.com>
      Reviewed-by: NAristeu Rozanski <aris@redhat.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      29a747a1
    • S
      PCI/sysfs: Fix double free in error path · 5112aa6b
      Sascha Hauer 提交于
      stable inclusion
      from stable-v4.19.270
      commit 17e1b1800ce07d88219e7bff6b23dd35aa751681
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      commit aa382ffa upstream.
      
      When pci_create_attr() fails, pci_remove_resource_files() is called which
      will iterate over the res_attr[_wc] arrays and frees every non NULL entry.
      To avoid a double free here set the array entry only after it's clear we
      successfully initialized it.
      
      Fixes: b562ec8f ("PCI: Don't leak memory if sysfs_create_bin_file() fails")
      Link: https://lore.kernel.org/r/20221007070735.GX986@pengutronix.de/Signed-off-by: NSascha Hauer <s.hauer@pengutronix.de>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      5112aa6b
    • M
      PCI: Fix pci_device_is_present() for VFs by checking PF · b142a0a7
      Michael S. Tsirkin 提交于
      stable inclusion
      from stable-v4.19.270
      commit 643d77fda08d06f863af35e80a7e517ea61d9629
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      commit 98b04dd0 upstream.
      
      pci_device_is_present() previously didn't work for VFs because it reads the
      Vendor and Device ID, which are 0xffff for VFs, which looks like they
      aren't present.  Check the PF instead.
      
      Wei Gong reported that if virtio I/O is in progress when the driver is
      unbound or "0" is written to /sys/.../sriov_numvfs, the virtio I/O
      operation hangs, which may result in output like this:
      
        task:bash state:D stack:    0 pid: 1773 ppid:  1241 flags:0x00004002
        Call Trace:
         schedule+0x4f/0xc0
         blk_mq_freeze_queue_wait+0x69/0xa0
         blk_mq_freeze_queue+0x1b/0x20
         blk_cleanup_queue+0x3d/0xd0
         virtblk_remove+0x3c/0xb0 [virtio_blk]
         virtio_dev_remove+0x4b/0x80
         ...
         device_unregister+0x1b/0x60
         unregister_virtio_device+0x18/0x30
         virtio_pci_remove+0x41/0x80
         pci_device_remove+0x3e/0xb0
      
      This happened because pci_device_is_present(VF) returned "false" in
      virtio_pci_remove(), so it called virtio_break_device().  The broken vq
      meant that vring_interrupt() skipped the vq.callback() that would have
      completed the virtio I/O operation via virtblk_done().
      
      [bhelgaas: commit log, simplify to always use pci_physfn(), add stable tag]
      Link: https://lore.kernel.org/r/20221026060912.173250-1-mst@redhat.comReported-by: NWei Gong <gongwei833x@gmail.com>
      Tested-by: NWei Gong <gongwei833x@gmail.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      b142a0a7
    • D
      ipmi: fix use after free in _ipmi_destroy_user() · 2f5f822e
      Dan Carpenter 提交于
      stable inclusion
      from stable-v4.19.270
      commit 35ad87bfe330f7ef6a19f772223c63296d643172
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      commit a92ce570 upstream.
      
      The intf_free() function frees the "intf" pointer so we cannot
      dereference it again on the next line.
      
      Fixes: cbb79863 ("ipmi: Don't allow device module unload when in use")
      Signed-off-by: NDan Carpenter <error27@gmail.com>
      Message-Id: <Y3M8xa1drZv4CToE@kili>
      Cc: <stable@vger.kernel.org> # 5.5+
      Signed-off-by: NCorey Minyard <cminyard@mvista.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      2f5f822e
    • H
      ima: Fix a potential NULL pointer access in ima_restore_measurement_list · c6ad76f1
      Huaxin Lu 提交于
      stable inclusion
      from stable-v4.19.270
      commit c3572fb4002fdd36ebb9e707f8c397a0e2830c9e
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      commit 11220db4 upstream.
      
      In restore_template_fmt, when kstrdup fails, a non-NULL value will still be
      returned, which causes a NULL pointer access in template_desc_init_fields.
      
      Fixes: c7d09367 ("ima: support restoring multiple template formats")
      Cc: stable@kernel.org
      Co-developed-by: NJiaming Li <lijiaming30@huawei.com>
      Signed-off-by: NJiaming Li <lijiaming30@huawei.com>
      Signed-off-by: NHuaxin Lu <luhuaxin1@huawei.com>
      Reviewed-by: NStefan Berger <stefanb@linux.ibm.com>
      Signed-off-by: NMimi Zohar <zohar@linux.ibm.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      c6ad76f1
    • Z
      ipmi: fix long wait in unload when IPMI disconnect · 6084ee2d
      Zhang Yuchen 提交于
      stable inclusion
      from stable-v4.19.270
      commit f99cb54d8ec6ba564ffc72354d9e1e6103fad887
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      commit f6f1234d upstream.
      
      When fixing the problem mentioned in PATCH1, we also found
      the following problem:
      
      If the IPMI is disconnected and in the sending process, the
      uninstallation driver will be stuck for a long time.
      
      The main problem is that uninstalling the driver waits for curr_msg to
      be sent or HOSED. After stopping tasklet, the only place to trigger the
      timeout mechanism is the circular poll in shutdown_smi.
      
      The poll function delays 10us and calls smi_event_handler(smi_info,10).
      Smi_event_handler deducts 10us from kcs->ibf_timeout.
      
      But the poll func is followed by schedule_timeout_uninterruptible(1).
      The time consumed here is not counted in kcs->ibf_timeout.
      
      So when 10us is deducted from kcs->ibf_timeout, at least 1 jiffies has
      actually passed. The waiting time has increased by more than a
      hundredfold.
      
      Now instead of calling poll(). call smi_event_handler() directly and
      calculate the elapsed time.
      
      For verification, you can directly use ebpf to check the kcs->
      ibf_timeout for each call to kcs_event() when IPMI is disconnected.
      Decrement at normal rate before unloading. The decrement rate becomes
      very slow after unloading.
      
        $ bpftrace -e 'kprobe:kcs_event {printf("kcs->ibftimeout : %d\n",
            *(arg0+584));}'
      Signed-off-by: NZhang Yuchen <zhangyuchen.lcr@bytedance.com>
      Message-Id: <20221007092617.87597-3-zhangyuchen.lcr@bytedance.com>
      Signed-off-by: NCorey Minyard <cminyard@mvista.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      6084ee2d
    • W
      binfmt: Fix error return code in load_elf_fdpic_binary() · 3c37b9ff
      Wang Yufen 提交于
      stable inclusion
      from stable-v4.19.270
      commit 72bd0b5cdbcbe31d6644960cdbcbc33d1b4b658d
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit e7f703ff ]
      
      Fix to return a negative error code from create_elf_fdpic_tables()
      instead of 0.
      
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Cc: stable@vger.kernel.org
      Signed-off-by: NWang Yufen <wangyufen@huawei.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/1669945261-30271-1-git-send-email-wangyufen@huawei.comSigned-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      3c37b9ff
    • Y
      chardev: fix error handling in cdev_device_add() · d493c90b
      Yang Yingliang 提交于
      stable inclusion
      from stable-v4.19.270
      commit 34d17b39bceef25e4cf9805cd59250ae05d0a139
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 11fa7fef ]
      
      While doing fault injection test, I got the following report:
      
      ------------[ cut here ]------------
      kobject: '(null)' (0000000039956980): is not initialized, yet kobject_put() is being called.
      WARNING: CPU: 3 PID: 6306 at kobject_put+0x23d/0x4e0
      CPU: 3 PID: 6306 Comm: 283 Tainted: G        W          6.1.0-rc2-00005-g307c1086d7c9 #1253
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
      RIP: 0010:kobject_put+0x23d/0x4e0
      Call Trace:
       <TASK>
       cdev_device_add+0x15e/0x1b0
       __iio_device_register+0x13b4/0x1af0 [industrialio]
       __devm_iio_device_register+0x22/0x90 [industrialio]
       max517_probe+0x3d8/0x6b4 [max517]
       i2c_device_probe+0xa81/0xc00
      
      When device_add() is injected fault and returns error, if dev->devt is not set,
      cdev_add() is not called, cdev_del() is not needed. Fix this by checking dev->devt
      in error path.
      
      Fixes: 233ed09d ("chardev: add helper function to register char devs with a struct device")
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Link: https://lore.kernel.org/r/20221202030237.520280-1-yangyingliang@huawei.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      d493c90b
    • schspa's avatar
      mrp: introduce active flags to prevent UAF when applicant uninit · f4db5752
      schspa 提交于
      stable inclusion
      from stable-v4.19.270
      commit 78d48bc41f7726113c9f114268d3ab11212814da
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit ab037780 ]
      
      The caller of del_timer_sync must prevent restarting of the timer, If
      we have no this synchronization, there is a small probability that the
      cancellation will not be successful.
      
      And syzbot report the fellowing crash:
      ==================================================================
      BUG: KASAN: use-after-free in hlist_add_head include/linux/list.h:929 [inline]
      BUG: KASAN: use-after-free in enqueue_timer+0x18/0xa4 kernel/time/timer.c:605
      Write at addr f9ff000024df6058 by task syz-fuzzer/2256
      Pointer tag: [f9], memory tag: [fe]
      
      CPU: 1 PID: 2256 Comm: syz-fuzzer Not tainted 6.1.0-rc5-syzkaller-00008-
      ge01d50cb #0
      Hardware name: linux,dummy-virt (DT)
      Call trace:
       dump_backtrace.part.0+0xe0/0xf0 arch/arm64/kernel/stacktrace.c:156
       dump_backtrace arch/arm64/kernel/stacktrace.c:162 [inline]
       show_stack+0x18/0x40 arch/arm64/kernel/stacktrace.c:163
       __dump_stack lib/dump_stack.c:88 [inline]
       dump_stack_lvl+0x68/0x84 lib/dump_stack.c:106
       print_address_description mm/kasan/report.c:284 [inline]
       print_report+0x1a8/0x4a0 mm/kasan/report.c:395
       kasan_report+0x94/0xb4 mm/kasan/report.c:495
       __do_kernel_fault+0x164/0x1e0 arch/arm64/mm/fault.c:320
       do_bad_area arch/arm64/mm/fault.c:473 [inline]
       do_tag_check_fault+0x78/0x8c arch/arm64/mm/fault.c:749
       do_mem_abort+0x44/0x94 arch/arm64/mm/fault.c:825
       el1_abort+0x40/0x60 arch/arm64/kernel/entry-common.c:367
       el1h_64_sync_handler+0xd8/0xe4 arch/arm64/kernel/entry-common.c:427
       el1h_64_sync+0x64/0x68 arch/arm64/kernel/entry.S:576
       hlist_add_head include/linux/list.h:929 [inline]
       enqueue_timer+0x18/0xa4 kernel/time/timer.c:605
       mod_timer+0x14/0x20 kernel/time/timer.c:1161
       mrp_periodic_timer_arm net/802/mrp.c:614 [inline]
       mrp_periodic_timer+0xa0/0xc0 net/802/mrp.c:627
       call_timer_fn.constprop.0+0x24/0x80 kernel/time/timer.c:1474
       expire_timers+0x98/0xc4 kernel/time/timer.c:1519
      
      To fix it, we can introduce a new active flags to make sure the timer will
      not restart.
      
      Reported-by: syzbot+6fd64001c20aa99e34a4@syzkaller.appspotmail.com
      Signed-off-by: schspa's avatarSchspa Shi <schspa@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      f4db5752
    • S
      bpf: make sure skb->len != 0 when redirecting to a tunneling device · d14c4e41
      Stanislav Fomichev 提交于
      stable inclusion
      from stable-v4.19.270
      commit e6a63203e5a90a39392fa1a7ffc60f5e9baf642a
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 07ec7b50 ]
      
      syzkaller managed to trigger another case where skb->len == 0
      when we enter __dev_queue_xmit:
      
      WARNING: CPU: 0 PID: 2470 at include/linux/skbuff.h:2576 skb_assert_len include/linux/skbuff.h:2576 [inline]
      WARNING: CPU: 0 PID: 2470 at include/linux/skbuff.h:2576 __dev_queue_xmit+0x2069/0x35e0 net/core/dev.c:4295
      
      Call Trace:
       dev_queue_xmit+0x17/0x20 net/core/dev.c:4406
       __bpf_tx_skb net/core/filter.c:2115 [inline]
       __bpf_redirect_no_mac net/core/filter.c:2140 [inline]
       __bpf_redirect+0x5fb/0xda0 net/core/filter.c:2163
       ____bpf_clone_redirect net/core/filter.c:2447 [inline]
       bpf_clone_redirect+0x247/0x390 net/core/filter.c:2419
       bpf_prog_48159a89cb4a9a16+0x59/0x5e
       bpf_dispatcher_nop_func include/linux/bpf.h:897 [inline]
       __bpf_prog_run include/linux/filter.h:596 [inline]
       bpf_prog_run include/linux/filter.h:603 [inline]
       bpf_test_run+0x46c/0x890 net/bpf/test_run.c:402
       bpf_prog_test_run_skb+0xbdc/0x14c0 net/bpf/test_run.c:1170
       bpf_prog_test_run+0x345/0x3c0 kernel/bpf/syscall.c:3648
       __sys_bpf+0x43a/0x6c0 kernel/bpf/syscall.c:5005
       __do_sys_bpf kernel/bpf/syscall.c:5091 [inline]
       __se_sys_bpf kernel/bpf/syscall.c:5089 [inline]
       __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5089
       do_syscall_64+0x54/0x70 arch/x86/entry/common.c:48
       entry_SYSCALL_64_after_hwframe+0x61/0xc6
      
      The reproducer doesn't really reproduce outside of syzkaller
      environment, so I'm taking a guess here. It looks like we
      do generate correct ETH_HLEN-sized packet, but we redirect
      the packet to the tunneling device. Before we do so, we
      __skb_pull l2 header and arrive again at skb->len == 0.
      Doesn't seem like we can do anything better than having
      an explicit check after __skb_pull?
      
      Cc: Eric Dumazet <edumazet@google.com>
      Reported-by: syzbot+f635e86ec3fa0a37e019@syzkaller.appspotmail.com
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Link: https://lore.kernel.org/r/20221027225537.353077-1-sdf@google.comSigned-off-by: NMartin KaFai Lau <martin.lau@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      d14c4e41
    • Z
      ipmi: fix memleak when unload ipmi driver · ea64e5af
      Zhang Yuchen 提交于
      stable inclusion
      from stable-v4.19.270
      commit acc6579bea6a20e472eca3264203dd5854ca9b4e
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 36992eb6 ]
      
      After the IPMI disconnect problem, the memory kept rising and we tried
      to unload the driver to free the memory. However, only part of the
      free memory is recovered after the driver is uninstalled. Using
      ebpf to hook free functions, we find that neither ipmi_user nor
      ipmi_smi_msg is free, only ipmi_recv_msg is free.
      
      We find that the deliver_smi_err_response call in clean_smi_msgs does
      the destroy processing on each message from the xmit_msg queue without
      checking the return value and free ipmi_smi_msg.
      
      deliver_smi_err_response is called only at this location. Adding the
      free handling has no effect.
      
      To verify, try using ebpf to trace the free function.
      
        $ bpftrace -e 'kretprobe:ipmi_alloc_recv_msg {printf("alloc rcv
            %p\n",retval);} kprobe:free_recv_msg {printf("free recv %p\n",
            arg0)} kretprobe:ipmi_alloc_smi_msg {printf("alloc smi %p\n",
              retval);} kprobe:free_smi_msg {printf("free smi  %p\n",arg0)}'
      Signed-off-by: NZhang Yuchen <zhangyuchen.lcr@bytedance.com>
      Message-Id: <20221007092617.87597-4-zhangyuchen.lcr@bytedance.com>
      [Fixed the comment above handle_one_recv_msg().]
      Signed-off-by: NCorey Minyard <cminyard@mvista.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      ea64e5af
    • R
      ACPICA: Fix error code path in acpi_ds_call_control_method() · e9109859
      Rafael J. Wysocki 提交于
      stable inclusion
      from stable-v4.19.270
      commit 2deb42c4f9776e59bee247c14af9c5e8c05ca9a6
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 404ec604 ]
      
      A use-after-free in acpi_ps_parse_aml() after a failing invocaion of
      acpi_ds_call_control_method() is reported by KASAN [1] and code
      inspection reveals that next_walk_state pushed to the thread by
      acpi_ds_create_walk_state() is freed on errors, but it is not popped
      from the thread beforehand.  Thus acpi_ds_get_current_walk_state()
      called by acpi_ps_parse_aml() subsequently returns it as the new
      walk state which is incorrect.
      
      To address this, make acpi_ds_call_control_method() call
      acpi_ds_pop_walk_state() to pop next_walk_state from the thread before
      returning an error.
      
      Link: https://lore.kernel.org/linux-acpi/20221019073443.248215-1-chenzhongjin@huawei.com/ # [1]
      Reported-by: NChen Zhongjin <chenzhongjin@huawei.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NChen Zhongjin <chenzhongjin@huawei.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      e9109859
    • S
      skbuff: Account for tail adjustment during pull operations · 8b13cf17
      Subash Abhinov Kasiviswanathan 提交于
      stable inclusion
      from stable-v4.19.270
      commit 2d59f0ca153e9573ec4f140988c0ccca0eb4181b
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 2d7afdcb ]
      
      Extending the tail can have some unexpected side effects if a program uses
      a helper like BPF_FUNC_skb_pull_data to read partial content beyond the
      head skb headlen when all the skbs in the gso frag_list are linear with no
      head_frag -
      
        kernel BUG at net/core/skbuff.c:4219!
        pc : skb_segment+0xcf4/0xd2c
        lr : skb_segment+0x63c/0xd2c
        Call trace:
         skb_segment+0xcf4/0xd2c
         __udp_gso_segment+0xa4/0x544
         udp4_ufo_fragment+0x184/0x1c0
         inet_gso_segment+0x16c/0x3a4
         skb_mac_gso_segment+0xd4/0x1b0
         __skb_gso_segment+0xcc/0x12c
         udp_rcv_segment+0x54/0x16c
         udp_queue_rcv_skb+0x78/0x144
         udp_unicast_rcv_skb+0x8c/0xa4
         __udp4_lib_rcv+0x490/0x68c
         udp_rcv+0x20/0x30
         ip_protocol_deliver_rcu+0x1b0/0x33c
         ip_local_deliver+0xd8/0x1f0
         ip_rcv+0x98/0x1a4
         deliver_ptype_list_skb+0x98/0x1ec
         __netif_receive_skb_core+0x978/0xc60
      
      Fix this by marking these skbs as GSO_DODGY so segmentation can handle
      the tail updates accordingly.
      
      Fixes: 3dcbdb13 ("net: gso: Fix skb_segment splat when splitting gso_size mangled skb having linear-headed frag_list")
      Signed-off-by: NSean Tranchetti <quic_stranche@quicinc.com>
      Signed-off-by: NSubash Abhinov Kasiviswanathan <quic_subashab@quicinc.com>
      Reviewed-by: NAlexander Duyck <alexanderduyck@fb.com>
      Link: https://lore.kernel.org/r/1671084718-24796-1-git-send-email-quic_subashab@quicinc.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      8b13cf17
    • D
      serial: pl011: Do not clear RX FIFO & RX interrupt in unthrottle. · 6c4ef1ea
      delisun 提交于
      stable inclusion
      from stable-v4.19.270
      commit 3a25c7891d717db137354476e0bb6eb34ad5f2d3
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 032d5a71 ]
      
      Clearing the RX FIFO will cause data loss.
      Copy the pl011_enabl_interrupts implementation, and remove the clear
      interrupt and FIFO part of the code.
      
      Fixes: 211565b1 ("serial: pl011: UPSTAT_AUTORTS requires .throttle/unthrottle")
      Signed-off-by: Ndelisun <delisun@pateo.com.cn>
      Reviewed-by: NIlpo Järvinen <ilpo.jarvinen@linux.intel.com>
      Link: https://lore.kernel.org/r/20221110020108.7700-1-delisun@pateo.com.cnSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      6c4ef1ea
    • J
      serial: amba-pl011: avoid SBSA UART accessing DMACR register · 7bac3bea
      Jiamei Xie 提交于
      stable inclusion
      from stable-v4.19.270
      commit 78d837ce20517e0c1ff3ebe08ad64636e02c2e48
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 94cdb9f3 ]
      
      Chapter "B Generic UART" in "ARM Server Base System Architecture" [1]
      documentation describes a generic UART interface. Such generic UART
      does not support DMA. In current code, sbsa_uart_pops and
      amba_pl011_pops share the same stop_rx operation, which will invoke
      pl011_dma_rx_stop, leading to an access of the DMACR register. This
      commit adds a using_rx_dma check in pl011_dma_rx_stop to avoid the
      access to DMACR register for SBSA UARTs which does not support DMA.
      
      When the kernel enables DMA engine with "CONFIG_DMA_ENGINE=y", Linux
      SBSA PL011 driver will access PL011 DMACR register in some functions.
      For most real SBSA Pl011 hardware implementations, the DMACR write
      behaviour will be ignored. So these DMACR operations will not cause
      obvious problems. But for some virtual SBSA PL011 hardware, like Xen
      virtual SBSA PL011 (vpl011) device, the behaviour might be different.
      Xen vpl011 emulation will inject a data abort to guest, when guest is
      accessing an unimplemented UART register. As Xen VPL011 is SBSA
      compatible, it will not implement DMACR register. So when Linux SBSA
      PL011 driver access DMACR register, it will get an unhandled data abort
      fault and the application will get a segmentation fault:
      Unhandled fault at 0xffffffc00944d048
      Mem abort info:
        ESR = 0x96000000
        EC = 0x25: DABT (current EL), IL = 32 bits
        SET = 0, FnV = 0
        EA = 0, S1PTW = 0
        FSC = 0x00: ttbr address size fault
      Data abort info:
        ISV = 0, ISS = 0x00000000
        CM = 0, WnR = 0
      swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000020e2e000
      [ffffffc00944d048] pgd=100000003ffff803, p4d=100000003ffff803, pud=100000003ffff803, pmd=100000003fffa803, pte=006800009c090f13
      Internal error: ttbr address size fault: 96000000 [#1] PREEMPT SMP
      ...
      Call trace:
       pl011_stop_rx+0x70/0x80
       tty_port_shutdown+0x7c/0xb4
       tty_port_close+0x60/0xcc
       uart_close+0x34/0x8c
       tty_release+0x144/0x4c0
       __fput+0x78/0x220
       ____fput+0x1c/0x30
       task_work_run+0x88/0xc0
       do_notify_resume+0x8d0/0x123c
       el0_svc+0xa8/0xc0
       el0t_64_sync_handler+0xa4/0x130
       el0t_64_sync+0x1a0/0x1a4
      Code: b9000083 b901f001 794038a0 8b000042 (b9000041)
      ---[ end trace 83dd93df15c3216f ]---
      note: bootlogd[132] exited with preempt_count 1
      /etc/rcS.d/S07bootlogd: line 47: 132 Segmentation fault start-stop-daemon
      
      This has been discussed in the Xen community, and we think it should fix
      this in Linux. See [2] for more information.
      
      [1] https://developer.arm.com/documentation/den0094/c/?lang=en
      [2] https://lists.xenproject.org/archives/html/xen-devel/2022-11/msg00543.html
      
      Fixes: 0dd1e247 (drivers: PL011: add support for the ARM SBSA generic UART)
      Signed-off-by: NJiamei Xie <jiamei.xie@arm.com>
      Reviewed-by: NAndre Przywara <andre.przywara@arm.com>
      Link: https://lore.kernel.org/r/20221117103237.86856-1-jiamei.xie@arm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      7bac3bea
    • Y
      class: fix possible memory leak in __class_register() · 45fdebf6
      Yang Yingliang 提交于
      stable inclusion
      from stable-v4.19.270
      commit 3bb9c92c27624ad076419a70f2b1a30cd1f8bbbd
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 8c3e8a6b ]
      
      If class_add_groups() returns error, the 'cp->subsys' need be
      unregister, and the 'cp' need be freed.
      
      We can not call kset_unregister() here, because the 'cls' will
      be freed in callback function class_release() and it's also
      freed in caller's error path, it will cause double free.
      
      So fix this by calling kobject_del() and kfree_const(name) to
      cleanup kobject. Besides, call kfree() to free the 'cp'.
      
      Fault injection test can trigger this:
      
      unreferenced object 0xffff888102fa8190 (size 8):
        comm "modprobe", pid 502, jiffies 4294906074 (age 49.296s)
        hex dump (first 8 bytes):
          70 6b 74 63 64 76 64 00                          pktcdvd.
        backtrace:
          [<00000000e7c7703d>] __kmalloc_track_caller+0x1ae/0x320
          [<000000005e4d70bc>] kstrdup+0x3a/0x70
          [<00000000c2e5e85a>] kstrdup_const+0x68/0x80
          [<000000000049a8c7>] kvasprintf_const+0x10b/0x190
          [<0000000029123163>] kobject_set_name_vargs+0x56/0x150
          [<00000000747219c9>] kobject_set_name+0xab/0xe0
          [<0000000005f1ea4e>] __class_register+0x15c/0x49a
      
      unreferenced object 0xffff888037274000 (size 1024):
        comm "modprobe", pid 502, jiffies 4294906074 (age 49.296s)
        hex dump (first 32 bytes):
          00 40 27 37 80 88 ff ff 00 40 27 37 80 88 ff ff  .@'7.....@'7....
          00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .....N..........
        backtrace:
          [<00000000151f9600>] kmem_cache_alloc_trace+0x17c/0x2f0
          [<00000000ecf3dd95>] __class_register+0x86/0x49a
      
      Fixes: ced6473e ("driver core: class: add class_groups support")
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Link: https://lore.kernel.org/r/20221026082803.3458760-1-yangyingliang@huawei.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      45fdebf6
    • Z
      crypto: tcrypt - Fix multibuffer skcipher speed test mem leak · bf0e65e2
      Zhang Yiqun 提交于
      stable inclusion
      from stable-v4.19.270
      commit e4ec2042899536b5a8f714b6eda4443d717f41bf
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit 1aa33fc8 ]
      
      In the past, the data for mb-skcipher test has been allocated
      twice, that means the first allcated memory area is without
      free, which may cause a potential memory leakage. So this
      patch is to remove one allocation to fix this error.
      
      Fixes: e161c593 ("crypto: tcrypt - add multibuf skcipher...")
      Signed-off-by: NZhang Yiqun <zhangyiqun@phytium.com.cn>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      bf0e65e2
    • Y
      blktrace: Fix output non-blktrace event when blk_classic option enabled · 41ee72be
      Yang Jihong 提交于
      stable inclusion
      from stable-v4.19.270
      commit 7349d943eaa189cbc13e02dfd3871c868253cf95
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6DPF8
      CVE: NA
      
      --------------------------------
      
      [ Upstream commit f596da3e ]
      
      When the blk_classic option is enabled, non-blktrace events must be
      filtered out. Otherwise, events of other types are output in the blktrace
      classic format, which is unexpected.
      
      The problem can be triggered in the following ways:
      
        # echo 1 > /sys/kernel/debug/tracing/options/blk_classic
        # echo 1 > /sys/kernel/debug/tracing/events/enable
        # echo blk > /sys/kernel/debug/tracing/current_tracer
        # cat /sys/kernel/debug/tracing/trace_pipe
      
      Fixes: c71a8961 ("blktrace: add ftrace plugin")
      Signed-off-by: NYang Jihong <yangjihong1@huawei.com>
      Link: https://lore.kernel.org/r/20221122040410.85113-1-yangjihong1@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
      41ee72be