1. 27 6月, 2023 2 次提交
  2. 26 6月, 2023 1 次提交
  3. 19 6月, 2023 1 次提交
  4. 09 6月, 2023 5 次提交
  5. 16 5月, 2023 2 次提交
  6. 10 5月, 2023 1 次提交
    • B
      writeback, cgroup: fix null-ptr-deref write in bdi_split_work_to_wbs · 5703fb4e
      Baokun Li 提交于
      mainline inclusion
      from mainline-v6.3-rc8
      commit 1ba1199e
      category: bugfix
      bugzilla: 188601, https://gitee.com/openeuler/kernel/issues/I6TNTC
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1ba1199ec5747f475538c0d25a32804e5ba1dfde
      
      --------------------------------
      
      KASAN report null-ptr-deref:
      ==================================================================
      BUG: KASAN: null-ptr-deref in bdi_split_work_to_wbs+0x5c5/0x7b0
      Write of size 8 at addr 0000000000000000 by task sync/943
      CPU: 5 PID: 943 Comm: sync Tainted: 6.3.0-rc5-next-20230406-dirty #461
      Call Trace:
       <TASK>
       dump_stack_lvl+0x7f/0xc0
       print_report+0x2ba/0x340
       kasan_report+0xc4/0x120
       kasan_check_range+0x1b7/0x2e0
       __kasan_check_write+0x24/0x40
       bdi_split_work_to_wbs+0x5c5/0x7b0
       sync_inodes_sb+0x195/0x630
       sync_inodes_one_sb+0x3a/0x50
       iterate_supers+0x106/0x1b0
       ksys_sync+0x98/0x160
      [...]
      ==================================================================
      
      The race that causes the above issue is as follows:
      
                 cpu1                     cpu2
      -------------------------|-------------------------
      inode_switch_wbs
       INIT_WORK(&isw->work, inode_switch_wbs_work_fn)
       queue_rcu_work(isw_wq, &isw->work)
       // queue_work async
        inode_switch_wbs_work_fn
         wb_put_many(old_wb, nr_switched)
          percpu_ref_put_many
           ref->data->release(ref)
           cgwb_release
            queue_work(cgwb_release_wq, &wb->release_work)
            // queue_work async
             &wb->release_work
             cgwb_release_workfn
                                  ksys_sync
                                   iterate_supers
                                    sync_inodes_one_sb
                                     sync_inodes_sb
                                      bdi_split_work_to_wbs
                                       kmalloc(sizeof(*work), GFP_ATOMIC)
                                       // alloc memory failed
              percpu_ref_exit
               ref->data = NULL
               kfree(data)
                                       wb_get(wb)
                                        percpu_ref_get(&wb->refcnt)
                                         percpu_ref_get_many(ref, 1)
                                          atomic_long_add(nr, &ref->data->count)
                                           atomic64_add(i, v)
                                           // trigger null-ptr-deref
      
      bdi_split_work_to_wbs() traverses &bdi->wb_list to split work into all
      wbs.  If the allocation of new work fails, the on-stack fallback will be
      used and the reference count of the current wb is increased afterwards.
      If cgroup writeback membership switches occur before getting the reference
      count and the current wb is released as old_wd, then calling wb_get() or
      wb_put() will trigger the null pointer dereference above.
      
      This issue was introduced in v4.3-rc7 (see fix tag1).  Both
      sync_inodes_sb() and __writeback_inodes_sb_nr() calls to
      bdi_split_work_to_wbs() can trigger this issue.  For scenarios called via
      sync_inodes_sb(), originally commit 7fc5854f ("writeback: synchronize
      sync(2) against cgroup writeback membership switches") reduced the
      possibility of the issue by adding wb_switch_rwsem, but in v5.14-rc1 (see
      fix tag2) removed the "inode_io_list_del_locked(inode, old_wb)" from
      inode_switch_wbs_work_fn() so that wb->state contains WB_has_dirty_io,
      thus old_wb is not skipped when traversing wbs in bdi_split_work_to_wbs(),
      and the issue becomes easily reproducible again.
      
      To solve this problem, percpu_ref_exit() is called under RCU protection to
      avoid race between cgwb_release_workfn() and bdi_split_work_to_wbs().
      Moreover, replace wb_get() with wb_tryget() in bdi_split_work_to_wbs(),
      and skip the current wb if wb_tryget() fails because the wb has already
      been shutdown.
      
      Link: https://lkml.kernel.org/r/20230410130826.1492525-1-libaokun1@huawei.com
      Fixes: b817525a ("writeback: bdi_writeback iteration must not skip dying ones")
      Signed-off-by: NBaokun Li <libaokun1@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Christian Brauner <brauner@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Hou Tao <houtao1@huawei.com>
      Cc: yangerkun <yangerkun@huawei.com>
      Cc: Zhang Yi <yi.zhang@huawei.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      
      Conflicts:
      	mm/backing-dev.c
      Signed-off-by: NBaokun Li <libaokun1@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Reviewed-by: NYang Erkun <yangerkun@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      5703fb4e
  7. 29 3月, 2023 5 次提交
  8. 22 3月, 2023 3 次提交
    • D
      mm: optimize do_wp_page() for fresh pages in local LRU pagevecs · 060210a9
      David Hildenbrand 提交于
      mainline inclusion
      from mainline-v5.18-rc1
      commit d4c47097
      category: bugfix
      bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6NK0S
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d4c470970d45c863fafc757521a82be2f80b1232
      
      --------------------------------
      
      For example, if a page just got swapped in via a read fault, the LRU
      pagevecs might still hold a reference to the page.  If we trigger a write
      fault on such a page, the additional reference from the LRU pagevecs will
      prohibit reusing the page.
      
      Let's conditionally drain the local LRU pagevecs when we stumble over a
      !PageLRU() page.  We cannot easily drain remote LRU pagevecs and it might
      not be desirable performance-wise.  Consequently, this will only avoid
      copying in some cases.
      
      Add a simple "page_count(page) > 3" check first but keep the
      "page_count(page) > 1 + PageSwapCache(page)" check in place, as we want to
      minimize cases where we remove a page from the swapcache but won't be able
      to reuse it, for example, because another process has it mapped R/O, to
      not affect reclaim.
      
      We cannot easily handle the following cases and we will always have to
      copy:
      
      (1) The page is referenced in the LRU pagevecs of other CPUs. We really
          would have to drain the LRU pagevecs of all CPUs -- most probably
          copying is much cheaper.
      
      (2) The page is already PageLRU() but is getting moved between LRU
          lists, for example, for activation (e.g., mark_page_accessed()),
          deactivation (MADV_COLD), or lazyfree (MADV_FREE). We'd have to
          drain mostly unconditionally, which might be bad performance-wise.
          Most probably this won't happen too often in practice.
      
      Note that there are other reasons why an anon page might temporarily not
      be PageLRU(): for example, compaction and migration have to isolate LRU
      pages from the LRU lists first (isolate_lru_page()), moving them to
      temporary local lists and clearing PageLRU() and holding an additional
      reference on the page.  In that case, we'll always copy.
      
      This change seems to be fairly effective with the reproducer [1] shared by
      Nadav, as long as writeback is done synchronously, for example, using
      zram.  However, with asynchronous writeback, we'll usually fail to free
      the swapcache because the page is still under writeback: something we
      cannot easily optimize for, and maybe it's not really relevant in
      practice.
      
      [1] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jann Horn <jannh@google.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Liang Zhang <zhangliang5@huawei.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      060210a9
    • D
      mm: optimize do_wp_page() for exclusive pages in the swapcache · 4c942f5f
      David Hildenbrand 提交于
      mainline inclusion
      from mainline-v5.18-rc1
      commit 53a05ad9
      category: bugfix
      bugzilla: https://gitee.com/src-openeuler/kernel/issues/I6NK0S
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=53a05ad9f21d858d24f76d12b3e990405f2036d1
      
      --------------------------------
      
      Patch series "mm: COW fixes part 1: fix the COW security issue for THP and swap", v3.
      
      This series attempts to optimize and streamline the COW logic for ordinary
      anon pages and THP anon pages, fixing two remaining instances of
      CVE-2020-29374 in do_swap_page() and do_huge_pmd_wp_page(): information
      can leak from a parent process to a child process via anonymous pages
      shared during fork().
      
      This issue, including other related COW issues, has been summarized in [2]:
      
       "1. Observing Memory Modifications of Private Pages From A Child Process
      
        Long story short: process-private memory might not be as private as you
        think once you fork(): successive modifications of private memory
        regions in the parent process can still be observed by the child
        process, for example, by smart use of vmsplice()+munmap().
      
        The core problem is that pinning pages readable in a child process, such
        as done via the vmsplice system call, can result in a child process
        observing memory modifications done in the parent process the child is
        not supposed to observe. [1] contains an excellent summary and [2]
        contains further details. This issue was assigned CVE-2020-29374 [9].
      
        For this to trigger, it's required to use a fork() without subsequent
        exec(), for example, as used under Android zygote. Without further
        details about an application that forks less-privileged child processes,
        one cannot really say what's actually affected and what's not -- see the
        details section the end of this mail for a short sshd/openssh analysis.
      
        While commit 17839856 ("gup: document and work around "COW can break
        either way" issue") fixed this issue and resulted in other problems
        (e.g., ptrace on pmem), commit 09854ba9 ("mm: do_wp_page()
        simplification") re-introduced part of the problem unfortunately.
      
        The original reproducer can be modified quite easily to use THP [3] and
        make the issue appear again on upstream kernels. I modified it to use
        hugetlb [4] and it triggers as well. The problem is certainly less
        severe with hugetlb than with THP; it merely highlights that we still
        have plenty of open holes we should be closing/fixing.
      
        Regarding vmsplice(), the only known workaround is to disallow the
        vmsplice() system call ... or disable THP and hugetlb. But who knows
        what else is affected (RDMA? O_DIRECT?) to achieve the same goal -- in
        the end, it's a more generic issue"
      
      This security issue was first reported by Jann Horn on 27 May 2020 and it
      currently affects anonymous pages during swapin, anonymous THP and hugetlb.
      This series tackles anonymous pages during swapin and anonymous THP:
      
       - do_swap_page() for handling COW on PTEs during swapin directly
      
       - do_huge_pmd_wp_page() for handling COW on PMD-mapped THP during write
         faults
      
      With this series, we'll apply the same COW logic we have in do_wp_page()
      to all swappable anon pages: don't reuse (map writable) the page in
      case there are additional references (page_count() != 1). All users of
      reuse_swap_page() are remove, and consequently reuse_swap_page() is
      removed.
      
      In general, we're struggling with the following COW-related issues:
      
      (1) "missed COW": we miss to copy on write and reuse the page (map it
          writable) although we must copy because there are pending references
          from another process to this page. The result is a security issue.
      
      (2) "wrong COW": we copy on write although we wouldn't have to and
          shouldn't: if there are valid GUP references, they will become out
          of sync with the pages mapped into the page table. We fail to detect
          that such a page can be reused safely, especially if never more than
          a single process mapped the page. The result is an intra process
          memory corruption.
      
      (3) "unnecessary COW": we copy on write although we wouldn't have to:
          performance degradation and temporary increases swap+memory
          consumption can be the result.
      
      While this series fixes (1) for swappable anon pages, it tries to reduce
      reported cases of (3) first as good and easy as possible to limit the
      impact when streamlining.  The individual patches try to describe in
      which cases we will run into (3).
      
      This series certainly makes (2) worse for THP, because a THP will now
      get PTE-mapped on write faults if there are additional references, even
      if there was only ever a single process involved: once PTE-mapped, we'll
      copy each and every subpage and won't reuse any subpage as long as the
      underlying compound page wasn't split.
      
      I'm working on an approach to fix (2) and improve (3): PageAnonExclusive
      to mark anon pages that are exclusive to a single process, allow GUP
      pins only on such exclusive pages, and allow turning exclusive pages
      shared (clearing PageAnonExclusive) only if there are no GUP pins.  Anon
      pages with PageAnonExclusive set never have to be copied during write
      faults, but eventually during fork() if they cannot be turned shared.
      The improved reuse logic in this series will essentially also be the
      logic to reset PageAnonExclusive.  This work will certainly take a
      while, but I'm planning on sharing details before having code fully
      ready.
      
      cleanups related to reuse_swap_page().
      
      Notes:
      * For now, I'll leave hugetlb code untouched: "unnecessary COW" might
        easily break existing setups because hugetlb pages are a scarce resource
        and we could just end up having to crash the application when we run out
        of hugetlb pages. We have to be very careful and the security aspect with
        hugetlb is most certainly less relevant than for unprivileged anon pages.
      * Instead of lru_add_drain() we might actually just drain the lru_add list
        or even just remove the single page of interest from the lru_add list.
        This would require a new helper function, and could be added if the
        conditional lru_add_drain() turn out to be a problem.
      * I extended the test case already included in [1] to also test for the
        newly found do_swap_page() case. I'll send that out separately once/if
        this part was merged.
      
      [1] https://lkml.kernel.org/r/20211217113049.23850-1-david@redhat.com
      [2] https://lore.kernel.org/r/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com
      
      This patch (of 9):
      
      Liang Zhang reported [1] that the current COW logic in do_wp_page() is
      sub-optimal when it comes to swap+read fault+write fault of anonymous
      pages that have a single user, visible via a performance degradation in
      the redis benchmark.  Something similar was previously reported [2] by
      Nadav with a simple reproducer.
      
      After we put an anon page into the swapcache and unmapped it from a single
      process, that process might read that page again and refault it read-only.
      If that process then writes to that page, the process is actually the
      exclusive user of the page, however, the COW logic in do_co_page() won't
      be able to reuse it due to the additional reference from the swapcache.
      
      Let's optimize for pages that have been added to the swapcache but only
      have an exclusive user.  Try removing the swapcache reference if there is
      hope that we're the exclusive user.
      
      We will fail removing the swapcache reference in two scenarios:
      (1) There are additional swap entries referencing the page: copying
          instead of reusing is the right thing to do.
      (2) The page is under writeback: theoretically we might be able to reuse
          in some cases, however, we cannot remove the additional reference
          and will have to copy.
      
      Note that we'll only try removing the page from the swapcache when it's
      highly likely that we'll be the exclusive owner after removing the page
      from the swapache.  As we're about to map that page writable and redirty
      it, that should not affect reclaim but is rather the right thing to do.
      
      Further, we might have additional references from the LRU pagevecs, which
      will force us to copy instead of being able to reuse.  We'll try handling
      such references for some scenarios next.  Concurrent writeback cannot be
      handled easily and we'll always have to copy.
      
      While at it, remove the superfluous page_mapcount() check: it's
      implicitly covered by the page_count() for ordinary anon pages.
      
      [1] https://lkml.kernel.org/r/20220113140318.11117-1-zhangliang5@huawei.com
      [2] https://lkml.kernel.org/r/0480D692-D9B2-429A-9A88-9BBA1331AC3A@gmail.com
      
      Link: https://lkml.kernel.org/r/20220131162940.210846-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NLiang Zhang <zhangliang5@huawei.com>
      Reported-by: NNadav Amit <nadav.amit@gmail.com>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Jason Gunthorpe <jgg@nvidia.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Roman Gushchin <roman.gushchin@linux.dev>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
      Reviewed-by: Ntong tiangen <tongtiangen@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      4c942f5f
    • N
      mm/vmalloc: huge vmalloc backing pages should be split rather than compound · 0242e899
      Nicholas Piggin 提交于
      mainline inclusion
      from mainline-v5.18-rc4
      commit 3b8000ae
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I6LD0S
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3b8000ae185cb068adbda5f966a3835053c85fd4
      
      --------------------------------
      
      Huge vmalloc higher-order backing pages were allocated with __GFP_COMP
      in order to allow the sub-pages to be refcounted by callers such as
      "remap_vmalloc_page [sic]" (remap_vmalloc_range).
      
      However a similar problem exists for other struct page fields callers
      use, for example fb_deferred_io_fault() takes a vmalloc'ed page and
      not only refcounts it but uses ->lru, ->mapping, ->index.
      
      This is not compatible with compound sub-pages, and can cause bad page
      state issues like
      
        BUG: Bad page state in process swapper/0  pfn:00743
        page:(____ptrval____) refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x743
        flags: 0x7ffff000000000(node=0|zone=0|lastcpupid=0x7ffff)
        raw: 007ffff000000000 c00c00000001d0c8 c00c00000001d0c8 0000000000000000
        raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
        page dumped because: corrupted mapping in tail page
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.18.0-rc3-00082-gfc6fff4a7ce1-dirty #2810
        Call Trace:
          dump_stack_lvl+0x74/0xa8 (unreliable)
          bad_page+0x12c/0x170
          free_tail_pages_check+0xe8/0x190
          free_pcp_prepare+0x31c/0x4e0
          free_unref_page+0x40/0x1b0
          __vunmap+0x1d8/0x420
          ...
      
      The correct approach is to use split high-order pages for the huge
      vmalloc backing. These allow callers to treat them in exactly the same
      way as individually-allocated order-0 pages.
      
      Link: https://lore.kernel.org/all/14444103-d51b-0fb3-ee63-c3f182f0b546@molgen.mpg.de/Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Cc: Paul Menzel <pmenzel@molgen.mpg.de>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Rick Edgecombe <rick.p.edgecombe@intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      conflicts:
      	mm/vmalloc.c
      Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      0242e899
  9. 22 2月, 2023 1 次提交
  10. 08 2月, 2023 2 次提交
  11. 31 1月, 2023 2 次提交
  12. 18 1月, 2023 7 次提交
  13. 04 1月, 2023 2 次提交
  14. 13 12月, 2022 3 次提交
  15. 07 12月, 2022 2 次提交
  16. 29 11月, 2022 1 次提交