- 19 7月, 2022 20 次提交
-
-
由 Wang Wensheng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ------------------------------------------------- The user doesn't care about the start address of the dvpp range, what is mattered is that the virtual space tagged DVPP located at in a 16G range. So we can safely drop the dvpp address space as long as it's empty during merging process. Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Wang Wensheng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ------------------------------------------------- Currently the dvpp range is global for each device. And it is unreasonable after the reconstruction that makes the DVPP mappings private to each process or group. This allows to configure the dvpp range for each process. The dvpp range for each dvpp mapping can only be configured once just as the old version. Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Wang Wensheng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ------------------------------------------------- When SPG_NOD_DVPP is specified to sp_group_add_task, we don't create a DVPP mapping for the newly created sp_group. And the new group cannot support allocating DVPP memory. Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Wang Wensheng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ------------------------------------------------- 1. Add a list for sp_mapping to record all the sp_groups attached to it. 2. Initialize the sp_mapping for local_group when it is created. So when we add a task to a group, we should merge the dvpp mapping of the local group. 3. Every two groups can be merged if and only if at least one of them is empty. Then the empty mapping would be dropped and another mapping would be attached to the two groups. This need to traverse all the groups attached to the mapping. 4. A mapping is considered empty when no spa is allocated from it and its address space is default. Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Wang Wensheng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ------------------------------------------------- A few structures must have been created when a process want to get into sharepool subsystem, including allocating sharepool memory, being added into a spg or doing k2u and so on. Currently we create those structures just before we actually need them. For example, we find or create a sp_spa_stat after a successful memory allocation and before updating the statistical structure. The creation of a new structure may fail due to oom and we should then reclaim the memory allocated and revert all the process before. Or we just forget to do that and a potential memory-leak occurs. This design makes it confused when we indeed create a structure and we always worry about potential memory-leak when we changes the code around it. A better solution is to initialize all that structures at the same time when a process join in sharepool subsystem. And in future, we will clear the unnecessary statistical structures. Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Wang Wensheng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ------------------------------------------------- There are two types of memory allocated from sharepool: passthrough memory for DVPP and shared memory. Currently, we branch to different routines depending on the memory type, both during the allocation and free process. Since we have already create a local group for passthrough memory, with just one step ahead, we could drop the redundant branches in allocation and free process and in all the fallback process when an error occurs. Here is the content of this patch: 1. Add erery process to its local group when initilizing its group_master. 2. Avoid to return the local group in find_sp_group_id_by_pid(). 3. Delete the redundant branches during allocation and free process. Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Wang Wensheng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ------------------------------------------------- When we destroy a vma, we first find the spa depending on the vma->vm_start, during which we should hold the sp_area_lock. While we store the spa in vma, we can get the spa directly. Don't worry if the spa exists or if it's to be freed soon, since we have increaced the refcount for the spa when it's mappend into a vma. Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zhou Guanghui 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ------------------------------------------------- The management of the address space is adjusted, and the statistical data processing of the shared pool needs to be adapted. Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com> Signed-off-by: NZhang Jian <zhangjian210@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zhou Guanghui 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ------------------------------------------------- The address space of the DVPP is managed by group. When releasing the shared pool memory, you need to find the corresponding address space based on the ID. Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zhou Guanghui 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ------------------------------------------------- The DVPP address space is per process or per sharing group. During sp_free and unshare, you need to know which address space the current address belongs to. Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zhou Guanghui 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ------------------------------------------------- Separately manage the normal and dvpp address spaces of the sp_group and set the normal and dvpp address spaces of the corresponding groups when adding a group, sp_alloc, and k2task. Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zhou Guanghui 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ------------------------------------------------- struct sp_mapping is used to manage the address space of a shared pool. During the initialization of the shared pool, normal address spaces are created to allocate the memory of the current shared pool. Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zhou Guanghui 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ------------------------------------------------- The single-group mode has no application scenario. Therefore, the related branch is deleted. The boot option "enable_sp_multi_group_mode" does not take effect. Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yuan Can 提交于
ascend inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ------------------------------------------------------ create_spg_node() may fail with NULL pointer returened, and in the out_drop_spg_node path, the NULL pointer will be dereferenced in free_spg_node(). Signed-off-by: NYuan Can <yuancan@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Wang Wensheng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ------------------------------------------------- Add the missing initialization for kc.sp_flag in sp_make_share_kva_to_spg(). Or a random value would be used in sp_remap_kva_to_vma(). Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zhou Guanghui 提交于
ascend inclusion category: Bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA -------------------------------- When the driver uses the shared pool memory to share the memory with the user space, the user space is not allowed to operate this area. This prevents users from damaging sensitive data. When the sp_alloc and k2u processes apply for private memory, read-only memory can be applied for. Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Guo Mengqi 提交于
ascend inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ----------------------------------- In sp_mmap(), if use offset = va - MMAP_BASE/DVPP_BASE, then normal sp_alloc pgoff may have same value with DVPP pgoff, causing DVPP and sp_alloc mapped to overlapped part of file unexpectedly. To fix the problem, pass VA value as mmap offset, for in this scenario, VA value in one task address space will not be same. Signed-off-by: NGuo Mengqi <guomengqi3@huawei.com> Reviewed-by: NDing Tianhong <dingtianhong@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Wang Wensheng 提交于
ascend inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ------------------------------------------------- We use device_id to select the correct dvpp vspace range when SP_DVPP flag is specified. Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Guo Mengqi 提交于
ascend inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA ------------------------------------------------- Fix following situation: when the last process in a group exits, and a second process tries to add to this group. The second process may get a invalid spg. However the group's use_count is increased by 1, which caused the first process failed to free the group when it exits. And then second process called sp_group_drop --> free_sp_group and cause a double request of rwsem. Signed-off-by: NGuo Mengqi <guomengqi3@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Wang Wensheng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I5DS9S CVE: NA -------------------------------------------------- This is not used for THP but the user page table is just like THP. The user alloc hugepages via a special driver and its vma is not marked with VM_HUGETLB. This commit allow to share those vma to kernel. Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 18 7月, 2022 4 次提交
-
-
由 Peter Xu 提交于
stable inclusion from stable-v5.10.111 commit f089471d1b754cdd386f081f6c62eec414e8e188 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5GL1Z Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f089471d1b754cdd386f081f6c62eec414e8e188 -------------------------------- commit 5abfd71d upstream. Patch series "mm: Rework zap ptes on swap entries", v5. Patch 1 should fix a long standing bug for zap_pte_range() on zap_details usage. The risk is we could have some swap entries skipped while we should have zapped them. Migration entries are not the major concern because file backed memory always zap in the pattern that "first time without page lock, then re-zap with page lock" hence the 2nd zap will always make sure all migration entries are already recovered. However there can be issues with real swap entries got skipped errornoously. There's a reproducer provided in commit message of patch 1 for that. Patch 2-4 are cleanups that are based on patch 1. After the whole patchset applied, we should have a very clean view of zap_pte_range(). Only patch 1 needs to be backported to stable if necessary. This patch (of 4): The "details" pointer shouldn't be the token to decide whether we should skip swap entries. For example, when the callers specified details->zap_mapping==NULL, it means the user wants to zap all the pages (including COWed pages), then we need to look into swap entries because there can be private COWed pages that was swapped out. Skipping some swap entries when details is non-NULL may lead to wrongly leaving some of the swap entries while we should have zapped them. A reproducer of the problem: Reviewed-by: NWei Li <liwei391@huawei.com> ===8<=== #define _GNU_SOURCE /* See feature_test_macros(7) */ #include <stdio.h> #include <assert.h> #include <unistd.h> #include <sys/mman.h> #include <sys/types.h> int page_size; int shmem_fd; char *buffer; void main(void) { int ret; char val; page_size = getpagesize(); shmem_fd = memfd_create("test", 0); assert(shmem_fd >= 0); ret = ftruncate(shmem_fd, page_size * 2); assert(ret == 0); buffer = mmap(NULL, page_size * 2, PROT_READ | PROT_WRITE, MAP_PRIVATE, shmem_fd, 0); assert(buffer != MAP_FAILED); /* Write private page, swap it out */ buffer[page_size] = 1; madvise(buffer, page_size * 2, MADV_PAGEOUT); /* This should drop private buffer[page_size] already */ ret = ftruncate(shmem_fd, page_size); assert(ret == 0); /* Recover the size */ ret = ftruncate(shmem_fd, page_size * 2); assert(ret == 0); /* Re-read the data, it should be all zero */ val = buffer[page_size]; if (val == 0) printf("Good\n"); else printf("BUG\n"); } ===8<=== We don't need to touch up the pmd path, because pmd never had a issue with swap entries. For example, shmem pmd migration will always be split into pte level, and same to swapping on anonymous. Add another helper should_zap_cows() so that we can also check whether we should zap private mappings when there's no page pointer specified. This patch drops that trick, so we handle swap ptes coherently. Meanwhile we should do the same check upon migration entry, hwpoison entry and genuine swap entries too. To be explicit, we should still remember to keep the private entries if even_cows==false, and always zap them when even_cows==true. The issue seems to exist starting from the initial commit of git. [peterx@redhat.com: comment tweaks] Link: https://lkml.kernel.org/r/20220217060746.71256-2-peterx@redhat.com Link: https://lkml.kernel.org/r/20220217060746.71256-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220216094810.60572-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20220216094810.60572-2-peterx@redhat.com Fixes: 1da177e4 ("Linux-2.6.12-rc2") Signed-off-by: NPeter Xu <peterx@redhat.com> Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com> Cc: David Hildenbrand <david@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Miaohe Lin 提交于
stable inclusion from stable-v5.10.111 commit f7e183b0a7136b6dc9c7b9b2a85a608a8feba894 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5GL1Z Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f7e183b0a7136b6dc9c7b9b2a85a608a8feba894 -------------------------------- commit 4ad09955 upstream. If mpol_new is allocated but not used in restart loop, mpol_new will be freed via mpol_put before returning to the caller. But refcnt is not initialized yet, so mpol_put could not do the right things and might leak the unused mpol_new. This would happen if mempolicy was updated on the shared shmem file while the sp->lock has been dropped during the memory allocation. This issue could be triggered easily with the below code snippet if there are many processes doing the below work at the same time: shmid = shmget((key_t)5566, 1024 * PAGE_SIZE, 0666|IPC_CREAT); shm = shmat(shmid, 0, 0); loop many times { mbind(shm, 1024 * PAGE_SIZE, MPOL_LOCAL, mask, maxnode, 0); mbind(shm + 128 * PAGE_SIZE, 128 * PAGE_SIZE, MPOL_DEFAULT, mask, maxnode, 0); } Link: https://lkml.kernel.org/r/20220329111416.27954-1-linmiaohe@huawei.com Fixes: 42288fe3 ("mm: mempolicy: Convert shared_policy mutex to spinlock") Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> [3.8] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com> Reviewed-by: NWei Li <liwei391@huawei.com>
-
由 Paolo Bonzini 提交于
stable inclusion from stable-v5.10.111 commit 7d659cb1763ff17d1c6ee082fa6feb4267c7a30b category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5GL1Z Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=7d659cb1763ff17d1c6ee082fa6feb4267c7a30b -------------------------------- commit 01e67e04 upstream. If an mremap() syscall with old_size=0 ends up in move_page_tables(), it will call invalidate_range_start()/invalidate_range_end() unnecessarily, i.e. with an empty range. This causes a WARN in KVM's mmu_notifier. In the past, empty ranges have been diagnosed to be off-by-one bugs, hence the WARNing. Given the low (so far) number of unique reports, the benefits of detecting more buggy callers seem to outweigh the cost of having to fix cases such as this one, where userspace is doing something silly. In this particular case, an early return from move_page_tables() is enough to fix the issue. Link: https://lkml.kernel.org/r/20220329173155.172439-1-pbonzini@redhat.com Reported-by: syzbot+6bde52d89cfdf9f61425@syzkaller.appspotmail.com Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com> Reviewed-by: NWei Li <liwei391@huawei.com>
-
stable inclusion from stable-v5.10.111 commit 3c3fbfa6dddb022e91be18902c3785a82aa4867d category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5GL1Z Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=3c3fbfa6dddb022e91be18902c3785a82aa4867d -------------------------------- commit 6c8e2a25 upstream. Problem: Reviewed-by: NWei Li <liwei391@huawei.com> ======= Userspace might read the zero-page instead of actual data from a direct IO read on a block device if the buffers have been called madvise(MADV_FREE) on earlier (this is discussed below) due to a race between page reclaim on MADV_FREE and blkdev direct IO read. - Race condition: ============== During page reclaim, the MADV_FREE page check in try_to_unmap_one() checks if the page is not dirty, then discards its rmap PTE(s) (vs. remap back if the page is dirty). However, after try_to_unmap_one() returns to shrink_page_list(), it might keep the page _anyway_ if page_ref_freeze() fails (it expects exactly _one_ page reference, from the isolation for page reclaim). Well, blkdev_direct_IO() gets references for all pages, and on READ operations it only sets them dirty _later_. So, if MADV_FREE'd pages (i.e., not dirty) are used as buffers for direct IO read from block devices, and page reclaim happens during __blkdev_direct_IO[_simple]() exactly AFTER bio_iov_iter_get_pages() returns, but BEFORE the pages are set dirty, the situation happens. The direct IO read eventually completes. Now, when userspace reads the buffers, the PTE is no longer there and the page fault handler do_anonymous_page() services that with the zero-page, NOT the data! A synthetic reproducer is provided. - Page faults: =========== If page reclaim happens BEFORE bio_iov_iter_get_pages() the issue doesn't happen, because that faults-in all pages as writeable, so do_anonymous_page() sets up a new page/rmap/PTE, and that is used by direct IO. The userspace reads don't fault as the PTE is there (thus zero-page is not used/setup). But if page reclaim happens AFTER it / BEFORE setting pages dirty, the PTE is no longer there; the subsequent page faults can't help: The data-read from the block device probably won't generate faults due to DMA (no MMU) but even in the case it wouldn't use DMA, that happens on different virtual addresses (not user-mapped addresses) because `struct bio_vec` stores `struct page` to figure addresses out (which are different from user-mapped addresses) for the read. Thus userspace reads (to user-mapped addresses) still fault, then do_anonymous_page() gets another `struct page` that would address/ map to other memory than the `struct page` used by `struct bio_vec` for the read. (The original `struct page` is not available, since it wasn't freed, as page_ref_freeze() failed due to more page refs. And even if it were available, its data cannot be trusted anymore.) Solution: ======== One solution is to check for the expected page reference count in try_to_unmap_one(). There should be one reference from the isolation (that is also checked in shrink_page_list() with page_ref_freeze()) plus one or more references from page mapping(s) (put in discard: label). Further references mean that rmap/PTE cannot be unmapped/nuked. (Note: there might be more than one reference from mapping due to fork()/clone() without CLONE_VM, which use the same `struct page` for references, until the copy-on-write page gets copied.) So, additional page references (e.g., from direct IO read) now prevent the rmap/PTE from being unmapped/dropped; similarly to the page is not freed per shrink_page_list()/page_ref_freeze()). - Races and Barriers: ================== The new check in try_to_unmap_one() should be safe in races with bio_iov_iter_get_pages() in get_user_pages() fast and slow paths, as it's done under the PTE lock. The fast path doesn't take the lock, but it checks if the PTE has changed and if so, it drops the reference and leaves the page for the slow path (which does take that lock). The fast path requires synchronization w/ full memory barrier: it writes the page reference count first then it reads the PTE later, while try_to_unmap() writes PTE first then it reads page refcount. And a second barrier is needed, as the page dirty flag should not be read before the page reference count (as in __remove_mapping()). (This can be a load memory barrier only; no writes are involved.) Call stack/comments: - try_to_unmap_one() - page_vma_mapped_walk() - map_pte() # see pte_offset_map_lock(): pte_offset_map() spin_lock() - ptep_get_and_clear() # write PTE - smp_mb() # (new barrier) GUP fast path - page_ref_count() # (new check) read refcount - page_vma_mapped_walk_done() # see pte_unmap_unlock(): pte_unmap() spin_unlock() - bio_iov_iter_get_pages() - __bio_iov_iter_get_pages() - iov_iter_get_pages() - get_user_pages_fast() - internal_get_user_pages_fast() # fast path - lockless_pages_from_mm() - gup_{pgd,p4d,pud,pmd,pte}_range() ptep = pte_offset_map() # not _lock() pte = ptep_get_lockless(ptep) page = pte_page(pte) try_grab_compound_head(page) # inc refcount # (RMW/barrier # on success) if (pte_val(pte) != pte_val(*ptep)) # read PTE put_compound_head(page) # dec refcount # go slow path # slow path - __gup_longterm_unlocked() - get_user_pages_unlocked() - __get_user_pages_locked() - __get_user_pages() - follow_{page,p4d,pud,pmd}_mask() - follow_page_pte() ptep = pte_offset_map_lock() pte = *ptep page = vm_normal_page(pte) try_grab_page(page) # inc refcount pte_unmap_unlock() - Huge Pages: ========== Regarding transparent hugepages, that logic shouldn't change, as MADV_FREE (aka lazyfree) pages are PageAnon() && !PageSwapBacked() (madvise_free_pte_range() -> mark_page_lazyfree() -> lru_lazyfree_fn()) thus should reach shrink_page_list() -> split_huge_page_to_list() before try_to_unmap[_one](), so it deals with normal pages only. (And in case unlikely/TTU_SPLIT_HUGE_PMD/split_huge_pmd_address() happens, which should not or be rare, the page refcount should be greater than mapcount: the head page is referenced by tail pages. That also prevents checking the head `page` then incorrectly call page_remove_rmap(subpage) for a tail page, that isn't even in the shrink_page_list()'s page_list (an effect of split huge pmd/pmvw), as it might happen today in this unlikely scenario.) MADV_FREE'd buffers: =================== So, back to the "if MADV_FREE pages are used as buffers" note. The case is arguable, and subject to multiple interpretations. The madvise(2) manual page on the MADV_FREE advice value says: 1) 'After a successful MADV_FREE ... data will be lost when the kernel frees the pages.' 2) 'the free operation will be canceled if the caller writes into the page' / 'subsequent writes ... will succeed and then [the] kernel cannot free those dirtied pages' 3) 'If there is no subsequent write, the kernel can free the pages at any time.' Thoughts, questions, considerations... respectively: 1) Since the kernel didn't actually free the page (page_ref_freeze() failed), should the data not have been lost? (on userspace read.) 2) Should writes performed by the direct IO read be able to cancel the free operation? - Should the direct IO read be considered as 'the caller' too, as it's been requested by 'the caller'? - Should the bio technique to dirty pages on return to userspace (bio_check_pages_dirty() is called/used by __blkdev_direct_IO()) be considered in another/special way here? 3) Should an upcoming write from a previously requested direct IO read be considered as a subsequent write, so the kernel should not free the pages? (as it's known at the time of page reclaim.) And lastly: Technically, the last point would seem a reasonable consideration and balance, as the madvise(2) manual page apparently (and fairly) seem to assume that 'writes' are memory access from the userspace process (not explicitly considering writes from the kernel or its corner cases; again, fairly).. plus the kernel fix implementation for the corner case of the largely 'non-atomic write' encompassed by a direct IO read operation, is relatively simple; and it helps. Reproducer: ========== @ test.c (simplified, but works) #define _GNU_SOURCE #include <fcntl.h> #include <stdio.h> #include <unistd.h> #include <sys/mman.h> int main() { int fd, i; char *buf; fd = open(DEV, O_RDONLY | O_DIRECT); buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) buf[i] = 1; // init to non-zero madvise(buf, BUF_SIZE, MADV_FREE); read(fd, buf, BUF_SIZE); for (i = 0; i < BUF_SIZE; i += PAGE_SIZE) printf("%p: 0x%x\n", &buf[i], buf[i]); return 0; } @ block/fops.c (formerly fs/block_dev.c) +#include <linux/swap.h> ... ... __blkdev_direct_IO[_simple](...) { ... + if (!strcmp(current->comm, "good")) + shrink_all_memory(ULONG_MAX); + ret = bio_iov_iter_get_pages(...); + + if (!strcmp(current->comm, "bad")) + shrink_all_memory(ULONG_MAX); ... } @ shell # NUM_PAGES=4 # PAGE_SIZE=$(getconf PAGE_SIZE) # yes | dd of=test.img bs=${PAGE_SIZE} count=${NUM_PAGES} # DEV=$(losetup -f --show test.img) # gcc -DDEV=\"$DEV\" \ -DBUF_SIZE=$((PAGE_SIZE * NUM_PAGES)) \ -DPAGE_SIZE=${PAGE_SIZE} \ test.c -o test # od -tx1 $DEV 0000000 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a 79 0a * 0040000 # mv test good # ./good 0x7f7c10418000: 0x79 0x7f7c10419000: 0x79 0x7f7c1041a000: 0x79 0x7f7c1041b000: 0x79 # mv good bad # ./bad 0x7fa1b8050000: 0x0 0x7fa1b8051000: 0x0 0x7fa1b8052000: 0x0 0x7fa1b8053000: 0x0 Note: the issue is consistent on v5.17-rc3, but it's intermittent with the support of MADV_FREE on v4.5 (60%-70% error; needs swap). [wrap do_direct_IO() in do_blockdev_direct_IO() @ fs/direct-io.c]. - v5.17-rc3: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x0 # free | grep Swap Swap: 0 0 0 - v4.5: # for i in {1..1000}; do ./good; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 # mv good bad # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 2702 0x0 1298 0x79 # swapoff -av swapoff /swap # for i in {1..1000}; do ./bad; done \ | cut -d: -f2 | sort | uniq -c 4000 0x79 Ceph/TCMalloc: ============= For documentation purposes, the use case driving the analysis/fix is Ceph on Ubuntu 18.04, as the TCMalloc library there still uses MADV_FREE to release unused memory to the system from the mmap'ed page heap (might be committed back/used again; it's not munmap'ed.) - PageHeap::DecommitSpan() -> TCMalloc_SystemRelease() -> madvise() - PageHeap::CommitSpan() -> TCMalloc_SystemCommit() -> do nothing. Note: TCMalloc switched back to MADV_DONTNEED a few commits after the release in Ubuntu 18.04 (google-perftools/gperftools 2.5), so the issue just 'disappeared' on Ceph on later Ubuntu releases but is still present in the kernel, and can be hit by other use cases. The observed issue seems to be the old Ceph bug #22464 [1], where checksum mismatches are observed (and instrumentation with buffer dumps shows zero-pages read from mmap'ed/MADV_FREE'd page ranges). The issue in Ceph was reasonably deemed a kernel bug (comment #50) and mostly worked around with a retry mechanism, but other parts of Ceph could still hit that (rocksdb). Anyway, it's less likely to be hit again as TCMalloc switched out of MADV_FREE by default. (Some kernel versions/reports from the Ceph bug, and relation with the MADV_FREE introduction/changes; TCMalloc versions not checked.) - 4.4 good - 4.5 (madv_free: introduction) - 4.9 bad - 4.10 good? maybe a swapless system - 4.12 (madv_free: no longer free instantly on swapless systems) - 4.13 bad [1] https://tracker.ceph.com/issues/22464 Thanks: ====== Several people contributed to analysis/discussions/tests/reproducers in the first stages when drilling down on ceph/tcmalloc/linux kernel: - Dan Hill - Dan Streetman - Dongdong Tao - Gavin Guo - Gerald Yang - Heitor Alves de Siqueira - Ioanna Alifieraki - Jay Vosburgh - Matthew Ruffell - Ponnuvel Palaniyappan Reviews, suggestions, corrections, comments: - Minchan Kim - Yu Zhao - Huang, Ying - John Hubbard - Christoph Hellwig [mfo@canonical.com: v4] Link: https://lkml.kernel.org/r/20220209202659.183418-1-mfo@canonical.comLink: https://lkml.kernel.org/r/20220131230255.789059-1-mfo@canonical.com Fixes: 802a3a92 ("mm: reclaim MADV_FREE pages") Signed-off-by: NMauricio Faria de Oliveira <mfo@canonical.com> Reviewed-by: N"Huang, Ying" <ying.huang@intel.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Yu Zhao <yuzhao@google.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Dan Hill <daniel.hill@canonical.com> Cc: Dan Streetman <dan.streetman@canonical.com> Cc: Dongdong Tao <dongdong.tao@canonical.com> Cc: Gavin Guo <gavin.guo@canonical.com> Cc: Gerald Yang <gerald.yang@canonical.com> Cc: Heitor Alves de Siqueira <halves@canonical.com> Cc: Ioanna Alifieraki <ioanna-maria.alifieraki@canonical.com> Cc: Jay Vosburgh <jay.vosburgh@canonical.com> Cc: Matthew Ruffell <matthew.ruffell@canonical.com> Cc: Ponnuvel Palaniyappan <ponnuvel.palaniyappan@canonical.com> Cc: <stable@vger.kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> [mfo: backport: replace folio/test_flag with page/flag equivalents; real Fixes: 854e9ed0 ("mm: support madvise(MADV_FREE)") in v4.] Signed-off-by: NMauricio Faria de Oliveira <mfo@canonical.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 13 7月, 2022 5 次提交
-
-
由 ZhaoLong Wang 提交于
hulk inclusion category: bugfix bugzilla: 187184, https://gitee.com/openeuler/kernel/issues/I5G9EK CVE: NA -------------------------------- An undefined-behavior issue has not been completely fixed since commit a70846a97e7b("tmpfs: fix undefined-behaviour in shmem_reconfigure()"). In the commit, check in the shmem_reconfigure() is added in remount process to avoid the Ubsan problem. However, the check is not added to the mount process. It causes inconsistent results between mount and remount. The operations to reproduce the problem in user mode as follows: If nr_blocks is set to 0x8000000000000000, the mounting is successful. # mount tmpfs /dev/shm/ -t tmpfs -o nr_blocks=0x8000000000000000 However, when -o remount is used, the mount fails because of the check in the shmem_reconfigure() # mount tmpfs /dev/shm/ -t tmpfs -o remount,nr_blocks=0x8000000000000000 mount: /dev/shm: mount point not mounted or bad option. Therefore, add checks in the shmem_parse_one() function and remove the check in shmem_reconfigure() to avoid this problem. Signed-off-by: NZhaoLong Wang <wangzhaolong1@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Luo Meng 提交于
mainline inclusion from mainline-v5.19-rc3 commit d14f5efa category: bugfix bugzilla: 186471 https://gitee.com/openeuler/kernel/issues/I5DRKR CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d14f5efadd846dbde561bd734318de6a9e6b26e6 ------------------------------------------------------------------------------- When shmem_reconfigure() calls __percpu_counter_compare(), the second parameter is unsigned long long. But in the definition of __percpu_counter_compare(), the second parameter is s64. So when __percpu_counter_compare() executes abs(count - rhs), UBSAN shows the following warning: ================================================================================ UBSAN: Undefined behaviour in lib/percpu_counter.c:209:6 signed integer overflow: 0 - -9223372036854775808 cannot be represented in type 'long long int' CPU: 1 PID: 9636 Comm: syz-executor.2 Tainted: G ---------r- - 4.18.0 #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014 Call Trace: __dump_stack home/install/linux-rh-3-10/lib/dump_stack.c:77 [inline] dump_stack+0x125/0x1ae home/install/linux-rh-3-10/lib/dump_stack.c:117 ubsan_epilogue+0xe/0x81 home/install/linux-rh-3-10/lib/ubsan.c:159 handle_overflow+0x19d/0x1ec home/install/linux-rh-3-10/lib/ubsan.c:190 __percpu_counter_compare+0x124/0x140 home/install/linux-rh-3-10/lib/percpu_counter.c:209 percpu_counter_compare home/install/linux-rh-3-10/./include/linux/percpu_counter.h:50 [inline] shmem_remount_fs+0x1ce/0x6b0 home/install/linux-rh-3-10/mm/shmem.c:3530 do_remount_sb+0x11b/0x530 home/install/linux-rh-3-10/fs/super.c:888 do_remount home/install/linux-rh-3-10/fs/namespace.c:2344 [inline] do_mount+0xf8d/0x26b0 home/install/linux-rh-3-10/fs/namespace.c:2844 ksys_mount+0xad/0x120 home/install/linux-rh-3-10/fs/namespace.c:3075 __do_sys_mount home/install/linux-rh-3-10/fs/namespace.c:3089 [inline] __se_sys_mount home/install/linux-rh-3-10/fs/namespace.c:3086 [inline] __x64_sys_mount+0xbf/0x160 home/install/linux-rh-3-10/fs/namespace.c:3086 do_syscall_64+0xca/0x5c0 home/install/linux-rh-3-10/arch/x86/entry/common.c:298 entry_SYSCALL_64_after_hwframe+0x6a/0xdf RIP: 0033:0x46b5e9 Code: 5d db fa ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 2b db fa ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f54d5f22c68 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5 RAX: ffffffffffffffda RBX: 000000000077bf60 RCX: 000000000046b5e9 RDX: 0000000000000000 RSI: 0000000020000000 RDI: 0000000000000000 RBP: 000000000077bf60 R08: 0000000020000140 R09: 0000000000000000 R10: 00000000026740a4 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffd1fb1592f R14: 00007f54d5f239c0 R15: 000000000077bf6c ================================================================================ [akpm@linux-foundation.org: tweak error message text] Link: https://lkml.kernel.org/r/20220513025225.2678727-1-luomeng12@huawei.comSigned-off-by: NLuo Meng <luomeng12@huawei.com> Cc: Hugh Dickins <hughd@google.com> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Conflicts: mm/shmem.c Signed-off-by: NZhaoLong Wang <wangzhaolong1@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Liu Shixin 提交于
maillist inclusion category: bugfix bugzilla: 186821, https://gitee.com/openeuler/kernel/issues/I5G69G Reference: https://lore.kernel.org/all/20220707020938.2122198-1-liushixin2@huawei.com/ -------------------------------- Release refcount after xas_set to fix UAF which may cause panic like this: page:ffffea000491fa40 refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1247e9 head:ffffea000491fa00 order:3 compound_mapcount:0 compound_pincount:0 memcg:ffff888104f91091 flags: 0x2fffff80010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff) ... page dumped because: VM_BUG_ON_PAGE(PageTail(page)) ------------[ cut here ]------------ kernel BUG at include/linux/page-flags.h:632! invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN CPU: 1 PID: 7642 Comm: sh Not tainted 5.15.51-dirty #26 ... Call Trace: <TASK> __invalidate_mapping_pages+0xe7/0x540 drop_pagecache_sb+0x159/0x320 iterate_supers+0x120/0x240 drop_caches_sysctl_handler+0xaa/0xe0 proc_sys_call_handler+0x2b4/0x480 new_sync_write+0x3d6/0x5c0 vfs_write+0x446/0x7a0 ksys_write+0x105/0x210 do_syscall_64+0x35/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f52b5733130 ... This problem has been fixed on mainline by patch 6b24ca4a ("mm: Use multi-index entries in the page cache") since it deletes the related code. Fixes: 5c211ba2 ("mm: add and use find_lock_entries") Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Acked-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Conflicts: mm/filemap.c Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Amir Goldstein 提交于
mainline inclusion from mainline-5.13-rc1 commit 59cda49e category: feature bugzilla: 187162, https://gitee.com/openeuler/kernel/issues/I5FYKV CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=59cda49ecf6c9a32fae4942420701b6e087204f6 ------------------------------------------------- Since kernel v5.1, fanotify_init(2) supports the flag FAN_REPORT_FID for identifying objects using file handle and fsid in events. fanotify_mark(2) fails with -ENODEV when trying to set a mark on filesystems that report null f_fsid in stasfs(2). Use the digest of uuid as f_fsid for tmpfs to uniquely identify tmpfs objects as best as possible and allow setting an fanotify mark that reports events with file handles on tmpfs. Link: https://lore.kernel.org/r/20210322173944.449469-3-amir73il@gmail.comAcked-by: NHugh Dickins <hughd@google.com> Reviewed-by: NChristian Brauner <christian.brauner@ubuntu.com> Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NBaokun Li <libaokun1@huawei.com> Reviewed-by: Nchengzhihao <chengzhihao1@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yu Kuai 提交于
hulk inclusion category: bugfix bugzilla: 186896, https://gitee.com/src-openeuler/kernel/issues/I5FRAP CVE: NA -------------------------------- In filemap_read(), first page will not mark accessed if previous page is equal to the current page: if (iocb->ki_pos >> PAGE_SHIFT != ra->prev_pos >> PAGE_SHIFT)) folio_mark_accessed(fbatch.folios[0]); However, 'prev_pos' is set to 'ki_pos + copied' during last read, which means 'prev_pos' can be equal to 'ki_pos' in this read, thus previous page can be miscaculated. Fix the problem by setting 'prev_pos' to the start offset of last read, so that 'prev_pos >> PAGE_SHIFT' will be previous page as expected. Fixes: 06c04442 ("mm/filemap.c: generic_file_buffered_read() now uses find_get_pages_contig") Signed-off-by: NYu Kuai <yukuai3@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 06 7月, 2022 11 次提交
-
-
由 Hyeonggon Yoo 提交于
mainline inclusion from mainline-v5.18-rc7 commit 2839b099 category: bugfix bugzilla: 187071, https://gitee.com/openeuler/kernel/issues/I5DLA7 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=2839b0999c20c9f6bf353849c69370e121e2fa1a -------------------------------- When kfence fails to initialize kfence pool, it frees the pool. But it does not reset memcg_data and PG_slab flag. Below is a BUG because of this. Let's fix it by resetting memcg_data and PG_slab flag before free. [ 0.089149] BUG: Bad page state in process swapper/0 pfn:3d8e06 [ 0.089149] page:ffffea46cf638180 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x3d8e06 [ 0.089150] memcg:ffffffff94a475d1 [ 0.089150] flags: 0x17ffffc0000200(slab|node=0|zone=2|lastcpupid=0x1fffff) [ 0.089151] raw: 0017ffffc0000200 ffffea46cf638188 ffffea46cf638188 0000000000000000 [ 0.089152] raw: 0000000000000000 0000000000000000 00000000ffffffff ffffffff94a475d1 [ 0.089152] page dumped because: page still charged to cgroup [ 0.089153] Modules linked in: [ 0.089153] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G B W 5.18.0-rc1+ #965 [ 0.089154] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [ 0.089154] Call Trace: [ 0.089155] <TASK> [ 0.089155] dump_stack_lvl+0x49/0x5f [ 0.089157] dump_stack+0x10/0x12 [ 0.089158] bad_page.cold+0x63/0x94 [ 0.089159] check_free_page_bad+0x66/0x70 [ 0.089160] __free_pages_ok+0x423/0x530 [ 0.089161] __free_pages_core+0x8e/0xa0 [ 0.089162] memblock_free_pages+0x10/0x12 [ 0.089164] memblock_free_late+0x8f/0xb9 [ 0.089165] kfence_init+0x68/0x92 [ 0.089166] start_kernel+0x789/0x992 [ 0.089167] x86_64_start_reservations+0x24/0x26 [ 0.089168] x86_64_start_kernel+0xa9/0xaf [ 0.089170] secondary_startup_64_no_verify+0xd5/0xdb [ 0.089171] </TASK> Link: https://lkml.kernel.org/r/YnPG3pQrqfcgOlVa@hyeyoo Fixes: 0ce20dd8 ("mm: add Kernel Electric-Fence infrastructure") Fixes: 8f0b3649 ("mm: kfence: fix objcgs vector allocation") Signed-off-by: NHyeonggon Yoo <42.hyeyoo@gmail.com> Reviewed-by: NMarco Elver <elver@google.com> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Conflicts: mm/kfence/core.c Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
mainline inclusion from mainline-v5.18-rc1 commit 8f0b3649 category: bugfix bugzilla: 187071, https://gitee.com/openeuler/kernel/issues/I5DLA7 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=8f0b36497303487d5a32c75789c77859cc2ee895 -------------------------------- If the kfence object is allocated to be used for objects vector, then this slot of the pool eventually being occupied permanently since the vector is never freed. The solutions could be (1) freeing vector when the kfence object is freed or (2) allocating all vectors statically. Since the memory consumption of object vectors is low, it is better to chose (2) to fix the issue and it is also can reduce overhead of vectors allocating in the future. Link: https://lkml.kernel.org/r/20220328132843.16624-1-songmuchun@bytedance.com Fixes: d3fb45f3 ("mm, kfence: insert KFENCE hooks for SLAB") Signed-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NMarco Elver <elver@google.com> Reviewed-by: NRoman Gushchin <roman.gushchin@linux.dev> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/kfence/core.c Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Jackie Liu 提交于
mainline inclusion from mainline-v5.19-rc1 commit 83d7d04f category: bugfix bugzilla: 187071, https://gitee.com/openeuler/kernel/issues/I5DLA7 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=83d7d04f9d2ef354858b2a8444aee38e41ec1699 -------------------------------- By printing information, we can friendly prompt the status change information of kfence by dmesg and record by syslog. Also, set kfence_enabled to false only when needed. Link: https://lkml.kernel.org/r/20220518073105.3160335-1-liu.yun@linux.devSigned-off-by: NJackie Liu <liuyun01@kylinos.cn> Co-developed-by: NMarco Elver <elver@google.com> Signed-off-by: NMarco Elver <elver@google.com> Reviewed-by: NMarco Elver <elver@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Conflicts: mm/kfence/core.c Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 huangshaobo 提交于
mainline inclusion from mainline-v5.19-rc1 commit 3c81b3bb category: bugfix bugzilla: 187071, https://gitee.com/openeuler/kernel/issues/I5DLA7 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c81b3bb0a33e2b555edb8d7eb99a7ae4f17d8bb -------------------------------- Out-of-bounds accesses that aren't caught by a guard page will result in corruption of canary memory. In pathological cases, where an object has certain alignment requirements, an out-of-bounds access might never be caught by the guard page. Such corruptions, however, are only detected on kfree() normally. If the bug causes the kernel to panic before kfree(), KFENCE has no opportunity to report the issue. Such corruptions may also indicate failing memory or other faults. To provide some more information in such cases, add the option to check canary bytes on panic. This might help narrow the search for the panic cause; but, due to only having the allocation stack trace, such reports are difficult to use to diagnose an issue alone. In most cases, such reports are inactionable, and is therefore an opt-in feature (disabled by default). [akpm@linux-foundation.org: add __read_mostly, per Marco] Link: https://lkml.kernel.org/r/20220425022456.44300-1-huangshaobo6@huawei.comSigned-off-by: Nhuangshaobo <huangshaobo6@huawei.com> Suggested-by: Nchenzefeng <chenzefeng2@huawei.com> Reviewed-by: NMarco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Xiaoming Ni <nixiaoming@huawei.com> Cc: Wangbing <wangbing6@huawei.com> Cc: Jubin Zhong <zhongjubin@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Conflicts: mm/kfence/core.c Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Peng Liu 提交于
mainline inclusion from mainline-v5.18-rc1 commit 3cb1c962 category: bugfix bugzilla: 187071, https://gitee.com/openeuler/kernel/issues/I5DLA7 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3cb1c9620eeeb67c614c0732a35861b0b1efdc53 -------------------------------- When CONFIG_KFENCE_NUM_OBJECTS is set to a big number, kfence kunit-test-case test_gfpzero will eat up nearly all the CPU's resources and rcu_stall is reported as the following log which is cut from a physical server. rcu: INFO: rcu_sched self-detected stall on CPU rcu: 68-....: (14422 ticks this GP) idle=6ce/1/0x4000000000000002 softirq=592/592 fqs=7500 (t=15004 jiffies g=10677 q=20019) Task dump for CPU 68: task:kunit_try_catch state:R running task stack: 0 pid: 9728 ppid: 2 flags:0x0000020a Call trace: dump_backtrace+0x0/0x1e4 show_stack+0x20/0x2c sched_show_task+0x148/0x170 ... rcu_sched_clock_irq+0x70/0x180 update_process_times+0x68/0xb0 tick_sched_handle+0x38/0x74 ... gic_handle_irq+0x78/0x2c0 el1_irq+0xb8/0x140 kfree+0xd8/0x53c test_alloc+0x264/0x310 [kfence_test] test_gfpzero+0xf4/0x840 [kfence_test] kunit_try_run_case+0x48/0x20c kunit_generic_run_threadfn_adapter+0x28/0x34 kthread+0x108/0x13c ret_from_fork+0x10/0x18 To avoid rcu_stall and unacceptable latency, a schedule point is added to test_gfpzero. Link: https://lkml.kernel.org/r/20220309083753.1561921-4-liupeng256@huawei.comSigned-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NMarco Elver <elver@google.com> Tested-by: NBrendan Higgins <brendanhiggins@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Wang Kefeng <wangkefeng.wang@huawei.com> Cc: Daniel Latypov <dlatypov@google.com> Cc: David Gow <davidgow@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Peng Liu 提交于
mainline inclusion from mainline-v5.18-rc1 commit adf50545 category: bugfix bugzilla: 187071, https://gitee.com/openeuler/kernel/issues/I5DLA7 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=adf505457032c11b79b5a7c277c62ff5d61b17c2 -------------------------------- Patch series "kunit: fix a UAF bug and do some optimization", v2. This series is to fix UAF (use after free) when running kfence test case test_gfpzero, which is time costly. This UAF bug can be easily triggered by setting CONFIG_KFENCE_NUM_OBJECTS = 65535. Furthermore, some optimization for kunit tests has been done. This patch (of 3): Kunit will create a new thread to run an actual test case, and the main process will wait for the completion of the actual test thread until overtime. The variable "struct kunit test" has local property in function kunit_try_catch_run, and will be used in the test case thread. Task kunit_try_catch_run will free "struct kunit test" when kunit runs overtime, but the actual test case is still run and an UAF bug will be triggered. The above problem has been both observed in a physical machine and qemu platform when running kfence kunit tests. The problem can be triggered when setting CONFIG_KFENCE_NUM_OBJECTS = 65535. Under this setting, the test case test_gfpzero will cost hours and kunit will run to overtime. The follows show the panic log. BUG: unable to handle page fault for address: ffffffff82d882e9 Call Trace: kunit_log_append+0x58/0xd0 ... test_alloc.constprop.0.cold+0x6b/0x8a [kfence_test] test_gfpzero.cold+0x61/0x8ab [kfence_test] kunit_try_run_case+0x4c/0x70 kunit_generic_run_threadfn_adapter+0x11/0x20 kthread+0x166/0x190 ret_from_fork+0x22/0x30 Kernel panic - not syncing: Fatal exception Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 To solve this problem, the test case thread should be stopped when the kunit frame runs overtime. The stop signal will send in function kunit_try_catch_run, and test_gfpzero will handle it. Link: https://lkml.kernel.org/r/20220309083753.1561921-1-liupeng256@huawei.com Link: https://lkml.kernel.org/r/20220309083753.1561921-2-liupeng256@huawei.comSigned-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NMarco Elver <elver@google.com> Reviewed-by: NBrendan Higgins <brendanhiggins@google.com> Tested-by: NBrendan Higgins <brendanhiggins@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Wang Kefeng <wangkefeng.wang@huawei.com> Cc: Daniel Latypov <dlatypov@google.com> Cc: David Gow <davidgow@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/kfence/kfence_test.c Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: 187071, https://gitee.com/openeuler/kernel/issues/I5DLA7 -------------------------------- KFENCE requires linear map to be mapped at page granularity, this must done in very early time. To save memory of page table, arm64 only map the pages in KFENCE pool itself at page granularity. Thus, the kfence pool could not allocated by buddy system. For the flexibility of KFENCE, scale sample_interval to control whether support to enable kfence after system startup(re-enabling). Once sample_interval is set to -1 on arm64, memory for kfence pool will be allocated from early memory no matter KFENCE is enabled or not. Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Liu Shixin 提交于
hulk inclusion category: feature bugzilla: 187071, https://gitee.com/openeuler/kernel/issues/I5DLA7 -------------------------------- Since re-enabling KFENCE is supported, make it compatible with dynamic configuired objects. Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Tianchen Ding 提交于
mainline inclusion from mainline-v5.18-rc1 commit b33f778b category: feature bugzilla: 187071, https://gitee.com/openeuler/kernel/issues/I5DLA7 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b33f778bba5ef3f76fe6708c611346c1ea03acd4 -------------------------------- Allow enabling KFENCE after system startup by allocating its pool via the page allocator. This provides the flexibility to enable KFENCE even if it wasn't enabled at boot time. Link: https://lkml.kernel.org/r/20220307074516.6920-3-dtcccc@linux.alibaba.comSigned-off-by: NTianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: NMarco Elver <elver@google.com> Tested-by: NPeng Liu <liupeng256@huawei.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/kfence/core.c Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Tianchen Ding 提交于
mainline inclusion from mainline-v5.18-rc1 commit 698361bc category: feature bugzilla: 187071, https://gitee.com/openeuler/kernel/issues/I5DLA7 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=698361bca2d59fd29d46c757163854454df477f1 -------------------------------- Patch series "provide the flexibility to enable KFENCE", v3. If CONFIG_CONTIG_ALLOC is not supported, we fallback to try alloc_pages_exact(). Allocating pages in this way has limits about MAX_ORDER (default 11). So we will not support allocating kfence pool after system startup with a large KFENCE_NUM_OBJECTS. When handling failures in kfence_init_pool_late(), we pair free_pages_exact() to alloc_pages_exact() for compatibility consideration, though it actually does the same as free_contig_range(). This patch (of 2): If once KFENCE is disabled by: echo 0 > /sys/module/kfence/parameters/sample_interval KFENCE could never be re-enabled until next rebooting. Allow re-enabling it by writing a positive num to sample_interval. Link: https://lkml.kernel.org/r/20220307074516.6920-1-dtcccc@linux.alibaba.com Link: https://lkml.kernel.org/r/20220307074516.6920-2-dtcccc@linux.alibaba.comSigned-off-by: NTianchen Ding <dtcccc@linux.alibaba.com> Reviewed-by: NMarco Elver <elver@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Oscar Salvador 提交于
mainline inclusion from mainline-v5.11-rc1 commit 32409cba category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5E2IG CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=32409cba3f66810626c1c15b728c31968d6bfa92 -------------------------------- memory_failure and soft_offline_path paths now drain pcplists by calling get_hwpoison_page. memory_failure flags the page as HWPoison before, so that page cannot longer go into a pcplist, and soft_offline_page only flags a page as HWPoison if 1) we took the page off a buddy freelist 2) the page was in-use and we migrated it 3) was a clean pagecache. Because of that, a page cannot longer be poisoned and be in a pcplist. Link: https://lkml.kernel.org/r/20201013144447.6706-5-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de> Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-