- 14 7月, 2021 40 次提交
-
-
由 Kent Overstreet 提交于
mainline inclusion from mainline-v5.11-rc1 commit 06c04442 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I40B5X CVE: NA ------------------------------------------------- Convert generic_file_buffered_read() to get pages to read from in batches, and then copy data to userspace from many pages at once - in particular, we now don't touch any cachelines that might be contended while we're in the loop to copy data to userspace. This is is a performance improvement on workloads that do buffered reads with large blocksizes, and a very large performance improvement if that file is also being accessed concurrently by different threads. On smaller reads (512 bytes), there's a very small performance improvement (1%, within the margin of error). akpm: kernel test robot found a 32% speedup on one test: https://lkml.kernel.org/r/20201030081456.GY31092@shao2-debian Link: https://lkml.kernel.org/r/20201025212949.602194-3-kent.overstreet@gmail.comSigned-off-by: NKent Overstreet <kent.overstreet@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: kernel test robot <rong.a.chen@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Kent Overstreet 提交于
mainline inclusion from mainline-v5.11-rc1 commit 723ef24b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I40B5X CVE: NA ------------------------------------------------- Patch series "generic_file_buffered_read() improvements", v2. generic_file_buffered_read() has turned into a real monstrosity to work with. And it's a major performance improvement, for both small random and large sequential reads. On my test box, 4k buffered random reads go from ~150k to ~250k iops, and the improvements to big sequential reads are even bigger. This incorporates the fix for IOCB_WAITQ handling that Jens just posted as well, also factors out lock_page_for_iocb() to improve handling of the various iocb flags. This patch (of 2): This is prep work for changing generic_file_buffered_read() to use find_get_pages_contig() to batch up all the pagecache lookups. This patch should be functionally identical to the existing code and changes as little as of the flow control as possible. More refactoring could be done, this patch is intended to be relatively minimal. Link: https://lkml.kernel.org/r/20201025212949.602194-1-kent.overstreet@gmail.com Link: https://lkml.kernel.org/r/20201025212949.602194-2-kent.overstreet@gmail.comSigned-off-by: NKent Overstreet <kent.overstreet@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NNanyong Sun <sunnanyong@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Aneesh Kumar K.V 提交于
mainline inclusion from mainline-v5.14-rc1 commit feac00aa category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI CVE: NA ------------------------------------------------- mremap HAVE_MOVE_PMD/PUD optimization time comparison for 1GB region: 1GB mremap - Source PTE-aligned, Destination PTE-aligned mremap time: 2292772ns 1GB mremap - Source PMD-aligned, Destination PMD-aligned mremap time: 1158928ns 1GB mremap - Source PUD-aligned, Destination PUD-aligned mremap time: 63886ns Link: https://lkml.kernel.org/r/20210616045735.374532-4-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Hugh Dickins <hughd@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Aneesh Kumar K.V 提交于
mainline inclusion from mainline-v5.14-rc1 commit cec6515a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI CVE: NA ------------------------------------------------- flush_tlb_range is special in that we don't specify the page size used for the translation. Hence when flushing TLB we flush the translation cache for all possible page sizes. The kernel also uses the same interface when moving page tables around. Such a move requires us to flush the page walk cache. Instead of adding another interface to force page walk cache flush, update flush_tlb_range to flush page walk cache if the range flushed is more than the PMD range. A page table move will always involve an invalidate range more than PMD_SIZE. Running microbenchmark with mprotect and parallel memory access didn't show any observable performance impact. Link: https://lkml.kernel.org/r/20210616045735.374532-3-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Hugh Dickins <hughd@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Aneesh Kumar K.V 提交于
mainline inclusion from mainline-v5.14-rc1 commit 3bbda69c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI CVE: NA ------------------------------------------------- Patch series "Speedup mremap on ppc64", v8. This patchset enables MOVE_PMD/MOVE_PUD support on power. This requires the platform to support updating higher-level page tables without updating page table entries. This also needs to invalidate the Page Walk Cache on architecture supporting the same. This patch (of 3): Architectures like ppc64 support faster mremap only with radix translation. Hence allow a runtime check w.r.t support for fast mremap. Link: https://lkml.kernel.org/r/20210616045735.374532-1-aneesh.kumar@linux.ibm.com Link: https://lkml.kernel.org/r/20210616045735.374532-2-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Nicholas Piggin 提交于
mainline inclusion from mainline-v5.12-rc1 commit 26418b36 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI CVE: NA ------------------------------------------------- The logic to decide what kind of TLB flush is required (local, global, or IPI) is spread multiple times over the several kinds of TLB flushes. Move it all into a single function which may issue IPIs if necessary, and also returns a flush type that is to be used. Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20201217134731.488135-3-npiggin@gmail.comSigned-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Aneesh Kumar K.V 提交于
mainline inclusion from mainline-v5.14-rc1 commit 97113eb3 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI CVE: NA ------------------------------------------------- To avoid a race between rmap walk and mremap, mremap does take_rmap_locks(). The lock was taken to ensure that rmap walk don't miss a page table entry due to PTE moves via move_pagetables(). The kernel does further optimization of this lock such that if we are going to find the newly added vma after the old vma, the rmap lock is not taken. This is because rmap walk would find the vmas in the same order and if we don't find the page table attached to older vma we would find it with the new vma which we would iterate later. As explained in commit eb66ae03 ("mremap: properly flush TLB before releasing the page") mremap is special in that it doesn't take ownership of the page. The optimized version for PUD/PMD aligned mremap also doesn't hold the ptl lock. This can result in stale TLB entries as show below. This patch updates the rmap locking requirement in mremap to handle the race condition explained below with optimized mremap:: Optmized PMD move CPU 1 CPU 2 CPU 3 mremap(old_addr, new_addr) page_shrinker/try_to_unmap_one mmap_write_lock_killable() addr = old_addr lock(pte_ptl) lock(pmd_ptl) pmd = *old_pmd pmd_clear(old_pmd) flush_tlb_range(old_addr) *new_pmd = pmd *new_addr = 10; and fills TLB with new addr and old pfn unlock(pmd_ptl) ptep_clear_flush() old pfn is free. Stale TLB entry Optimized PUD move also suffers from a similar race. Both the above race condition can be fixed if we force mremap path to take rmap lock. Link: https://lkml.kernel.org/r/20210616045239.370802-7-aneesh.kumar@linux.ibm.com Fixes: 2c91bd4a ("mm: speed up mremap by 20x on large regions") Fixes: c49dd340 ("mm: speedup mremap on 1GB or larger regions") Link: https://lore.kernel.org/linux-mm/CAHk-=wgXVR04eBNtxQfevontWnP6FDm+oj5vauQXP3S-huwbPw@mail.gmail.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: NHugh Dickins <hughd@google.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Aneesh Kumar K.V 提交于
mainline inclusion from mainline-v5.14-rc1 commit 0881ace2 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI CVE: NA ------------------------------------------------- pmd/pud_populate is the right interface to be used to set the respective page table entries. Some architectures like ppc64 do assume that set_pmd/pud_at can only be used to set a hugepage PTE. Since we are not setting up a hugepage PTE here, use the pmd/pud_populate interface. Link: https://lkml.kernel.org/r/20210616045239.370802-6-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Hugh Dickins <hughd@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Aneesh Kumar K.V 提交于
mainline inclusion from mainline-v5.14-rc1 commit d6655dff category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI CVE: NA ------------------------------------------------- With two level page table don't enable move_normal_pud. Link: https://lkml.kernel.org/r/20210616045239.370802-5-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Hugh Dickins <hughd@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Aneesh Kumar K.V 提交于
mainline inclusion from mainline-v5.14-rc1 commit 7d846db7 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI CVE: NA ------------------------------------------------- With TRANSPARENT_HUGEPAGE_PUD enabled the kernel can find huge PUD entries. Add a helper to move huge PUD entries on mremap(). This will be used by a later patch to optimize mremap of PUD_SIZE aligned level 4 PTE mapped address This also make sure we support mremap on huge PUD entries even with CONFIG_HAVE_MOVE_PUD disabled. [aneesh.kumar@linux.ibm.com: fix build failure with clang-10] Link: https://lore.kernel.org/lkml/YMuOSnJsL9qkxweY@archlinux-ax161 Link: https://lkml.kernel.org/r/20210619134310.89098-1-aneesh.kumar@linux.ibm.com Link: https://lkml.kernel.org/r/20210616045239.370802-4-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Hugh Dickins <hughd@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Aneesh Kumar K.V 提交于
mainline inclusion from mainline-v5.14-rc1 commit a9cc9c34 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI CVE: NA ------------------------------------------------- With a large mmap map size, we can overlap with the text area and using MAP_FIXED results in unmapping that area. Switch to MAP_FIXED_NOREPLACE and handle the EEXIST error. Link: https://lkml.kernel.org/r/20210616045239.370802-3-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: NKalesh Singh <kaleshsingh@google.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Hugh Dickins <hughd@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Aneesh Kumar K.V 提交于
mainline inclusion from mainline-v5.14-rc1 commit f27a5c93 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI CVE: NA ------------------------------------------------- Patch series "mrermap fixes", v2. This patch (of 6): Instead of hardcoding 4K page size fetch it using sysconf(). For the performance measurements test still assume 2M and 1G are hugepage sizes. Link: https://lkml.kernel.org/r/20210616045239.370802-1-aneesh.kumar@linux.ibm.com Link: https://lkml.kernel.org/r/20210616045239.370802-2-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Reviewed-by: NKalesh Singh <kaleshsingh@google.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Hugh Dickins <hughd@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Aneesh Kumar K.V 提交于
mainline inclusion from mainline-v5.14-rc1 commit dc4875f0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI CVE: NA ------------------------------------------------- No functional change in this patch. [aneesh.kumar@linux.ibm.com: m68k build error reported by kernel robot] Link: https://lkml.kernel.org/r/87tulxnb2v.fsf@linux.ibm.com Link: https://lkml.kernel.org/r/20210615110859.320299-2-aneesh.kumar@linux.ibm.com Link: https://lore.kernel.org/linuxppc-dev/CAHk-=wi+J+iodze9FtjM3Zi4j4OeS+qqbKxME9QN4roxPEXH9Q@mail.gmail.com/Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Hugh Dickins <hughd@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Aneesh Kumar K.V 提交于
mainline inclusion from mainline-v5.14-rc1 commit 9cf6fa24 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI CVE: NA ------------------------------------------------- No functional change in this patch. [aneesh.kumar@linux.ibm.com: fix] Link: https://lkml.kernel.org/r/87wnqtnb60.fsf@linux.ibm.com [sfr@canb.auug.org.au: another fix] Link: https://lkml.kernel.org/r/20210619134410.89559-1-aneesh.kumar@linux.ibm.com Link: https://lkml.kernel.org/r/20210615110859.320299-1-aneesh.kumar@linux.ibm.com Link: https://lore.kernel.org/linuxppc-dev/CAHk-=wi+J+iodze9FtjM3Zi4j4OeS+qqbKxME9QN4roxPEXH9Q@mail.gmail.com/Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Hugh Dickins <hughd@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Kalesh Singh 提交于
mainline inclusion from mainline-v5.11-rc2 commit e05986ee category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI CVE: NA ------------------------------------------------- When `next < old_addr`, `next - old_addr` arithmetic underflows causing `extent` to be incorrect. Make `extent` the smaller of `next - old_addr` or `old_end - old_addr`. Link: https://lkml.kernel.org/r/20201219170433.2418867-1-kaleshsingh@google.com Fixes: c49dd340 ("mm: speedup mremap on 1GB or larger regions") Signed-off-by: NKalesh Singh <kaleshsingh@google.com> Reported-by: NGuenter Roeck <linux@roeck-us.net> Tested-by: NGuenter Roeck <linux@roeck-us.net> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Helge Deller <deller@gmx.de> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Kalesh Singh 提交于
mainline inclusion from mainline-v5.11-rc1 commit f5308c89 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI CVE: NA ------------------------------------------------- HAVE_MOVE_PUD enables remapping pages at the PUD level if both the source and destination addresses are PUD-aligned. With HAVE_MOVE_PUD enabled it can be inferred that there is approximately a 19x improvement in performance on arm64. (See data below). ------- Test Results --------- The following results were obtained using a 5.4 kernel, by remapping a PUD-aligned, 1GB sized region to a PUD-aligned destination. The results from 10 iterations of the test are given below: Total mremap times for 1GB data on arm64. All times are in nanoseconds. Control HAVE_MOVE_PUD 1247761 74271 1219896 46771 1094792 59687 1227760 48385 1043698 76666 1101771 50365 1159896 52500 1143594 75261 1025833 61354 1078125 48697 1134312.6 59395.7 <-- Mean time in nanoseconds A 1GB mremap completion time drops from ~1.1 milliseconds to ~59 microseconds on arm64. (~19x speed up). Link: https://lkml.kernel.org/r/20201014005320.2233162-5-kaleshsingh@google.comSigned-off-by: NKalesh Singh <kaleshsingh@google.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Geffon <bgeffon@google.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Gavin Shan <gshan@redhat.com> Cc: Hassan Naveed <hnaveed@wavecomp.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kees Cook <keescook@chromium.org> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Kalesh Singh 提交于
mainline inclusion from mainline-v5.11-rc1 commit be37c98d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI CVE: NA ------------------------------------------------- HAVE_MOVE_PUD enables remapping pages at the PUD level if both the source and destination addresses are PUD-aligned. With HAVE_MOVE_PUD enabled it can be inferred that there is approximately a 13x improvement in performance on x86. (See data below). ------- Test Results --------- The following results were obtained using a 5.4 kernel, by remapping a PUD-aligned, 1GB sized region to a PUD-aligned destination. The results from 10 iterations of the test are given below: Total mremap times for 1GB data on x86. All times are in nanoseconds. Control HAVE_MOVE_PUD 180394 15089 235728 14056 238931 25741 187330 13838 241742 14187 177925 14778 182758 14728 160872 14418 205813 15107 245722 13998 205721.5 15594 <-- Mean time in nanoseconds A 1GB mremap completion time drops from ~205 microseconds to ~15 microseconds on x86. (~13x speed up). Link: https://lkml.kernel.org/r/20201014005320.2233162-6-kaleshsingh@google.comSigned-off-by: NKalesh Singh <kaleshsingh@google.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: NIngo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brian Geffon <bgeffon@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Gavin Shan <gshan@redhat.com> Cc: Hassan Naveed <hnaveed@wavecomp.com> Cc: Jia He <justin.he@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kees Cook <keescook@chromium.org> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Kalesh Singh 提交于
mainline inclusion from mainline-v5.11-rc1 commit c49dd340 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI CVE: NA ------------------------------------------------- Android needs to move large memory regions for garbage collection. The GC requires moving physical pages of multi-gigabyte heap using mremap. During this move, the application threads have to be paused for correctness. It is critical to keep this pause as short as possible to avoid jitters during user interaction. Optimize mremap for >= 1GB-sized regions by moving at the PUD/PGD level if the source and destination addresses are PUD-aligned. For CONFIG_PGTABLE_LEVELS == 3, moving at the PUD level in effect moves PGD entries, since the PUD entry is “folded back” onto the PGD entry. Add HAVE_MOVE_PUD so that architectures where moving at the PUD level isn't supported/tested can turn this off by not selecting the config. Link: https://lkml.kernel.org/r/20201014005320.2233162-4-kaleshsingh@google.comSigned-off-by: NKalesh Singh <kaleshsingh@google.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Nkernel test robot <lkp@intel.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Brian Geffon <bgeffon@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Gavin Shan <gshan@redhat.com> Cc: Hassan Naveed <hnaveed@wavecomp.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jia He <justin.he@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kees Cook <keescook@chromium.org> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Mina Almasry <almasrymina@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: SeongJae Park <sjpark@amazon.de> Cc: Shuah Khan <shuah@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Kalesh Singh 提交于
mainline inclusion from mainline-v5.11-rc1 commit 7df66625 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFUI CVE: NA ------------------------------------------------- Patch series "Speed up mremap on large regions", v4. mremap time can be optimized by moving entries at the PMD/PUD level if the source and destination addresses are PMD/PUD-aligned and PMD/PUD-sized. Enable moving at the PMD and PUD levels on arm64 and x86. Other architectures where this type of move is supported and known to be safe can also opt-in to these optimizations by enabling HAVE_MOVE_PMD and HAVE_MOVE_PUD. Observed Performance Improvements for remapping a PUD-aligned 1GB-sized region on x86 and arm64: - HAVE_MOVE_PMD is already enabled on x86 : N/A - Enabling HAVE_MOVE_PUD on x86 : ~13x speed up - Enabling HAVE_MOVE_PMD on arm64 : ~ 8x speed up - Enabling HAVE_MOVE_PUD on arm64 : ~19x speed up Altogether, HAVE_MOVE_PMD and HAVE_MOVE_PUD give a total of ~150x speed up on arm64. This patch (of 4): Test mremap on regions of various sizes and alignments and validate data after remapping. Also provide total time for remapping the region which is useful for performance comparison of the mremap optimizations that move pages at the PMD/PUD levels if HAVE_MOVE_PMD and/or HAVE_MOVE_PUD are enabled. Link: https://lkml.kernel.org/r/20201014005320.2233162-1-kaleshsingh@google.com Link: https://lkml.kernel.org/r/20201014005320.2233162-2-kaleshsingh@google.comSigned-off-by: NKalesh Singh <kaleshsingh@google.com> Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Minchan Kim <minchan@google.com> Cc: Lokesh Gidra <lokeshgidra@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Kees Cook <keescook@chromium.org> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Sami Tolvanen <samitolvanen@google.com> Cc: Masahiro Yamada <masahiroy@kernel.org> Cc: Krzysztof Kozlowski <krzk@kernel.org> Cc: Frederic Weisbecker <frederic@kernel.org> Cc: Hassan Naveed <hnaveed@wavecomp.com> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: Jia He <justin.he@arm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Sandipan Das <sandipan@linux.ibm.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Ralph Campbell <rcampbell@nvidia.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Brian Geffon <bgeffon@google.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: SeongJae Park <sjpark@amazon.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Mel Gorman 提交于
mainline inclusion from mainline-5.11-rc1 commit 23e6082a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I40C8N CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=23e6082a522e32232f7377540b4d42d8304253b8 --------------------------- At fork time currently, a local node can be allowed to fill completely and allow the periodic load balancer to fix the problem. This can be problematic in cases where a task creates lots of threads that idle until woken as part of a worker poll causing a memory bandwidth problem. However, a "real" workload suffers badly from this behaviour. The workload in question is mostly NUMA aware but spawns large numbers of threads that act as a worker pool that can be called from anywhere. These need to spread early to get reasonable behaviour. This patch limits how much a local node can fill before spilling over to another node and it will not be a universal win. Specifically, very short-lived workloads that fit within a NUMA node would prefer the memory bandwidth. As I cannot describe the "real" workload, the best proxy measure I found for illustration was a page fault microbenchmark. It's not representative of the workload but demonstrates the hazard of the current behaviour. pft timings 5.10.0-rc2 5.10.0-rc2 imbalancefloat-v2 forkspread-v2 Amean elapsed-1 46.37 ( 0.00%) 46.05 * 0.69%* Amean elapsed-4 12.43 ( 0.00%) 12.49 * -0.47%* Amean elapsed-7 7.61 ( 0.00%) 7.55 * 0.81%* Amean elapsed-12 4.79 ( 0.00%) 4.80 ( -0.17%) Amean elapsed-21 3.13 ( 0.00%) 2.89 * 7.74%* Amean elapsed-30 3.65 ( 0.00%) 2.27 * 37.62%* Amean elapsed-48 3.08 ( 0.00%) 2.13 * 30.69%* Amean elapsed-79 2.00 ( 0.00%) 1.90 * 4.95%* Amean elapsed-80 2.00 ( 0.00%) 1.90 * 4.70%* This is showing the time to fault regions belonging to threads. The target machine has 80 logical CPUs and two nodes. Note the ~30% gain when the machine is approximately the point where one node becomes fully utilised. The slower results are borderline noise. Kernel building shows similar benefits around the same balance point. Generally performance was either neutral or better in the tests conducted. The main consideration with this patch is the point where fork stops spreading a task so some workloads may benefit from different balance points but it would be a risky tuning parameter. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20201120090630.3286-5-mgorman@techsingularity.netSigned-off-by: NZheng Zucheng <zhengzucheng@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Mel Gorman 提交于
mainline inclusion from mainline-5.11-rc1 commit 7d2b5dd0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I40C8N CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7d2b5dd0bcc48095651f1b85f751eef610b3e034 --------------------------- Currently, an imbalance is only allowed when a destination node is almost completely idle. This solved one basic class of problems and was the cautious approach. This patch revisits the possibility that NUMA nodes can be imbalanced until 25% of the CPUs are occupied. The reasoning behind 25% is somewhat superficial -- it's half the cores when HT is enabled. At higher utilisations, balancing should continue as normal and keep things even until scheduler domains are fully busy or over utilised. Note that this is not expected to be a universal win. Any benchmark that prefers spreading as wide as possible with limited communication will favour the old behaviour as there is more memory bandwidth. Workloads that communicate heavily in pairs such as netperf or tbench benefit. For the tests I ran, the vast majority of workloads saw a benefit so it seems to be a worthwhile trade-off. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20201120090630.3286-4-mgorman@techsingularity.netSigned-off-by: NZheng Zucheng <zhengzucheng@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Mel Gorman 提交于
mainline inclusion from mainline-5.11-rc1 commit 5c339005 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I40C8N CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5c339005f854fa75aa46078ad640919425658b3e --------------------------- In find_idlest_group(), the load imbalance is only relevant when the group is either overloaded or fully busy but it is calculated unconditionally. This patch moves the imbalance calculation to the context it is required. Technically, it is a micro-optimisation but really the benefit is avoiding confusing one type of imbalance with another depending on the group_type in the next patch. No functional change. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20201120090630.3286-3-mgorman@techsingularity.netSigned-off-by: NZheng Zucheng <zhengzucheng@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Mel Gorman 提交于
mainline inclusion from mainline-v5.11-rc1 commit abeae76a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I40C8N CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=abeae76a47005aa3f07c9be12d8076365622e25c --------------------------- This is simply a preparation patch to make the following patches easier to read. No functional change. Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org> Link: https://lkml.kernel.org/r/20201120090630.3286-2-mgorman@techsingularity.netSigned-off-by: NZheng Zucheng <zhengzucheng@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Jiang Biao 提交于
mainline inclusion from mainline-5.12-rc1 commit fbcc8183 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZW3B CVE: NA ------------------------------------------------- Many 100us+ latencies have been deteceted in vmstat_shepherd() on CPX platform which has 208 logic cpus. And vmstat_shepherd is queued every second, which could make the case worse. Add schedule point in vmstat_shepherd() to erase the latency. Link: https://lkml.kernel.org/r/20210111035526.1511-1-benbjiang@tencent.comSigned-off-by: NJiang Biao <benbjiang@tencent.com> Reported-by: NBin Lai <robinlai@tencent.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChengyang Fan <cy.fan@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Vlastimil Babka 提交于
mainline inclusion from mainline-5.12-rc1 commit d930ff03 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZYRQ CVE: NA ------------------------------------------------- In deactivate_slab() we currently move all but one objects on the cpu freelist to the page freelist one by one using the costly cmpxchg_double() operation. Then we unfreeze the page while moving the last object on page freelist, with a final cmpxchg_double(). This can be optimized to avoid the cmpxchg_double() per object. Just count the objects on cpu freelist (to adjust page->inuse properly) and also remember the last object in the chain. Then splice page->freelist to the last object and effectively add the whole cpu freelist to page->freelist while unfreezing the page, with a single cmpxchg_double(). Link: https://lkml.kernel.org/r/20210115183543.15097-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NJann Horn <jannh@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChengyang Fan <cy.fan@huawei.com> Reviewed-by: chenwandun<chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Muchun Song 提交于
mainline inclusion from mainline-5.12-rc1 commit f3344adf category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZVOM CVE: NA ------------------------------------------------- The vmstat threshold is 32 (MEMCG_CHARGE_BATCH), Actually the threshold can be as big as MEMCG_CHARGE_BATCH * PAGE_SIZE. It still fits into s32. So introduce struct batched_lruvec_stat to optimize memory usage. The size of struct lruvec_stat is 304 bytes on 64 bit systems. As it is a per-cpu structure. So with this patch, we can save 304 / 2 * ncpu bytes per-memcg per-node where ncpu is the number of the possible CPU. If there are c memory cgroup (include dying cgroup) and n NUMA node in the system. Finally, we can save (152 * ncpu * c * n) bytes. [akpm@linux-foundation.org: fix typo in comment] Link: https://lkml.kernel.org/r/20201210042121.39665-1-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <guro@fb.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Chris Down <chris@chrisdown.name> Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChengyang Fan <cy.fan@huawei.com> Reviewed-by: chenwandun<chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yafang Shao 提交于
mainline inclusion from mainline-5.13-rc1 commit c244297a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZZ8S CVE: NA ------------------------------------------------- Currently the pGp only shows the names of page flags, rather than the full information including section, node, zone, last cpupid and kasan tag. While it is not easy to parse these information manually because there're so many flavors. Let's interpret them in pGp as well. To be compitable with the existed format of pGp, the new introduced ones also use '|' as the separator, then the user tools parsing pGp won't need to make change, suggested by Matthew. The new information is tracked onto the end of the existed one. On example of the output in mm/slub.c as follows, - Before the patch, [ 6343.396602] Slab 0x000000004382e02b objects=33 used=3 fp=0x000000009ae06ffc flags=0x17ffffc0010200(slab|head) - After the patch, [ 8448.272530] Slab 0x0000000090797883 objects=33 used=3 fp=0x00000000790f1c26 flags=0x17ffffc0010200(slab|head|node=0|zone=2|lastcpupid=0x1fffff) The documentation and test cases are also updated. The output of the test cases as follows, [68599.816764] test_printf: loaded. [68599.819068] test_printf: all 388 tests passed [68599.830367] test_printf: unloaded. [lkp@intel.com: reported issues in the prev version in test_printf.c] Signed-off-by: NYafang Shao <laoar.shao@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Joe Perches <joe@perches.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Petr Mladek <pmladek@suse.com> Cc: kernel test robot <lkp@intel.com> Reviewed-by: NSergey Senozhatsky <senozhatsky@chromium.org> Reviewed-by: NPetr Mladek <pmladek@suse.com> Signed-off-by: NPetr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210319101246.73513-4-laoar.shao@gmail.comSigned-off-by: NChengyang Fan <cy.fan@huawei.com> Reviewed-by: NChenWandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yafang Shao 提交于
mainline inclusion from mainline-5.13-rc1 commit 96b94abc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZZ8S CVE: NA ------------------------------------------------- It is strange to combine "pr_err" with "INFO", so let's remove the prefix completely. This patch is motivated by David's comment[1]. - before the patch [ 8846.517809] INFO: Slab 0x00000000f42a2c60 objects=33 used=3 fp=0x0000000060d32ca8 flags=0x17ffffc0010200(slab|head) - after the patch [ 6343.396602] Slab 0x000000004382e02b objects=33 used=3 fp=0x000000009ae06ffc flags=0x17ffffc0010200(slab|head) [1] https://lore.kernel.org/linux-mm/b9c0f2b6-e9b0-0c36-ebdd-2bc684c5a762@redhat.com/#tSuggested-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NYafang Shao <laoar.shao@gmail.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Matthew Wilcox <willy@infradead.org> Reviewed-by: NSergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: NPetr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210319101246.73513-3-laoar.shao@gmail.comSigned-off-by: NChengyang Fan <cy.fan@huawei.com> Reviewed-by: NChenWandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yafang Shao 提交于
mainline inclusion from mainline-5.13-rc1 commit 4a8ef190 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZZ8S CVE: NA ------------------------------------------------- As pGp has been already introduced in printk, we'd better use it to make the output human readable. Before this change, the output is, [ 6155.716018] INFO: Slab 0x000000004027dd4f objects=33 used=3 fp=0x000000008cd1579c flags=0x17ffffc0010200 While after this change, the output is, [ 8846.517809] INFO: Slab 0x00000000f42a2c60 objects=33 used=3 fp=0x0000000060d32ca8 flags=0x17ffffc0010200(slab|head) Signed-off-by: NYafang Shao <laoar.shao@gmail.com> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NChristoph Lameter <cl@linux.com> Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: NSergey Senozhatsky <senozhatsky@chromium.org> Signed-off-by: NPetr Mladek <pmladek@suse.com> Link: https://lore.kernel.org/r/20210319101246.73513-2-laoar.shao@gmail.comSigned-off-by: NChengyang Fan <cy.fan@huawei.com> Reviewed-by: NChenWandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Joao Martins 提交于
mainline inclusion from mainline-5.13-rc1 commit 1d4b0166 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I408MI CVE: NA ------------------------------------------------- Use the newly added unpin_user_page_range_dirty_lock() for more quickly unpinning a consecutive range of pages represented as compound pages. This will also calculate number of pages to unpin (for the tail pages which matching head page) and thus batch the refcount update. Running a test program which calls memory range reg/unreg on a region 1G in size and measures cost of both operations together (in a guest using rxe) with THP and hugetlbfs: Before: 590 rounds in 5.003 sec: 8480.335 usec / round 6898 rounds in 60.001 sec: 8698.367 usec / round After: 2688 rounds in 5.002 sec: 1860.786 usec / round 32517 rounds in 60.001 sec: 1845.225 usec / round Link: https://lkml.kernel.org/r/20210212130843.13865-5-joao.m.martins@oracle.comSigned-off-by: NJoao Martins <joao.m.martins@oracle.com> Acked-by: NJason Gunthorpe <jgg@nvidia.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Doug Ledford <dledford@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChengyang Fan <cy.fan@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Joao Martins 提交于
mainline inclusion from mainline-5.13-rc1 commit 458a4f78 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I408MI CVE: NA ------------------------------------------------- Add an unpin_user_page_range_dirty_lock() API which takes a starting page and how many consecutive pages we want to unpin and optionally dirty. To that end, define another iterator for_each_compound_range() that operates in page ranges as opposed to page array. For users (like RDMA mr_dereg) where each sg represents a contiguous set of pages, we're able to more efficiently unpin pages without having to supply an array of pages much of what happens today with unpin_user_pages(). Link: https://lkml.kernel.org/r/20210212130843.13865-4-joao.m.martins@oracle.comSuggested-by: NJason Gunthorpe <jgg@nvidia.com> Signed-off-by: NJoao Martins <joao.m.martins@oracle.com> Reviewed-by: NJason Gunthorpe <jgg@nvidia.com> Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Doug Ledford <dledford@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChengyang Fan <cy.fan@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Joao Martins 提交于
mainline inclusion from mainline-5.13-rc1 commit 31b912de category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I408MI CVE: NA ------------------------------------------------- Rather than decrementing the head page refcount one by one, we walk the page array and checking which belong to the same compound_head. Later on we decrement the calculated amount of references in a single write to the head page. To that end switch to for_each_compound_head() does most of the work. set_page_dirty() needs no adjustment as it's a nop for non-dirty head pages and it doesn't operate on tail pages. This considerably improves unpinning of pages with THP and hugetlbfs: - THP gup_test -t -m 16384 -r 10 [-L|-a] -S -n 512 -w PIN_LONGTERM_BENCHMARK (put values): ~87.6k us -> ~23.2k us - 16G with 1G huge page size gup_test -f /mnt/huge/file -m 16384 -r 10 [-L|-a] -S -n 512 -w PIN_LONGTERM_BENCHMARK: (put values): ~87.6k us -> ~27.5k us Link: https://lkml.kernel.org/r/20210212130843.13865-3-joao.m.martins@oracle.comSigned-off-by: NJoao Martins <joao.m.martins@oracle.com> Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com> Reviewed-by: NJason Gunthorpe <jgg@nvidia.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Doug Ledford <dledford@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChengyang Fan <cy.fan@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Joao Martins 提交于
mainline inclusion from mainline-5.13-rc1 commit 8745d7f6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I408MI CVE: NA ------------------------------------------------- Patch series "mm/gup: page unpining improvements", v4. This series improves page unpinning, with an eye on improving MR deregistration for big swaths of memory (which is bound by the page unpining), particularly: 1) Decrement the head page by @ntails and thus reducing a lot the number of atomic operations per compound page. This is done by comparing individual tail pages heads, and counting number of consecutive tails on which they match heads and based on that update head page refcount. Should have a visible improvement in all page (un)pinners which use compound pages 2) Introducing a new API for unpinning page ranges (to avoid the trick in the previous item and be based on math), and use that in RDMA ib_mem_release (used for mr deregistration). Performance improvements: unpin_user_pages() for hugetlbfs and THP improves ~3x (through gup_test) and RDMA MR dereg improves ~4.5x with the new API. See patches 2 and 4 for those. This patch (of 4): Add a helper that iterates over head pages in a list of pages. It essentially counts the tails until the next page to process has a different head that the current. This is going to be used by unpin_user_pages() family of functions, to batch the head page refcount updates once for all passed consecutive tail pages. Link: https://lkml.kernel.org/r/20210212130843.13865-1-joao.m.martins@oracle.com Link: https://lkml.kernel.org/r/20210212130843.13865-2-joao.m.martins@oracle.comSigned-off-by: NJoao Martins <joao.m.martins@oracle.com> Suggested-by: NJason Gunthorpe <jgg@nvidia.com> Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com> Reviewed-by: NJason Gunthorpe <jgg@nvidia.com> Cc: Doug Ledford <dledford@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChengyang Fan <cy.fan@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Vlastimil Babka 提交于
mainline inclusion from mainline-5.12-rc1 commit 59450bbc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZY8M CVE: NA ------------------------------------------------- SLAB has been using get/put_online_cpus() around creating, destroying and shrinking kmem caches since 95402b38 ("cpu-hotplug: replace per-subsystem mutexes with get_online_cpus()") in 2008, which is supposed to be replacing a private mutex (cache_chain_mutex, called slab_mutex today) with system-wide mechanism, but in case of SLAB it's in fact used in addition to the existing mutex, without explanation why. SLUB appears to have avoided the cpu hotplug lock initially, but gained it due to common code unification, such as 20cea968 ("mm, sl[aou]b: Move kmem_cache_create mutex handling to common code"). Regardless of the history, checking if the hotplug lock is actually needed today suggests that it's not, and therefore it's better to avoid this system-wide lock and the ordering this imposes wrt other locks (such as slab_mutex). Specifically, in SLAB we have for_each_online_cpu() in do_tune_cpucache() protected by slab_mutex, and cpu hotplug callbacks that also take the slab_mutex, which is also taken by the common slab function that currently also take the hotplug lock. Thus the slab_mutex protection should be sufficient. Also per-cpu array caches are allocated for each possible cpu, so not affected by their online/offline state. In SLUB we have for_each_online_cpu() in functions that show statistics and are already unprotected today, as racing with hotplug is not harmful. Otherwise SLUB relies on percpu allocator. The slub_cpu_dead() hotplug callback takes the slab_mutex. To sum up, this patch removes get/put_online_cpus() calls from slab as it should be safe without further adjustments. Link: https://lkml.kernel.org/r/20210113131634.3671-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Qian Cai <cai@redhat.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChengyang Fan <cy.fan@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Vlastimil Babka 提交于
mainline inclusion from mainline-5.12-rc1 commit 7e1fa93d category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZY8M CVE: NA ------------------------------------------------- Since commit 03afc0e2 ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}") we are taking memory hotplug lock for SLAB and SLUB when creating, destroying or shrinking a cache. It is quite a heavy lock and it's best to avoid it if possible, as we had several issues with lockdep complaining about ordering in the past, see e.g. e4f8e513 ("mm/slub: fix a deadlock in show_slab_objects()"). The problem scenario in 03afc0e2 (solved by the memory hotplug lock) can be summarized as follows: while there's slab_mutex synchronizing new kmem cache creation and SLUB's MEM_GOING_ONLINE callback slab_mem_going_online_callback(), we may miss creation of kmem_cache_node for the hotplugged node in the new kmem cache, because the hotplug callback doesn't yet see the new cache, and cache creation in init_kmem_cache_nodes() only inits kmem_cache_node for nodes in the N_NORMAL_MEMORY nodemask, which however may not yet include the new node, as that happens only later after the MEM_GOING_ONLINE callback. Instead of using get/put_online_mems(), the problem can be solved by SLUB maintaining its own nodemask of nodes for which it has allocated the per-node kmem_cache_node structures. This nodemask would generally mirror the N_NORMAL_MEMORY nodemask, but would be updated only in under SLUB's control in its memory hotplug callbacks under the slab_mutex. This patch adds such nodemask and its handling. Commit 03afc0e2 mentiones "issues like [the one above]", but there don't appear to be further issues. All the paths (shared for SLAB and SLUB) taking the memory hotplug locks are also taking the slab_mutex, except kmem_cache_shrink() where 03afc0e2 replaced slab_mutex with get/put_online_mems(). We however cannot simply restore slab_mutex in kmem_cache_shrink(), as SLUB can enters the function from a write to sysfs 'shrink' file, thus holding kernfs lock, and in kmem_cache_create() the kernfs lock is nested within slab_mutex. But on closer inspection we don't actually need to protect kmem_cache_shrink() from hotplug callbacks: While SLUB's __kmem_cache_shrink() does for_each_kmem_cache_node(), missing a new node added in parallel hotplug is not fatal, and parallel hotremove does not free kmem_cache_node's anymore after the previous patch, so use-after free cannot happen. The per-node shrinking itself is protected by n->list_lock. Same is true for SLAB, and SLOB is no-op. SLAB also doesn't need the memory hotplug locking, which it only gained by 03afc0e2 through the shared paths in slab_common.c. Its memory hotplug callbacks are also protected by slab_mutex against races with these paths. The problem of SLUB relying on N_NORMAL_MEMORY doesn't apply to SLAB, as its setup_kmem_cache_nodes relies on N_ONLINE, and the new node is already set there during the MEM_GOING_ONLINE callback, so no special care is needed for SLAB. As such, this patch removes all get/put_online_mems() usage by the slab subsystem. Link: https://lkml.kernel.org/r/20210113131634.3671-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: David Hildenbrand <david@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Qian Cai <cai@redhat.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChengyang Fan <cy.fan@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Vlastimil Babka 提交于
mainline inclusion from mainline-5.12-rc1 commit 666716fd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZY8M CVE: NA ------------------------------------------------- Patch series "mm, slab, slub: remove cpu and memory hotplug locks". Some related work caused me to look at how we use get/put_mems_online() and get/put_online_cpus() during kmem cache creation/descruction/shrinking, and realize that it should be actually safe to remove all of that with rather small effort (as e.g. Michal Hocko suspected in some of the past discussions already). This has the benefit to avoid rather heavy locks that have caused locking order issues already in the past. So this is the result, Patches 2 and 3 remove memory hotplug and cpu hotplug locking, respectively. Patch 1 is due to realization that in fact some races exist despite the locks (even if not removed), but the most sane solution is not to introduce more of them, but rather accept some wasted memory in scenarios that should be rare anyway (full memory hot remove), as we do the same in other contexts already. This patch (of 3): Commit e4f8e513 ("mm/slub: fix a deadlock in show_slab_objects()") has fixed a problematic locking order by removing the memory hotplug lock get/put_online_mems() from show_slab_objects(). During the discussion, it was argued [1] that this is OK, because existing slabs on the node would prevent a hotremove to proceed. That's true, but per-node kmem_cache_node structures are not necessarily allocated on the same node and may exist even without actual slab pages on the same node. Any path that uses get_node() directly or via for_each_kmem_cache_node() (such as show_slab_objects()) can race with freeing of kmem_cache_node even with the !NULL check, resulting in use-after-free. To that end, commit e4f8e513 argues in a comment that: * We don't really need mem_hotplug_lock (to hold off * slab_mem_going_offline_callback) here because slab's memory hot * unplug code doesn't destroy the kmem_cache->node[] data. While it's true that slab_mem_going_offline_callback() doesn't free the kmem_cache_node, the later callback slab_mem_offline_callback() actually does, so the race and use-after-free exists. Not just for show_slab_objects() after commit e4f8e513, but also many other places that are not under slab_mutex. And adding slab_mutex locking or other synchronization to SLUB paths such as get_any_partial() would be bad for performance and error-prone. The easiest solution is therefore to make the abovementioned comment true and stop freeing the kmem_cache_node structures, accepting some wasted memory in the full memory node removal scenario. Analogically we also don't free hotremoved pgdat as mentioned in [1], nor the similar per-node structures in SLAB. Importantly this approach will not block the hotremove, as generally such nodes should be movable in order to succeed hotremove in the first place, and thus the GFP_KERNEL allocated kmem_cache_node will come from elsewhere. [1] https://lore.kernel.org/linux-mm/20190924151147.GB23050@dhcp22.suse.cz/ Link: https://lkml.kernel.org/r/20210113131634.3671-1-vbabka@suse.cz Link: https://lkml.kernel.org/r/20210113131634.3671-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz> Cc: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: David Rientjes <rientjes@google.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Qian Cai <cai@redhat.com> Cc: David Hildenbrand <david@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChengyang Fan <cy.fan@huawei.com> Reviewed-by: NChen Wandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zhang Qiao 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZX4D CVE: NA -------------------------------- When freeing a taskgroup, we will free cfs rqs of the group, even if cfs rqs have been throttled, otherwise it will cause a Use-After-Free Bug. Therefore before freeing a taskgroup, we should unthrottle all cfs rqs belonging to the taskgroup. Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com> Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zheng Zucheng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZX4D CVE: NA -------------------------------- If online tasks occupy 100% CPU resources, offline tasks can't be scheduled since offline tasks are throttled, as a result, offline task can't timely respond after receiving SIGKILL signal. Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zhang Qiao 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZX4D CVE: NA -------------------------------- In cpu hotplug case, when a cpu go to offline, we should unthrottle cfs_rq which be throttled on this cpu, so they will be migrated to online cpu. Signed-off-by: NZhang Qiao <zhangqiao22@huawei.com> Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Zheng Zucheng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZX4D CVE: NA -------------------------------- Enable CONFIG_QOS_SCHED to support qos scheduler. Signed-off-by: NZheng Zucheng <zhengzucheng@huawei.com> Reviewed-by: NChen Hui <judy.chenhui@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-