- 02 12月, 2019 40 次提交
-
-
由 Minchan Kim 提交于
If a block device supports rw_page operation, it doesn't submit bios so the annotation in submit_bio() for refault stall doesn't work. It happens with zram in android, especially swap read path which could consume CPU cycle for decompress. It is also a problem for zswap which uses frontswap. Annotate swap_readpage() to account the synchronous IO overhead to prevent underreport memory pressure. [akpm@linux-foundation.org: add comment, per Johannes] Link: http://lkml.kernel.org/r/20191010152134.38545-1-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Seth Jennings <sjenning@redhat.com> Cc: Dan Streetman <ddstreet@ieee.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Randy Dunlap 提交于
End a Kconfig help text sentence with a period (aka full stop). Link: http://lkml.kernel.org/r/c17f2c75-dc2a-42a4-2229-bb6b489addf2@infradead.orgSigned-off-by: NRandy Dunlap <rdunlap@infradead.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Krzysztof Kozlowski 提交于
Adjust indentation from spaces to tab (+optional two spaces) as in coding style with command like: $ sed -e 's/^ / /' -i */Kconfig Link: http://lkml.kernel.org/r/1574306437-28837-1-git-send-email-krzk@kernel.orgSigned-off-by: NKrzysztof Kozlowski <krzk@kernel.org> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Jiri Kosina <trivial@kernel.org> Cc: Masahiro Yamada <yamada.masahiro@socionext.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Souptick Joarder 提交于
__online_page_set_limits() is a dummy function - remove it and all callers. Link: http://lkml.kernel.org/r/8e1bc9d3b492f6bde16e95ebc1dee11d6aefabd7.1567889743.git.jrdr.linux@gmail.com Link: http://lkml.kernel.org/r/854db2cf8145d9635249c95584d9a91fd774a229.1567889743.git.jrdr.linux@gmail.com Link: http://lkml.kernel.org/r/9afe6c5a18158f3884a6b302ac2c772f3da49ccc.1567889743.git.jrdr.linux@gmail.comSigned-off-by: NSouptick Joarder <jrdr.linux@gmail.com> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Juergen Gross <jgross@suse.com> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Yang 提交于
There are several places emphasise the effect of __SetPageUptodate(), while the comment seems to have a typo in two places. Link: http://lkml.kernel.org/r/20190926023705.7226-1-richardw.yang@linux.intel.comSigned-off-by: NWei Yang <richardw.yang@linux.intel.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chen Jun 提交于
In 64bit system. sb->s_maxbytes of shmem filesystem is MAX_LFS_FILESIZE, which equal LLONG_MAX. If offset > LLONG_MAX - PAGE_SIZE, offset + len < LLONG_MAX in shmem_fallocate, which will pass the checking in vfs_fallocate. /* Check for wrap through zero too */ if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0)) return -EFBIG; loff_t unmap_start = round_up(offset, PAGE_SIZE) in shmem_fallocate causes a overflow. Syzkaller reports a overflow problem in mm/shmem: UBSAN: Undefined behaviour in mm/shmem.c:2014:10 signed integer overflow: '9223372036854775807 + 1' cannot be represented in type 'long long int' CPU: 0 PID:17076 Comm: syz-executor0 Not tainted 4.1.46+ #1 Hardware name: linux, dummy-virt (DT) Call trace: dump_backtrace+0x0/0x2c8 arch/arm64/kernel/traps.c:100 show_stack+0x20/0x30 arch/arm64/kernel/traps.c:238 __dump_stack lib/dump_stack.c:15 [inline] ubsan_epilogue+0x18/0x70 lib/ubsan.c:164 handle_overflow+0x158/0x1b0 lib/ubsan.c:195 shmem_fallocate+0x6d0/0x820 mm/shmem.c:2104 vfs_fallocate+0x238/0x428 fs/open.c:312 SYSC_fallocate fs/open.c:335 [inline] SyS_fallocate+0x54/0xc8 fs/open.c:239 The highest bit of unmap_start will be appended with sign bit 1 (overflow) when calculate shmem_falloc.start: shmem_falloc.start = unmap_start >> PAGE_SHIFT. Fix it by casting the type of unmap_start to u64, when right shifted. This bug is found in LTS Linux 4.1. It also seems to exist in mainline. Link: http://lkml.kernel.org/r/1573867464-5107-1-git-send-email-chenjun102@huawei.comSigned-off-by: NChen Jun <chenjun102@huawei.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Cc: Qian Cai <cai@lca.pw> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yang Shi 提交于
The shmem_writepage() uses GFP_ATOMIC to allocate swap cache. GFP_ATOMIC used to mean __GFP_HIGH, but now it means __GFP_HIGH | __GFP_ATOMIC | __GFP_KSWAPD_RECLAIM. However, shmem_writepage() should write out to swap only in response to memory pressure, so __GFP_KSWAPD_RECLAIM looks useless since the caller may be kswapd itself or in direct reclaim already. In addition, XArray node allocations from PF_MEMALLOC contexts could completely exhaust the page allocator, __GFP_NOMEMALLOC stops emergency reserves from being allocated. Here just copy the gfp flags used by add_to_swap(). Hugh: "a cleanup to make the two calls look the same when they don't need to be different (whereas the call from __read_swap_cache_async() rightly uses a lower priority gfp)". Link: http://lkml.kernel.org/r/1572991351-86061-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Colin Ian King 提交于
Don't populate the array 'values' on the stack but instead make it static const. Makes the object code smaller by 111 bytes. Before: text data bss dec hex filename 108612 11169 512 120293 1d5e5 mm/shmem.o After: text data bss dec hex filename 108437 11233 512 120182 1d576 mm/shmem.o (gcc version 9.2.1, amd64) Link: http://lkml.kernel.org/r/20190906143012.28698-1-colin.king@canonical.comSigned-off-by: NColin Ian King <colin.king@canonical.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Yang 提交于
When doing UFFDIO_COPY, it is necessary to find the correct destination vma and make sure fault range is in it. Since there are two places need to do the same task, just wrap those common check into an inlined function. Link: http://lkml.kernel.org/r/20190927070032.2129-3-richardw.yang@linux.intel.comSigned-off-by: NWei Yang <richardw.yang@linux.intel.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Yang 提交于
These warning here is to make sure address(dst_addr) and length(len - copied) are huge page size aligned. While this is ensured by: dst_start and len is huge page size aligned dst_addr equals to dst_start and increase huge page size each time copied increase huge page size each time This means these warnings will never be triggered. Link: http://lkml.kernel.org/r/20190927070032.2129-2-richardw.yang@linux.intel.comSigned-off-by: NWei Yang <richardw.yang@linux.intel.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Yang 提交于
In __mcopy_atomic_hugetlb() we use two variables to deal with huge page size: vma_hpagesize and huge_page_size. Since they are the same, it is not necessary to use two different mechanism. This patch makes it consistent by all using vma_hpagesize. Link: http://lkml.kernel.org/r/20190927070032.2129-1-richardw.yang@linux.intel.comSigned-off-by: NWei Yang <richardw.yang@linux.intel.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Yang 提交于
Improve readability, no functional change. Link: http://lkml.kernel.org/r/20191118032857.22683-1-richardw.yang@linux.intel.comSigned-off-by: NWei Yang <richardw.yang@linux.intel.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yunfeng Ye 提交于
page_size() is supported after the commit a50b854e ("mm: introduce page_size()"). Use page_size() in madvise_inject_error() for readability. [akpm@linux-foundation.org: use ulong for `size', per David] Link: http://lkml.kernel.org/r/29dce60c-38d6-0220-f292-e298f0c78c4d@huawei.comSigned-off-by: NYunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Michal Hocko <mhocko@suse.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Hu Shiyuan <hushiyuan@huawei.com> Cc: Feilong Lin <linfeilong@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Yang 提交于
Case 1/6, 2/7 and 3/8 have the same pattern and we handle them in the same logic. Rearrange the comment to make it a little easy for audience to understand. Link: http://lkml.kernel.org/r/20191030012445.16944-1-richardw.yang@linux.intel.comSigned-off-by: NWei Yang <richardw.yang@linux.intel.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Darrick J. Wong <darrick.wong@oracle.com> Cc: Steve Capper <steve.capper@arm.com> Cc: Michel Lespinasse <walken@google.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Yangtao Li <tiny.windzz@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 zhong jiang 提交于
It is more clear to use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs file operation rather than DEFINE_SIMPLE_ATTRIBUTE. Link: http://lkml.kernel.org/r/1572403660-44718-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com> Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
In auto NUMA balancing page table scanning, if the pte_protnone() is true, the PTE needs not to be changed because it's in target state already. So other checking on corresponding struct page is unnecessary too. So, if we check pte_protnone() firstly for each PTE, we can avoid unnecessary struct page accessing, so that reduce the cache footprint of NUMA balancing page table scanning. In the performance test of pmbench memory accessing benchmark with 80:20 read/write ratio and normal access address distribution on a 2 socket Intel server with Optance DC Persistent Memory, perf profiling shows that the autonuma page table scanning time reduces from 1.23% to 0.97% (that is, reduced 21%) with the patch. Link: http://lkml.kernel.org/r/20191101075727.26683-3-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
When zone_watermark_ok() is called in migrate_balanced_pgdat() to check migration target node, the parameter classzone_idx (for requested zone) is specified as 0 (ZONE_DMA). But when allocating memory for autonuma in alloc_misplaced_dst_page(), the requested zone from GFP flags is ZONE_MOVABLE. That is, the requested zone is different. The size of lowmem_reserve for the different requested zone is different. And this may cause some issues. For example, in the zoneinfo of a test machine as below, Node 0, zone DMA32 pages free 61592 min 29 low 454 high 879 spanned 1044480 present 442306 managed 425921 protection: (0, 0, 62457, 62457, 62457) The free page number of ZONE_DMA32 is greater than "high watermark + lowmem_reserve[ZONE_DMA]", but less than "high watermark + lowmem_reserve[ZONE_MOVABLE]". And because __alloc_pages_node() in alloc_misplaced_dst_page() requests ZONE_MOVABLE, the zone_watermark_ok() on ZONE_DMA32 in migrate_balanced_pgdat() may always return true. So, autonuma may not stop even when memory pressure in node 0 is heavy. To fix the issue, ZONE_MOVABLE is used as parameter to call zone_watermark_ok() in migrate_balanced_pgdat(). This makes it same as requested zone in alloc_misplaced_dst_page(). So that migrate_balanced_pgdat() returns false when memory pressure is heavy. Link: http://lkml.kernel.org/r/20191101075727.26683-2-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 zhong jiang 提交于
It is more clear to use DEFINE_DEBUGFS_ATTRIBUTE to define debugfs file operation rather than DEFINE_SIMPLE_ATTRIBUTE. Link: http://lkml.kernel.org/r/1572348687-9951-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Yue Hu <huyue2@yulong.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yunfeng Ye 提交于
kzalloc() is used for cma bitmap allocation in cma_activate_area(), switch to bitmap_zalloc() for clarity. Link: http://lkml.kernel.org/r/895d4627-f115-c77a-d454-c0a196116426@huawei.comSigned-off-by: NYunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Yue Hu <huyue2@yulong.com> Cc: Peng Fan <peng.fan@nxp.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Ryohei Suzuki <ryh.szk.cmnty@gmail.com> Cc: Andrey Konovalov <andreyknvl@google.com> Cc: Doug Berger <opendmb@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Song Liu 提交于
For non-shmem file THPs, khugepaged only collapses read only .text mapping (VM_DENYWRITE). These pages should not be dirty except the case where the file hasn't been flushed since first write. Call filemap_flush() in collapse_file() to accelerate the write back in such cases. Link: http://lkml.kernel.org/r/20191106060930.2571389-3-songliubraving@fb.comSigned-off-by: NSong Liu <songliubraving@fb.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Hugh Dickins <hughd@google.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Adding fully unmapped pages into deferred split queue is not productive: these pages are about to be freed or they are pinned and cannot be split anyway. Link: http://lkml.kernel.org/r/20190913091849.11151-1-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yang Shi 提交于
When doing migration if the freed page is met, we just return without migrating it since it is pointless to migrate a freed page. But, the current code allocates target page unconditionally before handling freed page, if the page is freed, the newly allocated will be just freed. It doesn't make too much sense and is just a waste of time although migrating freed page is rare. So, handle freed page at the before that to avoid unnecessary page allocation and free. Link: http://lkml.kernel.org/r/1573755869-106954-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 zhong jiang 提交于
split_huge_pages_fops is used for debugfs file. hence, it is more clear to use DEFINE_DEBUGFS_ATTRIBUTE. Link: http://lkml.kernel.org/r/1572347674-8111-1-git-send-email-zhongjiang@huawei.comSigned-off-by: Nzhong jiang <zhongjiang@huawei.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Zhigang Lu 提交于
When mmapping an existing hugetlbfs file with MAP_POPULATE, we find it is very time consuming. For example, mmapping a 128GB file takes about 50 milliseconds. Sampling with perfevent shows it spends 99% time in the same_page loop in follow_hugetlb_page(). samples: 205 of event 'cycles', Event count (approx.): 136686374 - 99.04% test_mmap_huget [kernel.kallsyms] [k] follow_hugetlb_page follow_hugetlb_page __get_user_pages __mlock_vma_pages_range __mm_populate vm_mmap_pgoff sys_mmap_pgoff sys_mmap system_call_fastpath __mmap64 follow_hugetlb_page() is called with pages=NULL and vmas=NULL, so for each hugepage, we run into the same_page loop for pages_per_huge_page() times, but doing nothing. With this change, it takes less then 1 millisecond to mmap a 128GB file in hugetlbfs. Link: http://lkml.kernel.org/r/1567581712-5992-1-git-send-email-totty.lu@gmail.comSigned-off-by: NZhigang Lu <tonnylu@tencent.com> Reviewed-by: NHaozhong Zhang <hzhongzhang@tencent.com> Reviewed-by: NZongming Zhang <knightzhang@tencent.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Acked-by: NMatthew Wilcox <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Yang 提交于
The first parameter hstate in function hugetlb_fault_mutex_hash() is not used anymore. This patch removes it. [akpm@linux-foundation.org: various build fixes] [cai@lca.pw: fix a GCC compilation warning] Link: http://lkml.kernel.org/r/1570544108-32331-1-git-send-email-cai@lca.pw Link: http://lkml.kernel.org/r/20191005003302.785-1-richardw.yang@linux.intel.comSigned-off-by: NWei Yang <richardw.yang@linux.intel.com> Signed-off-by: NQian Cai <cai@lca.pw> Suggested-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mina Almasry 提交于
Remove duplicated code between region_chg and region_add, and refactor it into a common function, add_reservation_in_range. This is mostly done because there is a follow up change in another series that disables region coalescing in region_add, and I want to make that change in one place only. It should improve maintainability anyway on its own. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20190919200428.188797-3-almasrymina@google.comSigned-off-by: NMina Almasry <almasrymina@google.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mina Almasry 提交于
Current behavior is that region_chg provides both a cache entry in resv->region_cache, AND a placeholder entry in resv->regions. region_add first tries to use the placeholder, and if it finds that the placeholder has been deleted by a racing region_del call, it uses the cache entry. This behavior is completely unnecessary and is removed in this patch for a couple of reasons: 1. region_add needs to either find a cached file_region entry in resv->region_cache, or find an entry in resv->regions to expand. It does not need both. 2. region_chg adding a placeholder entry in resv->regions opens up a possible race with region_del, where region_chg adds a placeholder region in resv->regions, and this region is deleted by a racing call to region_del during region_chg execution or before region_add is called. Removing the race makes the code easier to reason about and maintain. In addition, a follow up patch in another series that disables region coalescing, which would be further complicated if the race with region_del exists. Link: http://lkml.kernel.org/r/20190919200428.188797-2-almasrymina@google.comSigned-off-by: NMina Almasry <almasrymina@google.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Waiman Long 提交于
A customer with large SMP systems (up to 16 sockets) with application that uses large amount of static hugepages (~500-1500GB) are experiencing random multisecond delays. These delays were caused by the long time it took to scan the VMA interval tree with mmap_sem held. The sharing of huge PMD does not require changes to the i_mmap at all. Therefore, we can just take the read lock and let other threads searching for the right VMA share it in parallel. Once the right VMA is found, either the PMD lock (2M huge page for x86-64) or the mm->page_table_lock will be acquired to perform the actual PMD sharing. Lock contention, if present, will happen in the spinlock. That is much better than contention in the rwsem where the time needed to scan the the interval tree is indeterminate. With this patch applied, the customer is seeing significant performance improvement over the unpatched kernel. Link: http://lkml.kernel.org/r/20191107211809.9539-1-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com> Suggested-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mike Kravetz 提交于
A new clang diagnostic (-Wsizeof-array-div) warns about the calculation to determine the number of u32's in an array of unsigned longs. Suppress warning by adding parentheses. While looking at the above issue, noticed that the 'address' parameter to hugetlb_fault_mutex_hash is no longer used. So, remove it from the definition and all callers. No functional change. Link: http://lkml.kernel.org/r/20190919011847.18400-1-mike.kravetz@oracle.comSigned-off-by: NMike Kravetz <mike.kravetz@oracle.com> Reported-by: NNathan Chancellor <natechancellor@gmail.com> Reviewed-by: NNathan Chancellor <natechancellor@gmail.com> Reviewed-by: NDavidlohr Bueso <dbueso@suse.de> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Ilie Halip <ilie.halip@gmail.com> Cc: David Bolvansky <david.bolvansky@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Yunfeng Ye 提交于
sparse_buffer_init() use memblock_alloc_try_nid_raw() to allocate memory for page management structure, if memory allocation fails from specified node, it will fall back to allocate from other nodes. Normally, the page management structure will not exceed 2% of the total memory, but a large continuous block of allocation is needed. In most cases, memory allocation from the specified node will succeed, but a node memory become highly fragmented will fail. we expect to allocate memory base section rather than by allocating a large block of memory from other NUMA nodes Add memblock_alloc_exact_nid_raw() for this situation, which allocate boot memory block on the exact node. If a large contiguous block memory allocate fail in sparse_buffer_init(), it will fall back to allocate small block memory base section. Link: http://lkml.kernel.org/r/66755ea7-ab10-8882-36fd-3e02b03775d5@huawei.comSigned-off-by: NYunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: NMike Rapoport <rppt@linux.ibm.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Qian Cai <cai@lca.pw> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cao jin 提交于
Change "max_addr" to "end" for less confusion in memblock_alloc_range_nid comments. Link: http://lkml.kernel.org/r/20191113051822.3296-1-ruansy.fnst@cn.fujitsu.comSigned-off-by: NCao jin <caoj.fnst@cn.fujitsu.com> Signed-off-by: NShiyang Ruan <ruansy.fnst@cn.fujitsu.com> Reviewed-by: NMike Rapoport <rppt@linux.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cao jin 提交于
fix typos for: elaboarte -> elaborate architecure -> architecture compltes -> completes And, convert the markup :c:func:`foo` to foo() as kernel documentation toolchain can recognize foo() as a function. Link: http://lkml.kernel.org/r/20190912123127.8694-1-caoj.fnst@cn.fujitsu.comSigned-off-by: NCao jin <caoj.fnst@cn.fujitsu.com> Suggested-by: NMike Rapoport <rppt@linux.ibm.com> Reviewed-by: NMike Rapoport <rppt@linux.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Li Xinhai 提交于
mbind() is required to report EFAULT if range, specified by addr and len, contains unmapped holes. In current implementation, below rules are applied for this checking: 1: Unmapped holes at any part of the specified range should be reported as EFAULT if mbind() for none MPOL_DEFAULT cases; 2: Unmapped holes at any part of the specified range should be ignored (do not reprot EFAULT) if mbind() for MPOL_DEFAULT case; 3: The whole range in an unmapped hole should be reported as EFAULT; Note that rule 2 does not fullfill the mbind() API definition, but since that behavior has existed for long days (the internal flag MPOL_MF_DISCONTIG_OK is for this purpose), this patch does not plan to change it. In current code, application observed inconsistent behavior on rule 1 and rule 2 respectively. That inconsistency is fixed as below details. Cases of rule 1: - Hole at head side of range. Current code reprot EFAULT, no change by this patch. [ vma ][ hole ][ vma ] [ range ] - Hole at middle of range. Current code report EFAULT, no change by this patch. [ vma ][ hole ][ vma ] [ range ] - Hole at tail side of range. Current code do not report EFAULT, this patch fixes it. [ vma ][ hole ][ vma ] [ range ] Cases of rule 2: - Hole at head side of range. Current code reports EFAULT, this patch fixes it. [ vma ][ hole ][ vma ] [ range ] - Hole at middle of range. Current code does not report EFAULT, no change by this patch. [ vma ][ hole ][ vma] [ range ] - Hole at tail side of range. Current code does not report EFAULT, no change by this patch. [ vma ][ hole ][ vma] [ range ] This patch has no changes to rule 3. The unmapped hole checking can also be handled by using .pte_hole(), instead of .test_walk(). But .pte_hole() is called for holes inside and outside vma, which causes more cost, so this patch keeps the original design with .test_walk(). Link: http://lkml.kernel.org/r/1573218104-11021-3-git-send-email-lixinhai.lxh@gmail.com Fixes: 6f4576e3 ("mempolicy: apply page table walker on queue_pages_range()") Signed-off-by: NLi Xinhai <lixinhai.lxh@gmail.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: linux-man <linux-man@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Li Xinhai 提交于
Patch series "mm: Fix checking unmapped holes for mbind", v4. This patchset fix checking unmapped holes for mbind(). First patch makes sure the vma been correctly tracked in .test_walk(), so each time when .test_walk() is called, the neighborhood of two vma is correct. Current problem is that the !vma_migratable() check could cause return immediately without update tracking to vma. Second patch fix the inconsistent report of EFAULT when mbind() is called for MPOL_DEFAULT and non MPOL_DEFAULT cases, so application do not need to have workaround code to handle this special behavior. Currently there are two problems, one is that the .test_walk() can not know there is hole at tail side of range, because .test_walk() only call for vma not for hole. The other one is that mbind_range() checks for hole at head side of range but do not consider the MPOL_MF_DISCONTIG_OK flag as done in .test_walk(). This patch (of 2): Checking unmapped hole and updating the previous vma must be handled first, otherwise the unmapped hole could be calculated from a wrong previous vma. Several commits were relevant to this error: - commit 6f4576e3 ("mempolicy: apply page table walker on queue_pages_range()") This commit was correct, the VM_PFNMAP check was after updating previous vma - commit 48684a65 ("mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)") This commit added VM_PFNMAP check before updating previous vma. Then, there were two VM_PFNMAP check did same thing twice. - commit acda0c33 ("mm/mempolicy.c: get rid of duplicated check for vma(VM_PFNMAP) in queue_page s_range()") This commit tried to fix the duplicated VM_PFNMAP check, but it wrongly removed the one which was after updating vma. Link: http://lkml.kernel.org/r/1573218104-11021-2-git-send-email-lixinhai.lxh@gmail.com Fixes: acda0c33 (mm/mempolicy.c: get rid of duplicated check for vma(VM_PFNMAP) in queue_pages_range()) Signed-off-by: NLi Xinhai <lixinhai.lxh@gmail.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: linux-man <linux-man@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vitaly Wool 提交于
For each page scheduled for compaction (e. g. by z3fold_free()), try to apply inter-page compaction before running the traditional/ existing intra-page compaction. That means, if the page has only one buddy, we treat that buddy as a new object that we aim to place into an existing z3fold page. If such a page is found, that object is transferred and the old page is freed completely. The transferred object is named "foreign" and treated slightly differently thereafter. Namely, we increase "foreign handle" counter for the new page. Pages with non-zero "foreign handle" count become unmovable. This patch implements "foreign handle" detection when a handle is freed to decrement the foreign handle counter accordingly, so a page may as well become movable again as the time goes by. As a result, we almost always have exactly 3 objects per page and significantly better average compression ratio. [cai@lca.pw: fix -Wunused-but-set-variable warnings] Link: http://lkml.kernel.org/r/1570542062-29144-1-git-send-email-cai@lca.pw [vitalywool@gmail.com: avoid subtle race when freeing slots] Link: http://lkml.kernel.org/r/20191127152118.6314b99074b0626d4c5a8835@gmail.com [vitalywool@gmail.com: compact objects more accurately] Link: http://lkml.kernel.org/r/20191127152216.6ad33745a21ba71c53606acb@gmail.com [vitalywool@gmail.com: protect handle reads] Link: http://lkml.kernel.org/r/20191127152345.8059852f60947686674d726d@gmail.com Link: http://lkml.kernel.org/r/20191006041457.24113-1-vitalywool@gmail.comSigned-off-by: NVitaly Wool <vitaly.vul@sony.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Xianting Tian 提交于
Fix the typo "resheduled" -> "rescheduled" in comment Link: http://lkml.kernel.org/r/1573486327-9591-1-git-send-email-xianting_tian@126.comSigned-off-by: NXianting Tian <xianting_tian@126.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
We split the LRU lists into inactive and an active parts to maximize workingset protection while allowing just enough inactive cache space to faciltate readahead and writeback for one-off file accesses (e.g. a linear scan through a file, or logging); or just enough inactive anon to maintain recent reference information when reclaim needs to swap. With cgroups and their nested LRU lists, we currently don't do this correctly. While recursive cgroup reclaim establishes a relative LRU order among the pages of all involved cgroups, inactive:active size decisions are done on a per-cgroup level. As a result, we'll reclaim a cgroup's workingset when it doesn't have cold pages, even when one of its siblings has plenty of it that should be reclaimed first. For example: workload A has 50M worth of hot cache but doesn't do any one-off file accesses; meanwhile, parallel workload B scans files and rarely accesses the same page twice. If these workloads were to run in an uncgrouped system, A would be protected from the high rate of cache faults from B. But if they were put in parallel cgroups for memory accounting purposes, B's fast cache fault rate would push out the hot cache pages of A. This is unexpected and undesirable - the "scan resistance" of the page cache is broken. This patch moves inactive:active size balancing decisions to the root of reclaim - the same level where the LRU order is established. It does this by looking at the recursive size of the inactive and the active file sets of the cgroup subtree at the beginning of the reclaim cycle, and then making a decision - scan or skip active pages - that applies throughout the entire run and to every cgroup involved. With that in place, in the test above, the VM will recognize that there are plenty of inactive pages in the combined cache set of workloads A and B and prefer the one-off cache in B over the hot pages in A. The scan resistance of the cache is restored. [cai@lca.pw: fix some -Wenum-conversion warnings] Link: http://lkml.kernel.org/r/1573848697-29262-1-git-send-email-cai@lca.pw Link: http://lkml.kernel.org/r/20191107205334.158354-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NSuren Baghdasaryan <surenb@google.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
We use refault information to determine whether the cache workingset is stable or transitioning, and dynamically adjust the inactive:active file LRU ratio so as to maximize protection from one-off cache during stable periods, and minimize IO during transitions. With cgroups and their nested LRU lists, we currently don't do this correctly. While recursive cgroup reclaim establishes a relative LRU order among the pages of all involved cgroups, refaults only affect the local LRU order in the cgroup in which they are occuring. As a result, cache transitions can take longer in a cgrouped system as the active pages of sibling cgroups aren't challenged when they should be. [ Right now, this is somewhat theoretical, because the siblings, under continued regular reclaim pressure, should eventually run out of inactive pages - and since inactive:active *size* balancing is also done on a cgroup-local level, we will challenge the active pages eventually in most cases. But the next patch will move that relative size enforcement to the reclaim root as well, and then this patch here will be necessary to propagate refault pressure to siblings. ] This patch moves refault detection to the root of reclaim. Instead of remembering the cgroup owner of an evicted page, remember the cgroup that caused the reclaim to happen. When refaults later occur, they'll correctly influence the cross-cgroup LRU order that reclaim follows. I.e. if global reclaim kicked out pages in some subgroup A/B/C, the refault of those pages will challenge the global LRU order, and not just the local order down inside C. [hannes@cmpxchg.org: use page_memcg() instead of another lookup] Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NSuren Baghdasaryan <surenb@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Patch series "mm: fix page aging across multiple cgroups". When applications are put into unconfigured cgroups for memory accounting purposes, the cgrouping itself should not change the behavior of the page reclaim code. We expect the VM to reclaim the coldest pages in the system. But right now the VM can reclaim hot pages in one cgroup while there is eligible cold cache in others. This is because one part of the reclaim algorithm isn't truly cgroup hierarchy aware: the inactive/active list balancing. That is the part that is supposed to protect hot cache data from one-off streaming IO. The recursive cgroup reclaim scheme will scan and rotate the physical LRU lists of each eligible cgroup at the same rate in a round-robin fashion, thereby establishing a relative order among the pages of all those cgroups. However, the inactive/active balancing decisions are made locally within each cgroup, so when a cgroup is running low on cold pages, its hot pages will get reclaimed - even when sibling cgroups have plenty of cold cache eligible in the same reclaim run. For example: [root@ham ~]# head -n1 /proc/meminfo MemTotal: 1016336 kB [root@ham ~]# ./reclaimtest2.sh Establishing 50M active files in cgroup A... Hot pages cached: 12800/12800 workingset-a Linearly scanning through 18G of file data in cgroup B: real 0m4.269s user 0m0.051s sys 0m4.182s Hot pages cached: 134/12800 workingset-a The streaming IO in B, which doesn't benefit from caching at all, pushes out most of the workingset in A. Solution This series fixes the problem by elevating inactive/active balancing decisions to the toplevel of the reclaim run. This is either a cgroup that hit its limit, or straight-up global reclaim if there is physical memory pressure. From there, it takes a recursive view of the cgroup subtree to decide whether page deactivation is necessary. In the test above, the VM will then recognize that cgroup B has plenty of eligible cold cache, and that the hot pages in A can be spared: [root@ham ~]# ./reclaimtest2.sh Establishing 50M active files in cgroup A... Hot pages cached: 12800/12800 workingset-a Linearly scanning through 18G of file data in cgroup B: real 0m4.244s user 0m0.064s sys 0m4.177s Hot pages cached: 12800/12800 workingset-a Implementation Whether active pages can be deactivated or not is influenced by two factors: the inactive list dropping below a minimum size relative to the active list, and the occurence of refaults. This patch series first moves refault detection to the reclaim root, then enforces the minimum inactive size based on a recursive view of the cgroup tree's LRUs. History Note that this actually never worked correctly in Linux cgroups. In the past it worked for global reclaim and leaf limit reclaim only (we used to have two physical LRU linkages per page), but it never worked for intermediate limit reclaim over multiple leaf cgroups. We're noticing this now because 1) we're putting everything into cgroups for accounting, not just the things we want to control and 2) we're moving away from leaf limits that invoke reclaim on individual cgroups, toward large tree reclaim, triggered by high-level limits, or physical memory pressure that is influenced by local protections such as memory.low and memory.min instead. This patch (of 3): When file pages are lower than the watermark on a node, we try to force scan anonymous pages to counter-act the balancing algorithms preference for new file pages when they are likely thrashing. This is a node-level decision, but it's currently made each time we look at an lruvec. This is unnecessarily expensive and also a layering violation that makes the code harder to understand. Clean this up by making the check once per node and setting a flag in the scan_control. Link: http://lkml.kernel.org/r/20191107205334.158354-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NShakeel Butt <shakeelb@google.com> Reviewed-by: NSuren Baghdasaryan <surenb@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The current writeback congestion tracking has separate flags for kswapd reclaim (node level) and cgroup limit reclaim (memcg-node level). This is unnecessarily complicated: the lruvec is an existing abstraction layer for that node-memcg intersection. Introduce lruvec->flags and LRUVEC_CONGESTED. Then track that at the reclaim root level, which is either the NUMA node for global reclaim, or the cgroup-node intersection for cgroup reclaim. Link: http://lkml.kernel.org/r/20191022144803.302233-9-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-