- 13 11月, 2013 40 次提交
-
-
由 Mel Gorman 提交于
Commit 0255d491 ("mm: Account for a THP NUMA hinting update as one PTE update") was added to account for the number of PTE updates when marking pages prot_numa. task_numa_work was using the old return value to track how much address space had been updated. Altering the return value causes the scanner to do more work than it is configured or documented to in a single unit of work. This patch reverts that commit and accounts for the number of THP updates separately in vmstat. It is up to the administrator to interpret the pair of values correctly. This is a straight-forward operation and likely to only be of interest when actively debugging NUMA balancing problems. The impact of this patch is that the NUMA PTE scanner will scan slower when THP is enabled and workloads may converge slower as a result. On the flip size system CPU usage should be lower than recent tests reported. This is an illustrative example of a short single JVM specjbb test specjbb 3.12.0 3.12.0 vanilla acctupdates TPut 1 26143.00 ( 0.00%) 25747.00 ( -1.51%) TPut 7 185257.00 ( 0.00%) 183202.00 ( -1.11%) TPut 13 329760.00 ( 0.00%) 346577.00 ( 5.10%) TPut 19 442502.00 ( 0.00%) 460146.00 ( 3.99%) TPut 25 540634.00 ( 0.00%) 549053.00 ( 1.56%) TPut 31 512098.00 ( 0.00%) 519611.00 ( 1.47%) TPut 37 461276.00 ( 0.00%) 474973.00 ( 2.97%) TPut 43 403089.00 ( 0.00%) 414172.00 ( 2.75%) 3.12.0 3.12.0 vanillaacctupdates User 5169.64 5184.14 System 100.45 80.02 Elapsed 252.75 251.85 Performance is similar but note the reduction in system CPU time. While this showed a performance gain, it will not be universal but at least it'll be behaving as documented. The vmstats are obviously different but here is an obvious interpretation of them from mmtests. 3.12.0 3.12.0 vanillaacctupdates NUMA page range updates 1408326 11043064 NUMA huge PMD updates 0 21040 NUMA PTE updates 1408326 291624 "NUMA page range updates" == nr_pte_updates and is the value returned to the NUMA pte scanner. NUMA huge PMD updates were the number of THP updates which in combination can be used to calculate how many ptes were updated from userspace. Signed-off-by: NMel Gorman <mgorman@suse.de> Reported-by: NAlex Thorlton <athorlton@sgi.com> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jerome Marchand 提交于
The same calculation is currently done in three differents places. Factor that code so future changes has to be made at only one place. [akpm@linux-foundation.org: uninline vm_commit_limit()] Signed-off-by: NJerome Marchand <jmarchan@redhat.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Zhi Yong Wu 提交于
Signed-off-by: NZhi Yong Wu <wuzhy@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Weijie Yang 提交于
The refcount routine was not fit the kernel get/put semantic exactly, There were too many judgement statements on refcount and it could be minus. This patch does the following: - move refcount judgement to zswap_entry_put() to hide resource free function. - add a new function zswap_entry_find_get(), so that callers can use easily in the following pattern: zswap_entry_find_get .../* do something */ zswap_entry_put - to eliminate compile error, move some functions declaration This patch is based on Minchan Kim <minchan@kernel.org> 's idea and suggestion. Signed-off-by: NWeijie Yang <weijie.yang@samsung.com> Cc: Seth Jennings <sjennings@variantweb.net> Acked-by: NMinchan Kim <minchan@kernel.org> Cc: Bob Liu <bob.liu@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Weijie Yang 提交于
Consider the following scenario: thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page) thread 1: call zswap_frontswap_invalidate_page to invalidate entry x. finished, entry x and its zbud is not freed as its refcount != 0 now, the swap_map[x] = 0 thread 0: now call zswap_get_swap_cache_page swapcache_prepare return -ENOENT because entry x is not used any more zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM zswap_writeback_entry do nothing except put refcount Now, the memory of zswap_entry x and its zpage leak. Modify: - check the refcount in fail path, free memory if it is not referenced. - use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path can be not only caused by nomem but also by invalidate. Signed-off-by: NWeijie Yang <weijie.yang@samsung.com> Reviewed-by: NBob Liu <bob.liu@oracle.com> Reviewed-by: NMinchan Kim <minchan@kernel.org> Acked-by: NSeth Jennings <sjenning@linux.vnet.ibm.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Qiang Huang 提交于
Signed-off-by: NQiang Huang <h.huangqiang@huawei.com> Reviewed-by: NPekka Enberg <penberg@kernel.org> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Qiang Huang 提交于
We can't see the relationship with memcg from the parameters, so the name with memcg_idx would be more reasonable. Signed-off-by: NQiang Huang <h.huangqiang@huawei.com> Reviewed-by: NPekka Enberg <penberg@kernel.org> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Qiang Huang 提交于
Signed-off-by: NQiang Huang <h.huangqiang@huawei.com> Reviewed-by: NPekka Enberg <penberg@kernel.org> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Glauber Costa <glommer@parallels.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Akira Takeuchi 提交于
This patch fixes the problem that get_unmapped_area() can return illegal address and result in failing mmap(2) etc. In case that the address higher than PAGE_SIZE is set to /proc/sys/vm/mmap_min_addr, the address lower than mmap_min_addr can be returned by get_unmapped_area(), even if you do not pass any virtual address hint (i.e. the second argument). This is because the current get_unmapped_area() code does not take into account mmap_min_addr. This leads to two actual problems as follows: 1. mmap(2) can fail with EPERM on the process without CAP_SYS_RAWIO, although any illegal parameter is not passed. 2. The bottom-up search path after the top-down search might not work in arch_get_unmapped_area_topdown(). Note: The first and third chunk of my patch, which changes "len" check, are for more precise check using mmap_min_addr, and not for solving the above problem. [How to reproduce] --- test.c ------------------------------------------------- #include <stdio.h> #include <unistd.h> #include <sys/mman.h> #include <sys/errno.h> int main(int argc, char *argv[]) { void *ret = NULL, *last_map; size_t pagesize = sysconf(_SC_PAGESIZE); do { last_map = ret; ret = mmap(0, pagesize, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); // printf("ret=%p\n", ret); } while (ret != MAP_FAILED); if (errno != ENOMEM) { printf("ERR: unexpected errno: %d (last map=%p)\n", errno, last_map); } return 0; } --------------------------------------------------------------- $ gcc -m32 -o test test.c $ sudo sysctl -w vm.mmap_min_addr=65536 vm.mmap_min_addr = 65536 $ ./test (run as non-priviledge user) ERR: unexpected errno: 1 (last map=0x10000) Signed-off-by: NAkira Takeuchi <takeuchi.akr@jp.panasonic.com> Signed-off-by: NKiyoshi Owada <owada.kiyoshi@jp.panasonic.com> Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
When __rmqueue_fallback() doesn't find a free block with the required size it splits a larger page and puts the rest of the page onto the free list. But it has one serious mistake. When putting back, __rmqueue_fallback() always use start_migratetype if type is not CMA. However, __rmqueue_fallback() is only called when all of the start_migratetype queue is empty. That said, __rmqueue_fallback always puts back memory to the wrong queue except try_to_steal_freepages() changed pageblock type (i.e. requested size is smaller than half of page block). The end result is that the antifragmentation framework increases fragmenation instead of decreasing it. Mel's original anti fragmentation does the right thing. But commit 47118af0 ("mm: mmzone: MIGRATE_CMA migration type added") broke it. This patch restores sane and old behavior. It also removes an incorrect comment which was introduced by commit fef903ef ("mm/page_alloc.c: restructure free-page stealing code and fix a bug"). Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
In general, every tracepoint should be zero overhead if it is disabled. However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate "new_type == start_migratetype" even if tracepoint is disabled. However, the code can be moved into tracepoint's TP_fast_assign() and TP_fast_assign exist exactly such purpose. This patch does it. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Currently, set_pageblock_migratetype() screws up MIGRATE_CMA and MIGRATE_ISOLATE if page_group_by_mobility_disabled is true. It rewrites the argument to MIGRATE_UNMOVABLE and we lost these attribute. The problem was introduced by commit 49255c61 ("page allocator: move check for disabled anti-fragmentation out of fastpath"). So a 4 year old issue may mean that nobody uses page_group_by_mobility_disabled. But anyway, this patch fixes the problem. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Damien Ramonda 提交于
The kernel's readahead algorithm sometimes interprets random read accesses as sequential and triggers unnecessary data prefecthing from storage device (impacting random read average latency). In order to identify sequential cache read misses, the readahead algorithm intends to check whether offset - previous offset == 1 (trivial sequential reads) or offset - previous offset == 0 (sequential reads not aligned on page boundary): if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) <= 1UL) The current offset is stored in the "offset" variable of type "pgoff_t" (unsigned long), while previous offset is stored in "ra->prev_pos" of type "loff_t" (long long). Therefore, operands of the if statement are implicitly converted to type long long. Consequently, when previous offset > current offset (which happens on random pattern), the if condition is true and access is wrongly interpeted as sequential. An unnecessary data prefetching is triggered, impacting the average random read latency. Storing the previous offset value in a "pgoff_t" variable (unsigned long) fixes the sequential read detection logic. Signed-off-by: NDamien Ramonda <damien.ramonda@intel.com> Reviewed-by: NFengguang Wu <fengguang.wu@intel.com> Acked-by: NPierre Tardy <pierre.tardy@intel.com> Acked-by: NDavid Cohen <david.a.cohen@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Daeseok Youn 提交于
Signed-off-by: NDaeseok Youn <daeseok.youn@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Toshi Kani 提交于
vmstat_cpuup_callback() is a CPU notifier callback, which marks N_CPU to a node at CPU online event. However, it does not update this N_CPU info at CPU offline event. Changed vmstat_cpuup_callback() to clear N_CPU when the last CPU in the node is put into offline, i.e. the node no longer has any online CPU. Signed-off-by: NToshi Kani <toshi.kani@hp.com> Acked-by: NChristoph Lameter <cl@linux.com> Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Tested-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Toshi Kani 提交于
After a system booted, N_CPU is not set to any node as has_cpu shows an empty line. # cat /sys/devices/system/node/has_cpu (show-empty-line) setup_vmstat() registers its CPU notifier callback, vmstat_cpuup_callback(), which marks N_CPU to a node when a CPU is put into online. However, setup_vmstat() is called after all CPUs are launched in the boot sequence. Changed setup_vmstat() to mark N_CPU to the nodes with online CPUs at boot, which is consistent with other operations in vmstat_cpuup_callback(), i.e. start_cpu_timer() and refresh_zone_stat_thresholds(). Also added get_online_cpus() to protect the for_each_online_cpu() loop. Signed-off-by: NToshi Kani <toshi.kani@hp.com> Acked-by: NChristoph Lameter <cl@linux.com> Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Tested-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tang Chen 提交于
The hot-Pluggable field in SRAT specifies which memory is hotpluggable. As we mentioned before, if hotpluggable memory is used by the kernel, it cannot be hot-removed. So memory hotplug users may want to set all hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it. Memory hotplug users may also set a node as movable node, which has ZONE_MOVABLE only, so that the whole node can be hot-removed. But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the kernel cannot use memory in movable nodes. This will cause NUMA performance down. And other users may be unhappy. So we need a way to allow users to enable and disable this functionality. In this patch, we introduce movable_node boot option to allow users to choose to not to consume hotpluggable memory at early boot time and later we can set it as ZONE_MOVABLE. To achieve this, the movable_node boot option will control the memblock allocation direction. That said, after memblock is ready, before SRAT is parsed, we should allocate memory near the kernel image as we explained in the previous patches. So if movable_node boot option is set, the kernel does the following: 1. After memblock is ready, make memblock allocate memory bottom up. 2. After SRAT is parsed, make memblock behave as default, allocate memory top down. Users can specify "movable_node" in kernel commandline to enable this functionality. For those who don't use memory hotplug or who don't want to lose their NUMA performance, just don't specify anything. The kernel will work as before. Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com> Suggested-by: NKamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Suggested-by: NIngo Molnar <mingo@kernel.org> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: NToshi Kani <toshi.kani@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Thomas Renninger <trenn@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tang Chen 提交于
The Linux kernel cannot migrate pages used by the kernel. As a result, kernel pages cannot be hot-removed. So we cannot allocate hotpluggable memory for the kernel. ACPI SRAT (System Resource Affinity Table) contains the memory hotplug info. But before SRAT is parsed, memblock has already started to allocate memory for the kernel. So we need to prevent memblock from doing this. In a memory hotplug system, any numa node the kernel resides in should be unhotpluggable. And for a modern server, each node could have at least 16GB memory. So memory around the kernel image is highly likely unhotpluggable. So the basic idea is: Allocate memory from the end of the kernel image and to the higher memory. Since memory allocation before SRAT is parsed won't be too much, it could highly likely be in the same node with kernel image. The current memblock can only allocate memory top-down. So this patch introduces a new bottom-up allocation mode to allocate memory bottom-up. And later when we use this allocation direction to allocate memory, we will limit the start address above the kernel. Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: NToshi Kani <toshi.kani@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Tejun Heo <tj@kernel.org> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Thomas Renninger <trenn@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tang Chen 提交于
[Problem] The current Linux cannot migrate pages used by the kernel because of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET. When the pa is changed, we cannot simply update the pagetable and keep the va unmodified. So the kernel pages are not migratable. There are also some other issues will cause the kernel pages not migratable. For example, the physical address may be cached somewhere and will be used. It is not to update all the caches. When doing memory hotplug in Linux, we first migrate all the pages in one memory device somewhere else, and then remove the device. But if pages are used by the kernel, they are not migratable. As a result, memory used by the kernel cannot be hot-removed. Modifying the kernel direct mapping mechanism is too difficult to do. And it may cause the kernel performance down and unstable. So we use the following way to do memory hotplug. [What we are doing] In Linux, memory in one numa node is divided into several zones. One of the zones is ZONE_MOVABLE, which the kernel won't use. In order to implement memory hotplug in Linux, we are going to arrange all hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory. To do this, we need ACPI's help. In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory affinities in SRAT record every memory range in the system, and also, flags specifying if the memory range is hotpluggable. (Please refer to ACPI spec 5.0 5.2.16) With the help of SRAT, we have to do the following two things to achieve our goal: 1. When doing memory hot-add, allow the users arranging hotpluggable as ZONE_MOVABLE. (This has been done by the MOVABLE_NODE functionality in Linux.) 2. when the system is booting, prevent bootmem allocator from allocating hotpluggable memory for the kernel before the memory initialization finishes. The problem 2 is the key problem we are going to solve. But before solving it, we need some preparation. Please see below. [Preparation] Bootloader has to load the kernel image into memory. And this memory must be unhotpluggable. We cannot prevent this anyway. So in a memory hotplug system, we can assume any node the kernel resides in is not hotpluggable. Before SRAT is parsed, we don't know which memory ranges are hotpluggable. But memblock has already started to work. In the current kernel, memblock allocates the following memory before SRAT is parsed: setup_arch() |->memblock_x86_fill() /* memblock is ready */ |...... |->early_reserve_e820_mpc_new() /* allocate memory under 1MB */ |->reserve_real_mode() /* allocate memory under 1MB */ |->init_mem_mapping() /* allocate page tables, about 2MB to map 1GB memory */ |->dma_contiguous_reserve() /* specified by user, should be low */ |->setup_log_buf() /* specified by user, several mega bytes */ |->relocate_initrd() /* could be large, but will be freed after boot, should reorder */ |->acpi_initrd_override() /* several mega bytes */ |->reserve_crashkernel() /* could be large, should reorder */ |...... |->initmem_init() /* Parse SRAT */ According to Tejun's advice, before SRAT is parsed, we should try our best to allocate memory near the kernel image. Since the whole node the kernel resides in won't be hotpluggable, and for a modern server, a node may have at least 16GB memory, allocating several mega bytes memory around the kernel image won't cross to hotpluggable memory. [About this patchset] So this patchset is the preparation for the problem 2 that we want to solve. It does the following: 1. Make memblock be able to allocate memory bottom up. 1) Keep all the memblock APIs' prototype unmodified. 2) When the direction is bottom up, keep the start address greater than the end of kernel image. 2. Improve init_mem_mapping() to support allocate page tables in bottom up direction. 3. Introduce "movable_node" boot option to enable and disable this functionality. This patch (of 6): Create a new function __memblock_find_range_top_down to factor out of top-down allocation from memblock_find_in_range_node. This is a preparation because we will introduce a new bottom-up allocation mode in the following patch. Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: NToshi Kani <toshi.kani@hp.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Thomas Renninger <trenn@suse.de> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Nazarewicz <mina86@mina86.com> Cc: Minchan Kim <minchan@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Heiko Carstens 提交于
This is more or less the generic variant of commit 41aacc1e ("x86 get_unmapped_area: Access mmap_legacy_base through mm_struct member"). So effectively architectures which use an own arch_pick_mmap_layout() implementation but call the generic arch_get_unmapped_area() now can also randomize their mmap_base. All architectures which have an own arch_pick_mmap_layout() and call the generic arch_get_unmapped_area() (arm64, s390, tile) currently set mmap_base to TASK_UNMAPPED_BASE. This is also true for the generic arch_pick_mmap_layout() function. So this change is a no-op currently. Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Cc: Radu Caragea <sinaelgl@gmail.com> Cc: Michel Lespinasse <walken@google.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Weijie Yang 提交于
Add SetPageReclaim() before __swap_writepage() so that page can be moved to the tail of the inactive list, which can avoid unnecessary page scanning as this page was reclaimed by swap subsystem before. Signed-off-by: NWeijie Yang <weijie.yang@samsung.com> Reviewed-by: NBob Liu <bob.liu@oracle.com> Reviewed-by: NMinchan Kim <minchan@kernel.org> Acked-by: NSeth Jennings <sjenning@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Zhang Yanfei 提交于
Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Krzysztof Kozlowski 提交于
During swapoff the frontswap_map was NULL-ified before calling frontswap_invalidate_area(). However the frontswap_invalidate_area() exits early if frontswap_map is NULL. Invalidate was never called during swapoff. This patch moves frontswap_map_set() in swapoff just after calling frontswap_invalidate_area() so outside of locks (swap_lock and swap_info_struct->lock). This shouldn't be a problem as during swapon the frontswap_map_set() is called also outside of any locks. Signed-off-by: NKrzysztof Kozlowski <k.kozlowski@samsung.com> Reviewed-by: NSeth Jennings <sjenning@linux.vnet.ibm.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Shaohua Li <shli@fusionio.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Seth Jennings 提交于
Signed-off-by: NSeth Jennings <sjenning@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Catalin Marinas 提交于
Commit 248ac0e1 ("mm/vmalloc: remove guard page from between vmap blocks") had the side effect of making vmap_area.va_end member point to the next vmap_area.va_start. This was creating an artificial reference to vmalloc'ed objects and kmemleak was rarely reporting vmalloc() leaks. This patch marks the vmap_area containing pointers explicitly and reduces the min ref_count to 2 as vm_struct still contains a reference to the vmalloc'ed object. The kmemleak add_scan_area() function has been improved to allow a SIZE_MAX argument covering the rest of the object (for simpler calling sites). Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Zhang Yanfei 提交于
We pass the number of pages which hold page structs of a memory section to free_map_bootmem(). This is right when !CONFIG_SPARSEMEM_VMEMMAP but wrong when CONFIG_SPARSEMEM_VMEMMAP. When CONFIG_SPARSEMEM_VMEMMAP, we should pass the number of pages of a memory section to free_map_bootmem. So the fix is removing the nr_pages parameter. When CONFIG_SPARSEMEM_VMEMMAP, we directly use the prefined marco PAGES_PER_SECTION in free_map_bootmem. When !CONFIG_SPARSEMEM_VMEMMAP, we calculate page numbers needed to hold the page structs for a memory section and use the value in free_map_bootmem(). This was found by reading the code. And I have no machine that support memory hot-remove to test the bug now. Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com> Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Zhang Yanfei 提交于
For below functions, - sparse_add_one_section() - kmalloc_section_memmap() - __kmalloc_section_memmap() - __kfree_section_memmap() they are always invoked to operate on one memory section, so it is redundant to always pass a nr_pages parameter, which is the page numbers in one section. So we can directly use predefined macro PAGES_PER_SECTION instead of passing the parameter. Signed-off-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Wen Congyang <wency@cn.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Toshi Kani <toshi.kani@hp.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ying Han 提交于
The memory.numa_stat file was not hierarchical. Memory charged to the children was not shown in parent's numa_stat. This change adds the "hierarchical_" stats to the existing stats. The new hierarchical stats include the sum of all children's values in addition to the value of the memcg. Tested: Create cgroup a, a/b and run workload under b. The values of b are included in the "hierarchical_*" under a. $ cd /sys/fs/cgroup $ echo 1 > memory.use_hierarchy $ mkdir a a/b Run workload in a/b: $ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) & The hierarchical_ fields in parent (a) show use of workload in a/b: $ cat a/memory.numa_stat total=0 N0=0 N1=0 N2=0 N3=0 file=0 N0=0 N1=0 N2=0 N3=0 anon=0 N0=0 N1=0 N2=0 N3=0 unevictable=0 N0=0 N1=0 N2=0 N3=0 hierarchical_total=908 N0=552 N1=317 N2=39 N3=0 hierarchical_file=850 N0=549 N1=301 N2=0 N3=0 hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0 hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0 $ cat a/b/memory.numa_stat total=908 N0=552 N1=317 N2=39 N3=0 file=850 N0=549 N1=301 N2=0 N3=0 anon=58 N0=3 N1=16 N2=39 N3=0 unevictable=0 N0=0 N1=0 N2=0 N3=0 hierarchical_total=908 N0=552 N1=317 N2=39 N3=0 hierarchical_file=850 N0=549 N1=301 N2=0 N3=0 hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0 hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0 Signed-off-by: NYing Han <yinghan@google.com> Signed-off-by: NGreg Thelen <gthelen@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Greg Thelen 提交于
Refactor mem_control_numa_stat_show() to use a new stats structure for smaller and simpler code. This consolidates nearly identical code. text data bss dec hex filename 8,137,679 1,703,496 1,896,448 11,737,623 b31a17 vmlinux.before 8,136,911 1,703,496 1,896,448 11,736,855 b31717 vmlinux.after Signed-off-by: NGreg Thelen <gthelen@google.com> Signed-off-by: NYing Han <yinghan@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jianguo Wu 提交于
Use more appropriate NUMA_NO_NODE instead of -1 Signed-off-by: NJianguo Wu <wujianguo@huawei.com> Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Bob Liu 提交于
Khugepaged will scan/free HPAGE_PMD_NR normal pages and replace with a hugepage which is allocated from the node of the first scanned normal page, but this policy is too rough and may end with unexpected result to upper users. The problem is the original page-balancing among all nodes will be broken after hugepaged started. Thinking about the case if the first scanned normal page is allocated from node A, most of other scanned normal pages are allocated from node B or C.. But hugepaged will always allocate hugepage from node A which will cause extra memory pressure on node A which is not the situation before khugepaged started. This patch try to fix this problem by making khugepaged allocate hugepage from the node which have max record of scaned normal pages hit, so that the effect to original page-balancing can be minimized. The other problem is if normal scanned pages are equally allocated from Node A,B and C, after khugepaged started Node A will still suffer extra memory pressure. Andrew Davidoff reported a related issue several days ago. He wanted his application interleaving among all nodes and "numactl --interleave=all ./test" was used to run the testcase, but the result wasn't not as expected. cat /proc/2814/numa_maps: 7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435 N3=50098 The end result showed that most pages are from Node3 instead of interleave among node0-3 which was unreasonable. This patch also fix this issue by allocating hugepage round robin from all nodes have the same record, after this patch the result was as expected: 7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235 N3=12722 The simple testcase is like this: int main() { char *p; int i; int j; for (i=0; i < 200; i++) { p = (char *)malloc(1048576); printf("malloc done\n"); if (p == 0) { printf("Out of memory\n"); return 1; } for (j=0; j < 1048576; j++) { p[j] = 'A'; } printf("touched memory\n"); sleep(1); } printf("enter sleep\n"); while(1) { sleep(100); } } [akpm@linux-foundation.org: make last_khugepaged_target_node local to khugepaged_find_target_node()] Reported-by: NAndrew Davidoff <davidoff@qedmf.net> Tested-by: NAndrew Davidoff <davidoff@qedmf.net> Signed-off-by: NBob Liu <bob.liu@oracle.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Bob Liu 提交于
Move alloc_hugepage() to a better place, no need for a seperate #ifndef CONFIG_NUMA Signed-off-by: NBob Liu <bob.liu@oracle.com> Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andrew Davidoff <davidoff@qedmf.net> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wanpeng Li 提交于
Don't warn twice in __vmalloc_area_node and __vmalloc_node_range if __vmalloc_area_node allocation failure. This patch reverts commit 46c001a2 ("mm/vmalloc.c: emit the failure message before return"). Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com> Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wanpeng Li 提交于
The VM_UNINITIALIZED/VM_UNLIST flag introduced by f5252e00 ("mm: avoid null pointer access in vm_struct via /proc/vmallocinfo") is used to avoid accessing the pages field with unallocated page when show_numa_info() is called. This patch moves the check just before show_numa_info in order that some messages still can be dumped via /proc/vmallocinfo. This patch reverts commit d157a558 ("mm/vmalloc.c: check VM_UNINITIALIZED flag in s_show instead of show_numa_info"); Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wanpeng Li 提交于
There is a race window between vmap_area tear down and show vmap_area information. A B remove_vm_area spin_lock(&vmap_area_lock); va->vm = NULL; va->flags &= ~VM_VM_AREA; spin_unlock(&vmap_area_lock); spin_lock(&vmap_area_lock); if (va->flags & (VM_LAZY_FREE | VM_LAZY_FREEZING)) return 0; if (!(va->flags & VM_VM_AREA)) { seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n", (void *)va->va_start, (void *)va->va_end, va->va_end - va->va_start); return 0; } free_unmap_vmap_area(va); flush_cache_vunmap free_unmap_vmap_area_noflush unmap_vmap_area free_vmap_area_noflush va->flags |= VM_LAZY_FREE The assumption !VM_VM_AREA represents vm_map_ram allocation is introduced by d4033afd ("mm, vmalloc: iterate vmap_area_list, instead of vmlist, in vmallocinfo()"). However, !VM_VM_AREA also represents vmap_area is being tear down in race window mentioned above. This patch fix it by don't dump any information for !VM_VM_AREA case and also remove (VM_LAZY_FREE | VM_LAZY_FREEING) check since they are not possible for !VM_VM_AREA case. Suggested-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wanpeng Li 提交于
The caller address has already been set in set_vmalloc_vm(), there's no need to set it again in __vmalloc_area_node. Reviewed-by: NZhang Yanfei <zhangyanfei@cn.fujitsu.com> Signed-off-by: NWanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mitsuo Hayasaka <mitsuo.hayasaka.hu@hitachi.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
mpol_to_str() should not fail. Currently, it either fails because the string buffer is too small or because a string hasn't been defined for a mempolicy mode. If a new mempolicy mode is introduced and no string is defined for it, just warn and return "unknown". If the buffer is too small, just truncate the string and return, the same behavior as snprintf(). This also fixes a bug where there was no NULL-byte termination when doing *p++ = '=' and *p++ ':' and maxlen has been reached. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Chen Gang <gang.chen@asianux.com> Cc: Rik van Riel <riel@redhat.com> Cc: Dave Jones <davej@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Chen Gong pointed out that set/unset_migratetype_isolate() was done in different functions in mm/memory-failure.c, which makes the code less readable/maintainable. So this patch does it in soft_offline_page(). With this patch, we get to hold lock_memory_hotplug() longer but it's not a problem because races between memory hotplug and soft offline are very rare. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reviewed-by: NChen, Gong <gong.chen@linux.intel.com> Acked-by: NAndi Kleen <ak@linux.intel.com> Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Toshi Kani 提交于
cpu_up() has #ifdef CONFIG_MEMORY_HOTPLUG code blocks, which call mem_online_node() to put its node online if offlined and then call build_all_zonelists() to initialize the zone list. These steps are specific to memory hotplug, and should be managed in mm/memory_hotplug.c. lock_memory_hotplug() should also be held for the whole steps. For this reason, this patch replaces mem_online_node() with try_online_node(), which performs the whole steps with lock_memory_hotplug() held. try_online_node() is named after try_offline_node() as they have similar purpose. There is no functional change in this patch. Signed-off-by: NToshi Kani <toshi.kani@hp.com> Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Robin Holt 提交于
On large memory machines it can take a few minutes to get through free_all_bootmem(). Currently, when free_all_bootmem() calls __free_pages_memory(), the number of contiguous pages that __free_pages_memory() passes to the buddy allocator is limited to BITS_PER_LONG. BITS_PER_LONG was originally chosen to keep things similar to mm/nobootmem.c. But it is more efficient to limit it to MAX_ORDER. base new change 8TB 202s 172s 30s 16TB 401s 351s 50s That is around 1%-3% improvement on total boot time. This patch was spun off from the boot time rfc Robin and I had been working on. Signed-off-by: NRobin Holt <robin.m.holt@gmail.com> Signed-off-by: NNathan Zimmer <nzimmer@sgi.com> Cc: Robin Holt <robinmholt@linux.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Mike Travis <travis@sgi.com> Cc: Yinghai Lu <yinghai@kernel.org> Cc: Mel Gorman <mgorman@suse.de> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-