- 07 5月, 2012 2 次提交
-
-
由 Linus Torvalds 提交于
The VM accounting makes no sense at this level, and half of the callers didn't ever actually use the end result. The only time we want to unaccount the memory is when we actually remove the vma, so do the accounting at that point instead. This simplifies the interfaces (no need to pass down that silly page counter to functions that really don't care), and also makes it much more obvious what is actually going on: we do vm_[un]acct_memory() when adding or removing the vma, not on random page walking. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
None of the callers want to pass in 'zap_details', and it doesn't even make sense for the case of actually unmapping vma's. So remove the argument, and clean up the interface. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 26 4月, 2012 6 次提交
-
-
由 Sasha Levin 提交于
Commit 3268c63e ("mm: fix move/migrate_pages() race on task struct") has added an odd construct where 'mm' is checked for being NULL, and if it is, it would get dereferenced anyways by mput()ing it. Signed-off-by: NSasha Levin <levinsasha928@gmail.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Acked-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sasha Levin 提交于
Commit 3268c63e ("mm: fix move/migrate_pages() race on task struct") has added an odd construct where 'mm' is checked for being NULL, and if it is, it would get dereferenced anyways by mput()ing it. This would lead to the following NULL ptr deref and BUG() when calling migrate_pages() with a pid that has no mm struct: [25904.193704] BUG: unable to handle kernel NULL pointer dereference at 0000000000000050 [25904.194235] IP: [<ffffffff810b0de7>] mmput+0x27/0xf0 [25904.194235] PGD 773e6067 PUD 77da0067 PMD 0 [25904.194235] Oops: 0002 [#1] PREEMPT SMP [25904.194235] CPU 2 [25904.194235] Pid: 31608, comm: trinity Tainted: G W 3.4.0-rc2-next-20120412-sasha #69 [25904.194235] RIP: 0010:[<ffffffff810b0de7>] [<ffffffff810b0de7>] mmput+0x27/0xf0 [25904.194235] RSP: 0018:ffff880077d49e08 EFLAGS: 00010202 [25904.194235] RAX: 0000000000000286 RBX: 0000000000000000 RCX: 0000000000000000 [25904.194235] RDX: ffff880075ef8000 RSI: 000000000000023d RDI: 0000000000000286 [25904.194235] RBP: ffff880077d49e18 R08: 0000000000000001 R09: 0000000000000001 [25904.194235] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 [25904.194235] R13: 00000000ffffffea R14: ffff880034287740 R15: ffff8800218d3010 [25904.194235] FS: 00007fc8b244c700(0000) GS:ffff880029800000(0000) knlGS:0000000000000000 [25904.194235] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [25904.194235] CR2: 0000000000000050 CR3: 00000000767c6000 CR4: 00000000000406e0 [25904.194235] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [25904.194235] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [25904.194235] Process trinity (pid: 31608, threadinfo ffff880077d48000, task ffff880075ef8000) [25904.194235] Stack: [25904.194235] ffff8800342876c0 0000000000000000 ffff880077d49f78 ffffffff811b8020 [25904.194235] ffffffff811b7d91 ffff880075ef8000 ffff88002256d200 0000000000000000 [25904.194235] 00000000000003ff 0000000000000000 0000000000000000 0000000000000000 [25904.194235] Call Trace: [25904.194235] [<ffffffff811b8020>] sys_migrate_pages+0x340/0x3a0 [25904.194235] [<ffffffff811b7d91>] ? sys_migrate_pages+0xb1/0x3a0 [25904.194235] [<ffffffff8266cbb9>] system_call_fastpath+0x16/0x1b [25904.194235] Code: c9 c3 66 90 55 31 d2 48 89 e5 be 3d 02 00 00 48 83 ec 10 48 89 1c 24 4c 89 64 24 08 48 89 fb 48 c7 c7 cf 0e e1 82 e8 69 18 03 00 <f0> ff 4b 50 0f 94 c0 84 c0 0f 84 aa 00 00 00 48 89 df e8 72 f1 [25904.194235] RIP [<ffffffff810b0de7>] mmput+0x27/0xf0 [25904.194235] RSP <ffff880077d49e08> [25904.194235] CR2: 0000000000000050 [25904.348999] ---[ end trace a307b3ed40206b4b ]--- Signed-off-by: NSasha Levin <levinsasha928@gmail.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ying Han 提交于
The "pgsteal" stat is confusing because it counts both direct reclaim as well as background reclaim. However, we have "kswapd_steal" which also counts background reclaim value. This patch fixes it and also makes it match the existng "pgscan_" stats. Test: pgsteal_kswapd_dma32 447623 pgsteal_kswapd_normal 42272677 pgsteal_kswapd_movable 0 pgsteal_direct_dma32 2801 pgsteal_direct_normal 44353270 pgsteal_direct_movable 0 Signed-off-by: NYing Han <yinghan@google.com> Reviewed-by: NRik van Riel <riel@redhat.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Dan Magenheimer <dan.magenheimer@oracle.com> Reviewed-by: NMinchan Kim <minchan@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
Fix a gcc warning (and bug?) introduced in cc9a6c87 ("cpuset: mm: reduce large amounts of memory barrier related damage v3") Local variable "page" can be uninitialized if the nodemask from vma policy does not intersects with nodemask from cpuset. Even if it doesn't happens it is better to initialize this variable explicitly than to introduce a kernel oops in a weird corner case. mm/hugetlb.c: In function `alloc_huge_page': mm/hugetlb.c:1135:5: warning: `page' may be used uninitialized in this function Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: NMel Gorman <mgorman@suse.de> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
None of the callsites actually need the page_cgroup descriptor themselves, so just pass the page and do the look up in there. We already had two bugs (6568d4a9 'mm: memcg: update the correct soft limit tree during migration' and 'memcg: fix Bad page state after replace_page_cache') where the passed page and pc were not referring to the same page frame. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NHugh Dickins <hughd@google.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Miller 提交于
The comments above __alloc_bootmem_node() claim that the code will first try the allocation using 'goal' and if that fails it will try again but with the 'goal' requirement dropped. Unfortunately, this is not what the code does, so fix it to do so. This is important for nobootmem conversions to architectures such as sparc where MAX_DMA_ADDRESS is infinity. On such architectures all of the allocations done by generic spots, such as the sparse-vmemmap implementation, will pass in: __pa(MAX_DMA_ADDRESS) as the goal, and with the limit given as "-1" this will always fail unless we add the appropriate fallback logic here. Signed-off-by: NDavid S. Miller <davem@davemloft.net> Acked-by: NYinghai Lu <yinghai@kernel.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 4月, 2012 1 次提交
-
-
由 Hugh Dickins 提交于
Mel reports a BUG_ON(slot == NULL) in radix_tree_tag_set() on s390 3.0.13: called from __set_page_dirty_nobuffers() when page_remove_rmap() tries to transfer dirty flag from s390 storage key to struct page and radix_tree. That would be because of reclaim's shrink_page_list() calling add_to_swap() on this page at the same time: first PageSwapCache is set (causing page_mapping(page) to appear as &swapper_space), then page->private set, then tree_lock taken, then page inserted into radix_tree - so there's an interval before taking the lock when the radix_tree slot is empty. We could fix this by moving __add_to_swap_cache()'s spin_lock_irq up before the SetPageSwapCache. But a better fix is simply to do what's five years overdue: Ken Chen introduced __set_page_dirty_no_writeback() (if !PageDirty TestSetPageDirty) for tmpfs to skip all the radix_tree overhead, and swap is just the same - it ignores the radix_tree tag, and does not participate in dirty page accounting, so should be using __set_page_dirty_no_writeback() too. s390 testing now confirms that this does indeed fix the problem. Reported-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NHugh Dickins <hughd@google.com> Acked-by: NMel Gorman <mgorman@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Ken Chen <kenchen@google.com> Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 4月, 2012 5 次提交
-
-
由 Al Viro 提交于
it's always current->mm Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Linus Torvalds 提交于
This continues the theme started with vm_brk() and vm_munmap(): vm_mmap() does the same thing as do_mmap(), but additionally does the required VM locking. This uninlines (and rewrites it to be clearer) do_mmap(), which sadly duplicates it in mm/mmap.c and mm/nommu.c. But that way we don't have to export our internal do_mmap_pgoff() function. Some day we hopefully don't have to export do_mmap() either, if all modular users can become the simpler vm_mmap() instead. We're actually very close to that already, with the notable exception of the (broken) use in i810, and a couple of stragglers in binfmt_elf. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
Like the vm_brk() function, this is the same as "do_munmap()", except it does the VM locking for the caller. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
It does the same thing as "do_brk()", except it handles the VM locking too. It turns out that all external callers want that anyway, so we can make do_brk() static to just mm/mmap.c while at it. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Tejun Heo 提交于
Commit 24aa0788 ("memblock, x86: Replace memblock_x86_reserve/ free_range() with generic ones") replaced x86 specific memblock operations with the generic ones; unfortunately, it lost zero length operation handling in the process making the kernel panic if somebody tries to reserve zero length area. There isn't much to be gained by being cranky to zero length operations and panicking is almost the worst response. Drop the BUG_ON() in memblock_reserve() and update memblock_add_region/isolate_range() so that all zero length operations are handled as noops. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: stable@vger.kernel.org Reported-by: NValere Monseur <valere.monseur@ymail.com> Bisected-by: NJoseph Freeman <jfree143dev@gmail.com> Tested-by: NJoseph Freeman <jfree143dev@gmail.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=43098Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 19 4月, 2012 1 次提交
-
-
由 Hugh Dickins 提交于
My 9ce70c02 "memcg: fix deadlock by inverting lrucare nesting" put a nasty little bug into v3.3's version of mem_cgroup_replace_page_cache(), sometimes used for FUSE. Replacing __mem_cgroup_commit_charge_lrucare() by __mem_cgroup_commit_charge(), I used the "pc" pointer set up earlier: but it's for oldpage, and needs now to be for newpage. Once oldpage was freed, its PageCgroupUsed bit (cleared above but set again here) caused "Bad page state" messages - and perhaps worse, being missed from newpage. (I didn't find this by using FUSE, but in reusing the function for tmpfs.) Signed-off-by: NHugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org [v3.3 only] Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 4月, 2012 4 次提交
-
-
由 Ying Han 提交于
This reverts commit c38446cc. Before the commit, the code makes senses to me but not after the commit. The "nr_reclaimed" is the number of pages reclaimed by scanning through the memcg's lru lists. The "nr_to_reclaim" is the target value for the whole function. For example, we like to early break the reclaim if reclaimed 32 pages under direct reclaim (not DEF_PRIORITY). After the reverted commit, the target "nr_to_reclaim" is decremented each time by "nr_reclaimed" but we still use it to compare the "nr_reclaimed". It just doesn't make sense to me... Signed-off-by: NYing Han <yinghan@google.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hillf Danton <dhillf@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Chris Metcalf 提交于
The race is as follows: Suppose a multi-threaded task forks a new process (on cpu A), thus bumping up the ref count on all the pages. While the fork is occurring (and thus we have marked all the PTEs as read-only), another thread in the original process (on cpu B) tries to write to a huge page, taking an access violation from the write-protect and calling hugetlb_cow(). Now, suppose the fork() fails. It will undo the COW and decrement the ref count on the pages, so the ref count on the huge page drops back to 1. Meanwhile hugetlb_cow() also decrements the ref count by one on the original page, since the original address space doesn't need it any more, having copied a new page to replace the original page. This leaves the ref count at zero, and when we call unlock_page(), we panic. fork on CPU A fault on CPU B ============= ============== ... down_write(&parent->mmap_sem); down_write_nested(&child->mmap_sem); ... while duplicating vmas if error break; ... up_write(&child->mmap_sem); up_write(&parent->mmap_sem); ... down_read(&parent->mmap_sem); ... lock_page(page); handle COW page_mapcount(old_page) == 2 alloc and prepare new_page ... handle error page_remove_rmap(page); put_page(page); ... fold new_page into pte page_remove_rmap(page); put_page(page); ... oops ==> unlock_page(page); up_read(&parent->mmap_sem); The solution is to take an extra reference to the page while we are holding the lock on it. Signed-off-by: NChris Metcalf <cmetcalf@tilera.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Glauber Costa 提交于
We should use the accessor res_counter_read_u64 for that. Although a purely cosmetic change is sometimes better delayed, to avoid conflicting with other people's work, we are starting to have people touching this code as well, and reproducing the open code behavior because that's the standard =) Time to fix it, then. Signed-off-by: NGlauber Costa <glommer@parallels.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
action != CPU_DEAD || action != CPU_DEAD_FROZEN is always true. Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 3月, 2012 7 次提交
-
-
由 Konstantin Khlebnikov 提交于
Replace radix_tree_gang_lookup_slot() and radix_tree_gang_lookup_tag_slot() in page-cache lookup functions with brand-new radix-tree direct iterating. This avoids the double-scanning and pointer copying. Iterator don't stop after nr_pages page-get fails in a row, it continue lookup till the radix-tree end. Thus we can safely remove these restart conditions. Unfortunately, old implementation didn't forbid nr_pages == 0, this corner case does not fit into new code, so the patch adds an extra check at the beginning. Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Tested-by: NHugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Gilad Ben-Yossef 提交于
Calculate a cpumask of CPUs with per-cpu pages in any zone and only send an IPI requesting CPUs to drain these pages to the buddy allocator if they actually have pages when asked to flush. This patch saves 85%+ of IPIs asking to drain per-cpu pages in case of severe memory pressure that leads to OOM since in these cases multiple, possibly concurrent, allocation requests end up in the direct reclaim code path so when the per-cpu pages end up reclaimed on first allocation failure for most of the proceeding allocation attempts until the memory pressure is off (possibly via the OOM killer) there are no per-cpu pages on most CPUs (and there can easily be hundreds of them). This also has the side effect of shortening the average latency of direct reclaim by 1 or more order of magnitude since waiting for all the CPUs to ACK the IPI takes a long time. Tested by running "hackbench 400" on a 8 CPU x86 VM and observing the difference between the number of direct reclaim attempts that end up in drain_all_pages() and those were more then 1/2 of the online CPU had any per-cpu page in them, using the vmstat counters introduced in the next patch in the series and using proc/interrupts. In the test sceanrio, this was seen to save around 3600 global IPIs after trigerring an OOM on a concurrent workload: $ cat /proc/vmstat | tail -n 2 pcp_global_drain 0 pcp_global_ipi_saved 0 $ cat /proc/interrupts | grep CAL CAL: 1 2 1 2 2 2 2 2 Function call interrupts $ hackbench 400 [OOM messages snipped] $ cat /proc/vmstat | tail -n 2 pcp_global_drain 3647 pcp_global_ipi_saved 3642 $ cat /proc/interrupts | grep CAL CAL: 6 13 6 3 3 3 1 2 7 Function call interrupts Please note that if the global drain is removed from the direct reclaim path as a patch from Mel Gorman currently suggests this should be replaced with an on_each_cpu_cond invocation. Signed-off-by: NGilad Ben-Yossef <gilad@benyossef.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NChristoph Lameter <cl@linux.com> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Pekka Enberg <penberg@kernel.org> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Acked-by: NMichal Nazarewicz <mina86@mina86.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Gilad Ben-Yossef 提交于
flush_all() is called for each kmem_cache_destroy(). So every cache being destroyed dynamically ends up sending an IPI to each CPU in the system, regardless if the cache has ever been used there. For example, if you close the Infinband ipath driver char device file, the close file ops calls kmem_cache_destroy(). So running some infiniband config tool on one a single CPU dedicated to system tasks might interrupt the rest of the 127 CPUs dedicated to some CPU intensive or latency sensitive task. I suspect there is a good chance that every line in the output of "git grep kmem_cache_destroy linux/ | grep '\->'" has a similar scenario. This patch attempts to rectify this issue by sending an IPI to flush the per cpu objects back to the free lists only to CPUs that seem to have such objects. The check which CPU to IPI is racy but we don't care since asking a CPU without per cpu objects to flush does no damage and as far as I can tell the flush_all by itself is racy against allocs on remote CPUs anyway, so if you required the flush_all to be determinstic, you had to arrange for locking regardless. Without this patch the following artificial test case: $ cd /sys/kernel/slab $ for DIR in *; do cat $DIR/alloc_calls > /dev/null; done produces 166 IPIs on an cpuset isolated CPU. With it it produces none. The code path of memory allocation failure for CPUMASK_OFFSTACK=y config was tested using fault injection framework. Signed-off-by: NGilad Ben-Yossef <gilad@benyossef.com> Acked-by: NChristoph Lameter <cl@linux.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Pekka Enberg <penberg@kernel.org> Cc: Matt Mackall <mpm@selenic.com> Cc: Sasha Levin <levinsasha928@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Avi Kivity <avi@redhat.com> Cc: Michal Nazarewicz <mina86@mina86.org> Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com> Cc: Milton Miller <miltonm@bga.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Most system calls taking flags first check that the flags passed in are valid, and that helps userspace to detect when new flags are supported. But swapon never did so: start checking now, to help if we ever want to support more swap_flags in future. It's difficult to get stray bits set in an int, and swapon is not widely used, so this is most unlikely to break any userspace; but we can just revert if it turns out to do so. Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
The size of coredump files is limited by RLIMIT_CORE, however, allocating large amounts of memory results in three negative consequences: - the coredumping process may be chosen for oom kill and quickly deplete all memory reserves in oom conditions preventing further progress from being made or tasks from exiting, - the coredumping process may cause other processes to be oom killed without fault of their own as the result of a SIGSEGV, for example, in the coredumping process, or - the coredumping process may result in a livelock while writing to the dump file if it needs memory to allocate while other threads are in the exit path waiting on the coredumper to complete. This is fixed by implying __GFP_NORETRY in the page allocator for coredumping processes when reclaim has failed so the allocations fail and the process continues to exit. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
pmd_trans_unstable() should be called before pmd_offset_map() in the locations where the mmap_sem is held for reading. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Ulrich Obergfell <uobergfe@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mark Salter <msalter@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Holepunching filesystems ext4 and xfs are using truncate_inode_pages_range but forgetting to unmap pages first (ocfs2 remembers). This is not really a bug, since races already require truncate_inode_page() to handle that case once the page is locked; but it can be very inefficient if the file being punched happens to be mapped into many vmas. Provide a drop-in replacement truncate_pagecache_range() which does the unmapping pass first, handling the awkward mismatch between arguments to truncate_inode_pages_range() and arguments to unmap_mapping_range(). Note that holepunching does not unmap privately COWed pages in the range: POSIX requires that we do so when truncating, but it's hard to justify, difficult to implement without an i_size cutoff, and no filesystem is attempting to implement it. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Ben Myers <bpm@sgi.com> Cc: Alex Elder <elder@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 3月, 2012 1 次提交
-
-
由 Rik van Riel 提交于
We should only test compaction_suitable if the kernel is built with CONFIG_COMPACTION, otherwise the stub compaction_suitable function will always return COMPACT_SKIPPED and send kswapd into an infinite loop. Reported-by: NAnton Blanchard <anton@samba.org> Signed-off-by: NRik van Riel <riel@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 3月, 2012 4 次提交
-
-
由 Jason Baron 提交于
Since we no longer need the VM_ALWAYSDUMP flag, let's use the freed bit for 'VM_NODUMP' flag. The idea is is to add a new madvise() flag: MADV_DONTDUMP, which can be set by applications to specifically request memory regions which should not dump core. The specific application I have in mind is qemu: we can add a flag there that wouldn't dump all of guest memory when qemu dumps core. This flag might also be useful for security sensitive apps that want to absolutely make sure that parts of memory are not dumped. To clear the flag use: MADV_DODUMP. [akpm@linux-foundation.org: s/MADV_NODUMP/MADV_DONTDUMP/, s/MADV_CLEAR_NODUMP/MADV_DODUMP/, per Roland] [akpm@linux-foundation.org: fix up the architectures which broke] Signed-off-by: NJason Baron <jbaron@redhat.com> Acked-by: NRoland McGrath <roland@hack.frob.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Avi Kivity <avi@redhat.com> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Matt Turner <mattst88@gmail.com> Cc: "James E.J. Bottomley" <jejb@parisc-linux.org> Cc: Helge Deller <deller@gmx.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jason Baron 提交于
The motivation for this patchset was that I was looking at a way for a qemu-kvm process, to exclude the guest memory from its core dump, which can be quite large. There are already a number of filter flags in /proc/<pid>/coredump_filter, however, these allow one to specify 'types' of kernel memory, not specific address ranges (which is needed in this case). Since there are no more vma flags available, the first patch eliminates the need for the 'VM_ALWAYSDUMP' flag. The flag is used internally by the kernel to mark vdso and vsyscall pages. However, it is simple enough to check if a vma covers a vdso or vsyscall page without the need for this flag. The second patch then replaces the 'VM_ALWAYSDUMP' flag with a new 'VM_NODUMP' flag, which can be set by userspace using new madvise flags: 'MADV_DONTDUMP', and unset via 'MADV_DODUMP'. The core dump filters continue to work the same as before unless 'MADV_DONTDUMP' is set on the region. The qemu code which implements this features is at: http://people.redhat.com/~jbaron/qemu-dump/qemu-dump.patch In my testing the qemu core dump shrunk from 383MB -> 13MB with this patch. I also believe that the 'MADV_DONTDUMP' flag might be useful for security sensitive apps, which might want to select which areas are dumped. This patch: The VM_ALWAYSDUMP flag is currently used by the coredump code to indicate that a vma is part of a vsyscall or vdso section. However, we can determine if a vma is in one these sections by checking it against the gate_vma and checking for a non-NULL return value from arch_vma_name(). Thus, freeing a valuable vma bit. Signed-off-by: NJason Baron <jbaron@redhat.com> Acked-by: NRoland McGrath <roland@hack.frob.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Avi Kivity <avi@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Change oom_kill_task() to use do_send_sig_info(SEND_SIG_FORCED) instead of force_sig(SIGKILL). With the recent changes we do not need force_ to kill the CLONE_NEWPID tasks. And this is more correct. force_sig() can race with the exiting thread even if oom_kill_task() checks p->mm != NULL, while do_send_sig_info(group => true) kille the whole process. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Tejun Heo <tj@kernel.org> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hillf Danton 提交于
Fix code duplication in __unmap_hugepage_range(), such as pte_page() and huge_pte_none(). Signed-off-by: NHillf Danton <dhillf@gmail.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 3月, 2012 1 次提交
-
-
由 Hugh Dickins 提交于
Adjusting cc715d99 "mm: vmscan: forcibly scan highmem if there are too many buffer_heads pinning highmem" for -stable reveals that it was slightly wrong once on top of fe2c2a10 "vmscan: reclaim at order 0 when compaction is enabled", which specifically adds testorder for the zone_watermark_ok_safe() test. Signed-off-by: NHugh Dickins <hughd@google.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NRik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 3月, 2012 8 次提交
-
-
由 Artem Bityutskiy 提交于
Export 'dirty_writeback_interval' to make it visible to file-systems. We are going to push superblock management down to file-systems and get rid of the 'sync_supers' kernel thread completly. Signed-off-by: NArtem Bityutskiy <artem.bityutskiy@linux.intel.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: N"Theodore Ts'o" <tytso@mit.edu>
-
由 Naoya Horiguchi 提交于
Currently we can't do task migration among memory cgroups without THP split, which means processes heavily using THP experience large overhead in task migration. This patch introduces the code for moving charge of THP and makes THP more valuable. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NHillf Danton <dhillf@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
These macros will be used in a later patch, where all usages are expected to be optimized away without #ifdef CONFIG_TRANSPARENT_HUGEPAGE. But to detect unexpected usages, we convert the existing BUG() to BUILD_BUG(). [akpm@linux-foundation.org: fix build in mm/pgtable-generic.c] Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NHillf Danton <dhillf@gmail.com> Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
- Replace lengthy function name is_target_pte_for_mc() with a shorter one in order to avoid ugly line breaks. - explicitly use MC_TARGET_* instead of simply using integers. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hillf Danton <dhillf@gmail.com> Cc: David Rientjes <rientjes@google.com> Acked-by: NHillf Danton <dhillf@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jeff Liu 提交于
Signed-off-by: NJie Liu <jeff.liu@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Anton Vorontsov 提交于
In the following code: if (type == _MEM) thresholds = &memcg->thresholds; else if (type == _MEMSWAP) thresholds = &memcg->memsw_thresholds; else BUG(); BUG_ON(!thresholds); The BUG_ON() seems redundant. Signed-off-by: NAnton Vorontsov <anton.vorontsov@linaro.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
A grammatical fix. Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
mem_cgroup_begin_update_page_stat() should be very fast because it's called very frequently. Now, it needs to look up page_cgroup and its memcg....this is slow. This patch adds a global variable to check "any memcg is moving or not". With this, the caller doesn't need to visit page_cgroup and memcg. Here is a test result. A test program makes page faults onto a file, MAP_SHARED and makes each page's page_mapcount(page) > 1, and free the range by madvise() and page fault again. This program causes 26214400 times of page fault onto a file(size was 1G.) and shows shows the cost of mem_cgroup_begin_update_page_stat(). Before this patch for mem_cgroup_begin_update_page_stat() [kamezawa@bluextal test]$ time ./mmap 1G real 0m21.765s user 0m5.999s sys 0m15.434s 27.46% mmap mmap [.] reader 21.15% mmap [kernel.kallsyms] [k] page_fault 9.17% mmap [kernel.kallsyms] [k] filemap_fault 2.96% mmap [kernel.kallsyms] [k] __do_fault 2.83% mmap [kernel.kallsyms] [k] __mem_cgroup_begin_update_page_stat After this patch [root@bluextal test]# time ./mmap 1G real 0m21.373s user 0m6.113s sys 0m15.016s In usual path, calls to __mem_cgroup_begin_update_page_stat() goes away. Note: we may be able to remove this optimization in future if we can get pointer to memcg directly from struct page. [akpm@linux-foundation.org: don't return a void] Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NGreg Thelen <gthelen@google.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.cz> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-