- 02 7月, 2011 9 次提交
-
-
由 Christoph Lameter 提交于
We will be calling free_debug_processing with interrupts disabled in some case when the later patches are applied. Some of the functions called by free_debug_processing expect interrupts to be off. Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
Locking slabs is no longer necesary if the arch supports cmpxchg operations and if no debuggin features are used on a slab. If the arch does not support cmpxchg then we fallback to use the slab lock to do a cmpxchg like operation. The patch also changes the lock order. Slab locks are subsumed to the node lock now. With that approach slab_trylocking is no longer necessary. Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
Rework the allocation paths so that updates of the page freelist, frozen state and number of objects use cmpxchg_double_slab(). Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
We need more information about the slab for the cmpxchg implementation. Signed-off-by: NChristoph Lameter <cl@linux.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
The allocator fastpath rework does change the usage of the list_lock. Remove the list_lock processing from the functions that hide them from the critical sections and move them into those critical sections. This in turn simplifies the support functions (no __ variant needed anymore) and simplifies the lock handling on bootstrap. Inline add_partial since it becomes pretty simple. Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
Add a function that operates on the second doubleword in the page struct and manipulates the object counters, the freelist and the frozen attribute. Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
This is necessary because the frozen bit has to be handled in the same cmpxchg_double with the freelist and the counters. Signed-off-by: NChristoph Lameter <cl@linux.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
Do not use a page flag for the frozen bit. It needs to be part of the state that is handled with cmpxchg_double(). So use a bit in the counter struct in the page struct for that purpose. Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
Do the irq handling in allocate_slab() instead of __slab_alloc(). __slab_alloc() is already cluttered and allocate_slab() is already fiddling around with gfp flags. v6->v7: Only increment ORDER_FALLBACK if we get a page during fallback Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 28 6月, 2011 7 次提交
-
-
由 KAMEZAWA Hiroyuki 提交于
Commit d149e3b2 ("memcg: add the soft_limit reclaim in global direct reclaim") adds a softlimit hook to shrink_zones(). By this, soft limit is called as try_to_free_pages() do_try_to_free_pages() shrink_zones() mem_cgroup_soft_limit_reclaim() Then, direct reclaim is memcg softlimit hint aware, now. But, the memory cgroup's "limit" path can call softlimit shrinker. try_to_free_mem_cgroup_pages() do_try_to_free_pages() shrink_zones() mem_cgroup_soft_limit_reclaim() This will cause a global reclaim when a memcg hits limit. This is bug. soft_limit_reclaim() should be called when scanning_global_lru(sc) == true. And the commit adds a variable "total_scanned" for counting softlimit scanned pages....it's not "total". This patch removes the variable and update sc->nr_scanned instead of it. This will affect shrink_slab()'s scan condition but, global LRU is scanned by softlimit and I think this change makes sense. TODO: avoid too much scanning of a zone when softlimit did enough work. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Ying Han <yinghan@google.com> Cc: Michal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Kara 提交于
Under heavy memory and filesystem load, users observe the assertion mapping->nrpages == 0 in end_writeback() trigger. This can be caused by page reclaim reclaiming the last page from a mapping in the following race: CPU0 CPU1 ... shrink_page_list() __remove_mapping() __delete_from_page_cache() radix_tree_delete() evict_inode() truncate_inode_pages() truncate_inode_pages_range() pagevec_lookup() - finds nothing end_writeback() mapping->nrpages != 0 -> BUG page->mapping = NULL mapping->nrpages-- Fix the problem by doing a reliable check of mapping->nrpages under mapping->tree_lock in end_writeback(). Analyzed by Jay <jinshan.xiong@whamcloud.com>, lost in LKML, and dug out by Miklos Szeredi <mszeredi@suse.de>. Cc: Jay <jinshan.xiong@whamcloud.com> Cc: Miklos Szeredi <mszeredi@suse.de> Signed-off-by: NJan Kara <jack@suse.cz> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Zijlstra 提交于
We cannot take a mutex while holding a spinlock, so flip the order and fix the locking documentation. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: NAndi Kleen <ak@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Although it is used (by i915) on nothing but tmpfs, read_cache_page_gfp() is unsuited to tmpfs, because it inserts a page into pagecache before calling the filesystem's ->readpage: tmpfs may have pages in swapcache which only it knows how to locate and switch to filecache. At present tmpfs provides a ->readpage method, and copes with this by copying pages; but soon we can simplify it by removing its ->readpage. Provide shmem_read_mapping_page_gfp() now, ready for that transition, Export shmem_read_mapping_page_gfp() and add it to list in shmem_fs.h, with shmem_read_mapping_page() inline for the common mapping_gfp case. (shmem_read_mapping_page_gfp or shmem_read_cache_page_gfp? Generally the read_mapping_page functions use the mapping's ->readpage, and the read_cache_page functions use the supplied filler, so I think read_cache_page_gfp was slightly misnamed.) Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
2.6.35's new truncate convention gave tmpfs the opportunity to control its file truncation, no longer enforced from outside by vmtruncate(). We shall want to build upon that, to handle pagecache and swap together. Slightly redefine the ->truncate_range interface: let it now be called between the unmap_mapping_range()s, with the filesystem responsible for doing the truncate_inode_pages_range() from it - just as the filesystem is nowadays responsible for doing that from its ->setattr. Let's rename shmem_notify_change() to shmem_setattr(). Instead of calling the generic truncate_setsize(), bring that code in so we can call shmem_truncate_range() - which will later be updated to perform its own variant of truncate_inode_pages_range(). Remove the punch_hole unmap_mapping_range() from shmem_truncate_range(): now that the COW's unmap_mapping_range() comes after ->truncate_range, there is no need to call it a third time. Export shmem_truncate_range() and add it to the list in shmem_fs.h, so that i915_gem_object_truncate() can call it explicitly in future; get this patch in first, then update drm/i915 once this is available (until then, i915 will just be doing the truncate_inode_pages() twice). Though introduced five years ago, no other filesystem is implementing ->truncate_range, and its only other user is madvise(,,MADV_REMOVE): we expect to convert it to fallocate(,FALLOC_FL_PUNCH_HOLE,,) shortly, whereupon ->truncate_range can be removed from inode_operations - shmem_truncate_range() will help i915 across that transition too. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Before adding any more global entry points into shmem.c, gather such prototypes into shmem_fs.h. Remove mm's own declarations from swap.h, but for now leave the ones in mm.h: because shmem_file_setup() and shmem_zero_setup() are called from various places, and we should not force other subsystems to update immediately. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
You would expect to find vmtruncate_range() next to vmtruncate() in mm/truncate.c: move it there. Signed-off-by: NHugh Dickins <hughd@google.com> Acked-by: NChristoph Hellwig <hch@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 6月, 2011 2 次提交
-
-
由 David Rientjes 提交于
Commit 959ecc48 ("mm/memory_hotplug.c: fix building of node hotplug zonelist") does not protect the build_all_zonelists() call with zonelists_mutex as needed. This can lead to races in constructing zonelist ordering if a concurrent build is underway. Protecting this with lock_memory_hotplug() is insufficient since zonelists can be rebuild though sysfs as well. Signed-off-by: NDavid Rientjes <rientjes@google.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
The error handling in mem_online_node() is incorrect: hotadd_new_pgdat() returns NULL if the new pgdat could not have been allocated and a pointer to it otherwise. mem_online_node() should fail if hotadd_new_pgdat() fails, not the inverse. This fixes an issue when memoryless nodes are not onlined and their sysfs interface is not registered when their first cpu is brought up. The bug was introduced by commit cf23422b ("cpu/mem hotplug: enable CPUs online before local memory online") iow v2.6.35. Signed-off-by: NDavid Rientjes <rientjes@google.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 18 6月, 2011 3 次提交
-
-
由 Linus Torvalds 提交于
Hugh Dickins points out that lockdep (correctly) spots a potential deadlock on the anon_vma lock, because we now do a GFP_KERNEL allocation of anon_vma_chain while doing anon_vma_clone(). The problem is that page reclaim will want to take the anon_vma lock of any anonymous pages that it will try to reclaim. So re-organize the code in anon_vma_clone() slightly: first do just a GFP_NOWAIT allocation, which will usually work fine. But if that fails, let's just drop the lock and re-do the allocation, now with GFP_KERNEL. End result: not only do we avoid the locking problem, this also ends up getting better concurrency in case the allocation does need to block. Tim Chen reports that with all these anon_vma locking tweaks, we're now almost back up to the spinlock performance. Reported-and-tested-by: NHugh Dickins <hughd@google.com> Tested-by: NTim Chen <tim.c.chen@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Zijlstra 提交于
This matches the anon_vma_clone() case, and uses the same lock helper functions. Because of the need to potentially release the anon_vma's, it's a bit more complex, though. We traverse the 'vma->anon_vma_chain' in two phases: the first loop gets the anon_vma lock (with the helper function that only takes the lock once for the whole loop), and removes any entries that don't need any more processing. The second phase just traverses the remaining list entries (without holding the anon_vma lock), and does any actual freeing of the anon_vma's that is required. Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Tested-by: NHugh Dickins <hughd@google.com> Tested-by: NTim Chen <tim.c.chen@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
In anon_vma_clone() we traverse the vma->anon_vma_chain of the source vma, locking the anon_vma for each entry. But they are all going to have the same root entry, which means that we're locking and unlocking the same lock over and over again. Which is expensive in locked operations, but can get _really_ expensive when that root entry sees any kind of lock contention. In fact, Tim Chen reports a big performance regression due to this: when we switched to use a mutex instead of a spinlock, the contention case gets much worse. So to alleviate this all, this commit creates a small helper function (lock_anon_vma_root()) that can be used to take the lock just once rather than taking and releasing it over and over again. We still have the same "take the lock and release" it behavior in the exit path (in unlink_anon_vmas()), but that one is a bit harder to fix since we're actually freeing the anon_vma entries as we go, and that will touch the lock too. Reported-and-tested-by: NTim Chen <tim.c.chen@linux.intel.com> Tested-by: NHugh Dickins <hughd@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andi Kleen <ak@linux.intel.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 6月, 2011 1 次提交
-
-
由 Andrea Arcangeli 提交于
swapcache will reach the below code path in migrate_page_move_mapping, and swapcache is accounted as NR_FILE_PAGES but it's not accounted as NR_SHMEM. Hugh pointed out we must use PageSwapCache instead of comparing mapping to &swapper_space, to avoid build failure with CONFIG_SWAP=n. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 6月, 2011 18 次提交
-
-
由 Linus Torvalds 提交于
We have some users of this function that date back to before the vma list was doubly linked, and just are silly. These days, you can find the previous vma by just following the vma->vm_prev pointer. In some cases you don't need any find_vma() lookup at all, and in other cases you're better off with the regular "find_vma()" that uses the vma cache front-end lookup. Some "find_vma_prev()" users are still valid, though. For example, in the case of a stack that grows up, it can be the case that we don't find any 'vma' at all (because we're looking up an address that is past the last vma), and that the stack that we want to grow is the 'prev' vma. But that kind of special case aside, we generally should prefer to use 'find_vma()'. Noticed due to a totally unrelated POWER memory corruption bug that just happened to hit in 'find_vma_prev()' and made me go "Hmm - why are we using that function here?". Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Andrea Righi reported a case where an exiting task can race against ksmd::scan_get_next_rmap_item (http://lkml.org/lkml/2011/6/1/742) easily triggering a NULL pointer dereference in ksmd. ksm_scan.mm_slot == &ksm_mm_head with only one registered mm CPU 1 (__ksm_exit) CPU 2 (scan_get_next_rmap_item) list_empty() is false lock slot == &ksm_mm_head list_del(slot->mm_list) (list now empty) unlock lock slot = list_entry(slot->mm_list.next) (list is empty, so slot is still ksm_mm_head) unlock slot->mm == NULL ... Oops Close this race by revalidating that the new slot is not simply the list head again. Andrea's test case: #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #define BUFSIZE getpagesize() int main(int argc, char **argv) { void *ptr; if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) { perror("posix_memalign"); exit(1); } if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) { perror("madvise"); exit(1); } *(char *)NULL = 0; return 0; } Reported-by: NAndrea Righi <andrea@betterlinux.com> Tested-by: NAndrea Righi <andrea@betterlinux.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NChris Wright <chrisw@sous-sol.org> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Asynchronous compaction is used when promoting to huge pages. This is all very nice but if there are a number of processes in compacting memory, a large number of pages can be isolated. An "asynchronous" process can stall for long periods of time as a result with a user reporting that firefox can stall for 10s of seconds. This patch aborts asynchronous compaction if too many pages are isolated as it's better to fail a hugepage promotion than stall a process. [minchan.kim@gmail.com: return COMPACT_PARTIAL for abort] Reported-and-tested-by: NUry Stankevich <urykhy@gmail.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
It is unsafe to run page_count during the physical pfn scan because compound_head could trip on a dangling pointer when reading page->first_page if the compound page is being freed by another CPU. [mgorman@suse.de: split out patch] Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Compaction works with two scanners, a migration and a free scanner. When the scanners crossover, migration within the zone is complete. The location of the scanner is recorded on each cycle to avoid excesive scanning. When a zone is small and mostly reserved, it's very easy for the migration scanner to be close to the end of the zone. Then the following situation can occurs o migration scanner isolates some pages near the end of the zone o free scanner starts at the end of the zone but finds that the migration scanner is already there o free scanner gets reinitialised for the next cycle as cc->migrate_pfn + pageblock_nr_pages moving the free scanner into the next zone o migration scanner moves into the next zone When this happens, NR_ISOLATED accounting goes haywire because some of the accounting happens against the wrong zone. One zones counter remains positive while the other goes negative even though the overall global count is accurate. This was reported on X86-32 with !SMP because !SMP allows the negative counters to be visible. The fact that it is the bug should theoritically be possible there. Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shaohua Li 提交于
fragmentation_index() returns -1000 when the allocation might succeed This doesn't match the comment and code in compaction_suitable(). I thought compaction_suitable should return COMPACT_PARTIAL in -1000 case, because in this case allocation could succeed depending on watermarks. The impact of this is that compaction starts and compact_finished() is called which rechecks the watermarks and the free lists. It should have the same result in that compaction should not start but is more expensive. Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NShaohua Li <shaohua.li@intel.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
Pages isolated for migration are accounted with the vmstat counters NR_ISOLATE_[ANON|FILE]. Callers of migrate_pages() are expected to increment these counters when pages are isolated from the LRU. Once the pages have been migrated, they are put back on the LRU or freed and the isolated count is decremented. Memory failure is not properly accounting for pages it isolates causing the NR_ISOLATED counters to be negative. On SMP builds, this goes unnoticed as negative counters are treated as 0 due to expected per-cpu drift. On UP builds, the counter is treated by too_many_isolated() as a large value causing processes to enter D state during page reclaim or compaction. This patch accounts for pages isolated by memory failure correctly. [mel@csn.ul.ie: rewrote changelog] Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NMinchan Kim <minchan.kim@gmail.com> Cc: Andi Kleen <andi@firstfloor.org> Acked-by: NMel Gorman <mel@csn.ul.ie> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Based on Michal Hocko's comment. We are not draining per cpu cached charges during soft limit reclaim because background reclaim doesn't care about charges. It tries to free some memory and charges will not give any. Cached charges might influence only selection of the biggest soft limit offender but as the call is done only after the selection has been already done it makes no change. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
For performance, memory cgroup caches some "charge" from res_counter into per cpu cache. This works well but because it's cache, it needs to be flushed in some cases. Typical cases are 1. when someone hit limit. 2. when rmdir() is called and need to charges to be 0. But "1" has problem. Recently, with large SMP machines, we see many kworker runs because of flushing memcg's cache. Bad things in implementation are that even if a cpu contains a cache for memcg not related to a memcg which hits limit, drain code is called. This patch does A) check percpu cache contains a useful data or not. B) check other asynchronous percpu draining doesn't run. C) don't call local cpu callback. (*)This patch avoid changing the calling condition with hard-limit. When I run "cat 1Gfile > /dev/null" under 300M limit memcg, [Before] 13767 kamezawa 20 0 98.6m 424 416 D 10.0 0.0 0:00.61 cat 58 root 20 0 0 0 0 S 0.6 0.0 0:00.09 kworker/2:1 60 root 20 0 0 0 0 S 0.6 0.0 0:00.08 kworker/4:1 4 root 20 0 0 0 0 S 0.3 0.0 0:00.02 kworker/0:0 57 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/1:1 61 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/5:1 62 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/6:1 63 root 20 0 0 0 0 S 0.3 0.0 0:00.05 kworker/7:1 [After] 2676 root 20 0 98.6m 416 416 D 9.3 0.0 0:00.87 cat 2626 kamezawa 20 0 15192 1312 920 R 0.3 0.0 0:00.28 top 1 root 20 0 19384 1496 1204 S 0.0 0.0 0:00.66 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/0 4 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kworker/0:0 [akpm@linux-foundation.org: make percpu_charge_mutex static, tweak comments] Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Tested-by: NYing Han <yinghan@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Hierarchical reclaim doesn't swap out if memsw and resource limits are thye same (memsw_is_minimum == true) because we would hit mem+swap limit anyway (during hard limit reclaim). If it comes to the soft limit we shouldn't consider memsw_is_minimum at all because it doesn't make much sense. Either the soft limit is bellow the hard limit and then we cannot hit mem+swap limit or the direct reclaim takes a precedence. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Commit 21a3c964 ("memcg: allocate memory cgroup structures in local nodes") makes page_cgroup allocation as NUMA aware. But that caused a problem https://bugzilla.kernel.org/show_bug.cgi?id=36192. The problem was getting a NID from invalid struct pages, which was not initialized because it was out-of-node, out of [node_start_pfn, node_end_pfn) Now, with sparsemem, page_cgroup_init scans pfn from 0 to max_pfn. But this may scan a pfn which is not on any node and can access memmap which is not initialized. This makes page_cgroup_init() for SPARSEMEM node aware and remove a code to get nid from page->flags. (Then, we'll use valid NID always.) [akpm@linux-foundation.org: try to fix up comments] Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Commit 406eb0c9 ("memcg: add memory.numastat api for numa statistics") adds memory.numa_stat file for memory cgroup. But the file permissions are wrong. [kamezawa@bluextal linux-2.6]$ ls -l /cgroup/memory/A/memory.numa_stat ---------- 1 root root 0 Jun 9 18:36 /cgroup/memory/A/memory.numa_stat This patch fixes the permission as [root@bluextal kamezawa]# ls -l /cgroup/memory/A/memory.numa_stat -r--r--r-- 1 root root 0 Jun 10 16:49 /cgroup/memory/A/memory.numa_stat Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NYing Han <yinghan@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rafael Aquini 提交于
When 1GB hugepages are allocated on a system, free(1) reports less available memory than what really is installed in the box. Also, if the total size of hugepages allocated on a system is over half of the total memory size, CommitLimit becomes a negative number. The problem is that gigantic hugepages (order > MAX_ORDER) can only be allocated at boot with bootmem, thus its frames are not accounted to 'totalram_pages'. However, they are accounted to hugetlb_total_pages() What happens to turn CommitLimit into a negative number is this calculation, in fs/proc/meminfo.c: allowed = ((totalram_pages - hugetlb_total_pages()) * sysctl_overcommit_ratio / 100) + total_swap_pages; A similar calculation occurs in __vm_enough_memory() in mm/mmap.c. Also, every vm statistic which depends on 'totalram_pages' will render confusing values, as if system were 'missing' some part of its memory. Impact of this bug: When gigantic hugepages are allocated and sysctl_overcommit_memory == OVERCOMMIT_NEVER. In a such situation, __vm_enough_memory() goes through the mentioned 'allowed' calculation and might end up mistakenly returning -ENOMEM, thus forcing the system to start reclaiming pages earlier than it would be ususal, and this could cause detrimental impact to overall system's performance, depending on the workload. Besides the aforementioned scenario, I can only think of this causing annoyances with memory reports from /proc/meminfo and free(1). [akpm@linux-foundation.org: standardize comment layout] Reported-by: NRuss Anderson <rja@sgi.com> Signed-off-by: NRafael Aquini <aquini@linux.com> Acked-by: NRuss Anderson <rja@sgi.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Christoph Lameter <cl@linux.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
During memory hotplug we refresh zonelists when we online a page in a new zone. It means that the node's zonelist is not initialized until pages are onlined. So for example, "nid" passed by MEM_GOING_ONLINE notifier will point to NODE_DATA(nid) which has no zone fallback list. Moreover, if we hot-add cpu-only nodes, alloc_pages() will do no fallback. This patch makes a zonelist when a new pgdata is available. Note: in production, at fujitsu, memory should be onlined before cpu and our server didn't have any memory-less nodes and had no problems. But recent changes in MEM_GOING_ONLINE+page_cgroup will access not initialized zonelist of node. Anyway, there are memory-less node and we need some care. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michal Hocko 提交于
Commit 56de7263 ("mm: compaction: direct compact when a high-order allocation fails") introduced a check for cc->order == -1 in compact_finished. We should continue compacting in that case because the request came from userspace and there is no particular order to compact for. Similar check has been added by 82478fb7 (mm: compaction: prevent division-by-zero during user-requested compaction) for compaction_suitable. The check is, however, done after zone_watermark_ok which uses order as a right hand argument for shifts. Not only watermark check is pointless if we can break out without it but it also uses 1 << -1 which is not well defined (at least from C standard). Let's move the -1 check above zone_watermark_ok. [minchan.kim@gmail.com> - caught compaction_suitable] Signed-off-by: NMichal Hocko <mhocko@suse.cz> Cc: Mel Gorman <mgorman@suse.de> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Steven Rostedt 提交于
Running a ktest.pl test, I hit the following bug on x86_32: ------------[ cut here ]------------ WARNING: at arch/x86/mm/highmem_32.c:81 __kunmap_atomic+0x64/0xc1() Hardware name: Modules linked in: Pid: 93, comm: sh Not tainted 2.6.39-test+ #1 Call Trace: [<c04450da>] warn_slowpath_common+0x7c/0x91 [<c042f5df>] ? __kunmap_atomic+0x64/0xc1 [<c042f5df>] ? __kunmap_atomic+0x64/0xc1^M [<c0445111>] warn_slowpath_null+0x22/0x24 [<c042f5df>] __kunmap_atomic+0x64/0xc1 [<c04d4a22>] unmap_vmas+0x43a/0x4e0 [<c04d9065>] exit_mmap+0x91/0xd2 [<c0443057>] mmput+0x43/0xad [<c0448358>] exit_mm+0x111/0x119 [<c044855f>] do_exit+0x1ff/0x5fa [<c0454ea2>] ? set_current_blocked+0x3c/0x40 [<c0454f24>] ? sigprocmask+0x7e/0x8e [<c0448b55>] do_group_exit+0x65/0x88 [<c0448b90>] sys_exit_group+0x18/0x1c [<c0c3915f>] sysenter_do_call+0x12/0x38 ---[ end trace 8055f74ea3c0eb62 ]--- Running a ktest.pl git bisect, found the culprit: commit e303297e ("mm: extended batches for generic mmu_gather") But although this was the commit triggering the bug, it was not the one originally responsible for the bug. That was commit d16dfc55 ("mm: mmu_gather rework"). The code in zap_pte_range() has something that looks like the following: pte = pte_offset_map_lock(mm, pmd, addr, &ptl); do { [...] } while (pte++, addr += PAGE_SIZE, addr != end); pte_unmap_unlock(pte - 1, ptl); The pte starts off pointing at the first element in the page table directory that was returned by the pte_offset_map_lock(). When it's done with the page, pte will be pointing to anything between the next entry and the first entry of the next page inclusive. By doing a pte - 1, this puts the pte back onto the original page, which is all that pte_unmap_unlock() needs. In most archs (64 bit), this is not an issue as the pte is ignored in the pte_unmap_unlock(). But on 32 bit archs, where things may be kmapped, it is essential that the pte passed to pte_unmap_unlock() resides on the same page that was given by pte_offest_map_lock(). The problem came in d16dfc55 ("mm: mmu_gather rework") where it introduced a "break;" from the while loop. This alone did not seem to easily trigger the bug. But the modifications made by e303297e caused that "break;" to be hit on the first iteration, before the pte++. The pte not being incremented will now cause pte_unmap_unlock(pte - 1) to be pointing to the previous page. This will cause the wrong page to be unmapped, and also trigger the warning above. The simple solution is to just save the pointer given by pte_offset_map_lock() and use it in the unlock. Signed-off-by: NSteven Rostedt <rostedt@goodmis.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
While testing for memcg aware swap token, I observed a swap token was often grabbed an intermittent running process (eg init, auditd) and they never release a token. Why? Some processes (eg init, auditd, audispd) wake up when a process exiting. And swap token can be get first page-in process when a process exiting makes no swap token owner. Thus such above intermittent running process often get a token. And currently, swap token priority is only decreased at page fault path. Then, if the process sleep immediately after to grab swap token, the swap token priority never be decreased. That's obviously undesirable. This patch implement very poor (and lightweight) priority aging. It only be affect to the above corner case and doesn't change swap tendency workload performance (eg multi process qsbench load) Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
This is useful for observing swap token activity. example output: zsh-1845 [000] 598.962716: update_swap_token_priority: mm=ffff88015eaf7700 old_prio=1 new_prio=0 memtoy-1830 [001] 602.033900: update_swap_token_priority: mm=ffff880037a45880 old_prio=947 new_prio=949 memtoy-1830 [000] 602.041509: update_swap_token_priority: mm=ffff880037a45880 old_prio=949 new_prio=951 memtoy-1830 [000] 602.051959: update_swap_token_priority: mm=ffff880037a45880 old_prio=951 new_prio=953 memtoy-1830 [000] 602.052188: update_swap_token_priority: mm=ffff880037a45880 old_prio=953 new_prio=955 memtoy-1830 [001] 602.427184: put_swap_token: token_mm=ffff880037a45880 zsh-1789 [000] 602.427281: replace_swap_token: old_token_mm= (null) old_prio=0 new_token_mm=ffff88015eaf7018 new_prio=2 zsh-1789 [001] 602.433456: update_swap_token_priority: mm=ffff88015eaf7018 old_prio=2 new_prio=4 zsh-1789 [000] 602.437613: update_swap_token_priority: mm=ffff88015eaf7018 old_prio=4 new_prio=6 zsh-1789 [000] 602.443924: update_swap_token_priority: mm=ffff88015eaf7018 old_prio=6 new_prio=8 zsh-1789 [000] 602.451873: update_swap_token_priority: mm=ffff88015eaf7018 old_prio=8 new_prio=10 zsh-1789 [001] 602.462639: update_swap_token_priority: mm=ffff88015eaf7018 old_prio=10 new_prio=12 Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: Rik van Riel<riel@redhat.com> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-