- 13 4月, 2010 4 次提交
-
-
由 Linus Torvalds 提交于
Otherwise we might be mapping in a page in a new mapping, but that page (through the swapcache) would later be mapped into an old mapping too. The page->mapping must be the case that works for everybody, not just the mapping that happened to page it in first. Here's the scenario: - page gets allocated/mapped by process A. Let's call the anon_vma we associate the page with 'A' to keep it easy to track. - Process A forks, creating process B. The anon_vma in B is 'B', and has a chain that looks like 'B' -> 'A'. Everything is fine. - Swapping happens. The page (with mapping pointing to 'A') gets swapped out (perhaps not to disk - it's enough to assume that it's just not mapped any more, and lives entirely in the swap-cache) - Process B pages it in, which goes like this: do_swap_page -> page = lookup_swap_cache(entry); ... set_pte_at(mm, address, page_table, pte); page_add_anon_rmap(page, vma, address); And think about what happens here! In particular, what happens is that this will now be the "first" mapping of that page, so page_add_anon_rmap() used to do if (first) __page_set_anon_rmap(page, vma, address); and notice what anon_vma it will use? It will use the anon_vma for process B! What happens then? Trivial: process 'A' also pages it in (nothing happens, it's not the first mapping), and then process 'B' execve's or exits or unmaps, making anon_vma B go away. End result: process A has a page that points to anon_vma B, but anon_vma B does not exist any more. This can go on forever. Forget about RCU grace periods, forget about locking, forget anything like that. The bug is simply that page->mapping points to an anon_vma that was correct at one point, but was _not_ the one that was shared by all users of that possible mapping. Changing it to always use the deepest anon_vma in the anonvma chain gets us to the safest model. This can be improved in certain cases: if we know the page is private to just this particular mapping (for example, it's a new page, or it is the only swapcache entry), we could pick the top (most specific) anon_vma. But that's a future optimization. Make it _work_ reliably first. Reviewed-by: NRik van Riel <riel@redhat.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Tested-by: Borislav Petkov <bp@alien8.de> [ "What do you know, I think you fixed it!" ] Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
We want to walk the chain in reverse order when cloning it, so that the order of the result chain will be the same as the order in the source chain. When we add entries to the chain, they go at the head of the chain, so we want to add the source head last. Reviewed-by: NRik van Riel <riel@redhat.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Tested-by: Borislav Petkov <bp@alien8.de> [ "No, it still oopses" ] Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
When we move the boundaries between two vma's due to things like mprotect, we need to make sure that the anon_vma of the pages that got moved from one vma to another gets properly copied around. And that was not always the case, in this rather hard-to-follow code sequence. Clarify the code, and fix it so that it copies the anon_vma from the right source. Reviewed-by: NRik van Riel <riel@redhat.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Tested-by: Borislav Petkov <bp@alien8.de> [ "Yeah, not so much this one either" ] Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Linus Torvalds 提交于
This changes the anon_vma reuse case to require that we only reuse simple anon_vma's - ie the case when the vma only has a single anon_vma associated with it. This means that a reuse of an anon_vma from an adjacent vma will always guarantee that both vma's are associated not only with the same anon_vma, they will also have the same anon_vma chain (of just a single entry in this case). And since anon_vma re-use was the only case where the same anon_vma might be associated with different chains of anon_vma's, we now have the case that every vma that shares the same anon_vma will always also have the same chain. That makes it much easier to think about merging vma's that share the same anon_vma's: you can always just drop the other anon_vma chain in anon_vma_merge() since you know that they are always identical. This also splits up the function to validate the anon_vma re-use, and adds a lot of commentary about the possible races. Reviewed-by: NRik van Riel <riel@redhat.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Tested-by: Borislav Petkov <bp@alien8.de> [ "That didn't fix it" ] Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 4月, 2010 2 次提交
-
-
由 Pekka Enberg 提交于
As suggested by Linus, fix up kmem_ptr_validate() to handle non-kernel pointers more graciously. The patch changes kmem_ptr_validate() to use the newly introduced kern_ptr_validate() helper to check that a pointer is a valid kernel pointer before we attempt to convert it into a 'struct page'. Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi> Acked-by: NChristoph Lameter <cl@linux-foundation.org> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Pekka Enberg 提交于
As suggested by Linus, introduce a kern_ptr_validate() helper that does some sanity checks to make sure a pointer is a valid kernel pointer. This is a preparational step for fixing SLUB kmem_ptr_validate(). Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Matt Mackall <mpm@selenic.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 4月, 2010 5 次提交
-
-
由 KAMEZAWA Hiroyuki 提交于
Presently, memcg's FILE_MAPPED accounting has following race with move_account (happens at rmdir()). increment page->mapcount (rmap.c) mem_cgroup_update_file_mapped() move_account() lock_page_cgroup() check page_mapped() if page_mapped(page)>1 { FILE_MAPPED -1 from old memcg FILE_MAPPED +1 to old memcg } ..... overwrite pc->mem_cgroup unlock_page_cgroup() lock_page_cgroup() FILE_MAPPED + 1 to pc->mem_cgroup unlock_page_cgroup() Then, old memcg (-1 file mapped) new memcg (+2 file mapped) This happens because move_account see page_mapped() which is not guarded by lock_page_cgroup(). This patch adds FILE_MAPPED flag to page_cgroup and move account information based on it. Now, all checks are synchronous with lock_page_cgroup(). Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NBalbir Singh <balbir@in.ibm.com> Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Andrea Righi <arighi@develer.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
When we look into pagemap using page-types with option -p, the value of pfn for hugepages looks wrong (see below.) This is because pte was evaluated only once for one vma although it should be updated for each hugepage. This patch fixes it. $ page-types -p 3277 -Nl -b huge voffset offset len flags 7f21e8a00 11e400 1 ___U___________H_G________________ 7f21e8a01 11e401 1ff ________________TG________________ ^^^ 7f21e8c00 11e400 1 ___U___________H_G________________ 7f21e8c01 11e401 1ff ________________TG________________ ^^^ One hugepage contains 1 head page and 511 tail pages in x86_64 and each two lines represent each hugepage. Voffset and offset mean virtual address and physical address in the page unit, respectively. The different hugepages should not have the same offset value. With this patch applied: $ page-types -p 3386 -Nl -b huge voffset offset len flags 7fec7a600 112c00 1 ___UD__________H_G________________ 7fec7a601 112c01 1ff ________________TG________________ ^^^ 7fec7a800 113200 1 ___UD__________H_G________________ 7fec7a801 113201 1ff ________________TG________________ ^^^ OK More info: - This patch modifies walk_page_range()'s hugepage walker. But the change only affects pagemap_read(), which is the only caller of hugepage callback. - Without this patch, hugetlb_entry() callback is called per vma, that doesn't match the natural expectation from its name. - With this patch, hugetlb_entry() is called per hugepte entry and the callback can become much simpler. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMatt Mackall <mpm@selenic.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Shaohua Li reported his tmpfs streaming I/O test can lead to make oom. The test uses a 6G tmpfs in a system with 3G memory. In the tmpfs, there are 6 copies of kernel source and the test does kbuild for each copy. His investigation shows the test has a lot of rotated anon pages and quite few file pages, so get_scan_ratio calculates percent[0] (i.e. scanning percent for anon) to be zero. Actually the percent[0] shoule be a big value, but our calculation round it to zero. Although before commit 84b18490 ("vmscan: get_scan_ratio() cleanup") , we have the same problem too. But the old logic can rescue percent[0]==0 case only when priority==0. It had hided the real issue. I didn't think merely streaming io can makes percent[0]==0 && priority==0 situation. but I was wrong. So, definitely we have to fix such tmpfs streaming io issue. but anyway I revert the regression commit at first. This reverts commit 84b18490. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: NShaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wu Fengguang 提交于
btrfs relocate_file_extent_cluster() calls us with NULL filp: [ 4005.426805] BUG: unable to handle kernel NULL pointer dereference at 00000021 [ 4005.426818] IP: [<c109a130>] page_cache_sync_readahead+0x18/0x3e Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> Cc: Yan Zheng <yanzheng@21cn.com> Reported-by: NKirill A. Shutemov <kirill@shutemov.name> Tested-by: NKirill A. Shutemov <kirill@shutemov.name> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
- We weren't zeroing p->rss_stat[] at fork() - Consequently sync_mm_rss() was dereferencing tsk->mm for kernel threads and was oopsing. - Make __sync_task_rss_stat() static, too. Addresses https://bugzilla.kernel.org/show_bug.cgi?id=15648 [akpm@linux-foundation.org: remove the BUG_ON(!mm->rss)] Reported-by: NTroels Liebe Bentsen <tlb@rapanden.dk> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> "Michael S. Tsirkin" <mst@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 4月, 2010 1 次提交
-
-
由 Rik van Riel 提交于
Fix a memory leak in anon_vma_fork(), where we fail to tear down the anon_vmas attached to the new VMA in case setting up the new anon_vma fails. This bug also has the potential to leave behind anon_vma_chain structs with pointers to invalid memory. Reported-by: NMinchan Kim <minchan.kim@gmail.com> Signed-off-by: NRik van Riel <riel@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 4月, 2010 3 次提交
-
-
由 Anton Blanchard 提交于
I hit this when we had a bug in IDR for a few days. Basically sysfs would fail to create new inodes since it uses an IDR and therefore class_create would fail. While we are unlikely to see this fail we may as well handle it instead of oopsing. Signed-off-by: NAnton Blanchard <anton@samba.org> Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
由 Yinghai Lu 提交于
When 32bit numa is used, free_all_bootmem() will still only go over with node id 0. If node 0 doesn't have RAM installed, the lowest populated node becomes low RAM. This one fixes BOOTMEM path by iterating over the bdata_list. -v3: add more comments, and fix bootmem path too. -v4: seperate from one big patch Signed-off-by: NYinghai Lu <yinghai@kernel.org> LKML-Reference: <4BB416D7.6090203@kernel.org> Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
-
由 Yinghai Lu 提交于
On one system without RAM on node0, got following boot dump with a 32 bit NUMA kernel: early_node_map[4] active PFN ranges 1: 0x00000010 -> 0x00000099 1: 0x00000100 -> 0x0007da00 1: 0x0007e800 -> 0x0007ffa0 1: 0x0007ffae -> 0x0007ffb0 ... Subtract (29 early reservations) #000 [0000001000 - 0000002000] #001 [0000089000 - 000008f000] #002 [0000091000 - 0000093500] ... #027 [007cbfef40 - 007e800000] #028 [007e9ca000 - 007ff95000] (0 free memory ranges) Initializing HighMem for node 0 (00000000:00000000) Initializing HighMem for node 1 (00000000:00000000) Memory: 0k/2096832k available (6662k kernel code, 2096300k reserved, 4829k data, 484k init, 0k highmem) ... Checking if this processor honours the WP bit even in supervisor mode...Ok. swapper: page allocation failure. order:0, mode:0x0 Pid: 0, comm: swapper Not tainted 2.6.34-rc3-tip-03818-g4b1ea6c-dirty #35 Call Trace: [<4087a5dc>] ? printk+0xf/0x11 [<40286728>] __alloc_pages_nodemask+0x417/0x487 [<402a9ce1>] new_slab+0xe2/0x1fe [<402aa5b2>] kmem_cache_open+0x185/0x358 [<402abbc0>] T.954+0x1c/0x60 [<40d52a29>] kmem_cache_init+0x24/0x113 [<40d39738>] start_kernel+0x166/0x2e4 [<40d3940e>] ? unknown_bootoption+0x0/0x18e [<40d390ce>] i386_start_kernel+0xce/0xd5 Mem-Info: Node 1 DMA per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 Node 1 Normal per-cpu: CPU 0: hi: 0, btch: 1 usd: 0 active_anon:0 inactive_anon:0 isolated_anon:0 active_file:0 inactive_file:0 isolated_file:0 unevictable:0 dirty:0 writeback:0 unstable:0 free:0 slab_reclaimable:0 slab_unreclaimable:0 mapped:0 shmem:0 pagetables:0 bounce:0 When 32bit NUMA is used, free_all_bootmem() will still only go over with node id 0. If node 0 doesn't have RAM installed, We need to go with node1 because early_node_map still use 1 for all ranges, and ram from node1 become low ram. Use MAX_NUMNODES like 64-bit NUMA does. Note: BOOTMEM path has the same problem. this bug exist before We have NO_BOOTMEM support. -v3: add more comments, and fix bootmem path too. -v4: seperate bootmem path fix Signed-off-by: NYinghai Lu <yinghai@kernel.org> LKML-Reference: <4BB41689.9090502@kernel.org> Signed-off-by: NH. Peter Anvin <hpa@zytor.com>
-
- 30 3月, 2010 3 次提交
-
-
由 Tejun Heo 提交于
percpu.h has always been including slab.h to get k[mz]alloc/free() for UP inline implementation. percpu.h being used by very low level headers including module.h and sched.h, this meant that a lot files unintentionally got slab.h inclusion. Lee Schermerhorn was trying to make topology.h use percpu.h and got bitten by this implicit inclusion. The right thing to do is break this ultimately unnecessary dependency. The previous patch added explicit inclusion of either gfp.h or slab.h to the source files using them. This patch updates percpu.h such that slab.h is no longer included from percpu.h. Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-by: NChristoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
-
由 Randy Dunlap 提交于
mm/kmemcheck.c:69: error: dereferencing pointer to incomplete type mm/kmemcheck.c:69: error: 'SLAB_NOTRACK' undeclared (first use in this function) mm/kmemcheck.c:82: error: dereferencing pointer to incomplete type mm/kmemcheck.c:94: error: dereferencing pointer to incomplete type mm/kmemcheck.c:94: error: dereferencing pointer to incomplete type mm/kmemcheck.c:94: error: 'SLAB_DESTROY_BY_RCU' undeclared (first use in this function) Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Tejun Heo 提交于
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: NTejun Heo <tj@kernel.org> Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
-
- 29 3月, 2010 1 次提交
-
-
由 Tejun Heo 提交于
lockdep has custom code to check whether a pointer belongs to static percpu area which is somewhat broken. Implement proper is_kernel/module_percpu_address() and replace the custom code. On UP, percpu variables are regular static variables and can't be distinguished from them. Always return %false on UP. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Ingo Molnar <mingo@redhat.com>
-
- 26 3月, 2010 2 次提交
-
-
由 David Howells 提交于
Fix __get_user_pages() to make it pin the last page on a buffer that doesn't begin at the start of a page, but is a multiple of PAGE_SIZE in size. The problem is that __get_user_pages() advances the pointer too much when it iterates to the next page if the page it's currently looking at isn't used from the first byte. This can cause the end of a short VMA to be reached prematurely, resulting in the last page being lost. Signed-off-by: NSteven J. Magnani <steve@digidescorp.com> Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Howells 提交于
Revert the following patch: commit c08c6e1f Author: Steven J. Magnani <steve@digidescorp.com> Date: Fri Mar 5 13:42:24 2010 -0800 nommu: get_user_pages(): pin last page on non-page-aligned start As it assumes that the mappings begin at the start of pages - something that isn't necessarily true on NOMMU systems. On NOMMU systems, it is possible for a mapping to only occupy part of the page, and not necessarily touch either end of it; in fact it's also possible for multiple non-overlapping mappings to coexist on one page (consider direct mappings of ROMFS files, for example). Signed-off-by: NDavid Howells <dhowells@redhat.com> Acked-by: NSteven J. Magnani <steve@digidescorp.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 3月, 2010 11 次提交
-
-
由 Lee Schermerhorn 提交于
Discovered while testing other mempolicy changes: get_mempolicy() does not handle static/relative mode flags correctly. Return the value that the user specified so that it can be restored via set_mempolicy() if desired. Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michael S. Tsirkin 提交于
In 2.6.34-rc1, removing vhost_net module causes an oops in sync_mm_rss (called from do_exit) when workqueue is destroyed. This does not happen on net-next, or with vhost on top of to 2.6.33. The issue seems to be introduced by 34e55232 ("mm: avoid false sharing of mm_counter) which added sync_mm_rss() that is passed task->mm, and dereferences it without checking. If task is a kernel thread, mm might be NULL. I think this might also happen e.g. with aio. This patch fixes the oops by calling sync_mm_rss when task->mm is set to NULL. I also added BUG_ON to detect any other cases where counters get incremented while mm is NULL. The oops I observed looks like this: BUG: unable to handle kernel NULL pointer dereference at 00000000000002a8 IP: [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f PGD 0 Oops: 0002 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map CPU 2 Modules linked in: vhost_net(-) tun bridge stp sunrpc ipv6 cpufreq_ondemand acpi_cpufreq freq_table kvm_intel kvm i5000_edac edac_core rtc_cmos bnx2 button i2c_i801 i2c_core rtc_core e1000e sg joydev ide_cd_mod serio_raw pcspkr rtc_lib cdrom virtio_net virtio_blk virtio_pci virtio_ring virtio af_packet e1000 shpchp aacraid uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Pid: 2046, comm: vhost Not tainted 2.6.34-rc1-vhost #25 System Planar/IBM System x3550 -[7978B3G]- RIP: 0010:[<ffffffff810b436d>] [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f RSP: 0018:ffff8802379b7e60 EFLAGS: 00010202 RAX: 0000000000000008 RBX: ffff88023f2390c0 RCX: 0000000000000000 RDX: ffff88023f2396b0 RSI: 0000000000000000 RDI: ffff88023f2390c0 RBP: ffff8802379b7e60 R08: 0000000000000000 R09: 0000000000000000 R10: ffff88023aecfbc0 R11: 0000000000013240 R12: 0000000000000000 R13: ffffffff81051a6c R14: ffffe8ffffc0f540 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff880001e80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000002a8 CR3: 000000023af23000 CR4: 00000000000406e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process vhost (pid: 2046, threadinfo ffff8802379b6000, task ffff88023f2390c0) Stack: ffff8802379b7ee0 ffffffff81040687 ffffe8ffffc0f558 ffffffffa00a3e2d <0> 0000000000000000 ffff88023f2390c0 ffffffff81055817 ffff8802379b7e98 <0> ffff8802379b7e98 0000000100000286 ffff8802379b7ee0 ffff88023ad47d78 Call Trace: [<ffffffff81040687>] do_exit+0x147/0x6c4 [<ffffffffa00a3e2d>] ? handle_rx_net+0x0/0x17 [vhost_net] [<ffffffff81055817>] ? autoremove_wake_function+0x0/0x39 [<ffffffff81051a6c>] ? worker_thread+0x0/0x229 [<ffffffff810553c9>] kthreadd+0x0/0xf2 [<ffffffff810038d4>] kernel_thread_helper+0x4/0x10 [<ffffffff81055342>] ? kthread+0x0/0x87 [<ffffffff810038d0>] ? kernel_thread_helper+0x0/0x10 Code: 00 8b 87 6c 02 00 00 85 c0 74 14 48 98 f0 48 01 86 a0 02 00 00 c7 87 6c 02 00 00 00 00 00 00 8b 87 70 02 00 00 85 c0 74 14 48 98 <f0> 48 01 86 a8 02 00 00 c7 87 70 02 00 00 00 00 00 00 8b 87 74 RIP [<ffffffff810b436d>] sync_mm_rss+0x33/0x6f RSP <ffff8802379b7e60> CR2: 00000000000002a8 ---[ end trace 41603ba922beddd2 ]--- Fixing recursive fault but reboot is needed! (note: handle_rx_net is a work item using workqueue in question). sync_mm_rss+0x33/0x6f gave me a hint. I also tried reverting 34e55232 and the oops goes away. The module in question calls use_mm and later unuse_mm from a kernel thread. It is when this kernel thread is destroyed that the crash happens. Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Andrea Arcangeli <aarcange@redhat.com> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
mpol_parse_str() made lots 'err' variable related bug. Because it is ugly and reviewing unfriendly. This patch simplifies it. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
commit 71fe804b (mempolicy: use struct mempolicy pointer in shmem_sb_info) added mpol=local mount option. but its feature is broken since it was born. because such code always return 1 (i.e. mount failure). This patch fixes it. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Currently, following mount operation cause mount error. % mount -t tmpfs -ompol=bind:0 none /tmp Because commit 71fe804b (mempolicy: use struct mempolicy pointer in shmem_sb_info) corrupted MPOL_BIND parse code. This patch restore the needed one. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ravikiran G Thirumalai 提交于
Fix an 'oops' when a tmpfs mount point is mounted with the mpol=default mempolicy. Upon remounting a tmpfs mount point with 'mpol=default' option, the mount code crashed with a null pointer dereference. The initial problem report was on 2.6.27, but the problem exists in mainline 2.6.34-rc as well. On examining the code, we see that mpol_new returns NULL if default mempolicy was requested. This 'NULL' mempolicy is accessed to store the node mask resulting in oops. The following patch fixes it. Signed-off-by: NRavikiran Thirumalai <kiran@scalex86.org> Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Acked-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Robin Holt 提交于
ksm.c's write_protect_page implements a lockless means of verifying a page does not have any users of the page which are not accounted for via other kernel tracking means. It does this by removing the writable pte with TLB flushes, checking the page_count against the total known users, and then using set_pte_at_notify to make it a read-only entry. An unneeded mmu_notifier callout is made in the case where the known users does not match the page_count. In that event, we are inserting the identical pte and there is no need for the set_pte_at_notify, but rather the simpler set_pte_at suffices. Signed-off-by: NRobin Holt <holt@sgi.com> Acked-by: NIzik Eidus <ieidus@redhat.com> Acked-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NHugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Chris Wright <chrisw@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Howells 提交于
Fix an incorrect comment in the do_mmap_shared_file(). If a mapping is requested MAP_SHARED, then a private copy cannot be made and still provide correct semantics. Signed-off-by: NDavid Howells <dhowells@redhat.com> Reported-by: NDave Hudson <uclinux@blueteddy.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dan Carpenter 提交于
There was a potential null deref introduced in c62b1a3b ("memcg: use generic percpu instead of private implementation"). Signed-off-by: NDan Carpenter <error27@gmail.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Daisuke Nishimura 提交于
In commit 02491447 ("memcg: move charges of anonymous swap"), I tried to disable move charge feature in no mmu case by enclosing all the related functions with "#ifdef CONFIG_MMU", but the commit places these ifdefs in wrong place. (it seems that it's mangled while handling some fixes...) This patch fixes it up. Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jiri Kosina 提交于
Commit 08677214 ("x86: Make 64 bit use early_res instead of bootmem before slab") introduced early_res replacement for bootmem, but left code in __free_pages_memory() which dumps all the ranges that are beeing freed, without any additional information, causing some noise in dmesg during bootup. Just remove printing of the ranges, that doesn't provide anything useful anyway. While at it, remove other commented-out KERN_DEBUG messages in the NO_BOOTMEM code as well. Signed-off-by: NJiri Kosina <jkosina@suse.cz> Found-OK-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Yinghai Lu <yinghai@kernel.org> LKML-Reference: <alpine.LNX.2.00.1003220931360.18642@pobox.suse.cz> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 18 3月, 2010 1 次提交
-
-
由 KAMEZAWA Hiroyuki 提交于
swap_cgroup uses 2bytes data and uses cmpxchg in a new operation. 2byte cmpxchg/xchg is not available on some archs. This patch replaces cmpxchg/xchg with operations under lock. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reported-by: Sachin Sant <sachinp@in.ibm.com> wrote: Acked-by: NBalbir Singh <balbir@in.ibm.com> Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 3月, 2010 7 次提交
-
-
由 KAMEZAWA Hiroyuki 提交于
In current page-fault code, handle_mm_fault() -> ... -> mem_cgroup_charge() -> map page or handle error. -> check return code. If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called. But if it's caused by memcg, OOM should have been already invoked. Then, I added a patch: a636b327. That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future. But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy. This patch changes memcg's oom logic as. * If memcg causes OOM-kill, continue to retry. * remove jiffies check which is used now. * add memcg-oom-lock which works like perzone oom lock. * If current is killed(as a process), bypass charge. Something more sophisticated can be added but this pactch does fundamental things. TODO: - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom. - more chances for wake up oom waiter (when changing memory limit etc..) Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Events should be removed after rmdir of cgroup directory, but before destroying subsystem state objects. Let's take reference to cgroup directory dentry to do that. Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hioryu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Acked-by: NLi Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Dan Malek <dan@embeddedalley.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Presently, if panic_on_oom=2, the whole system panics even if the oom happend in some special situation (as cpuset, mempolicy....). Then, panic_on_oom=2 means painc_on_oom_always. Now, memcg doesn't check panic_on_oom flag. This patch adds a check. BTW, how it's useful ? kdump+panic_on_oom=2 is the last tool to investigate what happens in oom-ed system. When a task is killed, the sysytem recovers and there will be few hint to know what happnes. In mission critical system, oom should never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing precise information via snapshot. TODO: - For memcg, it's for isolate system's memory usage, oom-notiifer and freeze_at_oom (or rest_at_oom) should be implemented. Then, management daemon can do similar jobs (as kdump) or taking snapshot per cgroup. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Memcg has 2 eventcountes which counts "the same" event. Just usages are different from each other. This patch tries to reduce event counter. Now logic uses "only increment, no reset" counter and masks for each checks. Softlimit chesk was done per 1000 evetns. So, the similar check can be done by !(new_counter & 0x3ff). Threshold check was done per 100 events. So, the similar check can be done by (!new_counter & 0x7f) ALL event checks are done right after EVENT percpu counter is updated. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Presently, move_task does "batched" precharge. Because res_counter or css's refcnt are not-scalable jobs for memcg, try_charge_().. tend to be done in batched manner if allowed. Now, softlimit and threshold check their event counter in try_charge, but the charge is not a per-page event. And event counter is not updated at charge(). Moreover, precharge doesn't pass "page" to try_charge() and softlimit tree will be never updated until uncharge() causes an event." So the best place to check the event counter is commit_charge(). This is per-page event by its nature. This patch move checks to there. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
When per-cpu counter for memcg was implemneted, dynamic percpu allocator was not very good. But now, we have good one and useful macros. This patch replaces memcg's private percpu counter implementation with generic dynamic percpu allocator. The benefits are - We can remove private implementation. - The counters will be NUMA-aware. (Current one is not...) - This patch makes sizeof struct mem_cgroup smaller. Then, struct mem_cgroup may be fit in page size on small config. - About basic performance aspects, see below. [Before] # size mm/memcontrol.o text data bss dec hex filename 24373 2528 4132 31033 7939 mm/memcontrol.o [page-fault-throuput test on 8cpu/SMP in root cgroup] # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8 Performance counter stats for './multi-fault-fork 8' (5 runs): 45878618 page-faults ( +- 0.110% ) 602635826 cache-misses ( +- 0.105% ) 61.005373262 seconds time elapsed ( +- 0.004% ) Then cache-miss/page fault = 13.14 [After] #size mm/memcontrol.o text data bss dec hex filename 23913 2528 4132 30573 776d mm/memcontrol.o # /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8 Performance counter stats for './multi-fault-fork 8' (5 runs): 48179400 page-faults ( +- 0.271% ) 588628407 cache-misses ( +- 0.136% ) 61.004615021 seconds time elapsed ( +- 0.004% ) Then cache-miss/page fault = 12.22 Text size is reduced. This performance improvement is not big and will be invisible in real world applications. But this result shows this patch has some good effect even on (small) SMP. Here is a test program I used. 1. fork() processes on each cpus. 2. do page fault repeatedly on each process. 3. after 60secs, kill all childredn and exit. (3 is necessary for getting stable data, this is improvement from previous one.) #define _GNU_SOURCE #include <stdio.h> #include <sched.h> #include <sys/mman.h> #include <sys/types.h> #include <sys/stat.h> #include <fcntl.h> #include <signal.h> #include <stdlib.h> /* * For avoiding contention in page table lock, FAULT area is * sparse. If FAULT_LENGTH is too large for your cpus, decrease it. */ #define FAULT_LENGTH (2 * 1024 * 1024) #define PAGE_SIZE 4096 #define MAXNUM (128) void alarm_handler(int sig) { } void *worker(int cpu, int ppid) { void *start, *end; char *c; cpu_set_t set; int i; CPU_ZERO(&set); CPU_SET(cpu, &set); sched_setaffinity(0, sizeof(set), &set); start = mmap(NULL, FAULT_LENGTH, PROT_READ|PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); if (start == MAP_FAILED) { perror("mmap"); exit(1); } end = start + FAULT_LENGTH; pause(); //fprintf(stderr, "run%d", cpu); while (1) { for (c = (char*)start; (void *)c < end; c += PAGE_SIZE) *c = 0; madvise(start, FAULT_LENGTH, MADV_DONTNEED); } return NULL; } int main(int argc, char *argv[]) { int num, i, ret, pid, status; int pids[MAXNUM]; if (argc < 2) return 0; setpgid(0, 0); signal(SIGALRM, alarm_handler); num = atoi(argv[1]); pid = getpid(); for (i = 0; i < num; ++i) { ret = fork(); if (!ret) { worker(i, pid); exit(0); } pids[i] = ret; } sleep(1); kill(-pid, SIGALRM); sleep(60); for (i = 0; i < num; i++) kill(pids[i], SIGKILL); for (i = 0; i < num; i++) waitpid(pids[i], &status, 0); return 0; } Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
s/mem_cgroup_print_mem_info/mem_cgroup_print_oom_info/ Signed-off-by: NKirill A. Shutemov <kirill@shutemov.name> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-