- 09 1月, 2009 10 次提交
-
-
由 KAMEZAWA Hiroyuki 提交于
For accounting swap, we need a record per swap entry, at least. This patch adds following function. - swap_cgroup_swapon() .... called from swapon - swap_cgroup_swapoff() ... called at the end of swapoff. - swap_cgroup_record() .... record information of swap entry. - swap_cgroup_lookup() .... lookup information of swap entry. This patch just implements "how to record information". No actual method for limit the usage of swap. These routine uses flat table to record and lookup. "wise" lookup system like radix-tree requires requires memory allocation at new records but swap-out is usually called under memory shortage (or memcg hits limit.) So, I used static allocation. (maybe dynamic allocation is not very hard but it adds additional memory allocation in memory shortage path.) Note1: In this, we use pointer to record information and this means 8bytes per swap entry. I think we can reduce this when we create "id of cgroup" in the range of 0-65535 or 0-255. Reported-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reported-by: NHugh Dickins <hugh@veritas.com> Reported-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Reported-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Config and control variable for mem+swap controller. This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP (memory resource controller swap extension.) For accounting swap, it's obvious that we have to use additional memory to remember "who uses swap". This adds more overhead. So, it's better to offer "choice" to users. This patch adds 2 choices. This patch adds 2 parameters to enable swap extension or not. - CONFIG - boot option Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
SwapCache support for memory resource controller (memcg) Before mem+swap controller, memcg itself should handle SwapCache in proper way. This is cut-out from it. In current memcg, SwapCache is just leaked and the user can create tons of SwapCache. This is a leak of account and should be handled. SwapCache accounting is done as following. charge (anon) - charged when it's mapped. (because of readahead, charge at add_to_swap_cache() is not sane) uncharge (anon) - uncharged when it's dropped from swapcache and fully unmapped. means it's not uncharged at unmap. Note: delete from swap cache at swap-in is done after rmap information is established. charge (shmem) - charged at swap-in. this prevents charge at add_to_page_cache(). uncharge (shmem) - uncharged when it's dropped from swapcache and not on shmem's radix-tree. at migration, check against 'old page' is modified to handle shmem. Comparing to the old version discussed (and caused troubles), we have advantages of - PCG_USED bit. - simple migrating handling. So, situation is much easier than several months ago, maybe. [hugh@veritas.com: memcg: handle swap caches build fix] Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
By memcg-move-all-accounts-to-parent-at-rmdir.patch, there is no leak of memory usage and force_empty is removed. This patch adds "force_empty" again, in reasonable manner. memory.force_empty file works when #echo 0 (or some) > memory.force_empty and have following function. 1. only works when there are no task in this cgroup. 2. free all page under this cgroup as much as possible. 3. page which cannot be freed will be moved up to parent. 4. Then, memcg will be empty after above echo returns. This is much better behavior than old "force_empty" which just forget all accounts. This patch also check signal_pending() and above "echo" can be stopped by "Ctrl-C". [akpm@linux-foundation.org: cleanup] Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Blunck 提交于
As Jan Blunck <jblunck@suse.de> pointed out, allocating per-cpu stat for memcg to the size of NR_CPUS is not good. This patch changes mem_cgroup's cpustat allocation not based on NR_CPUS but based on nr_cpu_ids. Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
This patch provides a function to move account information of a page between mem_cgroups and rewrite force_empty to make use of this. This moving of page_cgroup is done under - lru_lock of source/destination mem_cgroup is held. - lock_page_cgroup() is held. Then, a routine which touches pc->mem_cgroup without lock_page_cgroup() should confirm pc->mem_cgroup is still valid or not. Typical code can be following. (while page is not under lock_page()) mem = pc->mem_cgroup; mz = page_cgroup_zoneinfo(pc) spin_lock_irqsave(&mz->lru_lock); if (pc->mem_cgroup == mem) ...../* some list handling */ spin_unlock_irqrestore(&mz->lru_lock); Of course, better way is lock_page_cgroup(pc); .... unlock_page_cgroup(pc); But you should confirm the nest of lock and avoid deadlock. If you treats page_cgroup from mem_cgroup's LRU under mz->lru_lock, you don't have to worry about what pc->mem_cgroup points to. moved pages are added to head of lru, not to tail. Expected users of this routine is: - force_empty (rmdir) - moving tasks between cgroup (for moving account information.) - hierarchy (maybe useful.) force_empty(rmdir) uses this move_account and move pages to its parent. This "move" will not cause OOM (I added "oom" parameter to try_charge().) If the parent is busy (not enough memory), force_empty calls try_to_free_page() and reduce usage. Purpose of this behavior is - Fix "forget all" behavior of force_empty and avoid leak of accounting. - By "moving first, free if necessary", keep pages on memory as much as possible. Adding a switch to change behavior of force_empty to - free first, move if necessary - free all, if there is mlocked/busy pages, return -EBUSY. is under consideration. (I'll add if someone requtests.) This patch also removes memory.force_empty file, a brutal debug-only interface. Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
In init_section_page_cgroup() the section a given pfn belongs to is calculated at the top of the function and, despite the fact that the pfn/section correspondence does not change, it is recalculated further down the same function. By computing this just once and reusing that value we save some bytes in the object file and do not waste CPU cycles. Signed-off-by: NFernando Luis Vazquez Cao <fernando@oss.ntt.co.jp> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Now, management of "charge" under page migration is done under following manner. (Assume migrate page contents from oldpage to newpage) before - "newpage" is charged before migration. at success. - "oldpage" is uncharged at somewhere(unmap, radix-tree-replace) at failure - "newpage" is uncharged. - "oldpage" is charged if necessary (*1) But (*1) is not reliable....because of GFP_ATOMIC. This patch tries to change behavior as following by charge/commit/cancel ops. before - charge PAGE_SIZE (no target page) success - commit charge against "newpage". failure - commit charge against "oldpage". (PCG_USED bit works effectively to avoid double-counting) - if "oldpage" is obsolete, cancel charge of PAGE_SIZE. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Fix misuse of gfp_kernel. Now, most of callers of mem_cgroup_charge_xxx functions uses GFP_KERNEL. I think that this is from the fact that page_cgroup *was* dynamically allocated. But now, we allocate all page_cgroup at boot. And mem_cgroup_try_to_free_pages() reclaim memory from GFP_HIGHUSER_MOVABLE + specified GFP_RECLAIM_MASK. * This is because we just want to reduce memory usage. "Where we should reclaim from ?" is not a problem in memcg. This patch modifies gfp masks to be GFP_HIGUSER_MOVABLE if possible. Note: This patch is not for fixing behavior but for showing sane information in source code. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
There is a small race in do_swap_page(). When the page swapped-in is charged, the mapcount can be greater than 0. But, at the same time some process (shares it ) call unmap and make mapcount 1->0 and the page is uncharged. CPUA CPUB mapcount == 1. (1) charge if mapcount==0 zap_pte_range() (2) mapcount 1 => 0. (3) uncharge(). (success) (4) set page's rmap() mapcount 0=>1 Then, this swap page's account is leaked. For fixing this, I added a new interface. - charge account to res_counter by PAGE_SIZE and try to free pages if necessary. - commit register page_cgroup and add to LRU if necessary. - cancel uncharge PAGE_SIZE because of do_swap_page failure. CPUA (1) charge (always) (2) set page's rmap (mapcount > 0) (3) commit charge was necessary or not after set_pte(). This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting. Usual mem_cgroup_charge_common() does charge -> commit at a time. And this patch also adds following function to clarify all charges. - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge() called against newly allocated anon pages. - mem_cgroup_charge_migrate_fixup() called only from remove_migration_ptes(). we'll have to rewrite this later.(this patch just keeps old behavior) This function will be removed by additional patch to make migration clearer. Good for clarifying "what we do" Then, we have 4 following charge points. - newpage - swap-in - add-to-cache. - migration. [akpm@linux-foundation.org: add missing inline directives to stubs] Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 1月, 2009 30 次提交
-
-
由 Geert Uytterhoeven 提交于
commit 8308c54d ("generic: redefine resource_size_t as phys_addr_t") made CONFIG_RESOURCES_64BIT obsolete, but didn't remove it. Remove it. Signed-off-by: NGeert Uytterhoeven <geert@linux-m68k.org> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Cyrill Gorcunov 提交于
At this point we already know that 'addr' is not NULL so get rid of redundant 'if'. Probably gcc eliminate it by optimization pass. [akpm@linux-foundation.org: use __weak, too] Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org> Reviewed-by: NIngo Molnar <mingo@elte.hu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Wassim Dagash reported following kswapd infinite loop problem. kswapd runs in some infinite loop trying to swap until order 10 of zone highmem is OK.... kswapd will continue to try to balance order 10 of zone highmem forever (or until someone release a very large chunk of highmem). For non order-0 allocations, the system may never be balanced due to fragmentation but kswapd should not infinitely loop as a result. Instead, recheck all watermarks at order-0 as they are the most important. If watermarks are ok, kswapd will go back to sleep. [akpm@linux-foundation.org: fix comment] Reported-by: Nwassim dagash <wassim.dagash@gmail.com> Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Moving the request details print-out before the sanity checks that might panic() enables us to analyse invalid requests without having access to the line information of the stack dump. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
When dup_mmap() ooms we can end up with mm->mmap == NULL. The error path does mmput() and unmap_vmas() gets a NULL vma which it dereferences. In exit_mmap() there is nothing to do at all for this case, we can cancel the callpath right there. [akpm@linux-foundation.org: add sorely-needed comment] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reported-by: NAkinobu Mita <akinobu.mita@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
page_queue_congested() was introduced in 2002, but it was never used Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
No architectures use CONFIG_OUT_OF_LINE_PFN_TO_PAGE - it can be removed. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
xacct_add_tsk() relies on do_exit()->update_hiwater_xxx() and uses mm->hiwater_xxx directly, this leads to 2 problems: - taskstats_user_cmd() can call fill_pid()->xacct_add_tsk() at any moment before the task exits, so we should check the current values of rss/vm anyway. - do_exit()->update_hiwater_xxx() calls are racy. An exiting thread can be preempted right before mm->hiwater_xxx = new_val, and another thread can use A_LOT of memory and exit in between. When the first thread resumes it can be the last thread in the thread group, in that case we report the wrong hiwater_xxx values which do not take A_LOT into account. Introduce get_mm_hiwater_rss() and get_mm_hiwater_vm() helpers and change xacct_add_tsk() to use them. The first helper will also be used by rusage->ru_maxrss accounting. Kill do_exit()->update_hiwater_xxx() calls. Unless we are going to decrease rss/vm there is no point to update mm->hiwater_xxx, and nobody can look at this mm_struct when exit_mmap() actually unmaps the memory. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NHugh Dickins <hugh@veritas.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Frustratingly, gfp_t is really divided into two classes of flags. One are the context dependent ones (can we sleep? can we enter filesystem? block subsystem? should we use some extra reserves, etc.). The other ones are the type of memory required and depend on how the algorithm is implemented rather than the point at which the memory is allocated (highmem? dma memory? etc). Some of the functions which allocate a page and add it to page cache take a gfp_t, but sometimes those functions or their callers aren't really doing the right thing: when allocating pagecache page, the memory type should be mapping_gfp_mask(mapping). When allocating radix tree nodes, the memory type should be kernel mapped (not highmem) memory. The gfp_t argument should only really be needed for context dependent options. This patch doesn't really solve that tangle in a nice way, but it does attempt to fix a couple of bugs. - find_or_create_page changes its radix-tree allocation to only include the main context dependent flags in order so the pagecache page may be allocated from arbitrary types of memory without affecting the radix-tree. In practice, slab allocations don't come from highmem anyway, and radix-tree only uses slab allocations. So there isn't a practical change (unless some fs uses GFP_DMA for pages). - grab_cache_page_nowait() is changed to allocate radix-tree nodes with GFP_NOFS, because it is not supposed to reenter the filesystem. This bug could cause lock recursion if a filesystem is not expecting the function to reenter the fs (as-per documentation). Filesystems should be careful about exactly what semantics they want and what they get when fiddling with gfp_t masks to allocate pagecache. One should be as liberal as possible with the type of memory that can be used, and same for the the context specific flags. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Direct IO can invalidate and sync a lot of pagecache pages in the mapping. A 4K direct IO will actually try to sync and/or invalidate the pagecache of the entire file, for example (which might be many GB or TB large). Improve this by doing range syncs. Also, memory no longer has to be unmapped to catch the dirty bits for syncing, as dirty bits would remain coherent due to dirty mmap accounting. This fixes the immediate DM deadlocks when doing direct IO reads to block device with a mounted filesystem, if only by papering over the problem somewhat rather than addressing the fsync starvation cases. Signed-off-by: NNick Piggin <npiggin@suse.de> Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 ZhenwenXu 提交于
Fix a little of the coding style in mm/mmap.c [akpm@linux-foundation.org: cleanup] Signed-off-by: NZhenwenXu <helight.xu@gmail.com> Signed-off-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Matt Mackall 提交于
tiny-shmem shares most of its 130 lines of code with shmem and tends to break when particular bits of shmem get modified. Unifying saves code and makes keeping these two in sync much easier. before: 14367 392 24 14783 39bf mm/shmem.o 396 72 8 476 1dc mm/tiny-shmem.o after: 14367 392 24 14783 39bf mm/shmem.o 412 72 8 492 1ec mm/shmem.o tiny Signed-off-by: NMatt Mackall <mpm@selenic.com> Acked-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Ying Han 提交于
The initial implementation of checking TIF_MEMDIE covers the cases of OOM killing. If the process has been OOM killed, the TIF_MEMDIE is set and it return immediately. This patch includes: 1. add the case that the SIGKILL is sent by user processes. The process can try to get_user_pages() unlimited memory even if a user process has sent a SIGKILL to it(maybe a monitor find the process exceed its memory limit and try to kill it). In the old implementation, the SIGKILL won't be handled until the get_user_pages() returns. 2. change the return value to be ERESTARTSYS. It makes no sense to return ENOMEM if the get_user_pages returned by getting a SIGKILL signal. Considering the general convention for a system call interrupted by a signal is ERESTARTNOSYS, so the current return value is consistant to that. Lee: An unfortunate side effect of "make-get_user_pages-interruptible" is that it prevents a SIGKILL'd task from munlock-ing pages that it had mlocked, resulting in freeing of mlocked pages. Freeing of mlocked pages, in itself, is not so bad. We just count them now--altho' I had hoped to remove this stat and add PG_MLOCKED to the free pages flags check. However, consider pages in shared libraries mapped by more than one task that a task mlocked--e.g., via mlockall(). If the task that mlocked the pages exits via SIGKILL, these pages would be left mlocked and unevictable. Proposed fix: Add another GUP flag to ignore sigkill when calling get_user_pages from munlock()--similar to Kosaki Motohiro's 'IGNORE_VMA_PERMISSIONS flag for the same purpose. We are not actually allocating memory in this case, which "make-get_user_pages-interruptible" intends to avoid. We're just munlocking pages that are already resident and mapped, and we're reusing get_user_pages() to access those pages. ?? Maybe we should combine 'IGNORE_VMA_PERMISSIONS and '_IGNORE_SIGKILL into a single flag: GUP_FLAGS_MUNLOCK ??? [Lee.Schermerhorn@hp.com: ignore sigkill in get_user_pages during munlock] Signed-off-by: NPaul Menage <menage@google.com> Signed-off-by: NYing Han <yinghan@google.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NPekka Enberg <penberg@cs.helsinki.fi> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh@veritas.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rohit Seth <rohitseth@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
These three statements manipulate local variables and do not need the lock coverage. Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Rik van Riel <riel@redhat.com Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
bad_page() and rmap Eeek messages have said KERN_EMERG for a few years, which I've followed in print_bad_pte(). These are serious system errors, on a par with BUGs, but they're not quite emergencies, and we do our best to carry on: say KERN_ALERT "BUG: " like the x86 oops does. And remove the "Trying to fix it up, but a reboot is needed" line: it's not untrue, but I hope the KERN_ALERT "BUG: " conveys as much. Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
print_bad_pte() and bad_page() might each need ratelimiting - especially for their dump_stacks, almost never of interest, yet not quite dispensible. Correlating corruption across neighbouring entries can be very helpful, so allow a burst of 60 reports before keeping quiet for the remainder of that minute (or allow a steady drip of one report per second). Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Remove page_remove_rmap()'s vma arg, which was only for the Eeek message. And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's page_dup_rmap(): we're trying to be more resilient about that than BUGs. Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Complete zap_pte_range()'s coverage of bad pagetable entries by calling print_bad_pte() on a pte_file in a linear vma and on a bad swap entry. That needs free_swap_and_cache() to tell it, which will also have shown one of those "swap_free" errors (but with much less information). Similar checks in fork's copy_one_pte()? No, that would be more noisy than helpful: we'll see them when parent and child exec or exit. Where do_nonlinear_fault() calls print_bad_pte(): omit !VM_CAN_NONLINEAR case, that could only be a bug in sys_remap_file_pages(), not a bad pte. VM_FAULT_OOM rather than VM_FAULT_SIGBUS? Well, okay, that is consistent with what happens if do_swap_page() operates a bad swap entry; but don't we have patches to be more careful about killing when VM_FAULT_OOM? Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
print_bad_pte() is so far being called only when zap_pte_range() finds negative page_mapcount, or there's a fault on a pte_file where it does not belong. That's weak coverage when we suspect pagetable corruption. Originally, it was called when vm_normal_page() found an invalid pfn: but pfn_valid is expensive on some architectures and configurations, so 2.6.24 put that under CONFIG_DEBUG_VM (which doesn't help in the field), then 2.6.26 replaced it by a VM_BUG_ON (likewise). Reinstate the print_bad_pte() in vm_normal_page(), but use a cheaper test than pfn_valid(): memmap_init_zone() (used in bootup and hotplug) keep a __read_mostly note of the highest_memmap_pfn, vm_normal_page() then check pfn against that. We could call this pfn_plausible() or pfn_sane(), but I doubt we'll need it elsewhere: of course it's not reliable, but gives much stronger pagetable validation on many boxes. Also use print_bad_pte() when the pte_special bit is found outside a VM_PFNMAP or VM_MIXEDMAP area, instead of VM_BUG_ON. Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Now that bad pages are kept out of circulation, there is no need for the infamous page_remove_rmap() BUG() - once that page is freed, its negative mapcount will issue a "Bad page state" message and the page won't be freed. Removing the BUG() allows more info, on subsequent pages, to be gathered. We do have more info about the page at this point than bad_page() can know - notably, what the pmd is, which might pinpoint something like low 64kB corruption - but page_remove_rmap() isn't given the address to find that. In practice, there is only one call to page_remove_rmap() which has ever reported anything, that from zap_pte_range() (usually on exit, sometimes on munmap). It has all the info, so remove page_remove_rmap()'s "Eeek" message and leave it all to zap_pte_range(). mm/memory.c already has a hardly used print_bad_pte() function, showing some of the appropriate info: extend it to show what we want for the rmap case: pte info, page info (when there is a page) and vma info to compare. zap_pte_range() already knows the pmd, but print_bad_pte() is easier to use if it works that out for itself. Some of this info is also shown in bad_page()'s "Bad page state" message. Keep them separate, but adjust them to match each other as far as possible. Say "Bad page map" in print_bad_pte(), and add a TAINT_BAD_PAGE there too. print_bad_pte() show current->comm unconditionally (though it should get repeated in the usually irrelevant stack trace): sorry, I misled Nick Piggin to make it conditional on vm_mm == current->mm, but current->mm is already NULL in the exit case. Usually current->comm is good, though exceptionally it may not be that of the mm (when "swapoff" for example). Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Until now the bad_page() checkers have special-cased PageReserved, keeping those pages out of circulation thereafter. Now extend the special case to all: we want to keep ANY page with bad state out of circulation - the "free" page may well be in use by something. Leave the bad state of those pages untouched, for examination by debuggers; except for PageBuddy - leaving that set would risk bringing the page back. Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Simplify the PAGE_FLAGS checking and clearing when freeing and allocating a page: check the same flags as before when freeing, clear ALL the flags (unless PageReserved) when freeing, check ALL flags off when allocating. Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
zone_is_near_oom() is unused. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
The vmscan bail out patch move nr_reclaimed variable to struct scan_control. Unfortunately, indirect access can easily happen cache miss. if heavy memory pressure happend, that's ok. cache miss already plenty. it is not observable. but, if memory pressure is lite, performance degression is obserbable. I compared following three pattern (it was mesured 10 times each) hackbench 125 process 3000 hackbench 130 process 3000 hackbench 135 process 3000 2.6.28-rc6 bail-out 125 130 135 125 130 135 ============================================================== 71.866 75.86 81.274 93.414 73.254 193.382 74.145 78.295 77.27 74.897 75.021 80.17 70.305 77.643 75.855 70.134 77.571 79.896 74.288 73.986 75.955 77.222 78.48 80.619 72.029 79.947 78.312 75.128 82.172 79.708 71.499 77.615 77.042 74.177 76.532 77.306 76.188 74.471 83.562 73.839 72.43 79.833 73.236 75.606 78.743 76.001 76.557 82.726 69.427 77.271 76.691 76.236 79.371 103.189 72.473 76.978 80.643 69.128 78.932 75.736 avg 72.545 76.767 78.534 76.017 77.03 93.256 std 1.89 1.71 2.41 6.29 2.79 34.16 min 69.427 73.986 75.855 69.128 72.43 75.736 max 76.188 79.947 83.562 93.414 82.172 193.382 about 4-5% degression. Then, this patch introduces a temporary local variable. result: 2.6.28-rc6 this patch num 125 130 135 125 130 135 ============================================================== 71.866 75.86 81.274 67.302 68.269 77.161 74.145 78.295 77.27 72.616 72.712 79.06 70.305 77.643 75.855 72.475 75.712 77.735 74.288 73.986 75.955 69.229 73.062 78.814 72.029 79.947 78.312 71.551 74.392 78.564 71.499 77.615 77.042 69.227 74.31 78.837 76.188 74.471 83.562 70.759 75.256 76.6 73.236 75.606 78.743 69.966 76.001 78.464 69.427 77.271 76.691 69.068 75.218 80.321 72.473 76.978 80.643 72.057 77.151 79.068 avg 72.545 76.767 78.534 70.425 74.2083 78.462 std 1.89 1.71 2.41 1.66 2.34 1.00 min 69.427 73.986 75.855 67.302 68.269 76.6 max 76.188 79.947 83.562 72.616 77.151 80.321 OK. the degression is disappeared. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NRik van Riel <riel@redhat.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rik van Riel 提交于
When the VM is under pressure, it can happen that several direct reclaim processes are in the pageout code simultaneously. It also happens that the reclaiming processes run into mostly referenced, mapped and dirty pages in the first round. This results in multiple direct reclaim processes having a lower pageout priority, which corresponds to a higher target of pages to scan. This in turn can result in each direct reclaim process freeing many pages. Together, they can end up freeing way too many pages. This kicks useful data out of memory (in some cases more than half of all memory is swapped out). It also impacts performance by keeping tasks stuck in the pageout code for too long. A 30% improvement in hackbench has been observed with this patch. The fix is relatively simple: in shrink_zone() we can check how many pages we have already freed, direct reclaim tasks break out of the scanning loop if they have already freed enough pages and have reached a lower priority level. We do not break out of shrink_zone() when priority == DEF_PRIORITY, to ensure that equal pressure is applied to every zone in the common case. However, in order to do this we do need to know how many pages we already freed, so move nr_reclaimed into scan_control. akpm: a historical interlude... We tried this in 2004: :commit e468e46a9bea3297011d5918663ce6d19094cf87 :Author: akpm <akpm> :Date: Thu Jun 24 15:53:52 2004 +0000 : :[PATCH] vmscan.c: dont reclaim too many pages : : The shrink_zone() logic can, under some circumstances, cause far too many : pages to be reclaimed. Say, we're scanning at high priority and suddenly hit : a large number of reclaimable pages on the LRU. : Change things so we bale out when SWAP_CLUSTER_MAX pages have been reclaimed. And we reverted it in 2006: :commit 210fe530 :Author: Andrew Morton <akpm@osdl.org> :Date: Fri Jan 6 00:11:14 2006 -0800 : : [PATCH] vmscan: balancing fix : : Revert a patch which went into 2.6.8-rc1. The changelog for that patch was: : : The shrink_zone() logic can, under some circumstances, cause far too many : pages to be reclaimed. Say, we're scanning at high priority and suddenly : hit a large number of reclaimable pages on the LRU. : : Change things so we bale out when SWAP_CLUSTER_MAX pages have been : reclaimed. : : Problem is, this change caused significant imbalance in inter-zone scan : balancing by truncating scans of larger zones. : : Suppose, for example, ZONE_HIGHMEM is 10x the size of ZONE_NORMAL. The zone : balancing algorithm would require that if we're scanning 100 pages of : ZONE_HIGHMEM, we should scan 10 pages of ZONE_NORMAL. But this logic will : cause the scanning of ZONE_HIGHMEM to bale out after only 32 pages are : reclaimed. Thus effectively causing smaller zones to be scanned relatively : harder than large ones. : : Now I need to remember what the workload was which caused me to write this : patch originally, then fix it up in a different way... And we haven't demonstrated that whatever problem caused that reversion is not being reintroduced by this change in 2008. Signed-off-by: NRik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hannes Eder 提交于
Fix the following sparse warnings: mm/hugetlb.c:375:3: warning: returning void-valued expression mm/hugetlb.c:408:3: warning: returning void-valued expression Signed-off-by: NHannes Eder <hannes@hanneseder.net> Acked-by: NNishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Remove the srandom32((u32)get_seconds()) from non-rotational swapon: there's been a coincidental discussion of earlier randomization, assume that goes ahead, let swapon be a client rather than stirring for itself. Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Donjun Shin <djshin90@gmail.com> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Joern Engel <joern@logfs.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Tejun Heo <teheo@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Change pgoff_t nr_blocks in discard_swap() and discard_swap_cluster() to sector_t: given the constraints on swap offsets (in particular, the 5 bits of swap type accommodated in the same unsigned long), pgoff_t was actually safe as is, but it certainly looked worrying when shifted left. [akpm@linux-foundation.org: fix shift overflow] Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Joern Engel <joern@logfs.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Donjun Shin <djshin90@gmail.com> Cc: Tejun Heo <teheo@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Though attempting to find free clusters (Andrea), swap allocation has always restarted its searches from the beginning of the swap area (sct), to reduce seek times between swap pages, by not scattering them all over the partition. But on a solidstate swap device, seeks are cheap, and block remapping to level the wear may be limited by zones: in that case it's better to cycle around the whole partition. Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Joern Engel <joern@logfs.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Donjun Shin <djshin90@gmail.com> Cc: Tejun Heo <teheo@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Swap allocation has always started from the beginning of the swap area; but if we're dealing with a solidstate swap device which can only remap blocks within limited zones, that would sooner wear out the first zone. Therefore sys_swapon() test whether blk_queue is non-rotational, and if so randomize the cluster_next starting position for allocation. If blk_queue is nonrot, note SWP_SOLIDSTATE for later use, and report it with an "SS" at the right end of the kernel's "Adding ... swap" message (so that if it's both nonrot and discardable, "SSD" will be shown there). Perhaps something should be shown in /proc/swaps (swapon -s), but we have to be more cautious before making any addition to that format. Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Joern Engel <joern@logfs.org> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Donjun Shin <djshin90@gmail.com> Cc: Tejun Heo <teheo@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-