- 24 3月, 2011 1 次提交
-
-
由 Daisuke Nishimura 提交于
Add checks at allocating or freeing a page whether the page is used (iow, charged) from the view point of memcg. This check may be useful in debugging a problem and we did similar checks before the commit 52d4b9ac(memcg: allocate all page_cgroup at boot). This patch adds some overheads at allocating or freeing memory, so it's enabled only when CONFIG_DEBUG_VM is enabled. Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 3月, 2011 2 次提交
-
-
由 Minchan Kim 提交于
The rotate_reclaimable_page function moves just written out pages, which the VM wanted to reclaim, to the end of the inactive list. That way the VM will find those pages first next time it needs to free memory. This patch applies the rule in memcg. It can help to prevent unnecessary working page eviction of memcg. Signed-off-by: NMinchan Kim <minchan.kim@gmail.com> Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miklos Szeredi 提交于
This function basically does: remove_from_page_cache(old); page_cache_release(old); add_to_page_cache_locked(new); Except it does this atomically, so there's no possibility for the "add" to fail because of a race. If memory cgroups are enabled, then the memory cgroup charge is also moved from the old page to the new. This function is currently used by fuse to move pages into the page cache on read, instead of copying the page contents. [minchan.kim@gmail.com: add freepage() hook to replace_page_cache_page()] Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NMinchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 1月, 2011 1 次提交
-
-
由 KAMEZAWA Hiroyuki 提交于
Now, under THP: at charge: - PageCgroupUsed bit is set to all page_cgroup on a hugepage. ....set to 512 pages. at uncharge - PageCgroupUsed bit is unset on the head page. So, some pages will remain with "Used" bit. This patch fixes that Used bit is set only to the head page. Used bits for tail pages will be set at splitting if necessary. This patch adds this lock order: compound_lock() -> page_cgroup_move_lock(). [akpm@linux-foundation.org: fix warning] Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 1月, 2011 2 次提交
-
-
由 Daisuke Nishimura 提交于
In the current implementation mem_cgroup_end_migration() decides whether the page migration has succeeded or not by checking "oldpage->mapping". But if we are tring to migrate a shmem swapcache, the page->mapping of it is NULL from the begining, so the check would be invalid. As a result, mem_cgroup_end_migration() assumes the migration has succeeded even if it's not, so "newpage" would be freed while it's not uncharged. This patch fixes it by passing mem_cgroup_end_migration() the result of the page migration. Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Greg Thelen 提交于
Replace usage of the mem_cgroup_update_file_mapped() memcg statistic update routine with two new routines: * mem_cgroup_inc_page_stat() * mem_cgroup_dec_page_stat() As before, only the file_mapped statistic is managed. However, these more general interfaces allow for new statistics to be more easily added. New statistics are added with memcg dirty page accounting. Signed-off-by: NGreg Thelen <gthelen@google.com> Signed-off-by: NAndrea Righi <arighi@develer.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 8月, 2010 1 次提交
-
-
由 KOSAKI Motohiro 提交于
mem_cgroup_soft_limit_reclaim() has zone, nid and zid argument. but nid and zid can be calculated from zone. So remove it. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Nishimura Daisuke <d-nishimura@mtf.biglobe.ne.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 8月, 2010 2 次提交
-
-
由 David Rientjes 提交于
This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of "allowable" memory. "Allowable," in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of "allowable" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Since 2.6.28 zone->prev_priority is unused. Then it can be removed safely. It reduce stack usage slightly. Now I have to say that I'm sorry. 2 years ago, I thought prev_priority can be integrate again, it's useful. but four (or more) times trying haven't got good performance number. Thus I give up such approach. The rest of this changelog is notes on prev_priority and why it existed in the first place and why it might be not necessary any more. This information is based heavily on discussions between Andrew Morton, Rik van Riel and Kosaki Motohiro who is heavily quotes from. Historically prev_priority was important because it determined when the VM would start unmapping PTE pages. i.e. there are no balances of note within the VM, Anon vs File and Mapped vs Unmapped. Without prev_priority, there is a potential risk of unnecessarily increasing minor faults as a large amount of read activity of use-once pages could push mapped pages to the end of the LRU and get unmapped. There is no proof this is still a problem but currently it is not considered to be. Active files are not deactivated if the active file list is smaller than the inactive list reducing the liklihood that file-mapped pages are being pushed off the LRU and referenced executable pages are kept on the active list to avoid them getting pushed out by read activity. Even if it is a problem, prev_priority prev_priority wouldn't works nowadays. First of all, current vmscan still a lot of UP centric code. it expose some weakness on some dozens CPUs machine. I think we need more and more improvement. The problem is, current vmscan mix up per-system-pressure, per-zone-pressure and per-task-pressure a bit. example, prev_priority try to boost priority to other concurrent priority. but if the another task have mempolicy restriction, it is unnecessary, but also makes wrong big latency and exceeding reclaim. per-task based priority + prev_priority adjustment make the emulation of per-system pressure. but it have two issue 1) too rough and brutal emulation 2) we need per-zone pressure, not per-system. Another example, currently DEF_PRIORITY is 12. it mean the lru rotate about 2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer. but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the system have higher memory pressure than priority==0 (1/4096*10,000 > 2). prev_priority can't solve such multithreads workload issue. In other word, prev_priority concept assume the sysmtem don't have lots threads." Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NMel Gorman <mel@csn.ul.ie> Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Chris Mason <chris.mason@oracle.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michael Rubin <mrubin@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 28 5月, 2010 1 次提交
-
-
FILE_MAPPED per memcg of migrated file cache is not properly updated, because our hook in page_add_file_rmap() can't know to which memcg FILE_MAPPED should be counted. Basically, this patch is for fixing the bug but includes some big changes to fix up other messes. Now, at migrating mapped file, events happen in following sequence. 1. allocate a new page. 2. get memcg of an old page. 3. charge ageinst a new page before migration. But at this point, no changes to new page's page_cgroup, no commit for the charge. (IOW, PCG_USED bit is not set.) 4. page migration replaces radix-tree, old-page and new-page. 5. page migration remaps the new page if the old page was mapped. 6. Here, the new page is unlocked. 7. memcg commits the charge for newpage, Mark the new page's page_cgroup as PCG_USED. Because "commit" happens after page-remap, we can count FILE_MAPPED at "5", because we should avoid to trust page_cgroup->mem_cgroup. if PCG_USED bit is unset. (Note: memcg's LRU removal code does that but LRU-isolation logic is used for helping it. When we overwrite page_cgroup->mem_cgroup, page_cgroup is not on LRU or page_cgroup->mem_cgroup is NULL.) We can lose file_mapped accounting information at 5 because FILE_MAPPED is updated only when mapcount changes 0->1. So we should catch it. BTW, historically, above implemntation comes from migration-failure of anonymous page. Because we charge both of old page and new page with mapcount=0, we can't catch - the page is really freed before remap. - migration fails but it's freed before remap or .....corner cases. New migration sequence with memcg is: 1. allocate a new page. 2. mark PageCgroupMigration to the old page. 3. charge against a new page onto the old page's memcg. (here, new page's pc is marked as PageCgroupUsed.) 4. page migration replaces radix-tree, page table, etc... 5. At remapping, new page's page_cgroup is now makrked as "USED" We can catch 0->1 event and FILE_MAPPED will be properly updated. And we can catch SWAPOUT event after unlock this and freeing this page by unmap() can be caught. 7. Clear PageCgroupMigration of the old page. So, FILE_MAPPED will be correctly updated. Then, for what MIGRATION flag is ? Without it, at migration failure, we may have to charge old page again because it may be fully unmapped. "charge" means that we have to dive into memory reclaim or something complated. So, it's better to avoid charge it again. Before this patch, __commit_charge() was working for both of the old/new page and fixed up all. But this technique has some racy condtion around FILE_MAPPED and SWAPOUT etc... Now, the kernel use MIGRATION flag and don't uncharge old page until the end of migration. I hope this change will make memcg's page migration much simpler. This page migration has caused several troubles. Worth to add a flag for simplification. Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Reported-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: "Kirill A. Shutemov" <kirill@shutemov.name> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 5月, 2010 1 次提交
-
-
由 Johannes Weiner 提交于
For now, we have global isolation vs. memory control group isolation, do not allow the reclaim entry function to set an arbitrary page isolation callback, we do not need that flexibility. And since we already pass around the group descriptor for the memory control group isolation case, just use it to decide which one of the two isolator functions to use. The decisions can be merged into nearby branches, so no extra cost there. In fact, we save the indirect calls. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 3月, 2010 1 次提交
-
-
由 KAMEZAWA Hiroyuki 提交于
In current page-fault code, handle_mm_fault() -> ... -> mem_cgroup_charge() -> map page or handle error. -> check return code. If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is called. But if it's caused by memcg, OOM should have been already invoked. Then, I added a patch: a636b327. That patch records last_oom_jiffies for memcg's sub-hierarchy and prevents page_fault_out_of_memory from being invoked in near future. But Nishimura-san reported that check by jiffies is not enough when the system is terribly heavy. This patch changes memcg's oom logic as. * If memcg causes OOM-kill, continue to retry. * remove jiffies check which is used now. * add memcg-oom-lock which works like perzone oom lock. * If current is killed(as a process), bypass charge. Something more sophisticated can be added but this pactch does fundamental things. TODO: - add oom notifier - add permemcg disable-oom-kill flag and freezer at oom. - more chances for wake up oom waiter (when changing memory limit etc..) Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Tested-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 12月, 2009 4 次提交
-
-
由 KAMEZAWA Hiroyuki 提交于
In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE. This makes grep difficult. Replace memcg's MAPPED_FILE with FILE_MAPPED And in global VM, mapped shared memory is accounted into FILE_MAPPED. But memcg doesn't. fix it. Note: page_is_file_cache() just checks SwapBacked or not. So, we need to check PageAnon. Cc: Balbir Singh <balbir@in.ibm.com> Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
In massive parallel enviroment, res_counter can be a performance bottleneck. One strong techinque to reduce lock contention is reducing calls by coalescing some amount of calls into one. Considering charge/uncharge chatacteristic, - charge is done one by one via demand-paging. - uncharge is done by - in chunk at munmap, truncate, exit, execve... - one by one via vmscan/paging. It seems we have a chance to coalesce uncharges for improving scalability at unmap/truncation. This patch is a for coalescing uncharge. For avoiding scattering memcg's structure to functions under /mm, this patch adds memcg batch uncharge information to the task. A reason for per-task batching is for making use of caller's context information. We do batched uncharge (deleyed uncharge) when truncation/unmap occurs but do direct uncharge when uncharge is called by memory reclaim (vmscan.c). The degree of coalescing depends on callers - at invalidate/trucate... pagevec size - at unmap ....ZAP_BLOCK_SIZE (memory itself will be freed in this degree.) Then, we'll not coalescing too much. On x86-64 8cpu server, I tested overheads of memcg at page fault by running a program which does map/fault/unmap in a loop. Running a task per a cpu by taskset and see sum of the number of page faults in 60secs. [without memcg config] 40156968 page-faults # 0.085 M/sec ( +- 0.046% ) 27.67 cache-miss/faults [root cgroup] 36659599 page-faults # 0.077 M/sec ( +- 0.247% ) 31.58 miss/faults [in a child cgroup] 18444157 page-faults # 0.039 M/sec ( +- 0.133% ) 69.96 miss/faults [child with this patch] 27133719 page-faults # 0.057 M/sec ( +- 0.155% ) 47.16 miss/faults We can see some amounts of improvement. (root cgroup doesn't affected by this patch) Another patch for "charge" will follow this and above will be improved more. Changelog(since 2009/10/02): - renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes) - some clean up and commentary/description updates. - added initialize code to copy_process(). (possible bug fix) Changelog(old): - fixed !CONFIG_MEM_CGROUP case. - rebased onto the latest mmotm + softlimit fix patches. - unified patch for callers - added commetns. - make ->do_batch as bool. - removed css_get() at el. We don't need it. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wu Fengguang 提交于
So that an outside user can free the reference count grabbed by try_get_mem_cgroup_from_page(). CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Hugh Dickins <hugh.dickins@tiscali.co.uk> CC: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> CC: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndi Kleen <ak@linux.intel.com>
-
由 Wu Fengguang 提交于
So that the hwpoison injector can get mem_cgroup for arbitrary page and thus know whether it is owned by some mem_cgroup task(s). [AK: Merged with latest git tree] CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Hugh Dickins <hugh.dickins@tiscali.co.uk> CC: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> CC: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndi Kleen <ak@linux.intel.com>
-
- 24 9月, 2009 1 次提交
-
-
由 Balbir Singh 提交于
Implement reclaim from groups over their soft limit Permit reclaim from memory cgroups on contention (via the direct reclaim path). memory cgroup soft limit reclaim finds the group that exceeds its soft limit by the largest number of pages and reclaims pages from it and then reinserts the cgroup into its correct place in the rbtree. Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long loops in case all swap is turned off. The code has been refactored and the loop check (loop < 2) has been enhanced for soft limits. For soft limits, we try to do more targetted reclaim. Instead of bailing out after two loops, the routine now reclaims memory proportional to the size by which the soft limit is exceeded. The proportion has been empirically determined. [akpm@linux-foundation.org: build fix] [kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling] [nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop] Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 19 6月, 2009 1 次提交
-
-
由 Balbir Singh 提交于
Add file RSS tracking per memory cgroup We currently don't track file RSS, the RSS we report is actually anon RSS. All the file mapped pages, come in through the page cache and get accounted there. This patch adds support for accounting file RSS pages. It should 1. Help improve the metrics reported by the memory resource controller 2. Will form the basis for a future shared memory accounting heuristic that has been proposed by Kamezawa. Unfortunately, we cannot rename the existing "rss" keyword used in memory.stat to "anon_rss". We however, add "mapped_file" data and hope to educate the end user through documentation. [hugh.dickins@tiscali.co.uk: fix mem_cgroup_update_mapped_file_stat oops] Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.cn> Cc: Paul Menage <menage@google.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 6月, 2009 1 次提交
-
-
由 Rik van Riel 提交于
When the file LRU lists are dominated by streaming IO pages, evict those pages first, before considering evicting other pages. This should be safe from deadlocks or performance problems because only three things can happen to an inactive file page: 1) referenced twice and promoted to the active list 2) evicted by the pageout code 3) under IO, after which it will get evicted or promoted The pages freed in this way can either be reused for streaming IO, or allocated for something else. If the pages are used for streaming IO, this pageout pattern continues. Otherwise, we will fall back to the normal pageout pattern. Signed-off-by: NRik van Riel <riel@redhat.com> Reported-by: NElladan <elladan@eskimo.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 5月, 2009 1 次提交
-
-
由 Daisuke Nishimura 提交于
Current mem_cgroup_shrink_usage() has two problems. 1. It doesn't call mem_cgroup_out_of_memory and doesn't update last_oom_jiffies, so pagefault_out_of_memory invokes global OOM. 2. Considering hierarchy, shrinking has to be done from the mem_over_limit, not from the memcg which the page would be charged to. mem_cgroup_try_charge_swapin() does all of these things properly, so we use it and call cancel_charge_swapin when it succeeded. The name of "shrink_usage" is not appropriate for this behavior, so we change it too. Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.cn> Cc: Paul Menage <menage@google.com> Cc: Dhaval Giani <dhaval@linux.vnet.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 4月, 2009 1 次提交
-
-
由 KAMEZAWA Hiroyuki 提交于
mm->owner should be accessed with rcu_dereference(). Reported-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 4月, 2009 3 次提交
-
-
由 KOSAKI Motohiro 提交于
commit 4f98a2fe (vmscan: split LRU lists into anon & file sets) removed mem_cgroup_reclaim_imbalance(), but there are some leftovers in memcontrol.h. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Currently, mem_cgroup_calc_mapped_ratio() is unused at all. it can be removed and KAMEZAWA-san suggested it. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Balbir Singh 提交于
Add RSS and swap to OOM output from memcg Display memcg values like failcnt, usage and limit when an OOM occurs due to memcg. Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki, Daisuke Nishimura and KOSAKI Motohiro for review. Sample output ------------- Task in /a/x killed as a result of limit of /a memory: usage 1048576kB, limit 1048576kB, failcnt 4183 memory+swap: usage 1400964akB, limit 9007199254740991kB, failcnt 0 [akpm@linux-foundation.org: compilation fix] [akpm@linux-foundation.org: fix kerneldoc and whitespace] [akpm@linux-foundation.org: add printk facility level] Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 1月, 2009 16 次提交
-
-
由 KAMEZAWA Hiroyuki 提交于
Now, you can see following even when swap accounting is enabled. 1. Create Group 01, and 02. 2. allocate a "file" on tmpfs by a task under 01. 3. swap out the "file" (by memory pressure) 4. Read "file" from a task in group 02. 5. the charge of "file" is moved to group 02. This is not ideal behavior. This is because SwapCache which was loaded by read-ahead is not taken into account.. This is a patch to fix shmem's swapcache behavior. - remove mem_cgroup_cache_charge_swapin(). - Add SwapCache handler routine to mem_cgroup_cache_charge(). By this, shmem's file cache is charged at add_to_page_cache() with GFP_NOWAIT. - pass the page of swapcache to shrink_mem_cgroup. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Daisuke Nishimura 提交于
After previous patch, mem_cgroup_try_charge is not used by anyone, so we can remove it. Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Currently, inactive_ratio of memcg is calculated at setting limit. because page_alloc.c does so and current implementation is straightforward porting. However, memcg introduced hierarchy feature recently. In hierarchy restriction, memory limit is not only decided memory.limit_in_bytes of current cgroup, but also parent limit and sibling memory usage. Then, The optimal inactive_ratio is changed frequently. So, everytime calculation is better. Tested-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Now, get_scan_ratio() return correct value although memcg reclaim. Then, mem_cgroup_calc_reclaim() can be removed. So, memcg reclaim get the same capability of anon/file reclaim balancing as global reclaim now. Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@redhat.com> Acked-by: NRik van Riel <riel@redhat.com> Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Introduce mem_cgroup_per_zone::reclaim_stat member and its statics collecting function. Now, get_scan_ratio() can calculate correct value on memcg reclaim. [hugh@veritas.com: avoid reclaim_stat oops when disabled] Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NRik van Riel <riel@redhat.com> Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Introduce mem_cgroup_zone_nr_pages(). It is called by zone_nr_pages() helper function. This patch doesn't have any behavior change. Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NRik van Riel <riel@redhat.com> Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NBalbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
The inactive_anon_is_low() is key component of active/inactive anon balancing on reclaim. However current inactive_anon_is_low() function only consider global reclaim. Therefore, we need following ugly scan_global_lru() condition. if (lru == LRU_ACTIVE_ANON && (!scan_global_lru(sc) || inactive_anon_is_low(zone))) { shrink_active_list(nr_to_scan, zone, sc, priority, file); return 0; it cause that memcg reclaim always deactivate pages when shrink_list() is called. To make mem_cgroup_inactive_anon_is_low() improve active/inactive anon balancing of memcgroup. Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NRik van Riel <riel@redhat.com> Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: "Pekka Enberg" <penberg@cs.helsinki.fi> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will happen at memory reclaim. But in recent discussion, it's NACKed because it sounds ugly. This patch is for reverting it and add some clean up to gfp_mask of callers of charge. No behavior change but need review before generating HUNK in deep queue. This patch also adds explanation to meaning of gfp_mask passed to charge functions in memcontrol.h. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Hugh Dickins <hugh@veritas.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Current mmtom has new oom function as pagefault_out_of_memory(). It's added for select bad process rathar than killing current. When memcg hit limit and calls OOM at page_fault, this handler called and system-wide-oom handling happens. (means kernel panics if panic_on_oom is true....) To avoid overkill, check memcg's recent behavior before starting system-wide-oom. And this patch also fixes to guarantee "don't accnout against process with TIF_MEMDIE". This is necessary for smooth OOM. [akpm@linux-foundation.org: build fix] Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Badari Pulavarty <pbadari@us.ibm.com> Cc: Jan Blunck <jblunck@suse.de> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lai Jiangshan 提交于
mm_match_cgroup() calls cgroup_subsys_state(). We must use rcu_read_lock() to protect cgroup_subsys_state(). Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hirokazu Takahashi 提交于
We check mem_cgroup is disabled or not by checking mem_cgroup_subsys.disabled. I think it has more references than expected, now. replacing if (mem_cgroup_subsys.disabled) with if (mem_cgroup_disabled()) give us good look, I think. [kamezawa.hiroyu@jp.fujitsu.com: fix typo] Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
A big patch for changing memcg's LRU semantics. Now, - page_cgroup is linked to mem_cgroup's its own LRU (per zone). - LRU of page_cgroup is not synchronous with global LRU. - page and page_cgroup is one-to-one and statically allocated. - To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as - lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc); - SwapCache is handled. And, when we handle LRU list of page_cgroup, we do following. pc = lookup_page_cgroup(page); lock_page_cgroup(pc); .....................(1) mz = page_cgroup_zoneinfo(pc); spin_lock(&mz->lru_lock); .....add to LRU spin_unlock(&mz->lru_lock); unlock_page_cgroup(pc); But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock. So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct. This is a trial to remove this dirty nesting of locks. This patch changes mz->lru_lock to be zone->lru_lock. Then, above sequence will be written as spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU mem_cgroup_add/remove/etc_lru() { pc = lookup_page_cgroup(page); mz = page_cgroup_zoneinfo(pc); if (PageCgroupUsed(pc)) { ....add to LRU } spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU This is much simpler. (*) We're safe even if we don't take lock_page_cgroup(pc). Because.. 1. When pc->mem_cgroup can be modified. - at charge. - at account_move(). 2. at charge the PCG_USED bit is not set before pc->mem_cgroup is fixed. 3. at account_move() the page is isolated and not on LRU. Pros. - easy for maintenance. - memcg can make use of laziness of pagevec. - we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup. - LRU status of memcg will be synchronized with global LRU's one. - # of locks are reduced. - account_move() is simplified very much. Cons. - may increase cost of LRU rotation. (no impact if memcg is not configured.) Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
This patch implements per cgroup limit for usage of memory+swap. However there are SwapCache, double counting of swap-cache and swap-entry is avoided. Mem+Swap controller works as following. - memory usage is limited by memory.limit_in_bytes. - memory + swap usage is limited by memory.memsw_limit_in_bytes. This has following benefits. - A user can limit total resource usage of mem+swap. Without this, because memory resource controller doesn't take care of usage of swap, a process can exhaust all the swap (by memory leak.) We can avoid this case. And Swap is shared resource but it cannot be reclaimed (goes back to memory) until it's used. This characteristic can be trouble when the memory is divided into some parts by cpuset or memcg. Assume group A and group B. After some application executes, the system can be.. Group A -- very large free memory space but occupy 99% of swap. Group B -- under memory shortage but cannot use swap...it's nearly full. Ability to set appropriate swap limit for each group is required. Maybe someone wonder "why not swap but mem+swap ?" - The global LRU(kswapd) can swap out arbitrary pages. Swap-out means to move account from memory to swap...there is no change in usage of mem+swap. In other words, when we want to limit the usage of swap without affecting global LRU, mem+swap limit is better than just limiting swap. Accounting target information is stored in swap_cgroup which is per swap entry record. Charge is done as following. map - charge page and memsw. unmap - uncharge page/memsw if not SwapCache. swap-out (__delete_from_swap_cache) - uncharge page - record mem_cgroup information to swap_cgroup. swap-in (do_swap_page) - charged as page and memsw. record in swap_cgroup is cleared. memsw accounting is decremented. swap-free (swap_free()) - if swap entry is freed, memsw is uncharged by PAGE_SIZE. There are people work under never-swap environments and consider swap as something bad. For such people, this mem+swap controller extension is just an overhead. This overhead is avoided by config or boot option. (see Kconfig. detail is not in this patch.) TODO: - maybe more optimization can be don in swap-in path. (but not very safe.) But we just do simple accounting at this stage. [nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex] [hugh@veritas.com: memswap controller core swapcache fixes] Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Config and control variable for mem+swap controller. This patch adds CONFIG_CGROUP_MEM_RES_CTLR_SWAP (memory resource controller swap extension.) For accounting swap, it's obvious that we have to use additional memory to remember "who uses swap". This adds more overhead. So, it's better to offer "choice" to users. This patch adds 2 choices. This patch adds 2 parameters to enable swap extension or not. - CONFIG - boot option Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
Now, management of "charge" under page migration is done under following manner. (Assume migrate page contents from oldpage to newpage) before - "newpage" is charged before migration. at success. - "oldpage" is uncharged at somewhere(unmap, radix-tree-replace) at failure - "newpage" is uncharged. - "oldpage" is charged if necessary (*1) But (*1) is not reliable....because of GFP_ATOMIC. This patch tries to change behavior as following by charge/commit/cancel ops. before - charge PAGE_SIZE (no target page) success - commit charge against "newpage". failure - commit charge against "oldpage". (PCG_USED bit works effectively to avoid double-counting) - if "oldpage" is obsolete, cancel charge of PAGE_SIZE. Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KAMEZAWA Hiroyuki 提交于
There is a small race in do_swap_page(). When the page swapped-in is charged, the mapcount can be greater than 0. But, at the same time some process (shares it ) call unmap and make mapcount 1->0 and the page is uncharged. CPUA CPUB mapcount == 1. (1) charge if mapcount==0 zap_pte_range() (2) mapcount 1 => 0. (3) uncharge(). (success) (4) set page's rmap() mapcount 0=>1 Then, this swap page's account is leaked. For fixing this, I added a new interface. - charge account to res_counter by PAGE_SIZE and try to free pages if necessary. - commit register page_cgroup and add to LRU if necessary. - cancel uncharge PAGE_SIZE because of do_swap_page failure. CPUA (1) charge (always) (2) set page's rmap (mapcount > 0) (3) commit charge was necessary or not after set_pte(). This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting. Usual mem_cgroup_charge_common() does charge -> commit at a time. And this patch also adds following function to clarify all charges. - mem_cgroup_newpage_charge() ....replacement for mem_cgroup_charge() called against newly allocated anon pages. - mem_cgroup_charge_migrate_fixup() called only from remove_migration_ptes(). we'll have to rewrite this later.(this patch just keeps old behavior) This function will be removed by additional patch to make migration clearer. Good for clarifying "what we do" Then, we have 4 following charge points. - newpage - swap-in - add-to-cache. - migration. [akpm@linux-foundation.org: add missing inline directives to stubs] Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-