- 10 8月, 2010 2 次提交
-
-
由 KOSAKI Motohiro 提交于
shrink_zones() need relatively long time and lru_pages can change dramatically during shrink_zones(). So lru_pages should be recalculated for each priority. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Swap token don't works when zone reclaim is enabled since it was born. Because __zone_reclaim() always call disable_swap_token() unconditionally. This kill swap token feature completely. As far as I know, nobody want to that. Remove it. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 7月, 2010 1 次提交
-
-
由 Nick Piggin 提交于
We need lock_page_nosync() here because we have no reference to the mapping when taking the page lock. Signed-off-by: NNick Piggin <npiggin@suse.de> Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 19 7月, 2010 1 次提交
-
-
由 Dave Chinner 提交于
The current shrinker implementation requires the registered callback to have global state to work from. This makes it difficult to shrink caches that are not global (e.g. per-filesystem caches). Pass the shrinker structure to the callback so that users can embed the shrinker structure in the context the shrinker needs to operate on and get back to it in the callback via container_of(). Signed-off-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de>
-
- 05 6月, 2010 1 次提交
-
-
由 KOSAKI Motohiro 提交于
Greg Thelen reported recent Johannes's stack diet patch makes kernel hang. His test is following. mount -t cgroup none /cgroups -o memory mkdir /cgroups/cg1 echo $$ > /cgroups/cg1/tasks dd bs=1024 count=1024 if=/dev/null of=/data/foo echo $$ > /cgroups/tasks echo 1 > /cgroups/cg1/memory.force_empty Actually, This OOM hard to try logic have been corrupted since following two years old patch. commit a41f24ea Author: Nishanth Aravamudan <nacc@us.ibm.com> Date: Tue Apr 29 00:58:25 2008 -0700 page allocator: smarter retry of costly-order allocations Original intention was "return success if the system have shrinkable zones though priority==0 reclaim was failure". But the above patch changed to "return nr_reclaimed if .....". Oh, That forgot nr_reclaimed may be 0 if priority==0 reclaim failure. And Johannes's patch 0aeb2339 ("vmscan: remove all_unreclaimable scan control") made it more corrupt. Originally, priority==0 reclaim failure on memcg return 0, but this patch changed to return 1. It totally confused memcg. This patch fixes it completely. Reported-by: NGreg Thelen <gthelen@google.com> Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: NGreg Thelen <gthelen@google.com> Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 5月, 2010 6 次提交
-
-
由 Johannes Weiner 提交于
For now, we have global isolation vs. memory control group isolation, do not allow the reclaim entry function to set an arbitrary page isolation callback, we do not need that flexibility. And since we already pass around the group descriptor for the memory control group isolation case, just use it to decide which one of the two isolator functions to use. The decisions can be merged into nearby branches, so no extra cost there. In fact, we save the indirect calls. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
This scan control is abused to communicate a return value from shrink_zones(). Write this idiomatically and remove the knob. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
If vmscan is under lumpy reclaim mode, it have to ignore referenced bit for making contenious free pages. but current page_check_references() doesn't. Fix it. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Shaohua Li 提交于
get_scan_ratio() calculates percentage and if the percentage is < 1%, it will round percentage down to 0% and cause we completely ignore scanning anon/file pages to reclaim memory even the total anon/file pages are very big. To avoid underflow, we don't use percentage, instead we directly calculate how many pages should be scaned. In this way, we should get several scanned pages for < 1% percent. This has some benefits: 1. increase our calculation precision 2. making our scan more smoothly. Without this, if percent[x] is underflow, shrink_zone() doesn't scan any pages and suddenly it scans all pages when priority is zero. With this, even priority isn't zero, shrink_zone() gets chance to scan some pages. Note, this patch doesn't really change logics, but just increase precision. For system with a lot of memory, this might slightly changes behavior. For example, in a sequential file read workload, without the patch, we don't swap any anon pages. With it, if anon memory size is bigger than 16G, we will see one anon page swapped. The 16G is calculated as PAGE_SIZE * priority(4096) * (fp/ap). fp/ap is assumed to be 1024 which is common in this workload. So the impact sounds not a big deal. Signed-off-by: NShaohua Li <shaohua.li@intel.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NRik van Riel <riel@redhat.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Currently, vmscan.c defines the isolation modes for __isolate_lru_page(). Memory compaction needs access to these modes for isolating pages for migration. This patch exports them. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NChristoph Lameter <cl@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miao Xie 提交于
Before applying this patch, cpuset updates task->mems_allowed and mempolicy by setting all new bits in the nodemask first, and clearing all old unallowed bits later. But in the way, the allocator may find that there is no node to alloc memory. The reason is that cpuset rebinds the task's mempolicy, it cleans the nodes which the allocater can alloc pages on, for example: (mpol: mempolicy) task1 task1's mpol task2 alloc page 1 alloc on node0? NO 1 1 change mems from 1 to 0 1 rebind task1's mpol 0-1 set new bits 0 clear disallowed bits alloc on node1? NO 0 ... can't alloc page goto oom This patch fixes this problem by expanding the nodes range first(set newly allowed bits) and shrink it lazily(clear newly disallowed bits). So we use a variable to tell the write-side task that read-side task is reading nodemask, and the write-side task clears newly disallowed nodes after read-side task ends the current memory allocation. [akpm@linux-foundation.org: fix spello] Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Paul Menage <menage@google.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Ravikiran Thirumalai <kiran@scalex86.org> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Andi Kleen <andi@firstfloor.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 4月, 2010 1 次提交
-
-
由 KOSAKI Motohiro 提交于
Shaohua Li reported his tmpfs streaming I/O test can lead to make oom. The test uses a 6G tmpfs in a system with 3G memory. In the tmpfs, there are 6 copies of kernel source and the test does kbuild for each copy. His investigation shows the test has a lot of rotated anon pages and quite few file pages, so get_scan_ratio calculates percent[0] (i.e. scanning percent for anon) to be zero. Actually the percent[0] shoule be a big value, but our calculation round it to zero. Although before commit 84b18490 ("vmscan: get_scan_ratio() cleanup") , we have the same problem too. But the old logic can rescue percent[0]==0 case only when priority==0. It had hided the real issue. I didn't think merely streaming io can makes percent[0]==0 && priority==0 situation. but I was wrong. So, definitely we have to fix such tmpfs streaming io issue. but anyway I revert the regression commit at first. This reverts commit 84b18490. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reported-by: NShaohua Li <shaohua.li@intel.com> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 30 3月, 2010 1 次提交
-
-
由 Tejun Heo 提交于
include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: NTejun Heo <tj@kernel.org> Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
-
- 07 3月, 2010 7 次提交
-
-
由 Johannes Weiner 提交于
The VM currently assumes that an inactive, mapped and referenced file page is in use and promotes it to the active list. However, every mapped file page starts out like this and thus a problem arises when workloads create a stream of such pages that are used only for a short time. By flooding the active list with those pages, the VM quickly gets into trouble finding eligible reclaim canditates. The result is long allocation latencies and eviction of the wrong pages. This patch reuses the PG_referenced page flag (used for unmapped file pages) to implement a usage detection that scales with the speed of LRU list cycling (i.e. memory pressure). If the scanner encounters those pages, the flag is set and the page cycled again on the inactive list. Only if it returns with another page table reference it is activated. Otherwise it is reclaimed as 'not recently used cache'. This effectively changes the minimum lifetime of a used-once mapped file page from a full memory cycle to an inactive list cycle, which allows it to occur in linear streams without affecting the stable working set of the system. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
page_mapping_inuse() is a historic predicate function for pages that are about to be reclaimed or deactivated. According to it, a page is in use when it is mapped into page tables OR part of swap cache OR backing an mmapped file. This function is used in combination with page_referenced(), which checks for young bits in ptes and the page descriptor itself for the PG_referenced bit. Thus, checking for unmapped swap cache pages is meaningless as PG_referenced is not set for anonymous pages and unmapped pages do not have young ptes. The test makes no difference. Protecting file pages that are not by themselves mapped but are part of a mapped file is also a historic leftover for short-lived things like the exec() code in libc. However, the VM now does reference accounting and activation of pages at unmap time and thus the special treatment on reclaim is obsolete. This patch drops page_mapping_inuse() and switches the two callsites to use page_mapped() directly. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
The used-once mapped file page detection patchset. It is meant to help workloads with large amounts of shortly used file mappings, like rtorrent hashing a file or git when dealing with loose objects (git gc on a bigger site?). Right now, the VM activates referenced mapped file pages on first encounter on the inactive list and it takes a full memory cycle to reclaim them again. When those pages dominate memory, the system no longer has a meaningful notion of 'working set' and is required to give up the active list to make reclaim progress. Obviously, this results in rather bad scanning latencies and the wrong pages being reclaimed. This patch makes the VM be more careful about activating mapped file pages in the first place. The minimum granted lifetime without another memory access becomes an inactive list cycle instead of the full memory cycle, which is more natural given the mentioned loads. This test resembles a hashing rtorrent process. Sequentially, 32MB chunks of a file are mapped into memory, hashed (sha1) and unmapped again. While this happens, every 5 seconds a process is launched and its execution time taken: python2.4 -c 'import pydoc' old: max=2.31s mean=1.26s (0.34) new: max=1.25s mean=0.32s (0.32) find /etc -type f old: max=2.52s mean=1.44s (0.43) new: max=1.92s mean=0.12s (0.17) vim -c ':quit' old: max=6.14s mean=4.03s (0.49) new: max=3.48s mean=2.41s (0.25) mplayer --help old: max=8.08s mean=5.74s (1.02) new: max=3.79s mean=1.32s (0.81) overall hash time (stdev): old: time=1192.30 (12.85) thruput=25.78mb/s (0.27) new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%) I also tested kernbench with regular IO streaming in the background to see whether the delayed activation of frequently used mapped file pages had a negative impact on performance in the presence of pressure on the inactive list. The patch made no significant difference in timing, neither for kernbench nor for the streaming IO throughput. The first patch submission raised concerns about the cost of the extra faults for actually activated pages on machines that have no hardware support for young page table entries. I created an artificial worst case scenario on an ARM machine with around 300MHz and 64MB of memory to figure out the dimensions involved. The test would mmap a file of 20MB, then 1. touch all its pages to fault them in 2. force one full scan cycle on the inactive file LRU -- old: mapping pages activated -- new: mapping pages inactive 3. touch the mapping pages again -- old and new: fault exceptions to set the young bits 4. force another full scan cycle on the inactive file LRU 5. touch the mapping pages one last time -- new: fault exceptions to set the young bits The test showed an overall increase of 6% in time over 100 iterations of the above (old: ~212sec, new: ~225sec). 13 secs total overhead / (100 * 5k pages), ignoring the execution time of the test itself, makes for about 25us overhead for every page that gets actually activated. Note: 1. File mapping the size of one third of main memory, _completely_ in active use across memory pressure - i.e., most pages referenced within one LRU cycle. This should be rare to non-existant, especially on such embedded setups. 2. Many huge activation batches. Those batches only occur when the working set fluctuates. If it changes completely between every full LRU cycle, you have problematic reclaim overhead anyway. 3. Access of activated pages at maximum speed: sequential loads from every single page without doing anything in between. In reality, the extra faults will get distributed between actual operations on the data. So even if a workload manages to get the VM into the situation of activating a third of memory in one go on such a setup, it will take 2.2 seconds instead 2.1 without the patch. Comparing the numbers (and my user-experience over several months), I think this change is an overall improvement to the VM. Patch 1 is only refactoring to break up that ugly compound conditional in shrink_page_list() and make it easy to document and add new checks in a readable fashion. Patch 2 gets rid of the obsolete page_mapping_inuse(). It's not strictly related to #3, but it was in the original submission and is a net simplification, so I kept it. Patch 3 implements used-once detection of mapped file pages. This patch: Moving the big conditional into its own predicate function makes the code a bit easier to read and allows for better commenting on the checks one-by-one. This is just cleaning up, no semantics should have been changed. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: OSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
commit e815af95 ("change all_unreclaimable zone member to flags") changed all_unreclaimable member to bit flag. But it had an undesireble side effect. free_one_page() is one of most hot path in linux kernel and increasing atomic ops in it can reduce kernel performance a bit. Thus, this patch revert such commit partially. at least all_unreclaimable shouldn't share memory word with other zone flags. [akpm@linux-foundation.org: fix patch interaction] Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Huang Shijie <shijie8@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Commit cf40bd16 ("lockdep: annotate reclaim context") introduced reclaim context annotation. But it didn't annotate zone reclaim. This patch do it. The point is, commit cf40bd16 annotate __alloc_pages_direct_reclaim but zone-reclaim doesn't use __alloc_pages_direct_reclaim. current call graph is __alloc_pages_nodemask get_page_from_freelist zone_reclaim() __alloc_pages_slowpath __alloc_pages_direct_reclaim try_to_free_pages Actually, if zone_reclaim_mode=1, VM never call __alloc_pages_direct_reclaim in usual VM pressure. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Acked-by: NNick Piggin <npiggin@suse.de> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@elte.hu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
The get_scan_ratio() should have all scan-ratio related calculations. Thus, this patch move some calculation into get_scan_ratio. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Minchan Kim 提交于
Kswapd checks that zone has sufficient pages free via zone_watermark_ok(). If any zone doesn't have enough pages, we set all_zones_ok to zero. !all_zone_ok makes kswapd retry rather than sleeping. I think the watermark check before shrink_zone() is pointless. Only after kswapd has tried to shrink the zone is the check meaningful. Move the check to after the call to shrink_zone(). [akpm@linux-foundation.org: fix comment, layout] Signed-off-by: NMinchan Kim <minchan.kim@gmail.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Reviewed-by: NWu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 1月, 2010 1 次提交
-
-
由 KOSAKI Motohiro 提交于
Commit f50de2d3 (vmscan: have kswapd sleep for a short interval and double check it should be asleep) can cause kswapd to enter an infinite loop if running on a single-CPU system. If all zones are unreclaimble, sleeping_prematurely return 1 and kswapd will call balance_pgdat() again. but it's totally meaningless, balance_pgdat() doesn't anything against unreclaimable zone! Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Reported-by: NWill Newton <will.newton@gmail.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Reviewed-by: NRik van Riel <riel@redhat.com> Tested-by: NWill Newton <will.newton@gmail.com> Reviewed-by: NWu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 12月, 2009 11 次提交
-
-
由 Huang Shijie 提交于
Simplify the code for shrink_inactive_list(). Signed-off-by: NHuang Shijie <shijie8@gmail.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NWu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rik van Riel 提交于
In AIM7 runs, recent kernels start swapping out anonymous pages well before they should. This is due to shrink_list falling through to shrink_inactive_list if !inactive_anon_is_low(zone, sc), when all we really wanted to do is pre-age some anonymous pages to give them extra time to be referenced while on the inactive list. The obvious fix is to make sure that shrink_list does not fall through to scanning/reclaiming inactive pages when we called it to scan one of the active lists. This change should be safe because the loop in shrink_zone ensures that we will still shrink the anon and file inactive lists whenever we should. [kosaki.motohiro@jp.fujitsu.com: inactive_file_is_low() should be inactive_anon_is_low()] Reported-by: NLarry Woodman <lwoodman@redhat.com> Signed-off-by: NRik van Riel <riel@redhat.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Tomasz Chmielewski <mangoo@wpkg.org> Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Fix small inconsistent of ">" and ">=". Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Now, All caller of reclaim use swap_cluster_max as SWAP_CLUSTER_MAX. Then, we can remove it perfectly. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
In old days, we didn't have sc.nr_to_reclaim and it brought sc.swap_cluster_max misuse. huge sc.swap_cluster_max might makes unnecessary OOM risk and no performance benefit. Now, we can stop its insane thing. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
shrink_all_zone() was introduced by commit d6277db4 (swsusp: rework memory shrinker) for hibernate performance improvement. and sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing memory for suspend). commit a06fe4d307 said Without the patch: Freed 14600 pages in 1749 jiffies = 32.61 MB/s (Anomolous!) Freed 88563 pages in 14719 jiffies = 23.50 MB/s Freed 205734 pages in 32389 jiffies = 24.81 MB/s With the patch: Freed 68252 pages in 496 jiffies = 537.52 MB/s Freed 116464 pages in 569 jiffies = 798.54 MB/s Freed 209699 pages in 705 jiffies = 1161.89 MB/s At that time, their patch was pretty worth. However, Modern Hardware trend and recent VM improvement broke its worth. From several reason, I think we should remove shrink_all_zones() at all. detail: 1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle at no i/o congestion. but current shrink_zone() is sane, not slow. 2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works fine on numa system. example) System has 4GB memory and each node have 2GB. and hibernate need 1GB. optimal) steal 500MB from each node. shrink_all_zones) steal 1GB from node-0. Oh, Cache balancing logic was broken. ;) Unfortunately, Desktop system moved ahead NUMA at nowadays. (Side note, if hibernate require 2GB, shrink_all_zones() never success on above machine) 3) if the node has several I/O flighting pages, shrink_all_zones() makes pretty bad result. schenario) hibernate need 1GB 1) shrink_all_zones() try to reclaim 1GB from Node-0 2) but it only reclaimed 990MB 3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1 4) it reclaimed 990MB Oh, well. it reclaimed twice much than required. In the other hand, current shrink_zone() has sane baling out logic. then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk. 4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal. Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages(). it bring good reviewability and debuggability, and solve above problems. side note: Reclaim logic unificication makes two good side effect. - Fix recursive reclaim bug on shrink_all_memory(). it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock. - Now, shrink_all_memory() got lockdep awareness. it bring good debuggability. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NRik van Riel <riel@redhat.com> Acked-by: NRafael J. Wysocki <rjw@sisk.pl> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Currently, sc.scap_cluster_max has double meanings. 1) reclaim batch size as isolate_lru_pages()'s argument 2) reclaim baling out thresolds The two meanings pretty unrelated. Thus, Let's separate it. this patch doesn't change any behavior. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NRik van Riel <riel@redhat.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
If reclaim fails to make sufficient progress, the priority is raised. Once the priority is higher, kswapd starts waiting on congestion. However, if the zone is below the min watermark then kswapd needs to continue working without delay as there is a danger of an increased rate of GFP_ATOMIC allocation failure. This patch changes the conditions under which kswapd waits on congestion by only going to sleep if the min watermarks are being met. [mel@csn.ul.ie: add stats to track how relevant the logic is] [mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters] Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NMel Gorman <mel@csn.ul.ie> Reviewed-by: NRik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
After kswapd balances all zones in a pgdat, it goes to sleep. In the event of no IO congestion, kswapd can go to sleep very shortly after the high watermark was reached. If there are a constant stream of allocations from parallel processes, it can mean that kswapd went to sleep too quickly and the high watermark is not being maintained for sufficient length time. This patch makes kswapd go to sleep as a two-stage process. It first tries to sleep for HZ/10. If it is woken up by another process or the high watermark is no longer met, it's considered a premature sleep and kswapd continues work. Otherwise it goes fully to sleep. This adds more counters to distinguish between fast and slow breaches of watermarks. A "fast" premature sleep is one where the low watermark was hit in a very short time after kswapd going to sleep. A "slow" premature sleep indicates that the high watermark was breached after a very short interval. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Cc: Frans Pop <elendil@planet.nl> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vincent Li 提交于
Commit 543ade1f ("Streamline generic_file_* interfaces and filemap cleanups") removed generic_file_write() in filemap. Change the comment in vmscan pageout() to __generic_file_aio_write(). Signed-off-by: NVincent Li <macli@brc.ubc.ca> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
When memory is hot-removed, its node must be cleared in N_HIGH_MEMORY if there are no present pages left. In such a situation, kswapd must also be stopped since it has nothing left to do. Signed-off-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Cc: Adam Litke <agl@us.ibm.com> Cc: Andy Whitcroft <apw@canonical.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 29 10月, 2009 3 次提交
-
-
由 Johannes Weiner 提交于
Isolators putting a page back to the LRU do not hold the page lock, and if the page is mlocked, another thread might munlock it concurrently. Expecting this, the putback code re-checks the evictability of a page when it just moved it to the unevictable list in order to correct its decision. The problem, however, is that ordering is not garuanteed between setting PG_lru when moving the page to the list and checking PG_mlocked afterwards: #0: #1 spin_lock() if (TestClearPageMlocked()) if (PageLRU()) move to evictable list SetPageLRU() spin_unlock() if (!PageMlocked()) move to evictable list The PageMlocked() check may get reordered before SetPageLRU() in #0, resulting in #0 not moving the still mlocked page, and in #1 failing to isolate and move the page as well. The page is now stranded on the unevictable list. The race condition is very unlikely. The consequence currently is one page falling off the reclaim grid and eventually getting freed with PG_unevictable set, which triggers a warning in the page allocator. TestClearPageMlocked() in #1 already provides full memory barrier semantics. This patch adds an explicit full barrier to force ordering between SetPageLRU() and PageMlocked() so that either one of the competitors rescues the page. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: NRik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wu Fengguang 提交于
It is possible to have !Anon but SwapBacked pages, and some apps could create huge number of such pages with MAP_SHARED|MAP_ANONYMOUS. These pages go into the ANON lru list, and hence shall not be protected: we only care mapped executable files. Failing to do so may trigger OOM. Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: NRik van Riel <riel@redhat.com> Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
commit 8aa7e847 (Fix congestion_wait() sync/async vs read/write confusion) replace WRITE with BLK_RW_ASYNC. Unfortunately, concurrent mm development made the unchanged place accidentally. This patch fixes it too. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NJens Axboe <jens.axboe@oracle.com> Acked-by: NRik van Riel <riel@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 26 9月, 2009 1 次提交
-
-
由 Jens Axboe 提交于
Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
-
- 24 9月, 2009 2 次提交
-
-
由 Alexey Dobriyan 提交于
It's unused. It isn't needed -- read or write flag is already passed and sysctl shouldn't care about the rest. It _was_ used in two places at arch/frv for some reason. Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "David S. Miller" <davem@davemloft.net> Cc: James Morris <jmorris@namei.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Balbir Singh 提交于
Implement reclaim from groups over their soft limit Permit reclaim from memory cgroups on contention (via the direct reclaim path). memory cgroup soft limit reclaim finds the group that exceeds its soft limit by the largest number of pages and reclaims pages from it and then reinserts the cgroup into its correct place in the rbtree. Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long loops in case all swap is turned off. The code has been refactored and the loop check (loop < 2) has been enhanced for soft limits. For soft limits, we try to do more targetted reclaim. Instead of bailing out after two loops, the routine now reclaims memory proportional to the size by which the soft limit is exceeded. The proportion has been empirically determined. [akpm@linux-foundation.org: build fix] [kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling] [nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop] Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 9月, 2009 2 次提交
-
-
由 Vincent Li 提交于
Commit 084f71ae(kill page_queue_congested()) removed page_queue_congested(). Remove the page_queue_congested() comment in vmscan pageout() too. Signed-off-by: NVincent Li <macli@brc.ubc.ca> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wu Fengguang 提交于
For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in which case shrink_list() _still_ calls isolate_pages() with the much larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list scan rate by up to 32 times. For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4. So when shrink_zone() expects to scan 4 pages in the active/inactive list, the active list will be scanned 4 pages, while the inactive list will be (over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that could break the balance between the two lists. It can further impact the scan of anon active list, due to the anon active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone(): inactive anon list over scanned => inactive_anon_is_low() == TRUE => shrink_active_list() => active anon list over scanned So the end result may be - anon inactive => over scanned - anon active => over scanned (maybe not as much) - file inactive => over scanned - file active => under scanned (relatively) The accesses to nr_saved_scan are not lock protected and so not 100% accurate, however we can tolerate small errors and the resulted small imbalanced scan rates between zones. Cc: Rik van Riel <riel@redhat.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-