- 17 10月, 2007 40 次提交
-
-
由 Christoph Lameter 提交于
We need the check for a node with cpu in zone reclaim. Zone reclaim will not allow remote zone reclaim if a node has a cpu. [Lee.Schermerhorn@hp.com: Move setup of N_CPU node state mask] Signed-off-by: NChristoph Lameter <clameter@sgi.com> Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NBob Picco <bob.picco@hp.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
Online nodes now may have no memory. The checks and initialization must therefore be changed to no longer use the online functions. This will correctly initialize the interleave on bootup to only target nodes with memory and will make sys_move_pages return an error when a page is to be moved to a memoryless node. Similarly we will get an error if MPOL_BIND and MPOL_INTERLEAVE is used on a memoryless node. These are somewhat new semantics. So far one could specify memoryless nodes and we would maybe do the right thing and just ignore the node (or we'd do something strange like with MPOL_INTERLEAVE). If we want to allow the specification of memoryless nodes via memory policies then we need to keep checking for online nodes. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Acked-by: NNishanth Aravamudan <nacc@us.ibm.com> Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NBob Picco <bob.picco@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
Simply switch all for_each_online_node to for_each_node_state(NORMAL_MEMORY). That way SLUB only operates on nodes with regular memory. Any allocation attempt on a memoryless node or a node with just highmem will fall whereupon SLUB will fetch memory from a nearby node (depending on how memory policies and cpuset describe fallback). Signed-off-by: NChristoph Lameter <clameter@sgi.com> Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NBob Picco <bob.picco@hp.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
Slab should not allocate control structures for nodes without memory. This may seem to work right now but its unreliable since not all allocations can fall back due to the use of GFP_THISNODE. Switching a few for_each_online_node's to N_NORMAL_MEMORY will allow us to only allocate for nodes that have regular memory. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Acked-by: NNishanth Aravamudan <nacc@us.ibm.com> Acked-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NBob Picco <bob.picco@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
A node without memory does not need a kswapd. So use the memory map instead of the online map when starting kswapd. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Acked-by: NNishanth Aravamudan <nacc@us.ibm.com> Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NBob Picco <bob.picco@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
constrained_alloc() builds its own memory map for nodes with memory. We have that available in N_HIGH_MEMORY now. So simplify the code. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Acked-by: NNishanth Aravamudan <nacc@us.ibm.com> Acked-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NBob Picco <bob.picco@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
MPOL_INTERLEAVE currently simply loops over all nodes. Allocations on memoryless nodes will be redirected to nodes with memory. This results in an imbalance because the neighboring nodes to memoryless nodes will get significantly more interleave hits that the rest of the nodes on the system. We can avoid this imbalance by clearing the nodes in the interleave node set that have no memory. If we use the node map of the memory nodes instead of the online nodes then we have only the nodes we want. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com> Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NBob Picco <bob.picco@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
It is necessary to know if nodes have memory since we have recently begun to add support for memoryless nodes. For that purpose we introduce a two new node states: N_HIGH_MEMORY and N_NORMAL_MEMORY. A node has its bit in N_HIGH_MEMORY set if it has any memory regardless of the type of mmemory. If a node has memory then it has at least one zone defined in its pgdat structure that is located in the pgdat itself. A node has its bit in N_NORMAL_MEMORY set if it has a lower zone than ZONE_HIGHMEM. This means it is possible to allocate memory that is not subject to kmap. N_HIGH_MEMORY and N_NORMAL_MEMORY can then be used in various places to insure that we do the right thing when we encounter a memoryless node. [akpm@linux-foundation.org: build fix] [Lee.Schermerhorn@hp.com: update N_HIGH_MEMORY node state for memory hotadd] [y-goto@jp.fujitsu.com: Fix memory hotplug + sparsemem build] Signed-off-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com> Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: NChristoph Lameter <clameter@sgi.com> Acked-by: NBob Picco <bob.picco@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: NYasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: NPaul Mundt <lethal@linux-sh.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
Why do we need to support memoryless nodes? KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > For fujitsu, problem is called "empty" node. > > When ACPI's SRAT table includes "possible nodes", ia64 bootstrap(acpi_numa_init) > creates nodes, which includes no memory, no cpu. > > I tried to remove empty-node in past, but that was denied. > It was because we can hot-add cpu to the empty node. > (node-hotplug triggered by cpu is not implemented now. and it will be ugly.) > > > For HP, (Lee can comment on this later), they have memory-less-node. > As far as I hear, HP's machine can have following configration. > > (example) > Node0: CPU0 memory AAA MB > Node1: CPU1 memory AAA MB > Node2: CPU2 memory AAA MB > Node3: CPU3 memory AAA MB > Node4: Memory XXX GB > > AAA is very small value (below 16MB) and will be omitted by ia64 bootstrap. > After boot, only Node 4 has valid memory (but have no cpu.) > > Maybe this is memory-interleave by firmware config. Christoph Lameter <clameter@sgi.com> wrote: > Future SGI platforms (actually also current one can have but nothing like > that is deployed to my knowledge) have nodes with only cpus. Current SGI > platforms have nodes with just I/O that we so far cannot manage in the > core. So the arch code maps them to the nearest memory node. Lee Schermerhorn <Lee.Schermerhorn@hp.com> wrote: > For the HP platforms, we can configure each cell with from 0% to 100% > "cell local memory". When we configure with <100% CLM, the "missing > percentages" are interleaved by hardware on a cache-line granularity to > improve bandwidth at the expense of latency for numa-challenged > applications [and OSes, but not our problem ;-)]. When we boot Linux on > such a config, all of the real nodes have no memory--it all resides in a > single interleaved pseudo-node. > > When we boot Linux on a 100% CLM configuration [== NUMA], we still have > the interleaved pseudo-node. It contains a few hundred MB stolen from > the real nodes to contain the DMA zone. [Interleaved memory resides at > phys addr 0]. The memoryless-nodes patches, along with the zoneorder > patches, support this config as well. > > Also, when we boot a NUMA config with the "mem=" command line, > specifying less memory than actually exists, Linux takes the excluded > memory "off the top" rather than distributing it across the nodes. This > can result in memoryless nodes, as well. > This patch: Preparation for memoryless node patches. Provide a generic way to keep nodemasks describing various characteristics of NUMA nodes. Remove the node_online_map and the node_possible map and realize the same functionality using two nodes stats: N_POSSIBLE and N_ONLINE. [Lee.Schermerhorn@hp.com: Initialize N_*_MEMORY and N_CPU masks for non-NUMA config] Signed-off-by: NChristoph Lameter <clameter@sgi.com> Tested-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NBob Picco <bob.picco@hp.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Mel Gorman <mel@skynet.ie> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Cc: "Serge E. Hallyn" <serge@hallyn.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
prepare/commit_write no longer returns AOP_TRUNCATED_PAGE since OCFS2 and GFS2 were converted to the new aops, so we can make some simplifications for that. [michal.k.k.piotrowski@gmail.com: fix warning] Signed-off-by: NNick Piggin <npiggin@suse.de> Cc: Michael Halcrow <mhalcrow@us.ibm.com> Cc: Mark Fasheh <mark.fasheh@oracle.com> Cc: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: NMichal Piotrowski <michal.k.k.piotrowski@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Rework the generic block "cont" routines to handle the new aops. Supporting cont_prepare_write would take quite a lot of code to support, so remove it instead (and we later convert all filesystems to use it). write_begin gets passed AOP_FLAG_CONT_EXPAND when called from generic_cont_expand, so filesystems can avoid the old hacks they used. Signed-off-by: NNick Piggin <npiggin@suse.de> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Implement new aops for some of the simpler filesystems. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Restore the KERNEL_DS optimisation, especially helpful to the 2copy write path. This may be a pretty questionable gain in most cases, especially after the legacy 2copy write path is removed, but it doesn't cost much. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
These are intended to replace prepare_write and commit_write with more flexible alternatives that are also able to avoid the buffered write deadlock problems efficiently (which prepare_write is unable to do). [mark.fasheh@oracle.com: API design contributions, code review and fixes] [akpm@linux-foundation.org: various fixes] [dmonakhov@sw.ru: new aop block_write_begin fix] Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NMark Fasheh <mark.fasheh@oracle.com> Signed-off-by: NDmitriy Monakhov <dmonakhov@openvz.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Add an iterator data structure to operate over an iovec. Add usercopy operators needed by generic_file_buffered_write, and convert that function over. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Modify the core write() code so that it won't take a pagefault while holding a lock on the pagecache page. There are a number of different deadlocks possible if we try to do such a thing: 1. generic_buffered_write 2. lock_page 3. prepare_write 4. unlock_page+vmtruncate 5. copy_from_user 6. mmap_sem(r) 7. handle_mm_fault 8. lock_page (filemap_nopage) 9. commit_write 10. unlock_page a. sys_munmap / sys_mlock / others b. mmap_sem(w) c. make_pages_present d. get_user_pages e. handle_mm_fault f. lock_page (filemap_nopage) 2,8 - recursive deadlock if page is same 2,8;2,8 - ABBA deadlock is page is different 2,6;b,f - ABBA deadlock if page is same The solution is as follows: 1. If we find the destination page is uptodate, continue as normal, but use atomic usercopies which do not take pagefaults and do not zero the uncopied tail of the destination. The destination is already uptodate, so we can commit_write the full length even if there was a partial copy: it does not matter that the tail was not modified, because if it is dirtied and written back to disk it will not cause any problems (uptodate *means* that the destination page is as new or newer than the copy on disk). 1a. The above requires that fault_in_pages_readable correctly returns access information, because atomic usercopies cannot distinguish between non-present pages in a readable mapping, from lack of a readable mapping. 2. If we find the destination page is non uptodate, unlock it (this could be made slightly more optimal), then allocate a temporary page to copy the source data into. Relock the destination page and continue with the copy. However, instead of a usercopy (which might take a fault), copy the data from the pinned temporary page via the kernel address space. (also, rename maxlen to seglen, because it was confusing) This increases the CPU/memory copy cost by almost 50% on the affected workloads. That will be solved by introducing a new set of pagecache write aops in a subsequent patch. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Hide some of the open-coded nr_segs tests into the iovec helpers. This is all to simplify generic_file_buffered_write, because that gets more complex in the next patch. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Quite a bit of code is used in maintaining these "cached pages" that are probably pretty unlikely to get used. It would require a narrow race where the page is inserted concurrently while this process is allocating a page in order to create the spare page. Then a multi-page write into an uncached part of the file, to make use of it. Next, the buffered write path (and others) uses its own LRU pagevec when it should be just using the per-CPU LRU pagevec (which will cut down on both data and code size cacheline footprint). Also, these private LRU pagevecs are emptied after just a very short time, in contrast with the per-CPU pagevecs that are persistent. Net result: 7.3 times fewer lru_lock acquisitions required to add the pages to pagecache for a bulk write (in 4K chunks). [this gets rid of some cond_resched() calls in readahead.c and mpage.c due to clashes in -mm. What put them there, and why? ] Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
If prepare_write fails with AOP_TRUNCATED_PAGE, or if commit_write fails, then we may have failed the write operation despite prepare_write having instantiated blocks past i_size. Fix this, and consolidate the trimming into one place. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Allow CONFIG_DEBUG_VM to switch off the prefaulting logic, to simulate the Makes the race much easier to hit. This is useful for demonstration and testing purposes, but is removed in a subsequent patch. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
Rename some variables and fix some types. Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
This reverts commit 6527c2bd, which fixed the following bug: When prefaulting in the pages in generic_file_buffered_write(), we only faulted in the pages for the firts segment of the iovec. If the second of successive segment described a mmapping of the page into which we're write()ing, and that page is not up-to-date, the fault handler tries to lock the already-locked page (to bring it up to date) and deadlocks. An exploit for this bug is in writev-deadlock-demo.c, in http://www.zip.com.au/~akpm/linux/patches/stuff/ext3-tools.tar.gz. (These demos assume blocksize < PAGE_CACHE_SIZE). The problem with this fix is that it takes the kernel back to doing a single prepare_write()/commit_write() per iovec segment. So in the worst case we'll run prepare_write+commit_write 1024 times where we previously would have run it once. The other problem with the fix is that it fix all the locking problems. <insert numbers obtained via ext3-tools's writev-speed.c here> And apparently this change killed NFS overwrite performance, because, I suppose, it talks to the server for each prepare_write+commit_write. So just back that patch out - we'll be fixing the deadlock by other means. Nick says: also it only ever actually papered over the bug, because after faulting in the pages, they might be unmapped or reclaimed. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
This reverts commit 81b0c871, which was a bugfix against 6527c2bd ("[PATCH] generic_file_buffered_write(): deadlock on vectored write"), which we also revert. Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Revert the patch from Neil Brown to optimise NFSD writev handling. Cc: Neil Brown <neilb@suse.de> Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hisashi Hifumi 提交于
While running some memory intensive load, system response deteriorated just after swap-out started. The cause of this problem is that when a PG_reclaim page is moved to the tail of the inactive LRU list in rotate_reclaimable_page(), lru_lock spin lock is acquired every page writeback . This deteriorates system performance and makes interrupt hold off time longer when swap-out started. Following patch solves this problem. I use pagevec in rotating reclaimable pages to mitigate LRU spin lock contention and reduce interrupt hold off time. I did a test that allocating and touching pages in multiple processes, and pinging to the test machine in flooding mode to measure response under memory intensive load. The test result is: -2.6.23-rc5 --- testmachine ping statistics --- 3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma 17.746/0.092 ms -2.6.23-rc5-patched --- testmachine ping statistics --- 3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms Max round-trip-time was improved. The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled) 8GB memory , 8GB swap. I did ping test again to observe performance deterioration caused by taking a ref. -2.6.23-rc6-with-modifiedpatch --- testmachine ping statistics --- 3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma 17.801/0.129 ms The result for my original patch is as follows. -2.6.23-rc5-with-originalpatch --- testmachine ping statistics --- 3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms The influence to response was small. [akpm@linux-foundation.org: fix uninitalised var warning] [hugh@veritas.com: fix locking] [randy.dunlap@oracle.com: fix function declaration] [hugh@veritas.com: fix BUG at include/linux/mm.h:220!] [hugh@veritas.com: kill redundancy in rotate_reclaimable_page] [hugh@veritas.com: move_tail_pages into lru_add_drain] Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lee Schermerhorn 提交于
Allow an application to query the memories allowed by its context. Updated numa_memory_policy.txt to mention that applications can use this to obtain allowed memories for constructing valid policies. TODO: update out-of-tree libnuma wrapper[s], or maybe add a new wrapper--e.g., numa_get_mems_allowed() ? Also, update numa syscall man pages. Tested with memtoy V>=0.13. Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Acked-by: NChristoph Lameter <clameter@sgi.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Rik van Riel 提交于
The current VM can get itself into trouble fairly easily on systems with a small ZONE_HIGHMEM, which is common on i686 computers with 1GB of memory. On one side, page_alloc() will allocate down to zone->pages_low, while on the other side, kswapd() and balance_pgdat() will try to free memory from every zone, until every zone has more free pages than zone->pages_high. Highmem can be filled up to zone->pages_low with page tables, ramfs, vmalloc allocations and other unswappable things quite easily and without many bad side effects, since we still have a huge ZONE_NORMAL to do future allocations from. However, as long as the number of free pages in the highmem zone is below zone->pages_high, kswapd will continue swapping things out from ZONE_NORMAL, too! Sami Farin managed to get his system into a stage where kswapd had freed about 700MB of low memory and was still "going strong". The attached patch will make kswapd stop paging out data from zones when there is more than enough memory free. We do go above zone->pages_high in order to keep pressure between zones equal in normal circumstances, but the patch should prevent the kind of excesses that made Sami's computer totally unusable. Signed-off-by: NRik van Riel <riel@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jesper Juhl 提交于
vmalloc() returns a void pointer, so there's no need to cast its return value in mm/page_alloc.c::zone_wait_table_init(). Signed-off-by: NJesper Juhl <jesper.juhl@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
A NULL pointer means that the object was not allocated. One cannot determine the size of an object that has not been allocated. Currently we return 0 but we really should BUG() on attempts to determine the size of something nonexistent. krealloc() interprets NULL to mean a zero sized object. Handle that separately in krealloc(). Signed-off-by: NChristoph Lameter <clameter@sgi.com> Acked-by: NPekka Enberg <penberg@cs.helsinki.fi> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dean Nelson 提交于
The calculation of pgoff in do_linear_fault() should use PAGE_SHIFT and not PAGE_CACHE_SHIFT since vma->vm_pgoff is in units of PAGE_SIZE and not PAGE_CACHE_SIZE. At the moment linux/pagemap.h has PAGE_CACHE_SHIFT defined as PAGE_SHIFT, but should that ever change this calculation would break. Signed-off-by: NDean Nelson <dcn@sgi.com> Acked-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Satyam Sharma 提交于
Considering kfree(NULL) would normally occur only in error paths and kfree(ZERO_SIZE_PTR) is uncommon as well, so let's use unlikely() for the condition check in SLUB's and SLOB's kfree() to optimize for the common case. SLAB has this already. Signed-off-by: NSatyam Sharma <satyam@infradead.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
__add_to_swap_cache unconditionally sets the page locked, which can be a bit alarming to the unsuspecting reader: in the code paths where the page is visible to other CPUs, the page should be (and is) already locked. Instead, just add a check to ensure the page is locked here, and teach the one path relying on the old behaviour to call SetPageLocked itself. [hugh@veritas.com: locking fix] Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
find_lock_page does not need to recheck ->index because if the page is in the right mapping then the index must be the same. Also, tree_lock does not need to be retaken after the page is locked in order to test that ->mapping has not changed, because holding the page lock pins its mapping. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
Probing pages and radix_tree_tagged are lockless operations with the lockless radix-tree. Convert these users to RCU locking rather than using tree_lock. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
The commit b5810039 contains the note A last caveat: the ZERO_PAGE is now refcounted and managed with rmap (and thus mapcounted and count towards shared rss). These writes to the struct page could cause excessive cacheline bouncing on big systems. There are a number of ways this could be addressed if it is an issue. And indeed this cacheline bouncing has shown up on large SGI systems. There was a situation where an Altix system was essentially livelocked tearing down ZERO_PAGE pagetables when an HPC app aborted during startup. This situation can be avoided in userspace, but it does highlight the potential scalability problem with refcounting ZERO_PAGE, and corner cases where it can really hurt (we don't want the system to livelock!). There are several broad ways to fix this problem: 1. add back some special casing to avoid refcounting ZERO_PAGE 2. per-node or per-cpu ZERO_PAGES 3. remove the ZERO_PAGE completely I will argue for 3. The others should also fix the problem, but they result in more complex code than does 3, with little or no real benefit that I can see. Why? Inserting a ZERO_PAGE for anonymous read faults appears to be a false optimisation: if an application is performance critical, it would not be doing many read faults of new memory, or at least it could be expected to write to that memory soon afterwards. If cache or memory use is critical, it should not be working with a significant number of ZERO_PAGEs anyway (a more compact representation of zeroes should be used). As a sanity check -- mesuring on my desktop system, there are never many mappings to the ZERO_PAGE (eg. 2 or 3), thus memory usage here should not increase much without it. When running a make -j4 kernel compile on my dual core system, there are about 1,000 mappings to the ZERO_PAGE created per second, but about 1,000 ZERO_PAGE COW faults per second (less than 1 ZERO_PAGE mapping per second is torn down without being COWed). So removing ZERO_PAGE will save 1,000 page faults per second when running kbuild, while keeping it only saves less than 1 page clearing operation per second. 1 page clear is cheaper than a thousand faults, presumably, so there isn't an obvious loss. Neither the logical argument nor these basic tests give a guarantee of no regressions. However, this is a reasonable opportunity to try to remove the ZERO_PAGE from the pagefault path. If it is found to cause regressions, we can reintroduce it and just avoid refcounting it. The /dev/zero ZERO_PAGE usage and TLB tricks also get nuked. I don't see much use to them except on benchmarks. All other users of ZERO_PAGE are converted just to use ZERO_PAGE(0) for simplicity. We can look at replacing them all and maybe ripping out ZERO_PAGE completely when we are more satisfied with this solution. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus "snif" Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
This gets rid of all kmalloc caches larger than page size. A kmalloc request larger than PAGE_SIZE > 2 is going to be passed through to the page allocator. This works both inline where we will call __get_free_pages instead of kmem_cache_alloc and in __kmalloc. kfree is modified to check if the object is in a slab page. If not then the page is freed via the page allocator instead. Roughly similar to what SLOB does. Advantages: - Reduces memory overhead for kmalloc array - Large kmalloc operations are faster since they do not need to pass through the slab allocator to get to the page allocator. - Performance increase of 10%-20% on alloc and 50% on free for PAGE_SIZEd allocations. SLUB must call page allocator for each alloc anyways since the higher order pages which that allowed avoiding the page alloc calls are not available in a reliable way anymore. So we are basically removing useless slab allocator overhead. - Large kmallocs yields page aligned object which is what SLAB did. Bad things like using page sized kmalloc allocations to stand in for page allocate allocs can be transparently handled and are not distinguishable from page allocator uses. - Checking for too large objects can be removed since it is done by the page allocator. Drawbacks: - No accounting for large kmalloc slab allocations anymore - No debugging of large kmalloc slab allocations. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fengguang Wu 提交于
Convert some 'unsigned long' to pgoff_t. Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fengguang Wu 提交于
- remove unused local next_index in do_generic_mapping_read() - remove a redudant page_cache_read() declaration Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fengguang Wu 提交于
Remove VM_MAX_CACHE_HIT, MAX_RA_PAGES and MIN_RA_PAGES. Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Fengguang Wu 提交于
The local copy of ra in do_generic_mapping_read() can now go away. It predates readanead(req_size). In a time when the readahead code was called on *every* single page. Hence a local has to be made to reduce the chance of the readahead state being overwritten by a concurrent reader. More details in: Linux: Random File I/O Regressions In 2.6 <http://kerneltrap.org/node/3039> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: NFengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-