- 01 10月, 2005 1 次提交
-
-
由 Linus Torvalds 提交于
As requested by Thomas Gleixner <tglx@linutronix.de>: "5d3d0f77 breaks a couple of ARM boards, which depend on the historical bootmem allocation order. There is a cleaner solution around to remove the pgdat list completely, but this is a topic for post 2.6.14 Andi signalled ACK already." Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 28 9月, 2005 2 次提交
-
-
由 Alok N Kataria 提交于
In kmalloc_node we are checking if the allocation is for the same node when interrupts are "on". This may lead to an allocation on another node than intended. This patch just shifts the check for the current node in __cache_alloc_node when interrupts are disabled. Signed-off-by: NAlok N Kataria <alokk@calsoftinc.com> Acked-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Nick Piggin 提交于
Move the ZERO_PAGE remapping complexity to the move_pte macro in asm-generic, have it conditionally depend on __HAVE_ARCH_MULTIPLE_ZERO_PAGE, which gets defined for MIPS. For architectures without __HAVE_ARCH_MULTIPLE_ZERO_PAGE, move_pte becomes a noop. From: Hugh Dickins <hugh@veritas.com> Fix nasty little bug we've missed in Nick's mremap move ZERO_PAGE patch. The "pte" at that point may be a swap entry or a pte_file entry: we must check pte_present before perhaps corrupting such an entry. Patch below against 2.6.14-rc2-mm1, but the same bug is in 2.6.14-rc2's mm/mremap.c, and more dangerous there since it's affecting all arches: I think the safest course is to send Nick's patch and Yoichi's build fix and this fix (build tested) on to Linus - so only MIPS can be affected. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 24 9月, 2005 1 次提交
-
-
由 Andrew Morton 提交于
As davem points out, this wasn't such a great idea. There may be some code which does: size = 1024*1024; while (kmalloc(size, ...) == 0) size /= 2; which will now explode. Cc: "David S. Miller" <davem@davemloft.net> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 23 9月, 2005 4 次提交
-
-
由 Rob Landley 提交于
Problem: In some circumstances, bd_claim() is returning the wrong error code. If we try to swapon an unused block device that isn't swap formatted, we get -EINVAL. But if that same block device is already mounted, we instead get -EBUSY, even though it still isn't a valid swap device. This issue came up on the busybox list trying to get the error message from "swapon -a" right. If a swap device is already enabled, we get -EBUSY, and we shouldn't report this as an error. But we can't distinguish the two -EBUSY conditions, which are very different errors. In the code, bd_claim() returns either 0 or -EBUSY, but in this case busy means "somebody other than sys_swapon has already claimed this", and _that_ means this block device can't be a valid swap device. So return -EINVAL there. Signed-off-by: NRob Landley <rob@landley.net> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Christoph Lameter 提交于
I had an issue on ia64 where I got a bug in kernel/workqueue because kzalloc returned a NULL pointer due to the task structure getting too big for the slab allocator. Usually these cases are caught by the kmalloc macro in include/linux/slab.h. Compilation will fail if a too big value is passed to kmalloc. However, kzalloc uses __kmalloc which has no check for that. This patch makes __kmalloc bug if a too large entity is requested. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Christoph Lameter 提交于
The numa slab allocator may allocate pages from foreign nodes onto the lists for a particular node if a node runs out of memory. Inspecting the slab->nodeid field will not reflect that the page is now in use for the slabs of another node. This patch fixes that issue by adding a node field to free_block so that the caller can indicate which node currently uses a slab. Also removes the check for the current node from kmalloc_cache_node since the process may shift later to another node which may lead to an allocation on another node than intended. Signed-off-by: NChristoph Lameter <clameter@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Ivan Kokshaysky 提交于
It is essential that index_of() be inlined. But alpha undoes the gcc inlining hackery and index_of() ends up out-of-line. So fiddle with things to make that function inline again. Cc: Richard Henderson <rth@twiddle.net> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 22 9月, 2005 2 次提交
-
-
Hugh made me note this line for permission checking in mprotect(): if ((newflags & ~(newflags >> 4)) & 0xf) { after figuring out what's that about, I decided it's nasty enough. Btw Hugh itself didn't like the 0xf. We can safely change it to VM_READ|VM_WRITE|VM_EXEC because we never change VM_SHARED, so no need to check that. Signed-off-by: NPaolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Acked-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
That comment is plain wrong (we even take the pagetable lock inside unmap_region()). Signed-off-by: NPaolo 'Blaisorblade' Giarrusso <blaisorblade@yahoo.it> Acked-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 18 9月, 2005 1 次提交
-
-
由 Dave Hansen 提交于
Signed-off-by: NDave Hansen <haveblue@us.ibm.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 15 9月, 2005 2 次提交
-
-
由 Alok Kataria 提交于
With the new changes that we made in the initialization of the slab allocator, we first setup the cache from which array caches are allocated, and then the cache, from which kmem_list3's are allocated. Now if the array cache comes from a cache in which objsize > 32, (in this instance size-64) then, first size-64 cache will be allocated and then the size-128 (if this is the cache from which kmem_list3's are going to be allocated). So with these new changes, we are not guaranteed that we will be initializing the malloc_sizes array in a serialized order. Thus there is a bug in __find_general_cachep, as we are checking whether the first cache_sizes ptr is NULL. This is replaced by checking whether the array-cache cache is initialized. Attached is a patch which does that. Boots fine on a x86-64, with DEBUG_SPIN, DEBUG_SLAB, and preempt. Attached is a patch which does that. Boots fine on a x86-64, with DEBUG_SPIN, DEBUG_SLAB, and preempt.Thanks & Regards, Alok Signed-off-by: NAlok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhitdayal.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: Christoph Lameter <christoph@lameter.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Hugh Dickins 提交于
Pavel Emelianov and Kirill Korotaev observe that fs and arch users of security_vm_enough_memory tend to forget to vm_unacct_memory when a failure occurs further down (typically in setup_arg_pages variants). These are all users of insert_vm_struct, and that reservation will only be unaccounted on exit if the vma is marked VM_ACCOUNT: which in some cases it is (hidden inside VM_STACK_FLAGS) and in some cases it isn't. So x86_64 32-bit and ppc64 vDSO ELFs have been leaking memory into Committed_AS each time they're run. But don't add VM_ACCOUNT to them, it's inappropriate to reserve against the very unlikely case that gdb be used to COW a vDSO page - we ought to do something about that in do_wp_page, but there are yet other inconsistencies to be resolved. The safe and economical way to fix this is to let insert_vm_struct do the security_vm_enough_memory check when it finds VM_ACCOUNT is set. And the MIPS irix_brk has been calling security_vm_enough_memory before calling do_brk which repeats it, doubly accounting and so also leaking. Remove that, and all the fs and arch calls to security_vm_enough_memory: give it a less misleading name later on. Signed-off-by: NHugh Dickins <hugh@veritas.com> Signed-Off-By: NKirill Korotaev <dev@sw.ru> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 13 9月, 2005 4 次提交
-
-
由 Randy Dunlap 提交于
Use the add_taint() interface for setting tainted bit flags instead of doing it manually. Signed-off-by: NRandy Dunlap <rdunlap@xenotime.net> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Andi Kleen 提交于
There was a pretty bad bug in there that the code would always check the full VMA, not the range the user requested. When the VMA to be checked was merged with the previous VMA this could lead to spurious failures. Signed-off-by: N"Andi Kleen" <ak@suse.de> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Con Kolivas 提交于
Use the pgdat pointer we've already defined in wakeup_kswapd Signed-off-by: NCon Kolivas <kernel@kolivas.org> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Andi Kleen 提交于
This leads to bootmem allocating first from node 0 instead of from the last node. This avoids swiotlb allocating on the last node, which doesn't really work on a machine with >4GB. Note: there is a better patch around from someone else that gets rid of the pgdat list completely. Signed-off-by: NAndi Kleen <ak@suse.de> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 12 9月, 2005 1 次提交
-
-
由 Greg Ungerer 提交于
Move call to get_mm_counter() in update_mem_hiwater() to be inside the check for tsk->mm being null. Otherwise you can be following a null pointer here. This patch submitted by Javier Herrero <jherrero@hvsistemas.es>. Modify the end check for munmap regions to allow for the legacy behavior of 0 being valid. Pretty much all current uClinux system libc malloc's pass in 0 as the end point. A hard check will fail on these, so change the check so that if it is non-zero it must be valid otherwise it fails. A passed in value will always succeed (as it used too). Also export a few more mm system functions - to be consistent with the VM code exports. Signed-off-by: NGreg Ungerer <gerg@uclinux.com> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 11 9月, 2005 5 次提交
-
-
由 Nishanth Aravamudan 提交于
Use schedule_timeout_{,un}interruptible() instead of set_current_state()/schedule_timeout() to reduce kernel size. Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Renaud Lienhart 提交于
free_pages_bulk() doesn't free the entire list if count == 0. Signed-off-by: NRenaud Lienhart <renaud.lienhart@free.fr> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Victor Fusco 提交于
Fix the sparse warning "implicit cast to nocast type" Signed-off-by: NVictor Fusco <victor@cetuc.puc-rio.br> Signed-off-by: NDomen Puncer <domen@coderock.org> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Victor Fusco 提交于
Fix the sparse warning "implicit cast to nocast type" Signed-off-by: NVictor Fusco <victor@cetuc.puc-rio.br> Signed-off-by: NDomen Puncer <domen@coderock.org> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Adrian Bunk 提交于
With Nick Piggin <npiggin@suse.de> Give some things static scope. Signed-off-by: NAdrian Bunk <bunk@stusta.de> Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 10 9月, 2005 5 次提交
-
-
由 Ingo Molnar 提交于
Clean up timer initialization by introducing DEFINE_TIMER a'la DEFINE_SPINLOCK. Build and boot-tested on x86. A similar patch has been been in the -RT tree for some time. Signed-off-by: NIngo Molnar <mingo@elte.hu> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Pekka Enberg 提交于
This patch clarifies NULL handling of kfree() and vfree(). I addition, wording of calling context restriction for vfree() and vunmap() are changed from "may not" to "must not." Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi> Acked-by: NManfred Spraul <manfred@colorfullife.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Christoph Lameter 提交于
The NUMA API change that introduced kmalloc_node was accepted for 2.6.12-rc3. Now it is possible to do slab allocations on a node to localize memory structures. This API was used by the pageset localization patch and the block layer localization patch now in mm. The existing kmalloc_node is slow since it simply searches through all pages of the slab to find a page that is on the node requested. The two patches do a one time allocation of slab structures at initialization and therefore the speed of kmalloc node does not matter. This patch allows kmalloc_node to be as fast as kmalloc by introducing node specific page lists for partial, free and full slabs. Slab allocation improves in a NUMA system so that we are seeing a performance gain in AIM7 of about 5% with this patch alone. More NUMA localizations are possible if kmalloc_node operates in an fast way like kmalloc. Test run on a 32p systems with 32G Ram. w/o patch Tasks jobs/min jti jobs/min/task real cpu 1 485.36 100 485.3640 11.99 1.91 Sat Apr 30 14:01:51 2005 100 26582.63 88 265.8263 21.89 144.96 Sat Apr 30 14:02:14 2005 200 29866.83 81 149.3342 38.97 286.08 Sat Apr 30 14:02:53 2005 300 33127.16 78 110.4239 52.71 426.54 Sat Apr 30 14:03:46 2005 400 34889.47 80 87.2237 66.72 568.90 Sat Apr 30 14:04:53 2005 500 35654.34 76 71.3087 81.62 714.55 Sat Apr 30 14:06:15 2005 600 36460.83 75 60.7681 95.77 853.42 Sat Apr 30 14:07:51 2005 700 35957.00 75 51.3671 113.30 990.67 Sat Apr 30 14:09:45 2005 800 33380.65 73 41.7258 139.48 1140.86 Sat Apr 30 14:12:05 2005 900 35095.01 76 38.9945 149.25 1281.30 Sat Apr 30 14:14:35 2005 1000 36094.37 74 36.0944 161.24 1419.66 Sat Apr 30 14:17:17 2005 w/patch Tasks jobs/min jti jobs/min/task real cpu 1 484.27 100 484.2736 12.02 1.93 Sat Apr 30 15:59:45 2005 100 28262.03 90 282.6203 20.59 143.57 Sat Apr 30 16:00:06 2005 200 32246.45 82 161.2322 36.10 282.89 Sat Apr 30 16:00:42 2005 300 37945.80 83 126.4860 46.01 418.75 Sat Apr 30 16:01:28 2005 400 40000.69 81 100.0017 58.20 561.48 Sat Apr 30 16:02:27 2005 500 40976.10 78 81.9522 71.02 696.95 Sat Apr 30 16:03:38 2005 600 41121.54 78 68.5359 84.92 834.86 Sat Apr 30 16:05:04 2005 700 44052.77 78 62.9325 92.48 971.53 Sat Apr 30 16:06:37 2005 800 41066.89 79 51.3336 113.38 1111.15 Sat Apr 30 16:08:31 2005 900 38918.77 79 43.2431 134.59 1252.57 Sat Apr 30 16:10:46 2005 1000 41842.21 76 41.8422 139.09 1392.33 Sat Apr 30 16:13:05 2005 These are measurement taken directly after boot and show a greater improvement than 5%. However, the performance improvements become less over time if the AIM7 runs are repeated and settle down at around 5%. Links to earlier discussions: http://marc.theaimsgroup.com/?t=111094594500003&r=1&w=2 http://marc.theaimsgroup.com/?t=111603406600002&r=1&w=2 Changelog V4-V5: - alloc_arraycache and alloc_aliencache take node parameter instead of cpu - fix initialization so that nodes without cpus are properly handled. - simplify code in kmem_cache_init - patch against Andrews temp mm3 release - Add Shai to credits - fallback to __cache_alloc from __cache_alloc_node if the node's cache is not available yet. Changelog V3-V4: - Patch against 2.6.12-rc5-mm1 - Cleanup patch integrated - More and better use of for_each_node and for_each_cpu - GCC 2.95 fix (do not use [] use [0]) - Correct determination of INDEX_AC - Remove hack to cause an error on platforms that have no CONFIG_NUMA but nodes. - Remove list3_data and list3_data_ptr macros for better readability Changelog V2-V3: - Made to patch against 2.6.12-rc4-mm1 - Revised bootstrap mechanism so that larger size kmem_list3 structs can be supported. Do a generic solution so that the right slab can be found for the internal structs. - use for_each_online_node Changelog V1-V2: - Batching for freeing of wrong-node objects (alien caches) - Locking changes and NUMA #ifdefs as requested by Manfred Signed-off-by: NAlok N Kataria <alokk@calsoftinc.com> Signed-off-by: NShobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: NShai Fultheim <Shai@Scalex86.org> Signed-off-by: NChristoph Lameter <clameter@sgi.com> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Stephen Smalley 提交于
This patch modifies tmpfs to call the inode_init_security LSM hook to set up the incore inode security state for new inodes before the inode becomes accessible via the dcache. As there is no underlying storage of security xattrs in this case, it is not necessary for the hook to return the (name, value, len) triple to the tmpfs code, so this patch also modifies the SELinux hook function to correctly handle the case where the (name, value, len) pointers are NULL. The hook call is needed in tmpfs in order to support proper security labeling of tmpfs inodes (e.g. for udev with tmpfs /dev in Fedora). With this change in place, we should then be able to remove the security_inode_post_create/mkdir/... hooks safely. Signed-off-by: NStephen Smalley <sds@tycho.nsa.gov> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Mark Fasheh 提交于
Update the file systems in fs/ implementing a delete_inode() callback to call truncate_inode_pages(). One implementation note: In developing this patch I put the calls to truncate_inode_pages() at the very top of those filesystems delete_inode() callbacks in order to retain the previous behavior. I'm guessing that some of those could probably be optimized. Signed-off-by: NMark Fasheh <mark.fasheh@oracle.com> Acked-by: NChristoph Hellwig <hch@infradead.org> Signed-off-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 09 9月, 2005 1 次提交
-
-
由 Andi Kleen 提交于
Run PCI driver initialization on local node Instead of adding messy kmalloc_node()s everywhere run the PCI driver probe on the node local to the device. This would not have helped for IDE, but should for other more clean drivers that do more initialization in probe(). It won't help for drivers that do most of the work on first open (like many network drivers) Signed-off-by: NAndi Kleen <ak@suse.de> Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
-
- 08 9月, 2005 8 次提交
-
-
由 Pekka J Enberg 提交于
This patch introduces a kzalloc wrapper and converts kernel/ to use it. It saves a little program text. Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: NAdrian Bunk <bunk@stusta.de> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Paul Jackson 提交于
Now the real motivation for this cpuset mem_exclusive patch series seems trivial. This patch keeps a task in or under one mem_exclusive cpuset from provoking an oom kill of a task under a non-overlapping mem_exclusive cpuset. Since only interrupt and GFP_ATOMIC allocations are allowed to escape mem_exclusive containment, there is little to gain from oom killing a task under a non-overlapping mem_exclusive cpuset, as almost all kernel and user memory allocation must come from disjoint memory nodes. This patch enables configuring a system so that a runaway job under one mem_exclusive cpuset cannot cause the killing of a job in another such cpuset that might be using very high compute and memory resources for a prolonged time. Signed-off-by: NPaul Jackson <pj@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Paul Jackson 提交于
This patch makes use of the previously underutilized cpuset flag 'mem_exclusive' to provide what amounts to another layer of memory placement resolution. With this patch, there are now the following four layers of memory placement available: 1) The whole system (interrupt and GFP_ATOMIC allocations can use this), 2) The nearest enclosing mem_exclusive cpuset (GFP_KERNEL allocations can use), 3) The current tasks cpuset (GFP_USER allocations constrained to here), and 4) Specific node placement, using mbind and set_mempolicy. These nest - each layer is a subset (same or within) of the previous. Layer (2) above is new, with this patch. The call used to check whether a zone (its node, actually) is in a cpuset (in its mems_allowed, actually) is extended to take a gfp_mask argument, and its logic is extended, in the case that __GFP_HARDWALL is not set in the flag bits, to look up the cpuset hierarchy for the nearest enclosing mem_exclusive cpuset, to determine if placement is allowed. The definition of GFP_USER, which used to be identical to GFP_KERNEL, is changed to also set the __GFP_HARDWALL bit, in the previous cpuset_gfp_hardwall_flag patch. GFP_ATOMIC and GFP_KERNEL allocations will stay within the current tasks cpuset, so long as any node therein is not too tight on memory, but will escape to the larger layer, if need be. The intended use is to allow something like a batch manager to handle several jobs, each job in its own cpuset, but using common kernel memory for caches and such. Swapper and oom_kill activity is also constrained to Layer (2). A task in or below one mem_exclusive cpuset should not cause swapping on nodes in another non-overlapping mem_exclusive cpuset, nor provoke oom_killing of a task in another such cpuset. Heavy use of kernel memory for i/o caching and such by one job should not impact the memory available to jobs in other non-overlapping mem_exclusive cpusets. This patch enables providing hardwall, inescapable cpusets for memory allocations of each job, while sharing kernel memory allocations between several jobs, in an enclosing mem_exclusive cpuset. Like Dinakar's patch earlier to enable administering sched domains using the cpu_exclusive flag, this patch also provides a useful meaning to a cpuset flag that had previously done nothing much useful other than restrict what cpuset configurations were allowed. Signed-off-by: NPaul Jackson <pj@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Paul Jackson 提交于
This patch series extends the use of the cpuset attribute 'mem_exclusive' to support cpuset configurations that: 1) allow GFP_KERNEL allocations to come from a potentially larger set of memory nodes than GFP_USER allocations, and 2) can constrain the oom killer to tasks running in cpusets in a specified subtree of the cpuset hierarchy. Here's an example usage scenario. For a few hours or more, a large NUMA system at a University is to be divided in two halves, with a bunch of student jobs running in half the system under some form of batch manager, and with a big research project running in the other half. Each of the student jobs is placed in a small cpuset, but should share the classic Unix time share facilities, such as buffered pages of files in /bin and /usr/lib. The big research project wants no interference whatsoever from the student jobs, and has highly tuned, unusual memory and i/o patterns that intend to make full use of all the main memory on the nodes available to it. In this example, we have two big sibling cpusets, one of which is further divided into a more dynamic set of child cpusets. We want kernel memory allocations constrained by the two big cpusets, and user allocations constrained by the smaller child cpusets where present. And we require that the oom killer not operate across the two halves of this system, or else the first time a student job runs amuck, the big research project will likely be first inline to get shot. Tweaking /proc/<pid>/oom_adj is not ideal -- if the big research project really does run amuck allocating memory, it should be shot, not some other task outside the research projects mem_exclusive cpuset. I propose to extend the use of the 'mem_exclusive' flag of cpusets to manage such scenarios. Let memory allocations for user space (GFP_USER) be constrained by a tasks current cpuset, but memory allocations for kernel space (GFP_KERNEL) by constrained by the nearest mem_exclusive ancestor of the current cpuset, even though kernel space allocations will still _prefer_ to remain within the current tasks cpuset, if memory is easily available. Let the oom killer be constrained to consider only tasks that are in overlapping mem_exclusive cpusets (it won't help much to kill a task that normally cannot allocate memory on any of the same nodes as the ones on which the current task can allocate.) The current constraints imposed on setting mem_exclusive are unchanged. A cpuset may only be mem_exclusive if its parent is also mem_exclusive, and a mem_exclusive cpuset may not overlap any of its siblings memory nodes. This patch was presented on linux-mm in early July 2005, though did not generate much feedback at that time. It has been built for a variety of arch's using cross tools, and built, booted and tested for function on SN2 (ia64). There are 4 patches in this set: 1) Some minor cleanup, and some improvements to the code layout of one routine to make subsequent patches cleaner. 2) Add another GFP flag - __GFP_HARDWALL. It marks memory requests for USER space, which are tightly confined by the current tasks cpuset. 3) Now memory requests (such as KERNEL) that not marked HARDWALL can if short on memory, look in the potentially larger pool of memory defined by the nearest mem_exclusive ancestor cpuset of the current tasks cpuset. 4) Finally, modify the oom killer to skip any task whose mem_exclusive cpuset doesn't overlap ours. Patch (1), the one time I looked on an SN2 (ia64) build, actually saved 32 bytes of kernel text space. Patch (2) has no affect on the size of kernel text space (it just adds a preprocessor flag). Patches (3) and (4) added about 600 bytes each of kernel text space, mostly in kernel/cpuset.c, which matters only if CONFIG_CPUSET is enabled. This patch: This patch applies a few comment and code cleanups to mm/oom_kill.c prior to applying a few small patches to improve cpuset management of memory placement. The comment changed in oom_kill.c was seriously misleading. The code layout change in select_bad_process() makes room for adding another condition on which a process can be spared the oom killer (see the subsequent cpuset_nodes_overlap patch for this addition). Also a couple typos and spellos that bugged me, while I was here. This patch should have no material affect. Signed-off-by: NPaul Jackson <pj@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Ravikiran G Thirumalai 提交于
Mark variables which are usually accessed for reads with __readmostly. Signed-off-by: NAlok N Kataria <alokk@calsoftinc.com> Signed-off-by: NShai Fultheim <shai@scalex86.org> Signed-off-by: NRavikiran Thirumalai <kiran@scalex86.org> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Steven Pratt 提交于
We don't reset the cache hit count until after readahead does a successful readahead. This seems to leave a corner case open where we miss in cache, but don't restart the readhead right away. Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Christoph Hellwig 提交于
Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Christoph Lameter 提交于
Move some more frequently read variables that showed up during some of our performance tests as sometimes ending up in hot cachelines to the read_mostly section. Fix: Move the __read_mostly from before hpet_usec_quotient to follow the variable like the other uses of __read_mostly. Signed-off-by: NAlok N Kataria <alokk@calsoftinc.com> Signed-off-by: NChristoph Lameter <christoph@scalex86.org> Signed-off-by: NShai Fultheim <shai@scalex86.org> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
- 05 9月, 2005 3 次提交
-
-
由 Stephen Smalley 提交于
This patch modifies the VFS setxattr, getxattr, and listxattr code to fall back to the security module for security xattrs if the filesystem does not support xattrs natively. This allows security modules to export the incore inode security label information to userspace even if the filesystem does not provide xattr storage, and eliminates the need to individually patch various pseudo filesystem types to provide such access. The patch removes the existing xattr code from devpts and tmpfs as it is then no longer needed. The patch restructures the code flow slightly to reduce duplication between the normal path and the fallback path, but this should only have one user-visible side effect - a program may get -EACCES rather than -EOPNOTSUPP if policy denied access but the filesystem didn't support the operation anyway. Note that the post_setxattr hook call is not needed in the fallback case, as the inode_setsecurity hook call handles the incore inode security state update directly. In contrast, we do call fsnotify in both cases. Signed-off-by: NStephen Smalley <sds@tycho.nsa.gov> Acked-by: NJames Morris <jmorris@namei.org> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Martin Hicks 提交于
Add page_state info to the per-node meminfo file in sysfs. This is mostly just for informational purposes. The lack of this information was brought up recently during a discussion regarding pagecache clearing, and I put this patch together to test out one of the suggestions. It seems like interesting info to have, so I'm submitting the patch. Signed-off-by: NMartin Hicks <mort@sgi.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-
由 Manfred Spraul 提交于
Proposed by and based on a patch from Eric Dumazet <dada1@cosmosbay.com>: This patch removes unnecessary critical section in ksize() function, as cli/sti are rather expensive on modern CPUS. It additionally adds a docbook entry for ksize() and further simplifies the code. Signed-Off-By: NManfred Spraul <manfred@colorfullife.com> Signed-off-by: NAndrew Morton <akpm@osdl.org> Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
-