- 23 9月, 2008 2 次提交
-
-
由 Daisuke Nishimura 提交于
Current memory cgroup(both in mainline and -mm) doesn't account swap caches as memory(swap cache support is dropped temporarily now). So try_to_free_mem_cgroup_pages doesn't reflect the count of pages that have been moved to swap cache. But this makes mem_cgroup_shrink_usage fail easily if most of the pages are anon/shmem, and then shmem_getpage returns -ENOMEM and the process will be killed. This patch adds res_counter_check_under_limit to avoid these cases. BTW, even if swap cache support is enabled again, if a process is moved to another cgroup, which has been just made, between precharge and shrink_usage in shmem_getpage, shrink_usage may fail just because there is no pages to reclaim. So this change would make sense anyway. Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
tiny-shmem calls do_truncate in shmem_file_setup. do_truncate takes i_mutex, and shmem_file_setup is called with mmap_sem held. However i_mutex nests outside mmap_sem. Copy the code in shmem.c to avoid this problem. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NNick Piggin <npiggin@suse.de> Reported-and-tested-by: NIngo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Matt Mackall <mpm@selenic.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 9月, 2008 1 次提交
-
-
由 Salman Qazi 提交于
Initialized total objects atomic for the node in init_kmem_cache_node. The uninitialized value was ruining the stats in /proc/slabinfo. Acked-by: NChristoph Lameter <cl@linux-foundation.org> Signed-off-by: NSalman Qazi <sqazi@google.com> Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
-
- 14 9月, 2008 1 次提交
-
-
由 Mel Gorman 提交于
The iterator for_each_zone_zonelist() uses a struct zoneref *z cursor when scanning zonelists to keep track of where in the zonelist it is. The zoneref that is returned corresponds to the the next zone that is to be scanned, not the current one. It was intended to be treated as an opaque list. When the page allocator is scanning a zonelist, it marks elements in the zonelist corresponding to zones that are temporarily full. As the zonelist is being updated, it uses the cursor here; if (NUMA_BUILD) zlc_mark_zone_full(zonelist, z); This is intended to prevent rescanning in the near future but the zoneref cursor does not correspond to the zone that has been found to be full. This is an easy misunderstanding to make so this patch corrects the problem by changing zoneref cursor to be the current zone being scanned instead of the next one. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: <stable@kernel.org> [2.6.26.x] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 9月, 2008 1 次提交
-
-
由 Tejun Heo 提交于
Anonymous mappings should ignore offset but shared anonymous mapping forgot to clear it and makes the following legit test program trigger SIGBUS. #include <sys/mman.h> #include <stdio.h> #include <errno.h> #define PAGE_SIZE 4096 int main(void) { char *p; int i; p = mmap(NULL, 2 * PAGE_SIZE, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS, -1, PAGE_SIZE); if (p == MAP_FAILED) { perror("mmap"); return 1; } for (i = 0; i < 2; i++) { printf("page %d\n", i); p[i * 4096] = i; } return 0; } Fix it. Signed-off-by: NTejun Heo <tj@kernel.org> Acked-by: NHugh Dickins <hugh@veritas.com> Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 9月, 2008 4 次提交
-
-
由 KOSAKI Motohiro 提交于
Quicklists store pages for each CPU as caches. (Each CPU can cache node_free_pages/16 pages) It is used for page table cache. exit() will increase the cache size, while fork() consumes it. So for example if an apache-style application runs (one parent and many child model), one CPU process will fork() while another CPU will process the middleware work and exit(). At that time, the CPU on which the parent runs doesn't have page table cache at all. Others (on which children runs) have maximum caches. QList_max = (#ofCPUs - 1) x Free / 16 => QList_max / (Free + QList_max) = (#ofCPUs - 1) / (16 + #ofCPUs - 1) So, How much quicklist memory is used in the maximum case? This is proposional to # of CPUs because the limit of per cpu quicklist cache doesn't see the number of cpus. Above calculation mean Number of CPUs per node 2 4 8 16 ============================== ==================== QList_max / (Free + QList_max) 5.8% 16% 30% 48% Wow! Quicklist can spend about 50% memory at worst case. My demonstration program is here -------------------------------------------------------------------------------- #define _GNU_SOURCE #include <stdio.h> #include <errno.h> #include <stdlib.h> #include <string.h> #include <sched.h> #include <unistd.h> #include <sys/mman.h> #include <sys/wait.h> #define BUFFSIZE 512 int max_cpu(void) /* get max number of logical cpus from /proc/cpuinfo */ { FILE *fd; char *ret, buffer[BUFFSIZE]; int cpu = 1; fd = fopen("/proc/cpuinfo", "r"); if (fd == NULL) { perror("fopen(/proc/cpuinfo)"); exit(EXIT_FAILURE); } while (1) { ret = fgets(buffer, BUFFSIZE, fd); if (ret == NULL) break; if (!strncmp(buffer, "processor", 9)) cpu = atoi(strchr(buffer, ':') + 2); } fclose(fd); return cpu; } void cpu_bind(int cpu) /* bind current process to one cpu */ { cpu_set_t mask; int ret; CPU_ZERO(&mask); CPU_SET(cpu, &mask); ret = sched_setaffinity(0, sizeof(mask), &mask); if (ret == -1) { perror("sched_setaffinity()"); exit(EXIT_FAILURE); } sched_yield(); /* not necessary */ } #define MMAP_SIZE (10 * 1024 * 1024) /* 10 MB */ #define FORK_INTERVAL 1 /* 1 second */ main(int argc, char *argv[]) { int cpu_max, nextcpu; long pagesize; pid_t pid; /* set max number of logical cpu */ if (argc > 1) cpu_max = atoi(argv[1]) - 1; else cpu_max = max_cpu(); /* get the page size */ pagesize = sysconf(_SC_PAGESIZE); if (pagesize == -1) { perror("sysconf(_SC_PAGESIZE)"); exit(EXIT_FAILURE); } /* prepare parent process */ cpu_bind(0); nextcpu = cpu_max; loop: /* select destination cpu for child process by round-robin rule */ if (++nextcpu > cpu_max) nextcpu = 1; pid = fork(); if (pid == 0) { /* child action */ char *p; int i; /* consume page tables */ p = mmap(0, MMAP_SIZE, PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, 0, 0); i = MMAP_SIZE / pagesize; while (i-- > 0) { *p = 1; p += pagesize; } /* move to other cpu */ cpu_bind(nextcpu); /* printf("a child moved to cpu%d after mmap().\n", nextcpu); fflush(stdout); */ /* back page tables to pgtable_quicklist */ exit(0); } else if (pid > 0) { /* parent action */ sleep(FORK_INTERVAL); waitpid(pid, NULL, WNOHANG); } goto loop; } ---------------------------------------- When above program which does task migration runs, my 8GB box spends 800MB of memory for quicklist. This is not memory leak but doesn't seem good. % cat /proc/meminfo MemTotal: 7701568 kB MemFree: 4724672 kB (snip) Quicklists: 844800 kB because - My machine spec is number of numa node: 2 number of cpus: 8 (4CPU x2 node) total mem: 8GB (4GB x2 node) free mem: about 5GB - Then, 4.7GB x 16% ~= 880MB. So, Quicklist can use 800MB. So, if following spec machine run that program CPUs: 64 (8cpu x 8node) Mem: 1TB (128GB x8node) Then, quicklist can waste 300GB (= 1TB x 30%). It is too large. So, I don't like cache policies which is proportional to # of cpus. My patch changes the number of caches from: per-cpu-cache-amount = memory_on_node / 16 to per-cpu-cache-amount = memory_on_node / 16 / number_of_cpus_on_node. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Keiichiro Tokunaga <tokunaga.keiich@jp.fujitsu.com> Acked-by: NChristoph Lameter <cl@linux-foundation.org> Tested-by: NDavid Miller <davem@davemloft.net> Acked-by: NMike Travis <travis@sgi.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Marcin Slusarz 提交于
WARNING: vmlinux.o(.data+0x1f5c0): Section mismatch in reference from the variable contig_page_data to the variable .init.data:bootmem_node_data The variable contig_page_data references the variable __initdata bootmem_node_data If the reference is valid then annotate the variable with __init* (see linux/init.h) or name the variable: *driver, *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console, Signed-off-by: NMarcin Slusarz <marcin.slusarz@gmail.com> Cc: Johannes Weiner <hannes@saeurebad.de> Cc: Sean MacLennan <smaclennan@pikatech.com> Cc: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hisashi Hifumi 提交于
Dio write returns EIO when try_to_release_page fails because bh is still referenced. The patch commit 3f31fddf Author: Mingming Cao <cmm@us.ibm.com> Date: Fri Jul 25 01:46:22 2008 -0700 jbd: fix race between free buffer and commit transaction was merged into 2.6.27-rc1, but I noticed that this patch is not enough to fix the race. I did fsstress test heavily to 2.6.27-rc1, and found that dio write still sometimes got EIO through this test. The patch above fixed race between freeing buffer(dio) and committing transaction(jbd) but I discovered that there is another race, freeing buffer(dio) and ext3/4_ordered_writepage. : background_writeout() ->write_cache_pages() ->ext3_ordered_writepage() walk_page_buffers() -> take a bh ref block_write_full_page() -> unlock_page : <- end_page_writeback : <- race! (dio write->try_to_release_page fails) walk_page_buffers() ->release a bh ref ext3_ordered_writepage holds bh ref and does unlock_page remaining taking a bh ref, so this causes the race and failure of try_to_release_page. To fix this race, I used the approach of falling back to buffered writes if try_to_release_page() fails on a page. [akpm@linux-foundation.org: cleanups] Signed-off-by: NHisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> Cc: Chris Mason <chris.mason@oracle.com> Cc: Jan Kara <jack@suse.cz> Cc: Mingming Cao <cmm@us.ibm.com> Cc: Zach Brown <zach.brown@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Adam Litke 提交于
I have gotten to the root cause of the hugetlb badness I reported back on August 15th. My system has the following memory topology (note the overlapping node): Node 0 Memory: 0x8000000-0x44000000 Node 1 Memory: 0x0-0x8000000 0x44000000-0x80000000 setup_zone_migrate_reserve() scans the address range 0x0-0x8000000 looking for a pageblock to move onto the MIGRATE_RESERVE list. Finding no candidates, it happily continues the scan into 0x8000000-0x44000000. When a pageblock is found, the pages are moved to the MIGRATE_RESERVE list on the wrong zone. Oops. setup_zone_migrate_reserve() should skip pageblocks in overlapping nodes. Signed-off-by: NAdam Litke <agl@us.ibm.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Nishanth Aravamudan <nacc@us.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 9月, 2008 1 次提交
-
-
由 David Woodhouse 提交于
Signed-off-by: NDavid Woodhouse <David.Woodhouse@intel.com>
-
- 28 8月, 2008 1 次提交
-
-
由 Mel Gorman 提交于
Ordinarily, memory holes in flatmem still have a valid memmap and is safe to use. However, an architecture (ARM) frees up the memmap backing memory holes on the assumption it is never used. /proc/pagetypeinfo reads the whole range of pages in a zone believing that the memmap is valid and that pfn_valid will return false if it is not. On ARM, freeing the memmap breaks the page->zone linkages even though pfn_valid() returns true and the kernel can oops shortly afterwards due to accessing a bogus struct zone *. This patch lets architectures say when FLATMEM can have holes in the memmap. Rather than an expensive check for valid memory, /proc/pagetypeinfo will confirm that the page linkages are still valid by checking page->zone is still the expected zone. The lookup of page_zone is safe as there is a limited range of memory that is accessed when calling page_zone. Even if page_zone happens to return the correct zone, the impact is that the counters in /proc/pagetypeinfo are slightly off but fragmentation monitoring is unlikely to be relevant on an embedded system. Reported-by: NH Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: NMel Gorman <mel@csn.ul.ie> Tested-by: NH Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
-
- 21 8月, 2008 8 次提交
-
-
由 Nick Piggin 提交于
XIP can call into get_xip_mem concurrently with the same file,offset with create=1. This usually maps down to get_block, which expects the page lock to prevent such a situation. This causes ext2 to explode for one reason or another. Serialise those calls for the moment. For common usages today, I suspect get_xip_mem rarely is called to create new blocks. In future as XIP technologies evolve we might need to look at which operations require scalability, and rework the locking to suit. Signed-off-by: NNick Piggin <npiggin@suse.de> Cc: Jared Hulbert <jaredeh@gmail.com> Acked-by: NCarsten Otte <cotte@freenet.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
XIP has a race between sparse pages being inserted into page tables, and sparse pages being zapped when its time to put a non-sparse page in. What can happen is that a process can be left with a dangling sparse page in a MAP_SHARED mapping, while the rest of the world sees the non-sparse version. Ie. data corruption. Guard these operations with a seqlock, making fault-in-sparse-pages the slowpath, and try-to-unmap-sparse-pages the fastpath. Signed-off-by: NNick Piggin <npiggin@suse.de> Cc: Jared Hulbert <jaredeh@gmail.com> Acked-by: NCarsten Otte <cotte@freenet.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Piggin 提交于
There is a race with dirty page accounting where a page may not properly be accounted for. clear_page_dirty_for_io() calls page_mkclean; then TestClearPageDirty. page_mkclean walks the rmaps for that page, and for each one it cleans and write protects the pte if it was dirty. It uses page_check_address to find the pte. That function has a shortcut to avoid the ptl if the pte is not present. Unfortunately, the pte can be switched to not-present then back to present by other code while holding the page table lock -- this should not be a signal for page_mkclean to ignore that pte, because it may be dirty. For example, powerpc64's set_pte_at will clear a previously present pte before setting it to the desired value. There may also be other code in core mm or in arch which do similar things. The consequence of the bug is loss of data integrity due to msync, and loss of dirty page accounting accuracy. XIP's __xip_unmap could easily also be unreliable (depending on the exact XIP locking scheme), which can lead to data corruption. Fix this by having an option to always take ptl to check the pte in page_check_address. It's possible to retain this optimization for page_referenced and try_to_unmap. Signed-off-by: NNick Piggin <npiggin@suse.de> Cc: Jared Hulbert <jaredeh@gmail.com> Cc: Carsten Otte <cotte@freenet.de> Cc: Hugh Dickins <hugh@veritas.com> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Absolute alignment requirements may never be applied to node-relative offsets. Andreas Herrmann spotted this flaw when a bootmem allocation on an unaligned node was itself not aligned because the combination of an unaligned node with an aligned offset into that node is not garuanteed to be aligned itself. This patch introduces two helper functions that align a node-relative index or offset with respect to the node's starting address so that the absolute PFN or virtual address that results from combining the two satisfies the requested alignment. Then all the broken ALIGN()s in alloc_bootmem_core() are replaced by these helpers. Signed-off-by: NJohannes Weiner <hannes@saeurebad.de> Reported-by: NAndreas Herrmann <andreas.herrmann3@amd.com> Debugged-by: NAndreas Herrmann <andreas.herrmann3@amd.com> Reviewed-by: NAndreas Herrmann <andreas.herrmann3@amd.com> Tested-by: NAndreas Herrmann <andreas.herrmann3@amd.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Marcin Slusarz 提交于
mminit_loglevel is now used from mminit_verify_zonelist <- build_all_zonelists <- 1. online_pages <- memory_block_action <- memory_block_change_state <- store_mem_state (sys handler) 2. numa_zonelist_order_handler (proc handler) so it cannot be annotated __meminit - drop it fixes following section mismatch warning: WARNING: vmlinux.o(.text+0x71628): Section mismatch in reference from the function mminit_verify_zonelist() to the variable .meminit.data:mminit_loglevel The function mminit_verify_zonelist() references the variable __meminitdata mminit_loglevel. This is often because mminit_verify_zonelist lacks a __meminitdata annotation or the annotation of mminit_loglevel is wrong. Signed-off-by: NMarcin Slusarz <marcin.slusarz@gmail.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Adjust <Alt><SysRq>m show_swap_cache_info() to show "Free swap" as a signed long: the signed format is preferable, because during swapoff nr_swap_pages can legitimately go negative, so makes more sense thus (it used to be shown redundantly, once as signed and once as unsigned). Signed-off-by: NHugh Dickins <hugh@veritas.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Add a comment to s390's page_test_dirty/page_clear_dirty/page_set_dirty dance in page_remove_rmap(): I was wrong to think the PageSwapCache test could be avoided, and would like a comment in there to remind me. And mention s390, to help us remember that this block is not really common. Also move down the "It would be tidy to reset PageAnon" comment: it does not belong to s390's block, and it would be unwise to reset PageAnon before we're done with testing it. Signed-off-by: NHugh Dickins <hugh@veritas.com> Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
Switch remote node defragmentation off by default. The current settings can cause excessive node local allocations with hackbench: SLAB: % cat /proc/meminfo MemTotal: 7701760 kB MemFree: 5940096 kB Slab: 123840 kB SLUB: % cat /proc/meminfo MemTotal: 7701376 kB MemFree: 4740928 kB Slab: 1591680 kB [Note: this feature is not related to slab defragmentation.] You can find the original discussion here: http://lkml.org/lkml/2008/8/4/308Reported-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NChristoph Lameter <cl@linux-foundation.org> Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
-
- 15 8月, 2008 1 次提交
-
-
由 Mikulas Patocka 提交于
This is the minimal sequence that jams the allocator: void *p, *q, *r; p = alloc_bootmem(PAGE_SIZE); q = alloc_bootmem(64); free_bootmem(p, PAGE_SIZE); p = alloc_bootmem(PAGE_SIZE); r = alloc_bootmem(64); after this sequence (assuming that the allocator was empty or page-aligned before), pointer "q" will be equal to pointer "r". What's hapenning inside the allocator: p = alloc_bootmem(PAGE_SIZE); in allocator: last_end_off == PAGE_SIZE, bitmap contains bits 10000... q = alloc_bootmem(64); in allocator: last_end_off == PAGE_SIZE + 64, bitmap contains 11000... free_bootmem(p, PAGE_SIZE); in allocator: last_end_off == PAGE_SIZE + 64, bitmap contains 01000... p = alloc_bootmem(PAGE_SIZE); in allocator: last_end_off == PAGE_SIZE, bitmap contains 11000... r = alloc_bootmem(64); and now: it finds bit "2", as a place where to allocate (sidx) it hits the condition if (bdata->last_end_off && PFN_DOWN(bdata->last_end_off) + 1 == sidx)) start_off = ALIGN(bdata->last_end_off, align); -you can see that the condition is true, so it assigns start_off = ALIGN(bdata->last_end_off, align); (that is PAGE_SIZE) and allocates over already allocated block. With the patch it tries to continue at the end of previous allocation only if the previous allocation ended in the middle of the page. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Acked-by: NJohannes Weiner <hannes@saeurebad.de> Cc: David Miller <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 8月, 2008 1 次提交
-
-
由 David Howells 提交于
Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags the target process if that is not the current process and it is trying to change its own flags in a different way at the same time. __capable() is using neither atomic ops nor locking to protect t->flags. This patch removes __capable() and introduces has_capability() that doesn't set PF_SUPERPRIV on the process being queried. This patch further splits security_ptrace() in two: (1) security_ptrace_may_access(). This passes judgement on whether one process may access another only (PTRACE_MODE_ATTACH for ptrace() and PTRACE_MODE_READ for /proc), and takes a pointer to the child process. current is the parent. (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only, and takes only a pointer to the parent process. current is the child. In Smack and commoncap, this uses has_capability() to determine whether the parent will be permitted to use PTRACE_ATTACH if normal checks fail. This does not set PF_SUPERPRIV. Two of the instances of __capable() actually only act on current, and so have been changed to calls to capable(). Of the places that were using __capable(): (1) The OOM killer calls __capable() thrice when weighing the killability of a process. All of these now use has_capability(). (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see whether the parent was allowed to trace any process. As mentioned above, these have been split. For PTRACE_ATTACH and /proc, capable() is now used, and for PTRACE_TRACEME, has_capability() is used. (3) cap_safe_nice() only ever saw current, so now uses capable(). (4) smack_setprocattr() rejected accesses to tasks other than current just after calling __capable(), so the order of these two tests have been switched and capable() is used instead. (5) In smack_file_send_sigiotask(), we need to allow privileged processes to receive SIGIO on files they're manipulating. (6) In smack_task_wait(), we let a process wait for a privileged process, whether or not the process doing the waiting is privileged. I've tested this with the LTP SELinux and syscalls testscripts. Signed-off-by: NDavid Howells <dhowells@redhat.com> Acked-by: NSerge Hallyn <serue@us.ibm.com> Acked-by: NCasey Schaufler <casey@schaufler-ca.com> Acked-by: NAndrew G. Morgan <morgan@kernel.org> Acked-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJames Morris <jmorris@namei.org>
-
- 13 8月, 2008 7 次提交
-
-
由 Huang Weiyi 提交于
Signed-off-by: NHuang Weiyi <weiyi.huang@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 MinChan Kim 提交于
Signed-off-by: NMinChan Kim <minchan.kim@gmail.com> Acked-by: NChristoph Lameter <cl@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andy Whitcroft 提交于
[Andrew this should replace the previous version which did not check the returns from the region prepare for errors. This has been tested by us and Gerald and it looks good. Bah, while reviewing the locking based on your previous email I spotted that we need to check the return from the vma_needs_reservation call for allocation errors. Here is an updated patch to correct this. This passes testing here.] Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Tested-by: NGerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andy Whitcroft 提交于
In the normal case, hugetlbfs reserves hugepages at map time so that the pages exist for future faults. A struct file_region is used to track when reservations have been consumed and where. These file_regions are allocated as necessary with kmalloc() which can sleep with the mm->page_table_lock held. This is wrong and triggers may-sleep warning when PREEMPT is enabled. Updates to the underlying file_region are done in two phases. The first phase prepares the region for the change, allocating any necessary memory, without actually making the change. The second phase actually commits the change. This patch makes use of this by checking the reservations before the page_table_lock is taken; triggering any necessary allocations. This may then be safely repeated within the locks without any allocations being required. Credit to Mel Gorman for diagnosing this failure and initial versions of the patch. Signed-off-by: NAndy Whitcroft <apw@shadowen.org> Tested-by: NGerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Got an oops in mem_cgroup_shrink_usage() when testing loop over tmpfs: yes, of course, loop0 has no mm: other entry points check but this didn't. Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Beulich 提交于
.. since a failed allocation is being (initially) handled gracefully, and panic()-ed upon failure explicitly in the function if retries with smaller sizes failed. Signed-off-by: NJan Beulich <jbeulich@novell.com> Signed-off-by: NDavid Howells <dhowells@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Gerald Schaefer 提交于
The s390 software large page emulation implements shared page tables by using page->index of the first tail page from a compound large page to store page table information. This is set up in arch_prepare_hugepage(), which is called from alloc_fresh_huge_page_node(). A similar call to arch_prepare_hugepage() is missing for surplus large pages that are allocated in alloc_buddy_huge_page(), which breaks the software emulation mode for (surplus) large pages on s390. This patch adds the missing call to arch_prepare_hugepage(). It will have no effect on other architectures where arch_prepare_hugepage() is a nop. Also, use the correct order in the error path in alloc_fresh_huge_page_node(). Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com> Acked-by: NNick Piggin <npiggin@suse.de> Acked-by: NAdam Litke <agl@us.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 8月, 2008 1 次提交
-
-
由 Rusty Russell 提交于
Out of line get_user_pages_fast fallback implementation, make it a weak symbol, get rid of CONFIG_HAVE_GET_USER_PAGES_FAST. Export the symbol to modules so lguest can use it. Signed-off-by: NNick Piggin <npiggin@suse.de> Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
-
- 11 8月, 2008 2 次提交
-
-
由 Peter Zijlstra 提交于
Lockdep spotted: ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.27-rc1 #270 ------------------------------------------------------- qemu-kvm/2033 is trying to acquire lock: (&inode->i_data.i_mmap_lock){----}, at: [<ffffffff802996cc>] mm_take_all_locks+0xc2/0xea but task is already holding lock: (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (&anon_vma->lock){----}: [<ffffffff8025cd37>] __lock_acquire+0x11be/0x14d2 [<ffffffff8025d0a9>] lock_acquire+0x5e/0x7a [<ffffffff804c655b>] _spin_lock+0x3b/0x47 [<ffffffff8029a2ef>] vma_adjust+0x200/0x444 [<ffffffff8029a662>] split_vma+0x12f/0x146 [<ffffffff8029bc60>] mprotect_fixup+0x13c/0x536 [<ffffffff8029c203>] sys_mprotect+0x1a9/0x21e [<ffffffff8020c0db>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff -> #0 (&inode->i_data.i_mmap_lock){----}: [<ffffffff8025ca54>] __lock_acquire+0xedb/0x14d2 [<ffffffff8025d397>] lock_release_non_nested+0x1c2/0x219 [<ffffffff8025d515>] lock_release+0x127/0x14a [<ffffffff804c6403>] _spin_unlock+0x1e/0x50 [<ffffffff802995d9>] mm_drop_all_locks+0x7f/0xb0 [<ffffffff802a965d>] do_mmu_notifier_register+0xe2/0x112 [<ffffffff802a96a8>] mmu_notifier_register+0xe/0x10 [<ffffffffa0043b6b>] kvm_dev_ioctl+0x11e/0x287 [kvm] [<ffffffff802bd0ca>] vfs_ioctl+0x2a/0x78 [<ffffffff802bd36f>] do_vfs_ioctl+0x257/0x274 [<ffffffff802bd3e1>] sys_ioctl+0x55/0x78 [<ffffffff8020c0db>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff other info that might help us debug this: 5 locks held by qemu-kvm/2033: #0: (&mm->mmap_sem){----}, at: [<ffffffff802a95d0>] do_mmu_notifier_register+0x55/0x112 #1: (mm_all_locks_mutex){--..}, at: [<ffffffff8029963e>] mm_take_all_locks+0x34/0xea #2: (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea #3: (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea #4: (&anon_vma->lock){----}, at: [<ffffffff8029967a>] mm_take_all_locks+0x70/0xea stack backtrace: Pid: 2033, comm: qemu-kvm Not tainted 2.6.27-rc1 #270 Call Trace: [<ffffffff8025b7c7>] print_circular_bug_tail+0xb8/0xc3 [<ffffffff8025ca54>] __lock_acquire+0xedb/0x14d2 [<ffffffff80259bb1>] ? add_lock_to_list+0x7e/0xad [<ffffffff8029967a>] ? mm_take_all_locks+0x70/0xea [<ffffffff8029967a>] ? mm_take_all_locks+0x70/0xea [<ffffffff8025d397>] lock_release_non_nested+0x1c2/0x219 [<ffffffff802996cc>] ? mm_take_all_locks+0xc2/0xea [<ffffffff802996cc>] ? mm_take_all_locks+0xc2/0xea [<ffffffff8025b202>] ? trace_hardirqs_on_caller+0x4d/0x115 [<ffffffff802995d9>] ? mm_drop_all_locks+0x7f/0xb0 [<ffffffff8025d515>] lock_release+0x127/0x14a [<ffffffff804c6403>] _spin_unlock+0x1e/0x50 [<ffffffff802995d9>] mm_drop_all_locks+0x7f/0xb0 [<ffffffff802a965d>] do_mmu_notifier_register+0xe2/0x112 [<ffffffff802a96a8>] mmu_notifier_register+0xe/0x10 [<ffffffffa0043b6b>] kvm_dev_ioctl+0x11e/0x287 [kvm] [<ffffffff8033f9f2>] ? file_has_perm+0x83/0x8e [<ffffffff802bd0ca>] vfs_ioctl+0x2a/0x78 [<ffffffff802bd36f>] do_vfs_ioctl+0x257/0x274 [<ffffffff802bd3e1>] sys_ioctl+0x55/0x78 [<ffffffff8020c0db>] system_call_fastpath+0x16/0x1b Which the locking hierarchy in mm/rmap.c confirms as valid. Fix this by first taking all the mapping->i_mmap_lock instances and then take all anon_vma->lock instances. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: NHugh Dickins <hugh@veritas.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Peter Zijlstra 提交于
The nesting is correct due to holding mmap_sem, use the new annotation to annotate this. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 07 8月, 2008 1 次提交
-
-
由 Linus Torvalds 提交于
This reverts commit 7cb93181, since we did that patch twice, and the problem was already fixed earlier by 78a34ae2. Reported-by: NAndi Kleen <andi@firstfloor.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 06 8月, 2008 2 次提交
-
-
由 Benny Halevy 提交于
gcc 4.3.0 correctly emits the following warnings. When a vma covering addr is found, find_vma_prepare indeed returns without setting pprev, rb_link, and rb_parent. mm/mmap.c: In function `insert_vm_struct': mm/mmap.c:2085: warning: `rb_parent' may be used uninitialized in this function mm/mmap.c:2085: warning: `rb_link' may be used uninitialized in this function mm/mmap.c:2084: warning: `prev' may be used uninitialized in this function mm/mmap.c: In function `copy_vma': mm/mmap.c:2124: warning: `rb_parent' may be used uninitialized in this function mm/mmap.c:2124: warning: `rb_link' may be used uninitialized in this function mm/mmap.c:2123: warning: `prev' may be used uninitialized in this function mm/mmap.c: In function `do_brk': mm/mmap.c:1951: warning: `rb_parent' may be used uninitialized in this function mm/mmap.c:1951: warning: `rb_link' may be used uninitialized in this function mm/mmap.c:1949: warning: `prev' may be used uninitialized in this function mm/mmap.c: In function `mmap_region': mm/mmap.c:1092: warning: `rb_parent' may be used uninitialized in this function mm/mmap.c:1092: warning: `rb_link' may be used uninitialized in this function mm/mmap.c:1089: warning: `prev' may be used uninitialized in this function Hugh adds: in fact, none of find_vma_prepare's callers use those values when a vma is found to be already covering addr, it's either an error or an occasion to munmap and repeat. Okay, let's quieten the compiler (but I would prefer it if pprev, rb_link and rb_parent were meaningful in that case, rather than whatever's in them from descending the tree). Signed-off-by: NBenny Halevy <bhalevy@panasas.com> Signed-off-by: NHugh Dickins <hugh@veritas.com> Cc: "Ryan Hope" <rmh3093@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
gcc-3.2: mm/mm_init.c:77:1: directives may not be used inside a macro argument mm/mm_init.c:76:47: unterminated argument list invoking macro "mminit_dprintk" mm/mm_init.c: In function `mminit_verify_pageflags_layout': mm/mm_init.c:80: `mminit_dprintk' undeclared (first use in this function) mm/mm_init.c:80: (Each undeclared identifier is reported only once mm/mm_init.c:80: for each function it appears in.) mm/mm_init.c:80: syntax error before numeric constant Also fix a typo in a comment. Reported-by: NAdrian Bunk <bunk@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 8月, 2008 3 次提交
-
-
由 Pekka Enberg 提交于
This patch changes the static MIN_PARTIAL to a dynamic per-cache ->min_partial value that is calculated from object size. The bigger the object size, the more pages we keep on the partial list. I tested SLAB, SLUB, and SLUB with this patch on Jens Axboe's 'netio' example script of the fio benchmarking tool. The script stresses the networking subsystem which should also give a fairly good beating of kmalloc() et al. To run the test yourself, first clone the fio repository: git clone git://git.kernel.dk/fio.git and then run the following command n times on your machine: time ./fio examples/netio The results on my 2-way 64-bit x86 machine are as follows: [ the minimum, maximum, and average are captured from 50 individual runs ] real time (seconds) min max avg sd SLAB 22.76 23.38 22.98 0.17 SLUB 22.80 25.78 23.46 0.72 SLUB (dynamic) 22.74 23.54 23.00 0.20 sys time (seconds) min max avg sd SLAB 6.90 8.28 7.70 0.28 SLUB 7.42 16.95 8.89 2.28 SLUB (dynamic) 7.17 8.64 7.73 0.29 user time (seconds) min max avg sd SLAB 36.89 38.11 37.50 0.29 SLUB 30.85 37.99 37.06 1.67 SLUB (dynamic) 36.75 38.07 37.59 0.32 As you can see from the above numbers, this patch brings SLUB to the same level as SLAB for this particular workload fixing a ~2% regression. I'd expect this change to help similar workloads that allocate a lot of objects that are close to the size of a page. Cc: Matthew Wilcox <matthew@wil.cx> Cc: Andrew Morton <akpm@linux-foundation.org> Acked-by: NChristoph Lameter <cl@linux-foundation.org> Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
-
由 Nick Piggin 提交于
Converting page lock to new locking bitops requires a change of page flag operation naming, so we might as well convert it to something nicer (!TestSetPageLocked_Lock => trylock_page, SetPageLocked => set_page_locked). This also facilitates lockdeping of page lock. Signed-off-by: NNick Piggin <npiggin@suse.de> Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NPeter Zijlstra <peterz@infradead.org> Acked-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Halesh says: Please find the below testcase provide to test mlock. Test Case : =========================== #include <sys/resource.h> #include <stdio.h> #include <sys/stat.h> #include <sys/types.h> #include <unistd.h> #include <sys/mman.h> #include <fcntl.h> #include <errno.h> #include <stdlib.h> int main(void) { int fd,ret, i = 0; char *addr, *addr1 = NULL; unsigned int page_size; struct rlimit rlim; if (0 != geteuid()) { printf("Execute this pgm as root\n"); exit(1); } /* create a file */ if ((fd = open("mmap_test.c",O_RDWR|O_CREAT,0755)) == -1) { printf("cant create test file\n"); exit(1); } page_size = sysconf(_SC_PAGE_SIZE); /* set the MEMLOCK limit */ rlim.rlim_cur = 2000; rlim.rlim_max = 2000; if ((ret = setrlimit(RLIMIT_MEMLOCK,&rlim)) != 0) { printf("Cant change limit values\n"); exit(1); } addr = 0; while (1) { /* map a page into memory each time*/ if ((addr = (char *) mmap(addr,page_size, PROT_READ | PROT_WRITE,MAP_SHARED,fd,0)) == MAP_FAILED) { printf("cant do mmap on file\n"); exit(1); } if (0 == i) addr1 = addr; i++; errno = 0; /* lock the mapped memory pagewise*/ if ((ret = mlock((char *)addr, 1500)) == -1) { printf("errno value is %d\n", errno); printf("cant lock maped region\n"); exit(1); } addr = addr + page_size; } } ====================================================== This testcase results in an mlock() failure with errno 14 that is EFAULT, but it has nowhere been specified that mlock() will return EFAULT. When I tested the same on older kernels like 2.6.18, I got the correct result i.e errno 12 (ENOMEM). I think in source code mlock(2), setting errno ENOMEM has been missed in do_mlock() , on mlock_fixup() failure. SUSv3 requires the following behavior frmo mlock(2). [ENOMEM] Some or all of the address range specified by the addr and len arguments does not correspond to valid mapped pages in the address space of the process. [EAGAIN] Some or all of the memory identified by the operation could not be locked when the call was made. This rule isn't so nice and slighly strange. but many people think POSIX/SUS compliance is important. Reported-by: NHalesh Sadashiv <halesh.sadashiv@ap.sony.com> Tested-by: NHalesh Sadashiv <halesh.sadashiv@ap.sony.com> Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> [2.6.25.x, 2.6.26.x] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 8月, 2008 1 次提交
-
-
由 Paul Mundt 提交于
Now that SH has switched to vmalloc_exec() for PAGE_KERNEL_EXEC usage, it's apparent that nommu has no vmalloc_exec() definition of its own. Stub in the one from mm/vmalloc.c. Signed-off-by: NPaul Mundt <lethal@linux-sh.org>
-
- 03 8月, 2008 1 次提交
-
-
由 Miklos Szeredi 提交于
Brian Wang reported that a FUSE filesystem exported through NFS could return I/O errors on read. This was traced to splice_direct_to_actor() returning a short or zero count when racing with page invalidation. However this is not FUSE or NFSD specific, other filesystems (notably NFS) also call invalidate_inode_pages2() to purge stale data from the cache. If this happens while such pages are sitting in a pipe buffer, then splice(2) from the pipe can return zero, and read(2) from the pipe can return ENODATA. The zero return is especially bad, since it implies end-of-file or disconnected pipe/socket, and is documented as such for splice. But returning an error for read() is also nasty, when in fact there was no error (data becoming stale is not an error). The same problems can be triggered by "hole punching" with madvise(MADV_REMOVE). Fix this by not clearing the PG_uptodate flag on truncation and invalidation. Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Acked-by: NNick Piggin <nickpiggin@yahoo.com.au> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 8月, 2008 1 次提交
-
-
由 Jack Steiner 提交于
Delete 2 EXPORTs that were accidentally sent upstream. Signed-off-by: NJack Steiner <steiner@sgi.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-