- 23 9月, 2009 6 次提交
-
-
由 Scott James Remnant 提交于
The act of a process becoming a session leader is a useful signal to a supervising init daemon such as Upstart. While a daemon will normally do this as part of the process of becoming a daemon, it is rare for its children to do so. When the children do, it is nearly always a sign that the child should be considered detached from the parent and not supervised along with it. The poster-child example is OpenSSH; the per-login children call setsid() so that they may control the pty connected to them. If the primary daemon dies or is restarted, we do not want to consider the per-login children and want to respawn the primary daemon without killing the children. This patch adds a new PROC_SID_EVENT and associated structure to the proc_event event_data union, it arranges for this to be emitted when the special PIDTYPE_SID pid is set. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NScott James Remnant <scott@ubuntu.com> Acked-by: NMatt Helsley <matthltc@us.ibm.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: Evgeniy Polyakov <johnpol@2ka.mipt.ru> Acked-by: N"David S. Miller" <davem@davemloft.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 James Morris 提交于
Make all seq_operations structs const, to help mitigate against revectoring user-triggerable function pointers. This is derived from the grsecurity patch, although generated from scratch because it's simpler than extracting the changes from there. Signed-off-by: NJames Morris <jmorris@namei.org> Acked-by: NSerge Hallyn <serue@us.ibm.com> Acked-by: NCasey Schaufler <casey@schaufler-ca.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Xiao Guangrong 提交于
This patch can remove spinlock from struct call_function_data, the reasons are below: 1: add a new interface for cpumask named cpumask_test_and_clear_cpu(), it can atomically test and clear specific cpu, we can use it instead of cpumask_test_cpu() and cpumask_clear_cpu() and no need data->lock to protect those in generic_smp_call_function_interrupt(). 2: in smp_call_function_many(), after csd_lock() return, the current's cfd_data is deleted from call_function list, so it not have race between other cpus, then cfs_data is only used in smp_call_function_many() that must disable preemption and not from a hardware interrupthandler or from a bottom half handler to call, only the correspond cpu can use it, so it not have race in current cpu, no need cfs_data->lock to protect it. 3: after 1 and 2, cfs_data->lock is only use to protect cfs_data->refs in generic_smp_call_function_interrupt(), so we can define cfs_data->refs to atomic_t, and no need cfs_data->lock any more. Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Peter Zijlstra <peterz@infradead.org> Acked-by: NRusty Russell <rusty@rustcorp.com.au> [akpm@linux-foundation.org: use atomic_dec_return()] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Nick Black 提交于
Move various magic-number definitions into magic.h. Signed-off-by: NNick Black <dank@qemfd.net> Acked-by: NPekka Enberg <penberg@cs.helsinki.fi> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Casey Schaufler <casey@schaufler-ca.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Dave Young 提交于
When syslog is not possible, at the same time there's no serial/net console available, it will be hard to read the printk messages. For example oops/panic/warning messages in shutdown phase. Add a printk delay feature, we can make each printk message delay some milliseconds. Setting the delay by proc/sysctl interface: /proc/sys/kernel/printk_delay The value range from 0 - 10000, default value is 0 [akpm@linux-foundation.org: fix a few things] Signed-off-by: NDave Young <hidave.darkstar@gmail.com> Acked-by: NIngo Molnar <mingo@elte.hu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrew Morton 提交于
of the form include/net/inet_sock.h:208: warning: ISO C90 forbids mixed declarations and code Cc: Johannes Berg <johannes@sipsolutions.net> Acked-by: NVegard Nossum <vegard.nossum@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 9月, 2009 34 次提交
-
-
由 David Härdeman 提交于
The shutdown method is used by the winbond cir driver to setup the hardware for wake-from-S5. Signed-off-by: NBjorn Helgaas <bjorn.helgaas@hp.com> Signed-off-by: NDavid Härdeman <david@hardeman.nu> Cc: Dmitry Torokhov <dtor@mail.ru> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Daniel Mack 提交于
This offers a way for platforms to define flags and thresholds for the free-fall/wakeup functions of the lis302d chips. More registers needed to be seperated as they are specific to the Signed-off-by: NDaniel Mack <daniel@caiaq.de> Acked-by: NPavel Machek <pavel@ucw.cz> Cc: Eric Piel <eric.piel@tremplin-utc.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Daniel Mack 提交于
Bit 0x80 in CTRL_REG3 is an ACTIVE_LOW rather than an ACTIVE_HIGH function, I got that wrong during my last change. Signed-off-by: NDaniel Mack <daniel@caiaq.de> Acked-by: NPavel Machek <pavel@ucw.cz> Cc: Eric Piel <eric.piel@tremplin-utc.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
FLEX_ARRAY_INIT(element_size, total_nr_elements) cannot determine if either parameter is valid, so flex arrays which are statically allocated with this interface can easily become corrupted or reference beyond its allocated memory. This removes FLEX_ARRAY_INIT() as a struct flex_array initializer since no initializer may perform the required checking. Instead, the array is now defined with a new interface: DEFINE_FLEX_ARRAY(name, element_size, total_nr_elements) This may be prefixed with `static' for file scope. This interface includes compile-time checking of the parameters to ensure they are valid. Since the validity of both element_size and total_nr_elements depend on FLEX_ARRAY_BASE_SIZE and FLEX_ARRAY_PART_SIZE, the kernel build will fail if either of these predefined values changes such that the array parameters are no longer valid. Since BUILD_BUG_ON() requires compile time constants, several of the static inline functions that were once local to lib/flex_array.c had to be moved to include/linux/flex_array.h. Signed-off-by: NDavid Rientjes <rientjes@google.com> Acked-by: NDave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
Add a new function to the flex_array API: int flex_array_shrink(struct flex_array *fa) This function will free all unused second-level pages. Since elements are now poisoned if they are not allocated with __GFP_ZERO, it's possible to identify parts that consist solely of unused elements. flex_array_shrink() returns the number of pages freed. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
Newly initialized flex_array's and/or flex_array_part's are now poisoned with a new poison value, FLEX_ARRAY_FREE. It's value is similar to POISON_FREE used in the various slab allocators, but is different to distinguish between flex array's poisoned kmem and slab allocator poisoned kmem. This will allow us to identify flex_array_part's that only contain free elements (and free them with an addition to the flex_array API). This could also be extended in the future to identify `get' uses on elements that have not been `put'. If __GFP_ZERO is passed for a part's gfp mask, the poisoning is avoided. These elements are considered to be in-use since they have been initialized. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
Add a new function to the flex_array API: int flex_array_clear(struct flex_array *fa, unsigned int element_nr) This function will zero the element at element_nr in the flex_array. Although this is equivalent to using flex_array_put() and passing a pointer to zero'd memory, flex_array_clear() does not require such a pointer to memory that would most likely need to be allocated on the caller's stack which could be significantly large depending on element_size. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Arjan van de Ven 提交于
Fix the menu idle governor which balances power savings, energy efficiency and performance impact. The reason for a reworked governor is that there have been serious performance issues reported with the existing code on Nehalem server systems. To show this I'm sure Andrew wants to see benchmark results: (benchmark is "fio", "no cstates" is using "idle=poll") no cstates current linux new algorithm 1 disk 107 Mb/s 85 Mb/s 105 Mb/s 2 disks 215 Mb/s 123 Mb/s 209 Mb/s 12 disks 590 Mb/s 320 Mb/s 585 Mb/s In various power benchmark measurements, no degredation was found by our measurement&diagnostics team. Obviously a small percentage more power was used in the "fio" benchmark, due to the much higher performance. While it would be a novel idea to describe the new algorithm in this commit message, I cheaped out and described it in comments in the code instead. [changes since first post: spelling fixes from akpm, review feedback, folded menu-tng into menu.c] Signed-off-by: NArjan van de Ven <arjan@linux.intel.com> Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com> Cc: Len Brown <lenb@kernel.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com> Acked-by: NIngo Molnar <mingo@elte.hu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michael S. Tsirkin 提交于
Anyone who wants to do copy to/from user from a kernel thread, needs use_mm (like what fs/aio has). Move that into mm/, to make reusing and exporting easier down the line, and make aio use it. Next intended user, besides aio, will be vhost-net. Acked-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Eric B Munson 提交于
Add a flag for mmap that will be used to request a huge page region that will look like anonymous memory to userspace. This is accomplished by using a file on the internal vfsmount. MAP_HUGETLB is a modifier of MAP_ANONYMOUS and so must be specified with it. The region will behave the same as a MAP_ANONYMOUS region using small pages. [akpm@linux-foundation.org: fix arch definitions of MAP_HUGETLB] Signed-off-by: NEric B Munson <ebmunson@us.ibm.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Arnd Bergmann 提交于
Add a flag for mmap that will be used to request a huge page region that will look like anonymous memory to user space. This is accomplished by using a file on the internal vfsmount. MAP_HUGETLB is a modifier of MAP_ANONYMOUS and so must be specified with it. The region will behave the same as a MAP_ANONYMOUS region using small pages. The patch also adds the MAP_STACK flag, which was previously defined only on some architectures but not on others. Since MAP_STACK is meant to be a hint only, architectures can define it without assigning a specific meaning to it. Signed-off-by: NArnd Bergmann <arnd@arndb.de> Cc: Eric B Munson <ebmunson@us.ibm.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: David Rientjes <rientjes@google.com> Cc: <linux-arch@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Eric B Munson 提交于
This patchset adds a flag to mmap that allows the user to request that an anonymous mapping be backed with huge pages. This mapping will borrow functionality from the huge page shm code to create a file on the kernel internal mount and use it to approximate an anonymous mapping. The MAP_HUGETLB flag is a modifier to MAP_ANONYMOUS and will not work without both flags being preset. A new flag is necessary because there is no other way to hook into huge pages without creating a file on a hugetlbfs mount which wouldn't be MAP_ANONYMOUS. To userspace, this mapping will behave just like an anonymous mapping because the file is not accessible outside of the kernel. This patchset is meant to simplify the programming model. Presently there is a large chunk of boiler platecode, contained in libhugetlbfs, required to create private, hugepage backed mappings. This patch set would allow use of hugepages without linking to libhugetlbfs or having hugetblfs mounted. Unification of the VM code would provide these same benefits, but it has been resisted each time that it has been suggested for several reasons: it would break PAGE_SIZE assumptions across the kernel, it makes page-table abstractions really expensive, and it does not provide any benefit on architectures that do not support huge pages, incurring fast path penalties without providing any benefit on these architectures. This patch: There are two means of creating mappings backed by huge pages: 1. mmap() a file created on hugetlbfs 2. Use shm which creates a file on an internal mount which essentially maps it MAP_SHARED The internal mount is only used for shared mappings but there is very little that stops it being used for private mappings. This patch extends hugetlbfs_file_setup() to deal with the creation of files that will be mapped MAP_PRIVATE on the internal hugetlbfs mount. This extended API is used in a subsequent patch to implement the MAP_HUGETLB mmap() flag. Signed-off-by: NEric Munson <ebmunson@us.ibm.com> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
CONFIG_SHMEM off gives you (ramfs masquerading as) tmpfs, even when CONFIG_TMPFS is off: that's a little anomalous, and I'd intended to make more sense of it by removing CONFIG_TMPFS altogether, always enabling its code when CONFIG_SHMEM; but so many defconfigs have CONFIG_SHMEM on CONFIG_TMPFS off that we'd better leave that as is. But there is no point in asking for CONFIG_TMPFS if CONFIG_SHMEM is off: make TMPFS depend on SHMEM, which also prevents TMPFS_POSIX_ACL shmem_acl.o being pointlessly built into the kernel when SHMEM is off. And a selfish change, to prevent the world from being rebuilt when I switch between CONFIG_SHMEM on and off: the only CONFIG_SHMEM in the header files is mm.h shmem_lock() - give that a shmem.c stub instead. Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: NMatt Mackall <mpm@selenic.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
__get_user_pages() has been taking its own GUP flags, then processing them into FOLL flags for follow_page(). Though oddly named, the FOLL flags are more widely used, so pass them to __get_user_pages() now. Sorry, VM flags, VM_FAULT flags and FAULT_FLAGs are still distinct. (The patch to __get_user_pages() looks peculiar, with both gup_flags and foll_flags: the gup_flags remain constant; but as before there's an exceptional case, out of scope of the patch, in which foll_flags per page have FOLL_WRITE masked off.) Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
follow_hugetlb_page() shouldn't be guessing about the coredump case either: pass the foll_flags down to it, instead of just the write bit. Remove that obscure huge_zeropage_ok() test. The decision is easy, though unlike the non-huge case - here vm_ops->fault is always set. But we know that a fault would serve up zeroes, unless there's already a hugetlbfs pagecache page to back the range. (Alternatively, since hugetlb pages aren't swapped out under pressure, you could save more dump space by arguing that a page not yet faulted into this process cannot be relevant to the dump; but that would be more surprising.) Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: NRik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
The "FOLL_ANON optimization" and its use_zero_page() test have caused confusion and bugs: why does it test VM_SHARED? for the very good but unsatisfying reason that VMware crashed without. As we look to maybe reinstating anonymous use of the ZERO_PAGE, we need to sort this out. Easily done: it's silly for __get_user_pages() and follow_page() to be guessing whether it's safe to assume that they're being used for a coredump (which can take a shortcut snapshot where other uses must handle a fault) - just tell them with GUP_FLAGS_DUMP and FOLL_DUMP. get_dump_page() doesn't even want a ZERO_PAGE: an error suits fine. Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
In preparation for the next patch, add a simple get_dump_page(addr) interface for the CONFIG_ELF_CORE dumpers to use, instead of calling get_user_pages() directly. They're not interested in errors: they just want to use holes as much as possible, to save space and make sure that the data is aligned where the headers said it would be. Oh, and don't use that horrid DUMP_SEEK(off) macro! Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: NRik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
The following two patches remove searching in the page allocator fast-path by maintaining multiple free-lists in the per-cpu structure. At the time the search was introduced, increasing the per-cpu structures would waste a lot of memory as per-cpu structures were statically allocated at compile-time. This is no longer the case. The patches are as follows. They are based on mmotm-2009-08-27. Patch 1 adds multiple lists to struct per_cpu_pages, one per migratetype that can be stored on the PCP lists. Patch 2 notes that the pcpu drain path check empty lists multiple times. The patch reduces the number of checks by maintaining a count of free lists encountered. Lists containing pages will then free multiple pages in batch The patches were tested with kernbench, netperf udp/tcp, hackbench and sysbench. The netperf tests were not bound to any CPU in particular and were run such that the results should be 99% confidence that the reported results are within 1% of the estimated mean. sysbench was run with a postgres background and read-only tests. Similar to netperf, it was run multiple times so that it's 99% confidence results are within 1%. The patches were tested on x86, x86-64 and ppc64 as x86: Intel Pentium D 3GHz with 8G RAM (no-brand machine) kernbench - No significant difference, variance well within noise netperf-udp - 1.34% to 2.28% gain netperf-tcp - 0.45% to 1.22% gain hackbench - Small variances, very close to noise sysbench - Very small gains x86-64: AMD Phenom 9950 1.3GHz with 8G RAM (no-brand machine) kernbench - No significant difference, variance well within noise netperf-udp - 1.83% to 10.42% gains netperf-tcp - No conclusive until buffer >= PAGE_SIZE 4096 +15.83% 8192 + 0.34% (not significant) 16384 + 1% hackbench - Small gains, very close to noise sysbench - 0.79% to 1.6% gain ppc64: PPC970MP 2.5GHz with 10GB RAM (it's a terrasoft powerstation) kernbench - No significant difference, variance well within noise netperf-udp - 2-3% gain for almost all buffer sizes tested netperf-tcp - losses on small buffers, gains on larger buffers possibly indicates some bad caching effect. hackbench - No significant difference sysbench - 2-4% gain This patch: Currently the per-cpu page allocator searches the PCP list for pages of the correct migrate-type to reduce the possibility of pages being inappropriate placed from a fragmentation perspective. This search is potentially expensive in a fast-path and undesirable. Splitting the per-cpu list into multiple lists increases the size of a per-cpu structure and this was potentially a major problem at the time the search was introduced. These problem has been mitigated as now only the necessary number of structures is allocated for the running system. This patch replaces a list search in the per-cpu allocator with one list per migrate type. The potential snag with this approach is when bulk freeing pages. We round-robin free pages based on migrate type which has little bearing on the cache hotness of the page and potentially checks empty lists repeatedly in the event the majority of PCP pages are of one type. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NNick Piggin <npiggin@suse.de> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Currently, OOM logic callflow is here. __out_of_memory() select_bad_process() for each task badness() calculate badness of one task oom_kill_process() search child oom_kill_task() kill target task and mm shared tasks with it example, process-A have two thread, thread-A and thread-B and it have very fat memory and each thread have following oom_adj and oom_score. thread-A: oom_adj = OOM_DISABLE, oom_score = 0 thread-B: oom_adj = 0, oom_score = very-high Then, select_bad_process() select thread-B, but oom_kill_task() refuse kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory() call select_bad_process() again. but select_bad_process() select the same task. It mean kernel fall in livelock. The fact is, select_bad_process() must select killable task. otherwise OOM logic go into livelock. And root cause is, oom_adj shouldn't be per-thread value. it should be per-process value because OOM-killer kill a process, not thread. Thus This patch moves oomkilladj (now more appropriately named oom_adj) from struct task_struct to struct signal_struct. it naturally prevent select_bad_process() choose wrong task. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wu Fengguang 提交于
For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in which case shrink_list() _still_ calls isolate_pages() with the much larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list scan rate by up to 32 times. For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4. So when shrink_zone() expects to scan 4 pages in the active/inactive list, the active list will be scanned 4 pages, while the inactive list will be (over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that could break the balance between the two lists. It can further impact the scan of anon active list, due to the anon active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone(): inactive anon list over scanned => inactive_anon_is_low() == TRUE => shrink_active_list() => active anon list over scanned So the end result may be - anon inactive => over scanned - anon active => over scanned (maybe not as much) - file inactive => over scanned - file active => under scanned (relatively) The accesses to nr_saved_scan are not lock protected and so not 100% accurate, however we can tolerate small errors and the resulted small imbalanced scan rates between zones. Cc: Rik van Riel <riel@redhat.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NWu Fengguang <fengguang.wu@intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Alexey Dobriyan 提交于
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com> Acked-by: NDavid Rientjes <rientjes@google.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Beulich 提交于
This is being done by allowing boot time allocations to specify that they may want a sub-page sized amount of memory. Overall this seems more consistent with the other hash table allocations, and allows making two supposedly mm-only variables really mm-only (nr_{kernel,all}_pages). Signed-off-by: NJan Beulich <jbeulich@novell.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jan Beulich 提交于
Sizing of memory allocations shouldn't depend on the number of physical pages found in a system, as that generally includes (perhaps a huge amount of) non-RAM pages. The amount of what actually is usable as storage should instead be used as a basis here. Some of the calculations (i.e. those not intending to use high memory) should likely even use (totalram_pages - totalhigh_pages). Signed-off-by: NJan Beulich <jbeulich@novell.com> Acked-by: NRusty Russell <rusty@rustcorp.com.au> Acked-by: NIngo Molnar <mingo@elte.hu> Cc: Dave Airlie <airlied@linux.ie> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Jeremy Fitzhardinge <jeremy@goop.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: "David S. Miller" <davem@davemloft.net> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Make page_has_private() return a true boolean value and remove the double negations from the two callsites using it for arithmetic. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Christoph Lameter <cl@linux-foundation.org> Reviewed-by: NChristoph Lameter <cl@linux-foundation.org> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
page_is_file_cache() has been used for both boolean checks and LRU arithmetic, which was always a bit weird. Now that page_lru_base_type() exists for LRU arithmetic, make page_is_file_cache() a real predicate function and adjust the boolean-using callsites to drop those pesky double negations. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
Instead of abusing page_is_file_cache() for LRU list index arithmetic, add another helper with a more appropriate name and convert the non-boolean users of page_is_file_cache() accordingly. This new helper gives the LRU base type a page is supposed to live on, inactive anon or inactive file. [hugh.dickins@tiscali.co.uk: convert del_page_from_lru() also] Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sage Weil 提交于
The kzalloc mempool zeros items when they are initially allocated, but does not rezero used items that are returned to the pool. Consequently mempool_alloc()s may return non-zeroed memory. Since there are/were only two in-tree users for mempool_create_kzalloc_pool(), and 'fixing' this in a way that will re-zero used (but not new) items before first use is non-trivial, just remove it. Signed-off-by: NSage Weil <sage@newdream.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
The page allocation trace event reports that a page was successfully allocated but it does not specify where it came from. When analysing performance, it can be important to distinguish between pages coming from the per-cpu allocator and pages coming from the buddy lists as the latter requires the zone lock to the taken and more data structures to be examined. This patch adds a trace event for __rmqueue reporting when a page is being allocated from the buddy lists. It distinguishes between being called to refill the per-cpu lists or whether it is a high-order allocation. Similarly, this patch adds an event to catch when the PCP lists are being drained a little and pages are going back to the buddy lists. This is trickier to draw conclusions from but high activity on those events could explain why there were a large number of cache misses on a page-allocator-intensive workload. The coalescing and splitting of buddies involves a lot of writing of page metadata and cache line bounces not to mention the acquisition of an interrupt-safe lock necessary to enter this path. [akpm@linux-foundation.org: fix build] Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NRik van Riel <riel@redhat.com> Reviewed-by: NIngo Molnar <mingo@elte.hu> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Ming Chun <macli@brc.ubc.ca> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Fragmentation avoidance depends on being able to use free pages from lists of the appropriate migrate type. In the event this is not possible, __rmqueue_fallback() selects a different list and in some circumstances change the migratetype of the pageblock. Simplistically, the more times this event occurs, the more likely that fragmentation will be a problem later for hugepage allocation at least but there are other considerations such as the order of page being split to satisfy the allocation. This patch adds a trace event for __rmqueue_fallback() that reports what page is being used for the fallback, the orders of relevant pages, the desired migratetype and the migratetype of the lists being used, whether the pageblock changed type and whether this event is important with respect to fragmentation avoidance or not. This information can be used to help analyse fragmentation avoidance and help decide whether min_free_kbytes should be increased or not. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NRik van Riel <riel@redhat.com> Reviewed-by: NIngo Molnar <mingo@elte.hu> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Ming Chun <macli@brc.ubc.ca> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
This patch adds trace events for the allocation and freeing of pages, including the freeing of pagevecs. Using the events, it will be known what struct page and pfns are being allocated and freed and what the call site was in many cases. The page alloc tracepoints be used as an indicator as to whether the workload was heavily dependant on the page allocator or not. You can make a guess based on vmstat but you can't get a per-process breakdown. Depending on the call path, the call_site for page allocation may be __get_free_pages() instead of a useful callsite. Instead of passing down a return address similar to slab debugging, the user should enable the stacktrace and seg-addr options to get a proper stack trace. The pagevec free tracepoint has a different usecase. It can be used to get a idea of how many pages are being dumped off the LRU and whether it is kswapd doing the work or a process doing direct reclaim. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NRik van Riel <riel@redhat.com> Reviewed-by: NIngo Molnar <mingo@elte.hu> Cc: Larry Woodman <lwoodman@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Li Ming Chun <macli@brc.ubc.ca> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
The function free_cold_page() has no callers so delete it. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Just as the swapoff system call allocates many pages of RAM to various processes, perhaps triggering OOM, so "echo 2 >/sys/kernel/mm/ksm/run" (unmerge) is liable to allocate many pages of RAM to various processes, perhaps triggering OOM; and each is normally run from a modest admin process (swapoff or shell), easily repeated until it succeeds. So treat unmerge_and_remove_all_rmap_items() in the same way that we treat try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both with that, to ask the OOM killer to kill them first, to prevent them from spawning more and more OOM kills. Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: NIzik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
A few cleanups, given the munlock fix: the comment on ksm_test_exit() no longer applies, and it can be made private to ksm.c; there's no more reference to mmu_gather or tlb.h, and mmap.c doesn't need ksm.h. Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: NIzik Eidus <ieidus@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
Rawhide users have reported hang at startup when cryptsetup is run: the same problem can be simply reproduced by running a program int main() { mlockall(MCL_CURRENT | MCL_FUTURE); return 0; } The problem is that exit_mmap() applies munlock_vma_pages_all() to clean up VM_LOCKED areas, and its current implementation (stupidly) tries to fault in absent pages, for example where PROT_NONE prevented them being faulted in when mlocking. Whereas the "ksm: fix oom deadlock" patch, knowing there's a race by which KSM might try to fault in pages after exit_mmap() had finally zapped the range, backs out of such faults doing nothing when its ksm_test_exit() notices mm_users 0. So revert that part of "ksm: fix oom deadlock" which moved the ksm_exit() call from before exit_mmap() to the middle of exit_mmap(); and remove those ksm_test_exit() checks from the page fault paths, so allowing the munlocking to proceed without interference. ksm_exit, if there are rmap_items still chained on this mm slot, takes mmap_sem write side: so preventing KSM from working on an mm while exit_mmap runs. And KSM will bail out as soon as it notices that mm_users is already zero, thanks to its internal ksm_test_exit checks. So that when a task is killed by OOM killer or the user, KSM will not indefinitely prevent it from running exit_mmap to release its memory. This does break a part of what "ksm: fix oom deadlock" was trying to achieve. When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even when ksmd itself has to cancel a KSM page, it is possible that the first OOM-kill victim would be the KSM process being faulted: then its memory won't be freed until a second victim has been selected (freeing memory for the unmerging fault to complete). But the OOM killer is already liable to kill a second victim once the intended victim's p->mm goes to NULL: so there's not much point in rejecting this KSM patch before fixing that OOM behaviour. It is very much more important to allow KSM users to boot up, than to haggle over an unlikely and poorly supported OOM case. We also intend to fix munlocking to not fault pages: at which point this patch _could_ be reverted; though that would be controversial, so we hope to find a better solution. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NJustin M. Forbes <jforbes@redhat.com> Acked-for-now-by: NHugh Dickins <hugh.dickins@tiscali.co.uk> Cc: Izik Eidus <ieidus@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-