- 23 3月, 2011 5 次提交
-
-
由 Andrey Vagin 提交于
We shouldn't defer oom killing if a thread has already detached its ->mm and still has TIF_MEMDIE set. Memory needs to be freed, so find kill other threads that pin the same ->mm or find another task to kill. Signed-off-by: NAndrey Vagin <avagin@openvz.org> Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
This patch prevents unnecessary oom kills or kernel panics by reverting two commits: 495789a5 (oom: make oom_score to per-process value) cef1d352 (oom: multi threaded process coredump don't make deadlock) First, 495789a5 (oom: make oom_score to per-process value) ignores the fact that all threads in a thread group do not necessarily exit at the same time. It is imperative that select_bad_process() detect threads that are in the exit path, specifically those with PF_EXITING set, to prevent needlessly killing additional tasks. If a process is oom killed and the thread group leader exits, select_bad_process() cannot detect the other threads that are PF_EXITING by iterating over only processes. Thus, it currently chooses another task unnecessarily for oom kill or panics the machine when nothing else is eligible. By iterating over threads instead, it is possible to detect threads that are exiting and nominate them for oom kill so they get access to memory reserves. Second, cef1d352 (oom: multi threaded process coredump don't make deadlock) erroneously avoids making the oom killer a no-op when an eligible thread other than current isfound to be exiting. We want to detect this situation so that we may allow that exiting thread time to exit and free its memory; if it is able to exit on its own, that should free memory so current is no loner oom. If it is not able to exit on its own, the oom killer will nominate it for oom kill which, in this case, only means it will get access to memory reserves. Without this change, it is easy for the oom killer to unnecessarily target tasks when all threads of a victim don't exit before the thread group leader or, in the worst case, panic the machine. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrey Vagin <avagin@openvz.org> Cc: <stable@kernel.org> [2.6.38.x] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
If an administrator tries to swapon a file backed by NFS, the inode mutex is taken (as it is for any swapfile) but later identified to be a bad swapfile due to the lack of bmap and tries to cleanup. During cleanup, an attempt is made to close the file but with inode->i_mutex still held. Closing an NFS file syncs it which tries to acquire the inode mutex leading to deadlock. If lockdep is enabled the following appears on the console; ============================================= [ INFO: possible recursive locking detected ] 2.6.38-rc8-autobuild #1 --------------------------------------------- swapon/2192 is trying to acquire lock: (&sb->s_type->i_mutex_key#13){+.+.+.}, at: vfs_fsync_range+0x47/0x7c but task is already holding lock: (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7 other info that might help us debug this: 1 lock held by swapon/2192: #0: (&sb->s_type->i_mutex_key#13){+.+.+.}, at: sys_swapon+0x28d/0xae7 stack backtrace: Pid: 2192, comm: swapon Not tainted 2.6.38-rc8-autobuild #1 Call Trace: __lock_acquire+0x2eb/0x1623 find_get_pages_tag+0x14a/0x174 pagevec_lookup_tag+0x25/0x2e vfs_fsync_range+0x47/0x7c lock_acquire+0xd3/0x100 vfs_fsync_range+0x47/0x7c nfs_flush_one+0x0/0xdf [nfs] mutex_lock_nested+0x40/0x2b1 vfs_fsync_range+0x47/0x7c vfs_fsync_range+0x47/0x7c vfs_fsync+0x1c/0x1e nfs_file_flush+0x64/0x69 [nfs] filp_close+0x43/0x72 sys_swapon+0xa39/0xae7 sysret_check+0x2e/0x69 system_call_fastpath+0x16/0x1b This patch releases the mutex if its held before calling filep_close() so swapon fails as expected without deadlock when the swapfile is backed by NFS. If accepted for 2.6.39, it should also be considered a -stable candidate for 2.6.38 and 2.6.37. Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NHugh Dickins <hughd@google.com> Cc: <stable@kernel.org> [2.6.37+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Christoph Lameter 提交于
Add some statistics for debugging. Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
OOM path is missing the irq restore in the CONFIG_CMPXCHG_LOCAL case. Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 21 3月, 2011 1 次提交
-
-
由 Christoph Lameter 提交于
The redo label needs #ifdeffery. Fixes the following problem introduced by commit 8a5ec0ba ("Lockless (and preemptless) fastpaths for slub"): mm/slub.c: In function 'slab_free': mm/slub.c:2124: warning: label 'redo' defined but not used Reported-by: NStephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 18 3月, 2011 4 次提交
-
-
由 Andrea Arcangeli 提交于
Change the _mapcount value indicating PageBuddy from -2 to -128 for more robusteness against page_mapcount() undeflows. Use reset_page_mapcount instead of __ClearPageBuddy in bad_page to ignore the previous retval of PageBuddy(). Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Reported-by: NHugh Dickins <hughd@google.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Huang Ying 提交于
Unused. Signed-off-by: NHuang Ying <ying.huang@intel.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
-
由 Huang Ying 提交于
Make __get_user_pages return -EHWPOISON for HWPOISON page only if FOLL_HWPOISON is specified. With this patch, the interested callers can distinguish HWPOISON pages from general FAULT pages, while other callers will still get -EFAULT for all these pages, so the user space interface need not to be changed. This feature is needed by KVM, where UCR MCE should be relayed to guest for HWPOISON page, while instruction emulation and MMIO will be tried for general FAULT page. The idea comes from Andrew Morton. Signed-off-by: NHuang Ying <ying.huang@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com> Signed-off-by: NAvi Kivity <avi@redhat.com>
-
由 Huang Ying 提交于
In most cases, get_user_pages and get_user_pages_fast should be used to pin user pages in memory. But sometimes, some special flags except FOLL_GET, FOLL_WRITE and FOLL_FORCE are needed, for example in following patch, KVM needs FOLL_HWPOISON. To support these users, __get_user_pages is exported directly. There are some symbol name conflicts in infiniband driver, fixed them too. Signed-off-by: NHuang Ying <ying.huang@intel.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Michel Lespinasse <walken@google.com> CC: Roland Dreier <roland@kernel.org> CC: Ralph Campbell <infinipath@qlogic.com> Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
-
- 15 3月, 2011 2 次提交
-
-
由 Linus Torvalds 提交于
This reverts the parent commit. I hate doing that, but it's generating some discussion ("half of it is right"), and since I am planning on doing the 2.6.38 release later today we can punt it to stable if required. Let's not rock the boat right now. Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
oom_kill_process() starts with victim_points == 0. This means that (most likely) any child has more points and can be killed erroneously. Also, "children has a different mm" doesn't match the reality, we should check child->mm != t->mm. This check is not exactly correct if t->mm == NULL but this doesn't really matter, oom_kill_task() will kill them anyway. Note: "Kill all processes sharing p->mm" in oom_kill_task() is wrong too. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 3月, 2011 3 次提交
-
-
由 Hugh Dickins 提交于
THP's collapse_huge_page() has an understandable but ugly difference in when its huge page is allocated: inside if NUMA but outside if not. It's hardly surprising that the memcg failure path forgot that, freeing the page in the non-NUMA case, then hitting a VM_BUG_ON in get_page() (or even worse, using the freed page). Signed-off-by: NHugh Dickins <hughd@google.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Aneesh Kumar K.V 提交于
The exportfs encode handle function should return the minimum required handle size. This helps user to find out the handle size by passing 0 handle size in the first step and then redoing to the call again with the returned handle size value. Acked-by: NSerge Hallyn <serue@us.ibm.com> Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Andrea Arcangeli 提交于
When vmscan.c calls page_referenced(), if an anon page was created before a process forked, rmap will search for it in both of the processes, even though one of them might have since broken COW. If the child process mlocks the vma where the COWed page belongs to, page_referenced() running on the page mapped by the parent would lead to *vm_flags getting VM_LOCKED set erroneously (leading to the references on the parent page being ignored and evicting the parent page too early). *mapcount would also be decremented by page_referenced_one even if the page wasn't found by page_check_address. This also lets pmdp_clear_flush_young_notify() go ahead on a pmd_trans_splitting() pmd. We hold the page_table_lock so __split_huge_page_map() must wait the pmdp_clear_flush_young_notify() to complete before it can modify the pmd. The pmd is also still mapped in userland so the young bit may materialize through a tlb miss before split_huge_page_map runs. This will provide a more accurate page_referenced() behavior during split_huge_page(). Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Reported-by: NMichel Lespinasse <walken@google.com> Reviewed-by: NMichel Lespinasse <walken@google.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Reviewed-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: Rik van Riel<riel@redhat.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 3月, 2011 3 次提交
-
-
由 Lai Jiangshan 提交于
The size of struct rcu_head may be changed. When it becomes larger, it may pollute the data after struct slab. Acked-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Lai Jiangshan 提交于
The size of struct rcu_head may be changed. When it becomes larger, it will pollute the page array. We reserve some some bytes for struct rcu_head when a slab is allocated in this situation. Changed from V1: use VM_BUG_ON instead BUG_ON Acked-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Lai Jiangshan 提交于
There is no "struct" for slub's slab, it shares with struct page. But struct page is very small, it is insufficient when we need to add some metadata for slab. So we add a field "reserved" to struct kmem_cache, when a slab is allocated, kmem_cache->reserved bytes are automatically reserved at the end of the slab for slab's metadata. Changed from v1: Export the reserved field via sysfs Acked-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 11 3月, 2011 2 次提交
-
-
由 Christoph Lameter 提交于
Use the this_cpu_cmpxchg_double functionality to implement a lockless allocation algorithm on arches that support fast this_cpu_ops. Each of the per cpu pointers is paired with a transaction id that ensures that updates of the per cpu information can only occur in sequence on a certain cpu. A transaction id is a "long" integer that is comprised of an event number and the cpu number. The event number is incremented for every change to the per cpu state. This means that the cmpxchg instruction can verify for an update that nothing interfered and that we are updating the percpu structure for the processor where we picked up the information and that we are also currently on that processor when we update the information. This results in a significant decrease of the overhead in the fastpaths. It also makes it easy to adopt the fast path for realtime kernels since this is lockless and does not require the use of the current per cpu area over the critical section. It is only important that the per cpu area is current at the beginning of the critical section and at the end. So there is no need even to disable preemption. Test results show that the fastpath cycle count is reduced by up to ~ 40% (alloc/free test goes from ~140 cycles down to ~80). The slowpath for kfree adds a few cycles. Sadly this does nothing for the slowpath which is where the main issues with performance in slub are but the best case performance rises significantly. (For that see the more complex slub patches that require cmpxchg_double) Kmalloc: alloc/free test Before: 10000 times kmalloc(8)/kfree -> 134 cycles 10000 times kmalloc(16)/kfree -> 152 cycles 10000 times kmalloc(32)/kfree -> 144 cycles 10000 times kmalloc(64)/kfree -> 142 cycles 10000 times kmalloc(128)/kfree -> 142 cycles 10000 times kmalloc(256)/kfree -> 132 cycles 10000 times kmalloc(512)/kfree -> 132 cycles 10000 times kmalloc(1024)/kfree -> 135 cycles 10000 times kmalloc(2048)/kfree -> 135 cycles 10000 times kmalloc(4096)/kfree -> 135 cycles 10000 times kmalloc(8192)/kfree -> 144 cycles 10000 times kmalloc(16384)/kfree -> 754 cycles After: 10000 times kmalloc(8)/kfree -> 78 cycles 10000 times kmalloc(16)/kfree -> 78 cycles 10000 times kmalloc(32)/kfree -> 82 cycles 10000 times kmalloc(64)/kfree -> 88 cycles 10000 times kmalloc(128)/kfree -> 79 cycles 10000 times kmalloc(256)/kfree -> 79 cycles 10000 times kmalloc(512)/kfree -> 85 cycles 10000 times kmalloc(1024)/kfree -> 82 cycles 10000 times kmalloc(2048)/kfree -> 82 cycles 10000 times kmalloc(4096)/kfree -> 85 cycles 10000 times kmalloc(8192)/kfree -> 82 cycles 10000 times kmalloc(16384)/kfree -> 706 cycles Kmalloc: Repeatedly allocate then free test Before: 10000 times kmalloc(8) -> 211 cycles kfree -> 113 cycles 10000 times kmalloc(16) -> 174 cycles kfree -> 115 cycles 10000 times kmalloc(32) -> 235 cycles kfree -> 129 cycles 10000 times kmalloc(64) -> 222 cycles kfree -> 120 cycles 10000 times kmalloc(128) -> 343 cycles kfree -> 139 cycles 10000 times kmalloc(256) -> 827 cycles kfree -> 147 cycles 10000 times kmalloc(512) -> 1048 cycles kfree -> 272 cycles 10000 times kmalloc(1024) -> 2043 cycles kfree -> 528 cycles 10000 times kmalloc(2048) -> 4002 cycles kfree -> 571 cycles 10000 times kmalloc(4096) -> 7740 cycles kfree -> 628 cycles 10000 times kmalloc(8192) -> 8062 cycles kfree -> 850 cycles 10000 times kmalloc(16384) -> 8895 cycles kfree -> 1249 cycles After: 10000 times kmalloc(8) -> 190 cycles kfree -> 129 cycles 10000 times kmalloc(16) -> 76 cycles kfree -> 123 cycles 10000 times kmalloc(32) -> 126 cycles kfree -> 124 cycles 10000 times kmalloc(64) -> 181 cycles kfree -> 128 cycles 10000 times kmalloc(128) -> 310 cycles kfree -> 140 cycles 10000 times kmalloc(256) -> 809 cycles kfree -> 165 cycles 10000 times kmalloc(512) -> 1005 cycles kfree -> 269 cycles 10000 times kmalloc(1024) -> 1999 cycles kfree -> 527 cycles 10000 times kmalloc(2048) -> 3967 cycles kfree -> 570 cycles 10000 times kmalloc(4096) -> 7658 cycles kfree -> 637 cycles 10000 times kmalloc(8192) -> 8111 cycles kfree -> 859 cycles 10000 times kmalloc(16384) -> 8791 cycles kfree -> 1173 cycles Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
由 Christoph Lameter 提交于
The following patch will make the fastpaths lockless and will no longer require interrupts to be disabled. Calling the free hook with irq disabled will no longer be possible. Move the slab_free_hook_irq() logic into slab_free_hook. Only disable interrupts if the features are selected that require callbacks with interrupts off and reenable after calls have been made. Signed-off-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 05 3月, 2011 3 次提交
-
-
由 Andi Kleen 提交于
Pass down the correct node for a transparent hugepage allocation. Most callers continue to use the current node, however the hugepaged daemon now uses the previous node of the first to be collapsed page instead. This ensures that khugepaged does not mess up local memory for an existing process which uses local policy. The choice of node is somewhat primitive currently: it just uses the node of the first page in the pmd range. An alternative would be to look at multiple pages and use the most popular node. I used the simplest variant for now which should work well enough for the case of all pages being on the same node. [akpm@linux-foundation.org: coding-style fixes] Acked-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndi Kleen <ak@linux.intel.com> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andi Kleen 提交于
This makes a difference for LOCAL policy, where the node cannot be determined from the policy itself, but has to be gotten from the original page. Acked-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndi Kleen <ak@linux.intel.com> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andi Kleen 提交于
Currently alloc_pages_vma() always uses the local node as policy node for the LOCAL policy. Pass this node down as an argument instead. No behaviour change from this patch, but will be needed for followons. Acked-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndi Kleen <ak@linux.intel.com> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 3月, 2011 1 次提交
-
-
由 Justin P. Mattock 提交于
Signed-off-by: NJustin P. Mattock <justinmattock@gmail.com> Signed-off-by: NJiri Kosina <jkosina@suse.cz>
-
- 27 2月, 2011 1 次提交
-
-
由 Mariusz Kozlowski 提交于
mm/slub.c: In function 'ksize': mm/slub.c:2728: error: implicit declaration of function 'slab_ksize' slab_ksize() needs to go out of CONFIG_SLUB_DEBUG section. Acked-by: NRandy Dunlap <randy.dunlap@oracle.com> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NMariusz Kozlowski <mk@lab.zgora.pl> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 26 2月, 2011 6 次提交
-
-
由 Yinghai Lu 提交于
Heiko found recent memblock change triggers these warnings on s390: mm/page_alloc.c:3623:22: warning: 'last_active_region_index_in_nid' defined but not used mm/page_alloc.c:3638:22: warning: 'previous_active_region_index_in_nid' defined but not used Need to move those two function under HAVE_MEMBLOCK with its only user, find_memory_core_early(). -tj: Minor updates to description. Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: NYinghai Lu <yinghai@kernel.org> Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Hugh Dickins 提交于
It seems odd that truncate_inode_pages_range(), called not only when truncating but also when evicting inodes, has mem_cgroup_uncharge_start and _end() batching in its second loop to clear up a few leftovers, but not in its first loop that does almost all the work: add them there too. Signed-off-by: NHugh Dickins <hughd@google.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NBalbir Singh <balbir@linux.vnet.ibm.com> Acked-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andi Kleen 提交于
The THP code didn't pass the correct interleaving shift to the memory policy code. Fix this here by adjusting for the order. Signed-off-by: NAndi Kleen <ak@linux.intel.com> Reviewed-by: NChristoph Lameter <cl@linux.com> Acked-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Namhyung Kim 提交于
When pfn_valid_within() failed 'iter' was incremented twice. Signed-off-by: NNamhyung Kim <namhyung@gmail.com> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
should_continue_reclaim() for reclaim/compaction allows scanning to continue even if pages are not being reclaimed until the full list is scanned. In terms of allocation success, this makes sense but potentially it introduces unwanted latency for high-order allocations such as transparent hugepages and network jumbo frames that would prefer to fail the allocation attempt and fallback to order-0 pages. Worse, there is a potential that the full LRU scan will clear all the young bits, distort page aging information and potentially push pages into swap that would have otherwise remained resident. This patch will stop reclaim/compaction if no pages were reclaimed in the last SWAP_CLUSTER_MAX pages that were considered. For allocations such as hugetlbfs that use __GFP_REPEAT and have fewer fallback options, the full LRU list may still be scanned. Order-0 allocation should not be affected because RECLAIM_MODE_COMPACTION is not set so the following avoids the gfp_mask being examined: if (!(sc->reclaim_mode & RECLAIM_MODE_COMPACTION)) return false; A tool was developed based on ftrace that tracked the latency of high-order allocations while transparent hugepage support was enabled and three benchmarks were run. The "fix-infinite" figures are 2.6.38-rc4 with Johannes's patch "vmscan: fix zone shrinking exit when scan work is done" applied. STREAM Highorder Allocation Latency Statistics fix-infinite break-early 1 :: Count 10298 10229 1 :: Min 0.4560 0.4640 1 :: Mean 1.0589 1.0183 1 :: Max 14.5990 11.7510 1 :: Stddev 0.5208 0.4719 2 :: Count 2 1 2 :: Min 1.8610 3.7240 2 :: Mean 3.4325 3.7240 2 :: Max 5.0040 3.7240 2 :: Stddev 1.5715 0.0000 9 :: Count 111696 111694 9 :: Min 0.5230 0.4110 9 :: Mean 10.5831 10.5718 9 :: Max 38.4480 43.2900 9 :: Stddev 1.1147 1.1325 Mean time for order-1 allocations is reduced. order-2 looks increased but with so few allocations, it's not particularly significant. THP mean allocation latency is also reduced. That said, allocation time varies so significantly that the reductions are within noise. Max allocation time is reduced by a significant amount for low-order allocations but reduced for THP allocations which presumably are now breaking before reclaim has done enough work. SysBench Highorder Allocation Latency Statistics fix-infinite break-early 1 :: Count 15745 15677 1 :: Min 0.4250 0.4550 1 :: Mean 1.1023 1.0810 1 :: Max 14.4590 10.8220 1 :: Stddev 0.5117 0.5100 2 :: Count 1 1 2 :: Min 3.0040 2.1530 2 :: Mean 3.0040 2.1530 2 :: Max 3.0040 2.1530 2 :: Stddev 0.0000 0.0000 9 :: Count 2017 1931 9 :: Min 0.4980 0.7480 9 :: Mean 10.4717 10.3840 9 :: Max 24.9460 26.2500 9 :: Stddev 1.1726 1.1966 Again, mean time for order-1 allocations is reduced while order-2 allocations are too few to draw conclusions from. The mean time for THP allocations is also slightly reduced albeit the reductions are within varianes. Once again, our maximum allocation time is significantly reduced for low-order allocations and slightly increased for THP allocations. Anon stream mmap reference Highorder Allocation Latency Statistics 1 :: Count 1376 1790 1 :: Min 0.4940 0.5010 1 :: Mean 1.0289 0.9732 1 :: Max 6.2670 4.2540 1 :: Stddev 0.4142 0.2785 2 :: Count 1 - 2 :: Min 1.9060 - 2 :: Mean 1.9060 - 2 :: Max 1.9060 - 2 :: Stddev 0.0000 - 9 :: Count 11266 11257 9 :: Min 0.4990 0.4940 9 :: Mean 27250.4669 24256.1919 9 :: Max 11439211.0000 6008885.0000 9 :: Stddev 226427.4624 186298.1430 This benchmark creates one thread per CPU which references an amount of anonymous memory 1.5 times the size of physical RAM. This pounds swap quite heavily and is intended to exercise THP a bit. Mean allocation time for order-1 is reduced as before. It's also reduced for THP allocations but the variations here are pretty massive due to swap. As before, maximum allocation times are significantly reduced. Overall, the patch reduces the mean and maximum allocation latencies for the smaller high-order allocations. This was with Slab configured so it would be expected to be more significant with Slub which uses these size allocations more aggressively. The mean allocation times for THP allocations are also slightly reduced. The maximum latency was slightly increased as predicted by the comments due to reclaim/compaction breaking early. However, workloads care more about the latency of lower-order allocations than THP so it's an acceptable trade-off. Signed-off-by: NMel Gorman <mel@csn.ul.ie> Acked-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Acked-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NRik van Riel <riel@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Kent Overstreet <kent.overstreet@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Greg Thelen 提交于
The move_pages() usage of find_task_by_vpid() requires rcu_read_lock() to prevent free_pid() from reclaiming the pid. Without this patch, RCU warnings are printed in v2.6.38-rc4 move_pages() with: CONFIG_LOCKUP_DETECTOR=y CONFIG_PREEMPT=y CONFIG_LOCKDEP=y CONFIG_PROVE_LOCKING=y CONFIG_PROVE_RCU=y Previously, migrate_pages() went through a similar transformation replacing usage of tasklist_lock with rcu read lock: commit 55cfaa3c Author: Zeng Zhaoming <zengzm.kernel@gmail.com> Date: Thu Dec 2 14:31:13 2010 -0800 mm/mempolicy.c: add rcu read lock to protect pid structure commit 1e50df39 Author: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Date: Thu Jan 13 15:46:14 2011 -0800 mempolicy: remove tasklist_lock from migrate_pages Signed-off-by: NGreg Thelen <gthelen@google.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Zeng Zhaoming <zengzm.kernel@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 2月, 2011 1 次提交
-
-
由 Miklos Szeredi 提交于
Grab a reference to bdev before calling blkdev_get(), which expects the refcount to be already incremented and either returns success or decrements the refcount and returns an error. The bug was introduced by e525fd89 (block: make blkdev_get/put() handle exclusive access), which didn't take into account this behavior of blkdev_get(). Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 2月, 2011 5 次提交
-
-
由 Yinghai Lu 提交于
Now that bootmem.c and nobootmem.c are separate, there's no reason to define __alloc_memory_core_early(), which is used only by nobootmem, inside #ifdef in page_alloc.c. Move it to nobootmem.c and make it static. This patch doesn't introduce any behavior change. -tj: Updated commit description. Signed-off-by: NYinghai Lu <yinghai@kernel.org> Acked-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Yinghai Lu 提交于
Now that bootmem.c and nobootmem.c are separate, it's cleaner to define contig_page_data in each file than in page_alloc.c with #ifdef. Move it. This patch doesn't introduce any behavior change. -v2: According to Andrew, fixed the struct layout. -tj: Updated commit description. Signed-off-by: NYinghai Lu <yinghai@kernel.org> Acked-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Yinghai Lu 提交于
mm/bootmem.c contained code paths for both bootmem and no bootmem configurations. They implement about the same set of APIs in different ways and as a result bootmem.c contains massive amount of #ifdef CONFIG_NO_BOOTMEM. Separate out CONFIG_NO_BOOTMEM code into mm/nobootmem.c. As the common part is relatively small, duplicate them in nobootmem.c instead of creating a common file or ifdef'ing in bootmem.c. The followings are duplicated. * {min|max}_low_pfn, max_pfn, saved_max_pfn * free_bootmem_late() * ___alloc_bootmem() * __alloc_bootmem_low() The followings are applicable only to nobootmem and moved verbatim. * __free_pages_memory() * free_all_memory_core_early() The followings are not applicable to nobootmem and omitted in nobootmem.c. * reserve_bootmem_node() * reserve_bootmem() The rest split function bodies according to CONFIG_NO_BOOTMEM. Makefile is updated so that only either bootmem.c or nobootmem.c is built according to CONFIG_NO_BOOTMEM. This patch doesn't introduce any behavior change. -tj: Rewrote commit description. Suggested-by: NIngo Molnar <mingo@elte.hu> Signed-off-by: NYinghai Lu <yinghai@kernel.org> Acked-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NTejun Heo <tj@kernel.org>
-
由 Hugh Dickins 提交于
Robert Swiecki reported a BUG_ON(page_mapped) from a fuzzer, punching a hole with madvise(,, MADV_REMOVE). That path is under mutex, and cannot be explained by lack of serialization in unmap_mapping_range(). Reviewing the code, I found one place where vm_truncate_count handling should have been updated, when I switched at the last minute from one way of managing the restart_addr to another: mremap move changes the virtual addresses, so it ought to adjust the restart_addr. But rather than exporting the notion of restart_addr from memory.c, or converting to restart_pgoff throughout, simply reset vm_truncate_count to 0 to force a rescan if mremap move races with preempted truncation. We have no confirmation that this fixes Robert's BUG, but it is a fix that's worth making anyway. Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Miklos Szeredi 提交于
Michael Leun reported that running parallel opens on a fuse filesystem can trigger a "kernel BUG at mm/truncate.c:475" Gurudas Pai reported the same bug on NFS. The reason is, unmap_mapping_range() is not prepared for more than one concurrent invocation per inode. For example: thread1: going through a big range, stops in the middle of a vma and stores the restart address in vm_truncate_count. thread2: comes in with a small (e.g. single page) unmap request on the same vma, somewhere before restart_address, finds that the vma was already unmapped up to the restart address and happily returns without doing anything. Another scenario would be two big unmap requests, both having to restart the unmapping and each one setting vm_truncate_count to its own value. This could go on forever without any of them being able to finish. Truncate and hole punching already serialize with i_mutex. Other callers of unmap_mapping_range() do not, and it's difficult to get i_mutex protection for all callers. In particular ->d_revalidate(), which calls invalidate_inode_pages2_range() in fuse, may be called with or without i_mutex. This patch adds a new mutex to 'struct address_space' to prevent running multiple concurrent unmap_mapping_range() on the same mapping. [ We'll hopefully get rid of all this with the upcoming mm preemptibility series by Peter Zijlstra, the "mm: Remove i_mmap_mutex lockbreak" patch in particular. But that is for 2.6.39 ] Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz> Reported-by: NMichael Leun <lkml20101129@newton.leun.net> Reported-by: NGurudas Pai <gurudas.pai@oracle.com> Tested-by: NGurudas Pai <gurudas.pai@oracle.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 2月, 2011 1 次提交
-
-
由 Eric Dumazet 提交于
Recent use of ksize() in network stack (commit ca44ac38 : net: don't reallocate skb->head unless the current one hasn't the needed extra size or is shared) triggers kmemcheck warnings, because ksize() can return more space than kmemcheck is aware of. Pekka Enberg noticed SLAB+kmemcheck is doing the right thing, while SLUB +kmemcheck doesnt. Bugzilla reference #27212 Reported-by: NChristian Casteyde <casteyde.christian@free.fr> Suggested-by: NPekka Enberg <penberg@kernel.org> Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Acked-by: NDavid Rientjes <rientjes@google.com> Acked-by: NChristoph Lameter <cl@linux.com> CC: Changli Gao <xiaosuo@gmail.com> CC: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: NPekka Enberg <penberg@kernel.org>
-
- 17 2月, 2011 1 次提交
-
-
由 Ryota Ozaki 提交于
do_file_page and do_no_page don't exist anymore, but some comments still refers them. The patch fixes them by replacing them with existing ones. Signed-off-by: NRyota Ozaki <ozaki.ryota@gmail.com> Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Signed-off-by: NJiri Kosina <jkosina@suse.cz>
-
- 16 2月, 2011 1 次提交
-
-
由 Andrea Arcangeli 提交于
Transparent hugepages can only be created if rmap is fully functional. So we must prevent hugepages to be created while is_vma_temporary_stack() is true. This also optmizes away some harmless but unnecessary setting of khugepaged_scan.address and it switches some BUG_ON to VM_BUG_ON. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NRik van Riel <riel@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-