- 24 2月, 2013 10 次提交
-
-
由 Hugh Dickins 提交于
Complaints are rare, but lockdep still does not understand the way ksm_memory_callback(MEM_GOING_OFFLINE) takes ksm_thread_mutex, and holds it until the ksm_memory_callback(MEM_OFFLINE): that appears to be a problem because notifier callbacks are made under down_read of blocking_notifier_head->rwsem (so first the mutex is taken while holding the rwsem, then later the rwsem is taken while still holding the mutex); but is not in fact a problem because mem_hotplug_mutex is held throughout the dance. There was an attempt to fix this with mutex_lock_nested(); but if that happened to fool lockdep two years ago, apparently it does so no longer. I had hoped to eradicate this issue in extending KSM page migration not to need the ksm_thread_mutex. But then realized that although the page migration itself is safe, we do still need to lock out ksmd and other users of get_ksm_page() while offlining memory - at some point between MEM_GOING_OFFLINE and MEM_OFFLINE, the struct pages themselves may vanish, and get_ksm_page()'s accesses to them become a violation. So, give up on holding ksm_thread_mutex itself from MEM_GOING_OFFLINE to MEM_OFFLINE, and add a KSM_RUN_OFFLINE flag, and wait_while_offlining() checks, to achieve the same lockout without being caught by lockdep. This is less elegant for KSM, but it's more important to keep lockdep useful to other users - and I apologize for how long it took to fix. Signed-off-by: NHugh Dickins <hughd@google.com> Reported-by: NGerald Schaefer <gerald.schaefer@de.ibm.com> Tested-by: NGerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
The new KSM NUMA merge_across_nodes knob introduces a problem, when it's set to non-default 0: if a KSM page is migrated to a different NUMA node, how do we migrate its stable node to the right tree? And what if that collides with an existing stable node? ksm_migrate_page() can do no more than it's already doing, updating stable_node->kpfn: the stable tree itself cannot be manipulated without holding ksm_thread_mutex. So accept that a stable tree may temporarily indicate a page belonging to the wrong NUMA node, leave updating until the next pass of ksmd, just be careful not to merge other pages on to a misplaced page. Note nid of holding tree in stable_node, and recognize that it will not always match nid of kpfn. A misplaced KSM page is discovered, either when ksm_do_scan() next comes around to one of its rmap_items (we now have to go to cmp_and_merge_page even on pages in a stable tree), or when stable_tree_search() arrives at a matching node for another page, and this node page is found misplaced. In each case, move the misplaced stable_node to a list of migrate_nodes (and use the address of migrate_nodes as magic by which to identify them): we don't need them in a tree. If stable_tree_search() finds no match for a page, but it's currently exiled to this list, then slot its stable_node right there into the tree, bringing all of its mappings with it; otherwise they get migrated one by one to the original page of the colliding node. stable_tree_search() is now modelled more like stable_tree_insert(), in order to handle these insertions of migrated nodes. remove_node_from_stable_tree(), remove_all_stable_nodes() and ksm_check_stable_tree() have to handle the migrate_nodes list as well as the stable tree itself. Less obviously, we do need to prune the list of stale entries from time to time (scan_get_next_rmap_item() does it once each full scan): whereas stale nodes in the stable tree get naturally pruned as searches try to brush past them, these migrate_nodes may get forgotten and accumulate. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
KSM page migration is already supported in the case of memory hotremove, which takes the ksm_thread_mutex across all its migrations to keep life simple. But the new KSM NUMA merge_across_nodes knob introduces a problem, when it's set to non-default 0: if a KSM page is migrated to a different NUMA node, how do we migrate its stable node to the right tree? And what if that collides with an existing stable node? So far there's no provision for that, and this patch does not attempt to deal with it either. But how will I test a solution, when I don't know how to hotremove memory? The best answer is to enable KSM page migration in all cases now, and test more common cases. With THP and compaction added since KSM came in, page migration is now mainstream, and it's a shame that a KSM page can frustrate freeing a page block. Without worrying about merge_across_nodes 0 for now, this patch gets KSM page migration working reliably for default merge_across_nodes 1 (but leave the patch enabling it until near the end of the series). It's much simpler than I'd originally imagined, and does not require an additional tier of locking: page migration relies on the page lock, KSM page reclaim relies on the page lock, the page lock is enough for KSM page migration too. Almost all the care has to be in get_ksm_page(): that's the function which worries about when a stable node is stale and should be freed, now it also has to worry about the KSM page being migrated. The only new overhead is an additional put/get/lock/unlock_page when stable_tree_search() arrives at a matching node: to make sure migration respects the raised page count, and so does not migrate the page while we're busy with it here. That's probably avoidable, either by changing internal interfaces from using kpage to stable_node, or by moving the ksm_migrate_page() callsite into a page_freeze_refs() section (even if not swapcache); but this works well, I've no urge to pull it apart now. (Descents of the stable tree may pass through nodes whose KSM pages are under migration: being unlocked, the raised page count does not prevent that, nor need it: it's safe to memcmp against either old or new page.) You might worry about mremap, and whether page migration's rmap_walk to remove migration entries will find all the KSM locations where it inserted earlier: that should already be handled, by the satisfyingly heavy hammer of move_vma()'s call to ksm_madvise(,,,MADV_UNMERGEABLE,). Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Switching merge_across_nodes after running KSM is liable to oops on stale nodes still left over from the previous stable tree. It's not something that people will often want to do, but it would be lame to demand a reboot when they're trying to determine which merge_across_nodes setting is best. How can this happen? We only permit switching merge_across_nodes when pages_shared is 0, and usually set run 2 to force that beforehand, which ought to unmerge everything: yet oopses still occur when you then run 1. Three causes: 1. The old stable tree (built according to the inverse merge_across_nodes) has not been fully torn down. A stable node lingers until get_ksm_page() notices that the page it references no longer references it: but the page is not necessarily freed as soon as expected, particularly when swapcache. Fix this with a pass through the old stable tree, applying get_ksm_page() to each of the remaining nodes (most found stale and removed immediately), with forced removal of any left over. Unless the page is still mapped: I've not seen that case, it shouldn't occur, but better to WARN_ON_ONCE and EBUSY than BUG. 2. __ksm_enter() has a nice little optimization, to insert the new mm just behind ksmd's cursor, so there's a full pass for it to stabilize (or be removed) before ksmd addresses it. Nice when ksmd is running, but not so nice when we're trying to unmerge all mms: we were missing those mms forked and inserted behind the unmerge cursor. Easily fixed by inserting at the end when KSM_RUN_UNMERGE. 3. It is possible for a KSM page to be faulted back from swapcache into an mm, just after unmerge_and_remove_all_rmap_items() scanned past it. Fix this by copying on fault when KSM_RUN_UNMERGE: but that is private to ksm.c, so dissolve the distinction between ksm_might_need_to_copy() and ksm_does_need_to_copy(), doing it all in the one call into ksm.c. A long outstanding, unrelated bugfix sneaks in with that third fix: ksm_does_need_to_copy() would copy from a !PageUptodate page (implying I/O error when read in from swap) to a page which it then marks Uptodate. Fix this case by not copying, letting do_swap_page() discover the error. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
In some places where get_ksm_page() is used, we need the page to be locked. When KSM migration is fully enabled, we shall want that to make sure that the page just acquired cannot be migrated beneath us (raised page count is only effective when there is serialization to make sure migration notices). Whereas when navigating through the stable tree, we certainly do not want to lock each node (raised page count is enough to guarantee the memcmps, even if page is migrated to another node). Since we're about to add another use case, add the locked argument to get_ksm_page() now. Hmm, what's that rcu_read_lock() about? Complete misunderstanding, I really got the wrong end of the stick on that! There's a configuration in which page_cache_get_speculative() can do something cheaper than get_page_unless_zero(), relying on its caller's rcu_read_lock() to have disabled preemption for it. There's no need for rcu_read_lock() around get_page_unless_zero() (and mapping checks) here. Cut out that silliness before making this any harder to understand. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Memory hotremove's ksm_check_stable_tree() is pitifully inefficient (restarting whenever it finds a stale node to remove), but rearrange so that at least it does not needlessly restart from nid 0 each time. And add a couple of comments: here is why we keep pfn instead of page. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Add NUMA() and DO_NUMA() macros to minimize blight of #ifdef CONFIG_NUMAs (but indeed we don't want to expand struct rmap_item by nid when not NUMA). Add comment, remove "unsigned" from rmap_item->nid, as "int nid" elsewhere. Define ksm_merge_across_nodes 1U when #ifndef NUMA to help optimizing out. Use ?: in get_kpfn_nid(). Adjust a few comments noticed in ongoing work. Leave stable_tree_insert()'s rb_linkage until after the node has been set up, as unstable_tree_search_insert() does: ksm_thread_mutex and page lock make either way safe, but we're going to copy and I prefer this precedent. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Petr Holasek 提交于
Here's a KSM series, based on mmotm 2013-01-23-17-04: starting with Petr's v7 "KSM: numa awareness sysfs knob"; then fixing the two issues we had with that, fully enabling KSM page migration on the way. (A different kind of KSM/NUMA issue which I've certainly not begun to address here: when KSM pages are unmerged, there's usually no sense in preferring to allocate the new pages local to the caller's node.) This patch: Introduces new sysfs boolean knob /sys/kernel/mm/ksm/merge_across_nodes which control merging pages across different numa nodes. When it is set to zero only pages from the same node are merged, otherwise pages from all nodes can be merged together (default behavior). Typical use-case could be a lot of KVM guests on NUMA machine and cpus from more distant nodes would have significant increase of access latency to the merged ksm page. Sysfs knob was choosen for higher variability when some users still prefers higher amount of saved physical memory regardless of access latency. Every numa node has its own stable & unstable trees because of faster searching and inserting. Changing of merge_across_nodes value is possible only when there are not any ksm shared pages in system. I've tested this patch on numa machines with 2, 4 and 8 nodes and measured speed of memory access inside of KVM guests with memory pinned to one of nodes with this benchmark: http://pholasek.fedorapeople.org/alloc_pg.c Population standard deviations of access times in percentage of average were following: merge_across_nodes=1 2 nodes 1.4% 4 nodes 1.6% 8 nodes 1.7% merge_across_nodes=0 2 nodes 1% 4 nodes 0.32% 8 nodes 0.018% RFC: https://lkml.org/lkml/2011/11/30/91 v1: https://lkml.org/lkml/2012/1/23/46 v2: https://lkml.org/lkml/2012/6/29/105 v3: https://lkml.org/lkml/2012/9/14/550 v4: https://lkml.org/lkml/2012/9/23/137 v5: https://lkml.org/lkml/2012/12/10/540 v6: https://lkml.org/lkml/2012/12/23/154 v7: https://lkml.org/lkml/2012/12/27/225 Hugh notes that this patch brings two problems, whose solution needs further support in mm/ksm.c, which follows in subsequent patches: 1) switching merge_across_nodes after running KSM is liable to oops on stale nodes still left over from the previous stable tree; 2) memory hotremove may migrate KSM pages, but there is no provision here for !merge_across_nodes to migrate nodes to the proper tree. Signed-off-by: NPetr Holasek <pholasek@redhat.com> Signed-off-by: NHugh Dickins <hughd@google.com> Acked-by: NRik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Sasha Levin 提交于
Switch ksm to use the new hashtable implementation. This reduces the amount of generic unrelated code in the ksm module. Signed-off-by: NSasha Levin <levinsasha928@gmail.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Mel Gorman <mgorman@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Johannes Weiner 提交于
When ex-KSM pages are faulted from swap cache, the fault handler is not capable of re-establishing anon_vma-spanning KSM pages. In this case, a copy of the page is created instead, just like during a COW break. These freshly made copies are known to be exclusive to the faulting VMA and there is no reason to go look for this page in parent and sibling processes during rmap operations. Use page_add_new_anon_rmap() for these copies. This also puts them on the proper LRU lists and marks them SwapBacked, so we can get rid of doing this ad-hoc in the KSM copy code. Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRik van Riel <riel@redhat.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Simon Jeons <simon.jeons@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michal Hocko <mhocko@suse.cz> Cc: Satoru Moriya <satoru.moriya@hds.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 12月, 2012 1 次提交
-
-
由 Hugh Dickins 提交于
The rmap walks in ksm.c are like those in rmap.c: they can safely be done with anon_vma_lock_read(). Signed-off-by: NHugh Dickins <hughd@google.com> Acked-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 12月, 2012 3 次提交
-
-
由 David Rientjes 提交于
test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to specify that current should be killed first if an oom condition occurs in between the two calls. The usage is short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX); ... compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj); to store the thread's oom_score_adj, temporarily change it to the maximum score possible, and then restore the old value if it is still the same. This happens to still be racy, however, if the user writes OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls. The compare_swap_oom_score_adj() will then incorrectly reset the old value prior to the write of OOM_SCORE_ADJ_MAX. To fix this, introduce a new oom_flags_t member in struct signal_struct that will be used for per-thread oom killer flags. KSM and swapoff can now use a bit in this member to specify that threads should be killed first in oom conditions without playing around with oom_score_adj. This also allows the correct oom_score_adj to always be shown when reading /proc/pid/oom_score. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 David Rientjes 提交于
The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000, so this range can be represented by the signed short type with no functional change. The extra space this frees up in struct signal_struct will be used for per-thread oom kill flags in the next patch. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NMichal Hocko <mhocko@suse.cz> Cc: Anton Vorontsov <anton.vorontsov@linaro.org> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Bob Liu 提交于
Several place need to find the pmd by(mm_struct, address), so introduce a function to simplify it. [akpm@linux-foundation.org: fix warning] Signed-off-by: NBob Liu <lliubbo@gmail.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: Ni zhan Chen <nizhan.chen@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 12月, 2012 1 次提交
-
-
由 Ingo Molnar 提交于
rmap_walk_anon() and try_to_unmap_anon() appears to be too careful about locking the anon vma: while it needs protection against anon vma list modifications, it does not need exclusive access to the list itself. Transforming this exclusive lock to a read-locked rwsem removes a global lock from the hot path of page-migration intense threaded workloads which can cause pathological performance like this: 96.43% process 0 [kernel.kallsyms] [k] perf_trace_sched_switch | --- perf_trace_sched_switch __schedule schedule schedule_preempt_disabled __mutex_lock_common.isra.6 __mutex_lock_slowpath mutex_lock | |--50.61%-- rmap_walk | move_to_new_page | migrate_pages | migrate_misplaced_page | __do_numa_page.isra.69 | handle_pte_fault | handle_mm_fault | __do_page_fault | do_page_fault | page_fault | __memset_sse2 | | | --100.00%-- worker_thread | | | --100.00%-- start_thread | --49.39%-- page_lock_anon_vma try_to_unmap_anon try_to_unmap migrate_pages migrate_misplaced_page __do_numa_page.isra.69 handle_pte_fault handle_mm_fault __do_page_fault do_page_fault page_fault __memset_sse2 | --100.00%-- worker_thread start_thread With this change applied the profile is now nicely flat and there's no anon-vma related scheduling/blocking. Rename anon_vma_[un]lock() => anon_vma_[un]lock_write(), to make it clearer that it's an exclusive write-lock in that case - suggested by Rik van Riel. Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Turner <pjt@google.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux.com> Cc: Rik van Riel <riel@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NMel Gorman <mgorman@suse.de>
-
- 09 10月, 2012 6 次提交
-
-
由 Haggai Eran 提交于
In order to allow sleeping during invalidate_page mmu notifier calls, we need to avoid calling when holding the PT lock. In addition to its direct calls, invalidate_page can also be called as a substitute for a change_pte call, in case the notifier client hasn't implemented change_pte. This patch drops the invalidate_page call from change_pte, and instead wraps all calls to change_pte with invalidate_range_start and invalidate_range_end calls. Note that change_pte still cannot sleep after this patch, and that clients implementing change_pte should not take action on it in case the number of outstanding invalidate_range_start calls is larger than one, otherwise they might miss a later invalidation. Signed-off-by: NHaggai Eran <haggaie@mellanox.com> Cc: Andrea Arcangeli <andrea@qumranet.com> Cc: Sagi Grimberg <sagig@mellanox.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Cc: Or Gerlitz <ogerlitz@mellanox.com> Cc: Haggai Eran <haggaie@mellanox.com> Cc: Shachar Raindel <raindel@mellanox.com> Cc: Liran Liss <liranl@mellanox.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Avi Kivity <avi@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
page_evictable(page, vma) is an irritant: almost all its callers pass NULL for vma. Remove the vma arg and use mlocked_vma_newpage(vma, page) explicitly in the couple of places it's needed. But in those places we don't even need page_evictable() itself! They're dealing with a freshly allocated anonymous page, which has no "mapping" and cannot be mlocked yet. Signed-off-by: NHugh Dickins <hughd@google.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Cc: Rik van Riel <riel@redhat.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Michel Lespinasse <walken@google.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michel Lespinasse 提交于
When a large VMA (anon or private file mapping) is first touched, which will populate its anon_vma field, and then split into many regions through the use of mprotect(), the original anon_vma ends up linking all of the vmas on a linked list. This can cause rmap to become inefficient, as we have to walk potentially thousands of irrelevent vmas before finding the one a given anon page might fall into. By replacing the same_anon_vma linked list with an interval tree (where each avc's interval is determined by its vma's start and last pgoffs), we can make rmap efficient for this use case again. While the change is large, all of its pieces are fairly simple. Most places that were walking the same_anon_vma list were looking for a known pgoff, so they can just use the anon_vma_interval_tree_foreach() interval tree iterator instead. The exception here is ksm, where the page's index is not known. It would probably be possible to rework ksm so that the index would be known, but for now I have decided to keep things simple and just walk the entirety of the interval tree there. When updating vma's that already have an anon_vma assigned, we must take care to re-index the corresponding avc's on their interval tree. This is done through the use of anon_vma_interval_tree_pre_update_vma() and anon_vma_interval_tree_post_update_vma(), which remove the avc's from their interval tree before the update and re-insert them after the update. The anon_vma stays locked during the update, so there is no chance that rmap would miss the vmas that are being updated. Signed-off-by: NMichel Lespinasse <walken@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Daniel Santos <daniel.santos@pobox.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA, currently it lost original meaning but still has some effects: | effect | alternative flags -+------------------------+--------------------------------------------- 1| account as reserved_vm | VM_IO 2| skip in core dump | VM_IO, VM_DONTDUMP 3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP 4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP This patch removes reserved_vm counter from mm_struct. Seems like nobody cares about it, it does not exported into userspace directly, it only reduces total_vm showed in proc. Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP. remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP. remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP. [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup] Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
Merge VM_INSERTPAGE into VM_MIXEDMAP. VM_MIXEDMAP VMA can mix pure-pfn ptes, special ptes and normal ptes. Now copy_page_range() always copies VM_MIXEDMAP VMA on fork like VM_PFNMAP. If driver populates whole VMA at mmap() it probably not expects page-faults. This patch removes special check from vma_wants_writenotify() which disables pages write tracking for VMA populated via vm_instert_page(). BDI below mapped file should not use dirty-accounting, moreover do_wp_page() can handle this. vm_insert_page() still marks vma after first usage. Usually it is called from f_op->mmap() handler under mm->mmap_sem write-lock, so it able to change vma->vm_flags. Caller must set VM_MIXEDMAP at mmap time if it wants to call this function from other places, for example from page-fault handler. Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Konstantin Khlebnikov 提交于
Combine several arch-specific vma flags into one. before patch: 0x00000200 0x01000000 0x20000000 0x40000000 x86 VM_NOHUGEPAGE VM_HUGEPAGE - VM_PAT powerpc - - VM_SAO - parisc VM_GROWSUP - - - ia64 VM_GROWSUP - - - nommu - VM_MAPPED_COPY - - others - - - - after patch: 0x00000200 0x01000000 0x20000000 0x40000000 x86 - VM_PAT VM_HUGEPAGE VM_NOHUGEPAGE powerpc - VM_SAO - - parisc - VM_GROWSUP - - ia64 - VM_GROWSUP - - nommu - VM_MAPPED_COPY - - others - VM_ARCH_1 - - And voila! One completely free bit. Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Carsten Otte <cotte@de.ibm.com> Cc: Chris Metcalf <cmetcalf@tilera.com> Cc: Cyrill Gorcunov <gorcunov@openvz.org> Cc: Eric Paris <eparis@redhat.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: James Morris <james.l.morris@oracle.com> Cc: Jason Baron <jbaron@redhat.com> Cc: Kentaro Takeda <takedakn@nttdata.co.jp> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Robert Richter <robert.richter@amd.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Venkatesh Pallipadi <venki@google.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 3月, 2012 1 次提交
-
-
由 Bob Liu 提交于
There are multiple places which perform the same check. Add a new find_mergeable_vma() to handle this. Signed-off-by: NBob Liu <lliubbo@gmail.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 20 3月, 2012 1 次提交
-
-
由 Cong Wang 提交于
Signed-off-by: NCong Wang <amwang@redhat.com>
-
- 06 3月, 2012 1 次提交
-
-
由 Hugh Dickins 提交于
When moving tasks from old memcg (with move_charge_at_immigrate on new memcg), followed by removal of old memcg, hit General Protection Fault in mem_cgroup_lru_del_list() (called from release_pages called from free_pages_and_swap_cache from tlb_flush_mmu from tlb_finish_mmu from exit_mmap from mmput from exit_mm from do_exit). Somewhat reproducible, takes a few hours: the old struct mem_cgroup has been freed and poisoned by SLAB_DEBUG, but mem_cgroup_lru_del_list() is still trying to update its stats, and take page off lru before freeing. A task, or a charge, or a page on lru: each secures a memcg against removal. In this case, the last task has been moved out of the old memcg, and it is exiting: anonymous pages are uncharged one by one from the memcg, as they are zapped from its pagetables, so the charge gets down to 0; but the pages themselves are queued in an mmu_gather for freeing. Most of those pages will be on lru (and force_empty is careful to lru_add_drain_all, to add pages from pagevec to lru first), but not necessarily all: perhaps some have been isolated for page reclaim, perhaps some isolated for other reasons. So, force_empty may find no task, no charge and no page on lru, and let the removal proceed. There would still be no problem if these pages were immediately freed; but typically (and the put_page_testzero protocol demands it) they have to be added back to lru before they are found freeable, then removed from lru and freed. We don't see the issue when adding, because the mem_cgroup_iter() loops keep their own reference to the memcg being scanned; but when it comes to mem_cgroup_lru_del_list(). I believe this was not an issue in v3.2: there, PageCgroupAcctLRU and PageCgroupUsed flags were used (like a trick with mirrors) to deflect view of pc->mem_cgroup to the stable root_mem_cgroup when neither set. 38c5d72f ("memcg: simplify LRU handling by new rule") mercifully removed those convolutions, but left this General Protection Fault. But it's surprisingly easy to restore the old behaviour: just check PageCgroupUsed in mem_cgroup_lru_add_list() (which decides on which lruvec to add), and reset pc to root_mem_cgroup if page is uncharged. A risky change? just going back to how it worked before; testing, and an audit of uses of pc->mem_cgroup, show no problem. And there's a nice bonus: with mem_cgroup_lru_add_list() itself making sure that an uncharged page goes to root lru, mem_cgroup_reset_owner() no longer has any purpose, and we can safely revert 4e5f01c2 ("memcg: clear pc->mem_cgroup if necessary"). Calling update_page_reclaim_stat() after add_page_to_lru_list() in swap.c is not strictly necessary: the lru_lock there, with RCU before memcg structures are freed, makes mem_cgroup_get_reclaim_stat_from_page safe without that; but it seems cleaner to rely on one dependency less. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 1月, 2012 1 次提交
-
-
由 KAMEZAWA Hiroyuki 提交于
This is a preparation before removing a flag PCG_ACCT_LRU in page_cgroup and reducing atomic ops/complexity in memcg LRU handling. In some cases, pages are added to lru before charge to memcg and pages are not classfied to memory cgroup at lru addtion. Now, the lru where the page should be added is determined a bit in page_cgroup->flags and pc->mem_cgroup. I'd like to remove the check of flag. To handle the case pc->mem_cgroup may contain stale pointers if pages are added to LRU before classification. This patch resets pc->mem_cgroup to root_mem_cgroup before lru additions. [akpm@linux-foundation.org: fix CONFIG_CGROUP_MEM_CONT=n build] [hughd@google.com: fix CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_CGROUP_MEM_RES_CTLR_SWAP=n build] [akpm@linux-foundation.org: ksm.c needs memcontrol.h, per Michal] [hughd@google.com: stop oops in mem_cgroup_reset_owner()] [hughd@google.com: fix page migration to reset_owner] Signed-off-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Miklos Szeredi <mszeredi@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Ying Han <yinghan@google.com> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 11月, 2011 1 次提交
-
-
由 David Rientjes 提交于
test_set_oom_score_adj() was introduced in 72788c38 ("oom: replace PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate current's oom_score_adj for ksm and swapoff without requiring an additional per-process flag. Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and then reinstate the previous value is racy since it's possible that userspace can set the value to something else itself before the old value is reinstated. That results in userspace setting current's oom_score_adj to a different value and then the kernel immediately setting it back to its previous value without notification. To fix this, a new compare_swap_oom_score_adj() function is introduced with the same semantics as the compare and swap CAS instruction, or CMPXCHG on x86. It is used to reinstate the previous value of oom_score_adj if and only if the present value is the same as the old value. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Ying Han <yinghan@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 6月, 2011 1 次提交
-
-
由 Hugh Dickins 提交于
Andrea Righi reported a case where an exiting task can race against ksmd::scan_get_next_rmap_item (http://lkml.org/lkml/2011/6/1/742) easily triggering a NULL pointer dereference in ksmd. ksm_scan.mm_slot == &ksm_mm_head with only one registered mm CPU 1 (__ksm_exit) CPU 2 (scan_get_next_rmap_item) list_empty() is false lock slot == &ksm_mm_head list_del(slot->mm_list) (list now empty) unlock lock slot = list_entry(slot->mm_list.next) (list is empty, so slot is still ksm_mm_head) unlock slot->mm == NULL ... Oops Close this race by revalidating that the new slot is not simply the list head again. Andrea's test case: #include <stdio.h> #include <stdlib.h> #include <unistd.h> #include <sys/mman.h> #define BUFSIZE getpagesize() int main(int argc, char **argv) { void *ptr; if (posix_memalign(&ptr, getpagesize(), BUFSIZE) < 0) { perror("posix_memalign"); exit(1); } if (madvise(ptr, BUFSIZE, MADV_MERGEABLE) < 0) { perror("madvise"); exit(1); } *(char *)NULL = 0; return 0; } Reported-by: NAndrea Righi <andrea@betterlinux.com> Tested-by: NAndrea Righi <andrea@betterlinux.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NChris Wright <chrisw@sous-sol.org> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 25 5月, 2011 1 次提交
-
-
由 David Rientjes 提交于
There's a kernel-wide shortage of per-process flags, so it's always helpful to trim one when possible without incurring a significant penalty. It's even more important when you're planning on adding a per- process flag yourself, which I plan to do shortly for transparent hugepages. PF_OOM_ORIGIN is used by ksm and swapoff to prefer current since it has a tendency to allocate large amounts of memory and should be preferred for killing over other tasks. We'd rather immediately kill the task making the errant syscall rather than penalizing an innocent task. This patch removes PF_OOM_ORIGIN since its behavior is equivalent to setting the process's oom_score_adj to OOM_SCORE_ADJ_MAX. The process's old oom_score_adj is stored and then set to OOM_SCORE_ADJ_MAX during the time it used to have PF_OOM_ORIGIN. The old value is then reinstated when the process should no longer be considered a high priority for oom killing. Signed-off-by: NDavid Rientjes <rientjes@google.com> Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NMinchan Kim <minchan.kim@gmail.com> Cc: Hugh Dickins <hughd@google.com> Cc: Izik Eidus <ieidus@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 31 3月, 2011 1 次提交
-
-
由 Lucas De Marchi 提交于
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: NLucas De Marchi <lucas.demarchi@profusion.mobi>
-
- 23 3月, 2011 1 次提交
-
-
由 Peter Zijlstra 提交于
The normal code pattern used in the kernel is: get/put. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: NHugh Dickins <hughd@google.com> Reviewed-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 1月, 2011 6 次提交
-
-
由 Hugh Dickins 提交于
It was hard to explain the page counts which were causing new LTP tests of KSM to fail: we need to drain the per-cpu pagevecs to LRU occasionally. Signed-off-by: NHugh Dickins <hughd@google.com> Reported-by: NCAI Qian <caiqian@redhat.com> Cc:Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
Cleanup some code with common compound_trans_head helper. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Johannes Weiner <jweiner@redhat.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Avi Kivity <avi@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
This makes KSM full operational with THP pages. Subpages are scanned while the hugepage is still in place and delivering max cpu performance, and only if there's a match and we're going to deduplicate memory, the single hugepages with the subpage match is split. There will be no false sharing between ksmd and khugepaged. khugepaged won't collapse 2m virtual regions with KSM pages inside. ksmd also should only split pages when the checksum matches and we're likely to split an hugepage for some long living ksm page (usual ksm heuristic to avoid sharing pages that get de-cowed). Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
It's unclear why schedule friendly kernel threads can't be taken away by the CPU through the scheduler itself. It's safer to stop them as they can trigger memory allocation, if kswapd also freezes itself to avoid generating I/O they have too. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Andrea Arcangeli 提交于
Skip transhuge pages in ksm for now. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Reviewed-by: NRik van Riel <riel@redhat.com> Acked-by: NMel Gorman <mel@csn.ul.ie> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
When a swapcache page is replaced by a ksm page, it's best to free that swap immediately. Reported-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 12月, 2010 1 次提交
-
-
由 KOSAKI Motohiro 提交于
commit 62b61f61 ("ksm: memory hotremove migration only") caused the following new lockdep warning. ======================================================= [ INFO: possible circular locking dependency detected ] ------------------------------------------------------- bash/1621 is trying to acquire lock: ((memory_chain).rwsem){.+.+.+}, at: [<ffffffff81079339>] __blocking_notifier_call_chain+0x69/0xc0 but task is already holding lock: (ksm_thread_mutex){+.+.+.}, at: [<ffffffff8113a3aa>] ksm_memory_callback+0x3a/0xc0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (ksm_thread_mutex){+.+.+.}: [<ffffffff8108b70a>] lock_acquire+0xaa/0x140 [<ffffffff81505d74>] __mutex_lock_common+0x44/0x3f0 [<ffffffff81506228>] mutex_lock_nested+0x48/0x60 [<ffffffff8113a3aa>] ksm_memory_callback+0x3a/0xc0 [<ffffffff8150c21c>] notifier_call_chain+0x8c/0xe0 [<ffffffff8107934e>] __blocking_notifier_call_chain+0x7e/0xc0 [<ffffffff810793a6>] blocking_notifier_call_chain+0x16/0x20 [<ffffffff813afbfb>] memory_notify+0x1b/0x20 [<ffffffff81141b7c>] remove_memory+0x1cc/0x5f0 [<ffffffff813af53d>] memory_block_change_state+0xfd/0x1a0 [<ffffffff813afd62>] store_mem_state+0xe2/0xf0 [<ffffffff813a0bb0>] sysdev_store+0x20/0x30 [<ffffffff811bc116>] sysfs_write_file+0xe6/0x170 [<ffffffff8114f398>] vfs_write+0xc8/0x190 [<ffffffff8114fc14>] sys_write+0x54/0x90 [<ffffffff810028b2>] system_call_fastpath+0x16/0x1b -> #0 ((memory_chain).rwsem){.+.+.+}: [<ffffffff8108b5ba>] __lock_acquire+0x155a/0x1600 [<ffffffff8108b70a>] lock_acquire+0xaa/0x140 [<ffffffff81506601>] down_read+0x51/0xa0 [<ffffffff81079339>] __blocking_notifier_call_chain+0x69/0xc0 [<ffffffff810793a6>] blocking_notifier_call_chain+0x16/0x20 [<ffffffff813afbfb>] memory_notify+0x1b/0x20 [<ffffffff81141f1e>] remove_memory+0x56e/0x5f0 [<ffffffff813af53d>] memory_block_change_state+0xfd/0x1a0 [<ffffffff813afd62>] store_mem_state+0xe2/0xf0 [<ffffffff813a0bb0>] sysdev_store+0x20/0x30 [<ffffffff811bc116>] sysfs_write_file+0xe6/0x170 [<ffffffff8114f398>] vfs_write+0xc8/0x190 [<ffffffff8114fc14>] sys_write+0x54/0x90 [<ffffffff810028b2>] system_call_fastpath+0x16/0x1b But it's a false positive. Both memory_chain.rwsem and ksm_thread_mutex have an outer lock (mem_hotplug_mutex). So they cannot deadlock. Thus, This patch annotate ksm_thread_mutex is not deadlock source. [akpm@linux-foundation.org: update comment, from Hugh] Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: NHugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 10月, 2010 1 次提交
-
-
由 Hugh Dickins 提交于
Building under memory pressure, with KSM on 2.6.36-rc5, collapsed with an internal compiler error: typically indicating an error in swapping. Perhaps there's a timing issue which makes it now more likely, perhaps it's just a long time since I tried for so long: this bug goes back to KSM swapping in 2.6.33. Notice how reuse_swap_page() allows an exclusive page to be reused, but only does SetPageDirty if it can delete it from swap cache right then - if it's currently under Writeback, it has to be left in cache and we don't SetPageDirty, but the page can be reused. Fine, the dirty bit will get set in the pte; but notice how zap_pte_range() does not bother to transfer pte_dirty to page_dirty when unmapping a PageAnon. If KSM chooses to share such a page, it will look like a clean copy of swapcache, and not be written out to swap when its memory is needed; then stale data read back from swap when it's needed again. We could fix this in reuse_swap_page() (or even refuse to reuse a page under writeback), but it's more honest to fix my oversight in KSM's write_protect_page(). Several days of testing on three machines confirms that this fixes the issue they showed. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 9月, 2010 1 次提交
-
-
由 Andrea Arcangeli 提交于
The pte_same check is reliable only if the swap entry remains pinned (by the page lock on swapcache). We've also to ensure the swapcache isn't removed before we take the lock as try_to_free_swap won't care about the page pin. One of the possible impacts of this patch is that a KSM-shared page can point to the anon_vma of another process, which could exit before the page is freed. This can leave a page with a pointer to a recycled anon_vma object, or worse, a pointer to something that is no longer an anon_vma. [riel@redhat.com: changelog help] Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NHugh Dickins <hughd@google.com> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 10 8月, 2010 1 次提交
-
-
由 Lai Jiangshan 提交于
Use compile-allocated memory instead of dynamic allocated memory for mm_slots_hash. Use hash_ptr() instead divisions for bucket calculation. Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: NIzik Eidus <ieidus@redhat.com> Cc: Avi Kivity <avi@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-