- 09 10月, 2013 2 次提交
-
-
由 Rik van Riel 提交于
The load balancer can move tasks between nodes and does not take NUMA locality into account. With automatic NUMA balancing this may result in the tasks working set being migrated to the new node. However, as the fault buffer will still store faults from the old node the schduler may decide to reset the preferred node and migrate the task back resulting in more migrations. The ideal would be that the scheduler did not migrate tasks with a heavy memory footprint but this may result nodes being overloaded. We could also discard the fault information on task migration but this would still cause all the tasks working set to be migrated. This patch simply avoids migrating the memory for a short time after a task is migrated. Signed-off-by: NRik van Riel <riel@redhat.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-31-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
由 Mel Gorman 提交于
Ideally it would be possible to distinguish between NUMA hinting faults that are private to a task and those that are shared. If treated identically there is a risk that shared pages bounce between nodes depending on the order they are referenced by tasks. Ultimately what is desirable is that task private pages remain local to the task while shared pages are interleaved between sharing tasks running on different nodes to give good average performance. This is further complicated by THP as even applications that partition their data may not be partitioning on a huge page boundary. To start with, this patch assumes that multi-threaded or multi-process applications partition their data and that in general the private accesses are more important for cpu->memory locality in the general case. Also, no new infrastructure is required to treat private pages properly but interleaving for shared pages requires additional infrastructure. To detect private accesses the pid of the last accessing task is required but the storage requirements are a high. This patch borrows heavily from Ingo Molnar's patch "numa, mm, sched: Implement last-CPU+PID hash tracking" to encode some bits from the last accessing task in the page flags as well as the node information. Collisions will occur but it is better than just depending on the node information. Node information is then used to determine if a page needs to migrate. The PID information is used to detect private/shared accesses. The preferred NUMA node is selected based on where the maximum number of approximately private faults were measured. Shared faults are not taken into consideration for a few reasons. First, if there are many tasks sharing the page then they'll all move towards the same node. The node will be compute overloaded and then scheduled away later only to bounce back again. Alternatively the shared tasks would just bounce around nodes because the fault information is effectively noise. Either way accounting for shared faults the same as private faults can result in lower performance overall. The second reason is based on a hypothetical workload that has a small number of very important, heavily accessed private pages but a large shared array. The shared array would dominate the number of faults and be selected as a preferred node even though it's the wrong decision. The third reason is that multiple threads in a process will race each other to fault the shared page making the fault information unreliable. Signed-off-by: NMel Gorman <mgorman@suse.de> [ Fix complication error when !NUMA_BALANCING. ] Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: NPeter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1381141781-10992-30-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 12 9月, 2013 6 次提交
-
-
由 Naoya Horiguchi 提交于
new_vma_page() is called only by page migration called from do_mbind(), where pages to be migrated are queued into a pagelist by queue_pages_range(). queue_pages_range() confirms that a queued page belongs to some vma, so !vma case is not supposed to be happen. This patch adds BUG_ON() to catch this unexpected case. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Reported-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
The function check_range() (and its family) is not well-named, because it does not only checking something, but moving pages from list to list to do page migration for them. So queue_pages_*range is more desirable name. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Extend do_mbind() to handle vma with VM_HUGETLB set. We will be able to migrate hugepage with mbind(2) after applying the enablement patch which comes later in this series. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NAndi Kleen <ak@linux.intel.com> Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: NHillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Extend check_range() to handle vma with VM_HUGETLB set. We will be able to migrate hugepage with migrate_pages(2) after applying the enablement patch which comes later in this series. Note that for larger hugepages (covered by pud entries, 1GB for x86_64 for example), we simply skip it now. Note that using pmd_huge/pud_huge assumes that hugepages are pointed to by pmd/pud. This is not true in some architectures implementing hugepage with other mechanisms like ia64, but it's OK because pmd_huge/pud_huge simply return 0 in such arch and page walker simply ignores such hugepages. Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: NAndi Kleen <ak@linux.intel.com> Reviewed-by: NWanpeng Li <liwanp@linux.vnet.ibm.com> Acked-by: NHillf Danton <dhillf@gmail.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Hugh Dickins <hughd@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Hocko <mhocko@suse.cz> Cc: Rik van Riel <riel@redhat.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jianguo Wu 提交于
If node == NUMA_NO_NODE, pol is NULL, we should return NULL instead of do "if (!pol->mode)" check. [akpm@linux-foundation.org: reorganise code] Signed-off-by: NJianguo Wu <wujianguo@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Hanjun Guo <guohanjun@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Oleg Nesterov 提交于
Simple cleanup. Every user of vma_set_policy() does the same work, this looks a bit annoying imho. And the new trivial helper which does mpol_dup() + vma_set_policy() to simplify the callers. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 01 8月, 2013 1 次提交
-
-
由 Oleg Nesterov 提交于
vma_adjust() does vma_set_policy(vma, vma_policy(next)) and this is doubly wrong: 1. This leaks vma->vm_policy if it is not NULL and not equal to next->vm_policy. This can happen if vma_merge() expands "area", not prev (case 8). 2. This sets the wrong policy if vma_merge() joins prev and area, area is the vma the caller needs to update and it still has the old policy. Revert commit 1444f92c ("mm: merging memory blocks resets mempolicy") which introduced these problems. Change mbind_range() to recheck mpol_equal() after vma_merge() to fix the problem that commit tried to address. Signed-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Steven T Hampson <steven.t.hampson@intel.com> Cc: Mel Gorman <mgorman@suse.de> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 3月, 2013 2 次提交
-
-
由 KOSAKI Motohiro 提交于
Currently, n_new is wrongly initialized. start and end parameter are inverted. Let's fix it. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Jones <davej@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hillf Danton 提交于
n->end is accessed in sp_insert(). Thus it should be update before calling sp_insert(). This mistake may make kernel panic. Signed-off-by: NHillf Danton <dhillf@gmail.com> Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Sasha Levin <sasha.levin@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Dave Jones <davej@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 24 2月, 2013 5 次提交
-
-
由 David Rientjes 提交于
Make a sweep through mm/ and convert code that uses -1 directly to using the more appropriate NUMA_NO_NODE. Signed-off-by: NDavid Rientjes <rientjes@google.com> Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
No functional change, but the only purpose of the offlining argument to migrate_pages() etc, was to ensure that __unmap_and_move() could migrate a KSM page for memory hotremove (which took ksm_thread_mutex) but not for other callers. Now all cases are safe, remove the arg. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Migration of KSM pages is now safe: remove the PageKsm restrictions from mempolicy.c and migrate.c. But keep PageKsm out of __unmap_and_move()'s anon_vma contortions, which are irrelevant to KSM: it looks as if that code was preventing hotremove migration of KSM pages, unless they happened to be in swapcache. There is some question as to whether enforcing a NUMA mempolicy migration ought to migrate KSM pages, mapped into entirely unrelated processes; but moving page_mapcount > 1 is only permitted with MPOL_MF_MOVE_ALL anyway, and it seems reasonable to assume that you wouldn't set MADV_MERGEABLE on any area where this is a worry. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Cc: Petr Holasek <pholasek@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Izik Eidus <izik.eidus@ravellosystems.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
The function names page_xchg_last_nid(), page_last_nid() and reset_page_last_nid() were judged to be inconsistent so rename them to a struct_field_op style pattern. As it looked jarring to have reset_page_mapcount() and page_nid_reset_last() beside each other in memmap_init_zone(), this patch also renames reset_page_mapcount() to page_mapcount_reset(). There are others like init_page_count() but as it is used throughout the arch code a rename would likely cause more conflicts than it is worth. [akpm@linux-foundation.org: fix zcache] Signed-off-by: NMel Gorman <mgorman@suse.de> Suggested-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Lai Jiangshan 提交于
is_valid_nodemask() was introduced by commit 19770b32 ("mm: filter based on a nodemask as well as a gfp_mask"). but it does not match its comments, because it does not check the zone which > policy_zone. Also in commit b377fd39 ("Apply memory policies to top two highest zones when highest zone is ZONE_MOVABLE"), this commits told us, if highest zone is ZONE_MOVABLE, we should also apply memory policies to it. so ZONE_MOVABLE should be valid zone for policies. is_valid_nodemask() need to be changed to match it. Fix: check all zones, even its zoneid > policy_zone. Use nodes_intersects() instead open code to check it. Reported-by: NWen Congyang <wency@cn.fujitsu.com> Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Jiang Liu <jiang.liu@huawei.com> Cc: Jianguo Wu <wujianguo@huawei.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 03 1月, 2013 3 次提交
-
-
由 Mel Gorman 提交于
Sasha was fuzzing with trinity and reported the following problem: BUG: sleeping function called from invalid context at kernel/mutex.c:269 in_atomic(): 1, irqs_disabled(): 0, pid: 6361, name: trinity-main 2 locks held by trinity-main/6361: #0: (&mm->mmap_sem){++++++}, at: [<ffffffff810aa314>] __do_page_fault+0x1e4/0x4f0 #1: (&(&mm->page_table_lock)->rlock){+.+...}, at: [<ffffffff8122f017>] handle_pte_fault+0x3f7/0x6a0 Pid: 6361, comm: trinity-main Tainted: G W 3.7.0-rc2-next-20121024-sasha-00001-gd95ef01-dirty #74 Call Trace: __might_sleep+0x1c3/0x1e0 mutex_lock_nested+0x29/0x50 mpol_shared_policy_lookup+0x2e/0x90 shmem_get_policy+0x2e/0x30 get_vma_policy+0x5a/0xa0 mpol_misplaced+0x41/0x1d0 handle_pte_fault+0x465/0x6a0 This was triggered by a different version of automatic NUMA balancing but in theory the current version is vunerable to the same problem. do_numa_page -> numa_migrate_prep -> mpol_misplaced -> get_vma_policy -> shmem_get_policy It's very unlikely this will happen as shared pages are not marked pte_numa -- see the page_mapcount() check in change_pte_range() -- but it is possible. To address this, this patch restores sp->lock as originally implemented by Kosaki Motohiro. In the path where get_vma_policy() is called, it should not be calling sp_alloc() so it is not necessary to treat the PTL specially. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Remove the unused argument (formerly no_context) from mpol_parse_str() and from mpol_to_str(). Signed-off-by: NHugh Dickins <hughd@google.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Hugh Dickins 提交于
Recently I suggested using "mount -o remount,mpol=local /tmp" in NUMA mempolicy testing. Very nasty. Reading /proc/mounts, /proc/pid/mounts or /proc/pid/mountinfo may then corrupt one bit of kernel memory, often in a page table (causing "Bad swap" or "Bad page map" warning or "Bad pagetable" oops), sometimes in a vm_area_struct or rbnode or somewhere worse. "mpol=prefer" and "mpol=prefer:Node" are equally toxic. Recent NUMA enhancements are not to blame: this dates back to 2.6.35, when commit e17f74af "mempolicy: don't call mpol_set_nodemask() when no_context" skipped mpol_parse_str()'s call to mpol_set_nodemask(), which used to initialize v.preferred_node, or set MPOL_F_LOCAL in flags. With slab poisoning, you can then rely on mpol_to_str() to set the bit for node 0x6b6b, probably in the next page above the caller's stack. mpol_parse_str() is only called from shmem_parse_options(): no_context is always true, so call it unused for now, and remove !no_context code. Set v.nodes or v.preferred_node or MPOL_F_LOCAL as mpol_to_str() might expect. Then mpol_to_str() can ignore its no_context argument also, the mpol being appropriately initialized whether contextualized or not. Rename its no_context unused too, and let subsequent patch remove them (that's not needed for stable backporting, which would involve rejects). I don't understand why MPOL_LOCAL is described as a pseudo-policy: it's a reasonable policy which suffers from a confusing implementation in terms of MPOL_PREFERRED with MPOL_F_LOCAL. I believe this would be much more robust if MPOL_LOCAL were recognized in switch statements throughout, MPOL_F_LOCAL deleted, and MPOL_PREFERRED use the (possibly empty) nodes mask like everyone else, instead of its preferred_node variant (I presume an optimization from the days before MPOL_LOCAL). But that would take me too long to get right and fully tested. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 12月, 2012 2 次提交
-
-
由 Lai Jiangshan 提交于
N_HIGH_MEMORY stands for the nodes that has normal or high memory. N_MEMORY stands for the nodes that has any memory. The code here need to handle with the nodes which have memory, we should use N_MEMORY instead. Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: NWen Congyang <wency@cn.fujitsu.com> Cc: Christoph Lameter <cl@linux.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: Lin Feng <linfeng@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Kirill A. Shutemov 提交于
Pass vma instead of mm and add address parameter. In most cases we already have vma on the stack. We provides split_huge_page_pmd_mm() for few cases when we have mm, but not vma. This change is preparation to huge zero pmd splitting implementation. Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: "H. Peter Anvin" <hpa@linux.intel.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: David Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 12 12月, 2012 1 次提交
-
-
由 David Rientjes 提交于
Remove some duplicate code and simplify alloc_pages_vma(). No functional change. Signed-off-by: NDavid Rientjes <rientjes@google.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 11 12月, 2012 11 次提交
-
-
由 Mel Gorman 提交于
This patch adds Kconfig options and kernel parameters to allow the enabling and disabling of automatic NUMA balancing. The existance of such a switch was and is very important when debugging problems related to transparent hugepages and we should have the same for automatic NUMA placement. Signed-off-by: NMel Gorman <mgorman@suse.de>
-
由 Mel Gorman 提交于
mm: numa: Use a two-stage filter to restrict pages being migrated for unlikely task<->node relationships Note: This two-stage filter was taken directly from the sched/numa patch "sched, numa, mm: Add the scanning page fault machinery" but is only a partial extraction. As the end result is not necessarily recognisable, the signed-offs-by had to be removed. Will be added back if requested. While it is desirable that all threads in a process run on its home node, this is not always possible or necessary. There may be more threads than exist within the node or the node might over-subscribed with unrelated processes. This can cause a situation whereby a page gets migrated off its home node because the threads clearing pte_numa were running off-node. This patch uses page->last_nid to build a two-stage filter before pages get migrated to avoid problems with short or unlikely task<->node relationships. Signed-off-by: NMel Gorman <mgorman@suse.de>
-
由 Mel Gorman 提交于
This is the simplest possible policy that still does something of note. When a pte_numa is faulted, it is moved immediately. Any replacement policy must at least do better than this and in all likelihood this policy regresses normal workloads. Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NRik van Riel <riel@redhat.com>
-
由 Mel Gorman 提交于
It is tricky to quantify the basic cost of automatic NUMA placement in a meaningful manner. This patch adds some vmstats that can be used as part of a basic costing model. u = basic unit = sizeof(void *) Ca = cost of struct page access = sizeof(struct page) / u Cpte = Cost PTE access = Ca Cupdate = Cost PTE update = (2 * Cpte) + (2 * Wlock) where Cpte is incurred twice for a read and a write and Wlock is a constant representing the cost of taking or releasing a lock Cnumahint = Cost of a minor page fault = some high constant e.g. 1000 Cpagerw = Cost to read or write a full page = Ca + PAGE_SIZE/u Ci = Cost of page isolation = Ca + Wi where Wi is a constant that should reflect the approximate cost of the locking operation Cpagecopy = Cpagerw + (Cpagerw * Wnuma) + Ci + (Ci * Wnuma) where Wnuma is the approximate NUMA factor. 1 is local. 1.2 would imply that remote accesses are 20% more expensive Balancing cost = Cpte * numa_pte_updates + Cnumahint * numa_hint_faults + Ci * numa_pages_migrated + Cpagecopy * numa_pages_migrated Note that numa_pages_migrated is used as a measure of how many pages were isolated even though it would miss pages that failed to migrate. A vmstat counter could have been added for it but the isolation cost is pretty marginal in comparison to the overall cost so it seemed overkill. The ideal way to measure automatic placement benefit would be to count the number of remote accesses versus local accesses and do something like benefit = (remote_accesses_before - remove_access_after) * Wnuma but the information is not readily available. As a workload converges, the expection would be that the number of remote numa hints would reduce to 0. convergence = numa_hint_faults_local / numa_hint_faults where this is measured for the last N number of numa hints recorded. When the workload is fully converged the value is 1. This can measure if the placement policy is converging and how fast it is doing it. Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NRik van Riel <riel@redhat.com>
-
由 Mel Gorman 提交于
The use of MPOL_NOOP and MPOL_MF_LAZY to allow an application to explicitly request lazy migration is a good idea but the actual API has not been well reviewed and once released we have to support it. For now this patch prevents an application using the services. This will need to be revisited. Signed-off-by: NMel Gorman <mgorman@suse.de>
-
由 Mel Gorman 提交于
This patch converts change_prot_numa() to use change_protection(). As pte_numa and friends check the PTE bits directly it is necessary for change_protection() to use pmd_mknuma(). Hence the required modifications to change_protection() are a little clumsy but the end result is that most of the numa page table helpers are just one or two instructions. Signed-off-by: NMel Gorman <mgorman@suse.de>
-
由 Lee Schermerhorn 提交于
NOTE: Once again there is a lot of patch stealing and the end result is sufficiently different that I had to drop the signed-offs. Will re-add if the original authors are ok with that. This patch adds another mbind() flag to request "lazy migration". The flag, MPOL_MF_LAZY, modifies MPOL_MF_MOVE* such that the selected pages are marked PROT_NONE. The pages will be migrated in the fault path on "first touch", if the policy dictates at that time. "Lazy Migration" will allow testing of migrate-on-fault via mbind(). Also allows applications to specify that only subsequently touched pages be migrated to obey new policy, instead of all pages in range. This can be useful for multi-threaded applications working on a large shared data area that is initialized by an initial thread resulting in all pages on one [or a few, if overflowed] nodes. After PROT_NONE, the pages in regions assigned to the worker threads will be automatically migrated local to the threads on 1st touch. Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com>
-
由 Lee Schermerhorn 提交于
This patch provides a new function to test whether a page resides on a node that is appropriate for the mempolicy for the vma and address where the page is supposed to be mapped. This involves looking up the node where the page belongs. So, the function returns that node so that it may be used to allocated the page without consulting the policy again. A subsequent patch will call this function from the fault path. Because of this, I don't want to go ahead and allocate the page, e.g., via alloc_page_vma() only to have to free it if it has the correct policy. So, I just mimic the alloc_page_vma() node computation logic--sort of. Note: we could use this function to implement a MPOL_MF_STRICT behavior when migrating pages to match mbind() mempolicy--e.g., to ensure that pages in an interleaved range are reinterleaved rather than left where they are when they reside on any page in the interleave nodemask. Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> [ Added MPOL_F_LAZY to trigger migrate-on-fault; simplified code now that we don't have to bother with special crap for interleaved ] Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NMel Gorman <mgorman@suse.de>
-
由 Lee Schermerhorn 提交于
This patch augments the MPOL_MF_LAZY feature by adding a "NOOP" policy to mbind(). When the NOOP policy is used with the 'MOVE and 'LAZY flags, mbind() will map the pages PROT_NONE so that they will be migrated on the next touch. This allows an application to prepare for a new phase of operation where different regions of shared storage will be assigned to worker threads, w/o changing policy. Note that we could just use "default" policy in this case. However, this also allows an application to request that pages be migrated, only if necessary, to follow any arbitrary policy that might currently apply to a range of pages, without knowing the policy, or without specifying multiple mbind()s for ranges with different policies. [ Bug in early version of mpol_parse_str() reported by Fengguang Wu. ] Bug-Reported-by: NReported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NMel Gorman <mgorman@suse.de>
-
由 Peter Zijlstra 提交于
Make MPOL_LOCAL a real and exposed policy such that applications that relied on the previous default behaviour can explicitly request it. Requested-by: NChristoph Lameter <cl@linux.com> Reviewed-by: NRik van Riel <riel@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NMel Gorman <mgorman@suse.de>
-
由 Mel Gorman 提交于
The pgmigrate_success and pgmigrate_fail vmstat counters tells the user about migration activity but not the type or the reason. This patch adds a tracepoint to identify the type of page migration and why the page is being migrated. Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NRik van Riel <riel@redhat.com>
-
- 07 12月, 2012 1 次提交
-
-
由 Mel Gorman 提交于
This fixes a regression in 3.7-rc, which has since gone into stable. Commit 00442ad0 ("mempolicy: fix a memory corruption by refcount imbalance in alloc_pages_vma()") changed get_vma_policy() to raise the refcount on a shmem shared mempolicy; whereas shmem_alloc_page() went on expecting alloc_page_vma() to drop the refcount it had acquired. This deserves a rework: but for now fix the leak in shmem_alloc_page(). Hugh: shmem_swapin() did not need a fix, but surely it's clearer to use the same refcounting there as in shmem_alloc_page(), delete its onstack mempolicy, and the strange mpol_cond_copy() and __mpol_cond_copy() - those were invented to let swapin_readahead() make an unknown number of calls to alloc_pages_vma() with one mempolicy; but since 00442ad0, alloc_pages_vma() has kept refcount in balance, so now no problem. Reported-and-tested-by: NTommi Rantala <tt.rantala@gmail.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Signed-off-by: NHugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 10月, 2012 1 次提交
-
-
由 David Rientjes 提交于
When reading /proc/pid/numa_maps, it's possible to return the contents of the stack where the mempolicy string should be printed if the policy gets freed from beneath us. This happens because mpol_to_str() may return an error the stack-allocated buffer is then printed without ever being stored. There are two possible error conditions in mpol_to_str(): - if the buffer allocated is insufficient for the string to be stored, and - if the mempolicy has an invalid mode. The first error condition is not triggered in any of the callers to mpol_to_str(): at least 50 bytes is always allocated on the stack and this is sufficient for the string to be written. A future patch should convert this into BUILD_BUG_ON() since we know the maximum strlen possible, but that's not -rc material. The second error condition is possible if a race occurs in dropping a reference to a task's mempolicy causing it to be freed during the read(). The slab poison value is then used for the mode and mpol_to_str() returns -EINVAL. This race is only possible because get_vma_policy() believes that mm->mmap_sem protects task->mempolicy, which isn't true. The exit path does not hold mm->mmap_sem when dropping the reference or setting task->mempolicy to NULL: it uses task_lock(task) instead. Thus, it's required for the caller of a task mempolicy to hold task_lock(task) while grabbing the mempolicy and reading it. Callers with a vma policy store their mempolicy earlier and can simply increment the reference count so it's guaranteed not to be freed. Reported-by: NDave Jones <davej@redhat.com> Signed-off-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 09 10月, 2012 5 次提交
-
-
由 Minchan Kim 提交于
Revert commit 0def08e3 because check_range can't fail in migrate_to_node with considering current usecases. Quote from Johannes : I think it makes sense to revert. Not because of the semantics, but I : just don't see how check_range() could even fail for this callsite: : : 1. we pass mm->mmap->vm_start in there, so we should not fail due to : find_vma() : : 2. we pass MPOL_MF_DISCONTIG_OK, so the discontig checks do not apply : and so can not fail : : 3. we pass MPOL_MF_MOVE | MPOL_MF_MOVE_ALL, the page table loops will : continue until addr == end, so we never fail with -EIO And I added a new VM_BUG_ON for checking migrate_to_node's future usecase which might pass to MPOL_MF_STRICT. Suggested-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NMinchan Kim <minchan@kernel.org> Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Vasiliy Kulikov <segooon@gmail.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
Commit cc9a6c87 ("cpuset: mm: reduce large amounts of memory barrier related damage v3") introduced a potential memory corruption. shmem_alloc_page() uses a pseudo vma and it has one significant unique combination, vma->vm_ops=NULL and vma->policy->flags & MPOL_F_SHARED. get_vma_policy() does NOT increase a policy ref when vma->vm_ops=NULL and mpol_cond_put() DOES decrease a policy ref when a policy has MPOL_F_SHARED. Therefore, when a cpuset update race occurs, alloc_pages_vma() falls in 'goto retry_cpuset' path, decrements the reference count and frees the policy prematurely. Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NChristoph Lameter <cl@linux.com> Cc: Josh Boyer <jwboyer@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
When shared_policy_replace() fails to allocate new->policy is not freed correctly by mpol_set_shared_policy(). The problem is that shared mempolicy code directly call kmem_cache_free() in multiple places where it is easy to make a mistake. This patch creates an sp_free wrapper function and uses it. The bug was introduced pre-git age (IOW, before 2.6.12-rc2). [mgorman@suse.de: Editted changelog] Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Reviewed-by: NChristoph Lameter <cl@linux.com> Cc: Josh Boyer <jwboyer@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Mel Gorman 提交于
shared_policy_replace() use of sp_alloc() is unsafe. 1) sp_node cannot be dereferenced if sp->lock is not held and 2) another thread can modify sp_node between spin_unlock for allocating a new sp node and next spin_lock. The bug was introduced before 2.6.12-rc2. Kosaki's original patch for this problem was to allocate an sp node and policy within shared_policy_replace and initialise it when the lock is reacquired. I was not keen on this approach because it partially duplicates sp_alloc(). As the paths were sp->lock is taken are not that performance critical this patch converts sp->lock to sp->mutex so it can sleep when calling sp_alloc(). [kosaki.motohiro@jp.fujitsu.com: Original patch] Signed-off-by: NMel Gorman <mgorman@suse.de> Acked-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Reviewed-by: NChristoph Lameter <cl@linux.com> Cc: Josh Boyer <jwboyer@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 KOSAKI Motohiro 提交于
Dave Jones' system call fuzz testing tool "trinity" triggered the following bug error with slab debugging enabled ============================================================================= BUG numa_policy (Not tainted): Poison overwritten ----------------------------------------------------------------------------- INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154 __slab_alloc+0x3d3/0x445 kmem_cache_alloc+0x29d/0x2b0 mpol_new+0xa3/0x140 sys_mbind+0x142/0x620 system_call_fastpath+0x16/0x1b INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154 __slab_free+0x2e/0x1de kmem_cache_free+0x25a/0x260 __mpol_put+0x27/0x30 remove_vma+0x68/0x90 exit_mmap+0x118/0x140 mmput+0x73/0x110 exit_mm+0x108/0x130 do_exit+0x162/0xb90 do_group_exit+0x4f/0xc0 sys_exit_group+0x17/0x20 system_call_fastpath+0x16/0x1b INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x (null) flags=0x20000000004080 INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0 The problem is that the structure is being prematurely freed due to a reference count imbalance. In the following case mbind(addr, len) should replace the memory policies of both vma1 and vma2 and thus they will become to share the same mempolicy and the new mempolicy will have the MPOL_F_SHARED flag. +-------------------+-------------------+ | vma1 | vma2(shmem) | +-------------------+-------------------+ | | addr addr+len alloc_pages_vma() uses get_vma_policy() and mpol_cond_put() pair for maintaining the mempolicy reference count. The current rule is that get_vma_policy() only increments refcount for shmem VMA and mpol_conf_put() only decrements refcount if the policy has MPOL_F_SHARED. In above case, vma1 is not shmem vma and vma->policy has MPOL_F_SHARED! The reference count will be decreased even though was not increased whenever alloc_page_vma() is called. This has been broken since commit [52cd3b07: mempolicy: rework mempolicy Reference Counting] in 2008. There is another serious bug with the sharing of memory policies. Currently, mempolicy rebind logic (it is called from cpuset rebinding) ignores a refcount of mempolicy and override it forcibly. Thus, any mempolicy sharing may cause mempolicy corruption. The bug was introduced by commit [68860ec1: cpusets: automatic numa mempolicy rebinding]. Ideally, the shared policy handling would be rewritten to either properly handle COW of the policy structures or at least reference count MPOL_F_SHARED based exclusively on information within the policy. However, this patch takes the easier approach of disabling any policy sharing between VMAs. Each new range allocated with sp_alloc will allocate a new policy, set the reference count to 1 and drop the reference count of the old policy. This increases the memory footprint but is not expected to be a major problem as mbind() is unlikely to be used for fine-grained ranges. It is also inefficient because it means we allocate a new policy even in cases where mbind_range() could use the new_policy passed to it. However, it is more straight-forward and the change should be invisible to the user. [mgorman@suse.de: Edited changelog] Reported-by: Dave Jones <davej@redhat.com>, Cc: Christoph Lameter <cl@linux.com>, Reviewed-by: NChristoph Lameter <cl@linux.com> Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NMel Gorman <mgorman@suse.de> Cc: Josh Boyer <jwboyer@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-