[IA64] optimize flush_tlb_range on large numa box

It was reported from a field customer that global spin lock ptcg_lock is giving a lot of grief on munmap performance running on a large numa machine. What appears to be a problem coming from flush_tlb_range(), which currently unconditionally calls platform_global_tlb_purge(). For some of the numa machines in existence today, this function is mapped into ia64_global_tlb_purge(), which holds ptcg_lock spin lock while executing ptc.ga instruction. Here is a patch that attempt to avoid global tlb purge whenever possible. It will use local tlb purge as much as possible. Though the conditions to use local tlb purge is pretty restrictive. One of the side effect of having flush tlb range instruction on ia64 is that kernel don't get a chance to clear out cpu_vm_mask. On ia64, this mask is sticky and it will accumulate if process bounces around. Thus diminishing the possible use of ptc.l. Thoughts? Signed-off-by: N Ken Chen <kenneth.w.chen@intel.com> Acked-by: N Jack Steiner <steiner@sgi.com> Acked-by: N KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: N Tony Luck <tony.luck@intel.com>

[IA64] optimize flush_tlb_range on large numa box
It was reported from a field customer that global spin lock ptcg_lock is giving a lot of grief on munmap performance running on a large numa machine. What appears to be a problem coming from flush_tlb_range(), which currently unconditionally calls platform_global_tlb_purge(). For some of the numa machines in existence today, this function is mapped into ia64_global_tlb_purge(), which holds ptcg_lock spin lock while executing ptc.ga instruction. Here is a patch that attempt to avoid global tlb purge whenever possible. It will use local tlb purge as much as possible. Though the conditions to use local tlb purge is pretty restrictive. One of the side effect of having flush tlb range instruction on ia64 is that kernel don't get a chance to clear out cpu_vm_mask. On ia64, this mask is sticky and it will accumulate if process bounces around. Thus diminishing the possible use of ptc.l. Thoughts? Signed-off-by: N Ken Chen <kenneth.w.chen@intel.com> Acked-by: N Jack Steiner <steiner@sgi.com> Acked-by: N KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: N Tony Luck <tony.luck@intel.com>
ce9eed5a · Chen, Kenneth W · Tony Luck · 5e48521e · ce9eed5a
隐藏空白更改
内联并排

Showing with 7 addition and 5 deletion

arch/ia64/mm/tlb.c arch/ia64/mm/tlb.c +7 -5

未找到文件。
--- a/arch/ia64/mm/tlb.c
+++ b/arch/ia64/mm/tlb.c
@@ -156,17 +156,19 @@ flush_tlb_range (struct vm_area_struct *vma, unsigned long start,
 		nbits = purge.max_bits;
 	start &= ~((1UL << nbits) - 1);

-# ifdef CONFIG_SMP
-	platform_global_tlb_purge(mm, start, end, nbits);
-# else
 	preempt_disable();
+#ifdef CONFIG_SMP
+	if (mm != current->active_mm || cpus_weight(mm->cpu_vm_mask) != 1) {
+		platform_global_tlb_purge(mm, start, end, nbits);
+		preempt_enable();
+		return;
+	}
+#endif
 	do {
 		ia64_ptcl(start, (nbits<<2));
 		start += (1UL << nbits);
 	} while (start < end);
 	preempt_enable();
-# endif
-
 	ia64_srlz_i();			/* srlz.i implies srlz.d */
 }
 EXPORT_SYMBOL(flush_tlb_range);