• M
    sched/numa: Stagger NUMA balancing scan periods for new threads · 13784475
    Mel Gorman 提交于
    Threads share an address space and each can change the protections of the
    same address space to trap NUMA faults. This is redundant and potentially
    counter-productive as any thread doing the update will suffice. Potentially
    only one thread is required but that thread may be idle or it may not have
    any locality concerns and pick an unsuitable scan rate.
    
    This patch uses independent scan period but they are staggered based on
    the number of address space users when the thread is created.  The intent
    is that threads will avoid scanning at the same time and have a chance
    to adapt their scan rate later if necessary. This reduces the total scan
    activity early in the lifetime of the threads.
    
    The different in headline performance across a range of machines and
    workloads is marginal but the system CPU usage is reduced as well as overall
    scan activity.  The following is the time reported by NAS Parallel Benchmark
    using unbound openmp threads and a D size class:
    
    			      4.17.0-rc1             4.17.0-rc1
    				 vanilla           stagger-v1r1
    	Time bt.D      442.77 (   0.00%)      419.70 (   5.21%)
    	Time cg.D      171.90 (   0.00%)      180.85 (  -5.21%)
    	Time ep.D       33.10 (   0.00%)       32.90 (   0.60%)
    	Time is.D        9.59 (   0.00%)        9.42 (   1.77%)
    	Time lu.D      306.75 (   0.00%)      304.65 (   0.68%)
    	Time mg.D       54.56 (   0.00%)       52.38 (   4.00%)
    	Time sp.D     1020.03 (   0.00%)      903.77 (  11.40%)
    	Time ua.D      400.58 (   0.00%)      386.49 (   3.52%)
    
    Note it's not a universal win but we have no prior knowledge of which
    thread matters but the number of threads created often exceeds the size
    of the node when the threads are not bound. However, there is a reducation
    of overall system CPU usage:
    
    				    4.17.0-rc1             4.17.0-rc1
    				       vanilla           stagger-v1r1
    	sys-time-bt.D         48.78 (   0.00%)       48.22 (   1.15%)
    	sys-time-cg.D         25.31 (   0.00%)       26.63 (  -5.22%)
    	sys-time-ep.D          1.65 (   0.00%)        0.62 (  62.42%)
    	sys-time-is.D         40.05 (   0.00%)       24.45 (  38.95%)
    	sys-time-lu.D         37.55 (   0.00%)       29.02 (  22.72%)
    	sys-time-mg.D         47.52 (   0.00%)       34.92 (  26.52%)
    	sys-time-sp.D        119.01 (   0.00%)      109.05 (   8.37%)
    	sys-time-ua.D         51.52 (   0.00%)       45.13 (  12.40%)
    
    NUMA scan activity is also reduced:
    
    	NUMA alloc local               1042828     1342670
    	NUMA base PTE updates        140481138    93577468
    	NUMA huge PMD updates           272171      180766
    	NUMA page range updates      279832690   186129660
    	NUMA hint faults               1395972     1193897
    	NUMA hint local faults          877925      855053
    	NUMA hint local percent             62          71
    	NUMA pages migrated           12057909     9158023
    
    Similar observations are made for other thread-intensive workloads. System
    CPU usage is lower even though the headline gains in performance tend to be
    small. For example, specjbb 2005 shows almost no difference in performance
    but scan activity is reduced by a third on a 4-socket box. I didn't find
    a workload (thread intensive or otherwise) that suffered badly.
    Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
    Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Matt Fleming <matt@codeblueprint.co.uk>
    Cc: Mike Galbraith <efault@gmx.de>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/r/20180504154109.mvrha2qo5wdl65vr@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
    13784475
fair.c 274.6 KB