[PATCH] sched: reduce overhead of calc_load

Currently, count_active_tasks() calls both nr_running() & nr_interruptible(). Each of these functions does a "for_each_cpu" & reads values from the runqueue of each cpu. Although this is not a lot of instructions, each runqueue may be located on different node. Depending on the architecture, a unique TLB entry may be required to access each runqueue. Since there may be more runqueues than cpu TLB entries, a scan of all runqueues can trash the TLB. Each memory reference incurs a TLB miss & refill. In addition, the runqueue cacheline that contains nr_running & nr_uninterruptible may be evicted from the cache between the two passes. This causes unnecessary cache misses. Combining nr_running() & nr_interruptible() into a single function substantially reduces the TLB & cache misses on large systems. This should have no measureable effect on smaller systems. On a 128p IA64 system running a memory stress workload, the new function reduced the overhead of calc_load() from 605 usec/call to 324 usec/call. Signed-off-by: N Jack Steiner <steiner@sgi.com> Acked-by: N Ingo Molnar <mingo@elte.hu> Signed-off-by: N Andrew Morton <akpm@osdl.org> Signed-off-by: N Linus Torvalds <torvalds@osdl.org>

[PATCH] sched: reduce overhead of calc_load
Currently, count_active_tasks() calls both nr_running() & nr_interruptible(). Each of these functions does a "for_each_cpu" & reads values from the runqueue of each cpu. Although this is not a lot of instructions, each runqueue may be located on different node. Depending on the architecture, a unique TLB entry may be required to access each runqueue. Since there may be more runqueues than cpu TLB entries, a scan of all runqueues can trash the TLB. Each memory reference incurs a TLB miss & refill. In addition, the runqueue cacheline that contains nr_running & nr_uninterruptible may be evicted from the cache between the two passes. This causes unnecessary cache misses. Combining nr_running() & nr_interruptible() into a single function substantially reduces the TLB & cache misses on large systems. This should have no measureable effect on smaller systems. On a 128p IA64 system running a memory stress workload, the new function reduced the overhead of calc_load() from 605 usec/call to 324 usec/call. Signed-off-by: N Jack Steiner <steiner@sgi.com> Acked-by: N Ingo Molnar <mingo@elte.hu> Signed-off-by: N Andrew Morton <akpm@osdl.org> Signed-off-by: N Linus Torvalds <torvalds@osdl.org>
db1b1fef · Jack Steiner · Linus Torvalds · 3055adda · db1b1fef · db1b1fef
隐藏空白更改
内联并排

Showing with 17 addition and 1 deletion

include/linux/sched.h include/linux/sched.h +1 -0

kernel/sched.c kernel/sched.c +15 -0

kernel/timer.c kernel/timer.c +1 -1

未找到文件。
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -100,6 +100,7 @@ DECLARE_PER_CPU(unsigned long, process_counts);
 extern int nr_processes(void);
 extern unsigned long nr_running(void);
 extern unsigned long nr_uninterruptible(void);
+extern unsigned long nr_active(void);
 extern unsigned long nr_iowait(void);
 #include <linux/time.h>

--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1658,6 +1658,21 @@ unsigned long nr_iowait(void)
 	return sum;
 }
+unsigned long nr_active(void)
+{
+	unsigned long i, running = 0, uninterruptible = 0;
+	for_each_online_cpu(i) {
+		running += cpu_rq(i)->nr_running;
+		uninterruptible += cpu_rq(i)->nr_uninterruptible;
+	}
+	if (unlikely((long)uninterruptible < 0))
+		uninterruptible = 0;
+	return running + uninterruptible;
+}
 #ifdef CONFIG_SMP
 /*

--- a/kernel/timer.c
+++ b/kernel/timer.c
@@ -825,7 +825,7 @@ void update_process_times(int user_tick)
 */
 static unsigned long count_active_tasks(void)
 {
-	return (nr_running() + nr_uninterruptible()) * FIXED_1;
+	return nr_active() * FIXED_1;
 }
 /*