1. 24 9月, 2009 1 次提交
  2. 23 9月, 2009 2 次提交
    • S
      procfs: provide stack information for threads · d899bf7b
      Stefani Seibold 提交于
      A patch to give a better overview of the userland application stack usage,
      especially for embedded linux.
      
      Currently you are only able to dump the main process/thread stack usage
      which is showed in /proc/pid/status by the "VmStk" Value.  But you get no
      information about the consumed stack memory of the the threads.
      
      There is an enhancement in the /proc/<pid>/{task/*,}/*maps and which marks
      the vm mapping where the thread stack pointer reside with "[thread stack
      xxxxxxxx]".  xxxxxxxx is the maximum size of stack.  This is a value
      information, because libpthread doesn't set the start of the stack to the
      top of the mapped area, depending of the pthread usage.
      
      A sample output of /proc/<pid>/task/<tid>/maps looks like:
      
      08048000-08049000 r-xp 00000000 03:00 8312       /opt/z
      08049000-0804a000 rw-p 00001000 03:00 8312       /opt/z
      0804a000-0806b000 rw-p 00000000 00:00 0          [heap]
      a7d12000-a7d13000 ---p 00000000 00:00 0
      a7d13000-a7f13000 rw-p 00000000 00:00 0          [thread stack: 001ff4b4]
      a7f13000-a7f14000 ---p 00000000 00:00 0
      a7f14000-a7f36000 rw-p 00000000 00:00 0
      a7f36000-a8069000 r-xp 00000000 03:00 4222       /lib/libc.so.6
      a8069000-a806b000 r--p 00133000 03:00 4222       /lib/libc.so.6
      a806b000-a806c000 rw-p 00135000 03:00 4222       /lib/libc.so.6
      a806c000-a806f000 rw-p 00000000 00:00 0
      a806f000-a8083000 r-xp 00000000 03:00 14462      /lib/libpthread.so.0
      a8083000-a8084000 r--p 00013000 03:00 14462      /lib/libpthread.so.0
      a8084000-a8085000 rw-p 00014000 03:00 14462      /lib/libpthread.so.0
      a8085000-a8088000 rw-p 00000000 00:00 0
      a8088000-a80a4000 r-xp 00000000 03:00 8317       /lib/ld-linux.so.2
      a80a4000-a80a5000 r--p 0001b000 03:00 8317       /lib/ld-linux.so.2
      a80a5000-a80a6000 rw-p 0001c000 03:00 8317       /lib/ld-linux.so.2
      afaf5000-afb0a000 rw-p 00000000 00:00 0          [stack]
      ffffe000-fffff000 r-xp 00000000 00:00 0          [vdso]
      
      Also there is a new entry "stack usage" in /proc/<pid>/{task/*,}/status
      which will you give the current stack usage in kb.
      
      A sample output of /proc/self/status looks like:
      
      Name:	cat
      State:	R (running)
      Tgid:	507
      Pid:	507
      .
      .
      .
      CapBnd:	fffffffffffffeff
      voluntary_ctxt_switches:	0
      nonvoluntary_ctxt_switches:	0
      Stack usage:	12 kB
      
      I also fixed stack base address in /proc/<pid>/{task/*,}/stat to the base
      address of the associated thread stack and not the one of the main
      process.  This makes more sense.
      
      [akpm@linux-foundation.org: fs/proc/array.c now needs walk_page_range()]
      Signed-off-by: NStefani Seibold <stefani@seibold.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Randy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d899bf7b
    • J
      getrusage: fill ru_maxrss value · 1f10206c
      Jiri Pirko 提交于
      Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater
      mark.  This struct is filled as a parameter to getrusage syscall.
      ->ru_maxrss value is set to KBs which is the way it is done in BSD
      systems.  /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs
      which seems to be incorrect behavior.  Maintainer of this util was
      notified by me with the patch which corrects it and cc'ed.
      
      To make this happen we extend struct signal_struct by two fields.  The
      first one is ->maxrss which we use to store rss hiwater of the task.  The
      second one is ->cmaxrss which we use to store highest rss hiwater of all
      task childs.  These values are used in k_getrusage() to actually fill
      ->ru_maxrss.  k_getrusage() uses current rss hiwater value directly if mm
      struct exists.
      
      Note:
      exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss.
      it is intetionally behavior. *BSD getrusage have exec() inheriting.
      
      test programs
      ========================================================
      
      getrusage.c
      ===========
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/types.h>
       #include <sys/time.h>
       #include <sys/resource.h>
       #include <sys/types.h>
       #include <sys/wait.h>
       #include <unistd.h>
       #include <signal.h>
       #include <sys/mman.h>
      
       #include "common.h"
      
       #define err(str) perror(str), exit(1)
      
      int main(int argc, char** argv)
      {
      	int status;
      
      	printf("allocate 100MB\n");
      	consume(100);
      
      	printf("testcase1: fork inherit? \n");
      	printf("  expect: initial.self ~= child.self\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      	} else {
      		show_rusage("fork child");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase2: fork inherit? (cont.) \n");
      	printf("  expect: initial.children ~= 100MB, but child.children = 0\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      	} else {
      		show_rusage("child");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase3: fork + malloc \n");
      	printf("  expect: child.self ~= initial.self + 50MB\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      	} else {
      		printf("allocate +50MB\n");
      		consume(50);
      		show_rusage("fork child");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase4: grandchild maxrss\n");
      	printf("  expect: post_wait.children ~= 300MB\n");
      	show_rusage("initial");
      	if (__fork()) {
      		wait(&status);
      		show_rusage("post_wait");
      	} else {
      		system("./child -n 0 -g 300");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase5: zombie\n");
      	printf("  expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n");
      	printf("          post_wait ~= 400MB, IOW wait() collect child's max_rss. \n");
      	show_rusage("initial");
      	if (__fork()) {
      		sleep(1); /* children become zombie */
      		show_rusage("pre_wait");
      		wait(&status);
      		show_rusage("post_wait");
      	} else {
      		system("./child -n 400");
      		_exit(0);
      	}
      	printf("\n");
      
      	printf("testcase6: SIG_IGN\n");
      	printf("  expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n");
      	show_rusage("initial");
      	signal(SIGCHLD, SIG_IGN);
      	if (__fork()) {
      		sleep(1); /* children become zombie */
      		show_rusage("after_zombie");
      	} else {
      		system("./child -n 500");
      		_exit(0);
      	}
      	printf("\n");
      	signal(SIGCHLD, SIG_DFL);
      
      	printf("testcase7: exec (without fork) \n");
      	printf("  expect: initial ~= exec \n");
      	show_rusage("initial");
      	execl("./child", "child", "-v", NULL);
      
      	return 0;
      }
      
      child.c
      =======
       #include <sys/types.h>
       #include <unistd.h>
       #include <sys/types.h>
       #include <sys/wait.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/types.h>
       #include <sys/time.h>
       #include <sys/resource.h>
      
       #include "common.h"
      
      int main(int argc, char** argv)
      {
      	int status;
      	int c;
      	long consume_size = 0;
      	long grandchild_consume_size = 0;
      	int show = 0;
      
      	while ((c = getopt(argc, argv, "n:g:v")) != -1) {
      		switch (c) {
      		case 'n':
      			consume_size = atol(optarg);
      			break;
      		case 'v':
      			show = 1;
      			break;
      		case 'g':
      
      			grandchild_consume_size = atol(optarg);
      			break;
      		default:
      			break;
      		}
      	}
      
      	if (show)
      		show_rusage("exec");
      
      	if (consume_size) {
      		printf("child alloc %ldMB\n", consume_size);
      		consume(consume_size);
      	}
      
      	if (grandchild_consume_size) {
      		if (fork()) {
      			wait(&status);
      		} else {
      			printf("grandchild alloc %ldMB\n", grandchild_consume_size);
      			consume(grandchild_consume_size);
      
      			exit(0);
      		}
      	}
      
      	return 0;
      }
      
      common.c
      ========
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
       #include <sys/types.h>
       #include <sys/time.h>
       #include <sys/resource.h>
       #include <sys/types.h>
       #include <sys/wait.h>
       #include <unistd.h>
       #include <signal.h>
       #include <sys/mman.h>
      
       #include "common.h"
       #define err(str) perror(str), exit(1)
      
      void show_rusage(char *prefix)
      {
          	int err, err2;
          	struct rusage rusage_self;
          	struct rusage rusage_children;
      
          	printf("%s: ", prefix);
          	err = getrusage(RUSAGE_SELF, &rusage_self);
          	if (!err)
          		printf("self %ld ", rusage_self.ru_maxrss);
          	err2 = getrusage(RUSAGE_CHILDREN, &rusage_children);
          	if (!err2)
          		printf("children %ld ", rusage_children.ru_maxrss);
      
          	printf("\n");
      }
      
      /* Some buggy OS need this worthless CPU waste. */
      void make_pagefault(void)
      {
      	void *addr;
      	int size = getpagesize();
      	int i;
      
      	for (i=0; i<1000; i++) {
      		addr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON, -1, 0);
      		if (addr == MAP_FAILED)
      			err("make_pagefault");
      		memset(addr, 0, size);
      		munmap(addr, size);
      	}
      }
      
      void consume(int mega)
      {
          	size_t sz = mega * 1024 * 1024;
          	void *ptr;
      
          	ptr = malloc(sz);
          	memset(ptr, 0, sz);
      	make_pagefault();
      }
      
      pid_t __fork(void)
      {
      	pid_t pid;
      
      	pid = fork();
      	make_pagefault();
      
      	return pid;
      }
      
      common.h
      ========
      void show_rusage(char *prefix);
      void make_pagefault(void);
      void consume(int mega);
      pid_t __fork(void);
      
      FreeBSD result (expected result)
      ========================================================
      allocate 100MB
      testcase1: fork inherit?
        expect: initial.self ~= child.self
      initial: self 103492 children 0
      fork child: self 103540 children 0
      
      testcase2: fork inherit? (cont.)
        expect: initial.children ~= 100MB, but child.children = 0
      initial: self 103540 children 103540
      child: self 103564 children 0
      
      testcase3: fork + malloc
        expect: child.self ~= initial.self + 50MB
      initial: self 103564 children 103564
      allocate +50MB
      fork child: self 154860 children 0
      
      testcase4: grandchild maxrss
        expect: post_wait.children ~= 300MB
      initial: self 103564 children 154860
      grandchild alloc 300MB
      post_wait: self 103564 children 308720
      
      testcase5: zombie
        expect: pre_wait ~= initial, IOW the zombie process is not accounted.
                post_wait ~= 400MB, IOW wait() collect child's max_rss.
      initial: self 103564 children 308720
      child alloc 400MB
      pre_wait: self 103564 children 308720
      post_wait: self 103564 children 411312
      
      testcase6: SIG_IGN
        expect: initial ~= after_zombie (child's 500MB alloc should be ignored).
      initial: self 103564 children 411312
      child alloc 500MB
      after_zombie: self 103624 children 411312
      
      testcase7: exec (without fork)
        expect: initial ~= exec
      initial: self 103624 children 411312
      exec: self 103624 children 411312
      
      Linux result (actual test result)
      ========================================================
      allocate 100MB
      testcase1: fork inherit?
        expect: initial.self ~= child.self
      initial: self 102848 children 0
      fork child: self 102572 children 0
      
      testcase2: fork inherit? (cont.)
        expect: initial.children ~= 100MB, but child.children = 0
      initial: self 102876 children 102644
      child: self 102572 children 0
      
      testcase3: fork + malloc
        expect: child.self ~= initial.self + 50MB
      initial: self 102876 children 102644
      allocate +50MB
      fork child: self 153804 children 0
      
      testcase4: grandchild maxrss
        expect: post_wait.children ~= 300MB
      initial: self 102876 children 153864
      grandchild alloc 300MB
      post_wait: self 102876 children 307536
      
      testcase5: zombie
        expect: pre_wait ~= initial, IOW the zombie process is not accounted.
                post_wait ~= 400MB, IOW wait() collect child's max_rss.
      initial: self 102876 children 307536
      child alloc 400MB
      pre_wait: self 102876 children 307536
      post_wait: self 102876 children 410076
      
      testcase6: SIG_IGN
        expect: initial ~= after_zombie (child's 500MB alloc should be ignored).
      initial: self 102876 children 410076
      child alloc 500MB
      after_zombie: self 102880 children 410076
      
      testcase7: exec (without fork)
        expect: initial ~= exec
      initial: self 102880 children 410076
      exec: self 102880 children 410076
      Signed-off-by: NJiri Pirko <jpirko@redhat.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f10206c
  3. 22 9月, 2009 4 次提交
    • A
      cpuidle: fix the menu governor to boost IO performance · 69d25870
      Arjan van de Ven 提交于
      Fix the menu idle governor which balances power savings, energy efficiency
      and performance impact.
      
      The reason for a reworked governor is that there have been serious
      performance issues reported with the existing code on Nehalem server
      systems.
      
      To show this I'm sure Andrew wants to see benchmark results:
      (benchmark is "fio", "no cstates" is using "idle=poll")
      
      		no cstates	current linux	new algorithm
      1 disk		107 Mb/s	85 Mb/s		105 Mb/s
      2 disks		215 Mb/s	123 Mb/s	209 Mb/s
      12 disks	590 Mb/s	320 Mb/s	585 Mb/s
      
      In various power benchmark measurements, no degredation was found by our
      measurement&diagnostics team.  Obviously a small percentage more power was
      used in the "fio" benchmark, due to the much higher performance.
      
      While it would be a novel idea to describe the new algorithm in this
      commit message, I cheaped out and described it in comments in the code
      instead.
      
      [changes since first post: spelling fixes from akpm, review feedback,
      folded menu-tng into menu.c]
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69d25870
    • K
      oom: move oom_adj value from task_struct to signal_struct · 28b83c51
      KOSAKI Motohiro 提交于
      Currently, OOM logic callflow is here.
      
          __out_of_memory()
              select_bad_process()            for each task
                  badness()                   calculate badness of one task
                      oom_kill_process()      search child
                          oom_kill_task()     kill target task and mm shared tasks with it
      
      example, process-A have two thread, thread-A and thread-B and it have very
      fat memory and each thread have following oom_adj and oom_score.
      
           thread-A: oom_adj = OOM_DISABLE, oom_score = 0
           thread-B: oom_adj = 0,           oom_score = very-high
      
      Then, select_bad_process() select thread-B, but oom_kill_task() refuse
      kill the task because thread-A have OOM_DISABLE.  Thus __out_of_memory()
      call select_bad_process() again.  but select_bad_process() select the same
      task.  It mean kernel fall in livelock.
      
      The fact is, select_bad_process() must select killable task.  otherwise
      OOM logic go into livelock.
      
      And root cause is, oom_adj shouldn't be per-thread value.  it should be
      per-process value because OOM-killer kill a process, not thread.  Thus
      This patch moves oomkilladj (now more appropriately named oom_adj) from
      struct task_struct to struct signal_struct.  it naturally prevent
      select_bad_process() choose wrong task.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28b83c51
    • H
      ksm: unmerge is an origin of OOMs · 35451bee
      Hugh Dickins 提交于
      Just as the swapoff system call allocates many pages of RAM to various
      processes, perhaps triggering OOM, so "echo 2 >/sys/kernel/mm/ksm/run"
      (unmerge) is liable to allocate many pages of RAM to various processes,
      perhaps triggering OOM; and each is normally run from a modest admin
      process (swapoff or shell), easily repeated until it succeeds.
      
      So treat unmerge_and_remove_all_rmap_items() in the same way that we treat
      try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both
      with that, to ask the OOM killer to kill them first, to prevent them from
      spawning more and more OOM kills.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Acked-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35451bee
    • H
      ksm: the mm interface to ksm · f8af4da3
      Hugh Dickins 提交于
      This patch presents the mm interface to a dummy version of ksm.c, for
      better scrutiny of that interface: the real ksm.c follows later.
      
      When CONFIG_KSM is not set, madvise(2) reject MADV_MERGEABLE and
      MADV_UNMERGEABLE with EINVAL, since that seems more helpful than
      pretending that they can be serviced.  But when CONFIG_KSM=y, accept them
      even if KSM is not currently running, and even on areas which KSM will not
      touch (e.g.  hugetlb or shared file or special driver mappings).
      
      Like other madvices, report ENOMEM despite success if any area in the
      range is unmapped, and use EAGAIN to report out of memory.
      
      Define vma flag VM_MERGEABLE to identify an area on which KSM may try
      merging pages: leave it to ksm_madvise() to decide whether to set it.
      Define mm flag MMF_VM_MERGEABLE to identify an mm which might contain
      VM_MERGEABLE areas, to minimize callouts when forking or exiting.
      
      Based upon earlier patches by Chris Wright and Izik Eidus.
      Signed-off-by: NHugh Dickins <hugh.dickins@tiscali.co.uk>
      Signed-off-by: NChris Wright <chrisw@redhat.com>
      Signed-off-by: NIzik Eidus <ieidus@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8af4da3
  4. 21 9月, 2009 3 次提交
    • I
      perf: Do the big rename: Performance Counters -> Performance Events · cdd6c482
      Ingo Molnar 提交于
      Bye-bye Performance Counters, welcome Performance Events!
      
      In the past few months the perfcounters subsystem has grown out its
      initial role of counting hardware events, and has become (and is
      becoming) a much broader generic event enumeration, reporting, logging,
      monitoring, analysis facility.
      
      Naming its core object 'perf_counter' and naming the subsystem
      'perfcounters' has become more and more of a misnomer. With pending
      code like hw-breakpoints support the 'counter' name is less and
      less appropriate.
      
      All in one, we've decided to rename the subsystem to 'performance
      events' and to propagate this rename through all fields, variables
      and API names. (in an ABI compatible fashion)
      
      The word 'event' is also a bit shorter than 'counter' - which makes
      it slightly more convenient to write/handle as well.
      
      Thanks goes to Stephane Eranian who first observed this misnomer and
      suggested a rename.
      
      User-space tooling and ABI compatibility is not affected - this patch
      should be function-invariant. (Also, defconfigs were not touched to
      keep the size down.)
      
      This patch has been generated via the following script:
      
        FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
      
        sed -i \
          -e 's/PERF_EVENT_/PERF_RECORD_/g' \
          -e 's/PERF_COUNTER/PERF_EVENT/g' \
          -e 's/perf_counter/perf_event/g' \
          -e 's/nb_counters/nb_events/g' \
          -e 's/swcounter/swevent/g' \
          -e 's/tpcounter_event/tp_event/g' \
          $FILES
      
        for N in $(find . -name perf_counter.[ch]); do
          M=$(echo $N | sed 's/perf_counter/perf_event/g')
          mv $N $M
        done
      
        FILES=$(find . -name perf_event.*)
      
        sed -i \
          -e 's/COUNTER_MASK/REG_MASK/g' \
          -e 's/COUNTER/EVENT/g' \
          -e 's/\<event\>/event_id/g' \
          -e 's/counter/event/g' \
          -e 's/Counter/Event/g' \
          $FILES
      
      ... to keep it as correct as possible. This script can also be
      used by anyone who has pending perfcounters patches - it converts
      a Linux kernel tree over to the new naming. We tried to time this
      change to the point in time where the amount of pending patches
      is the smallest: the end of the merge window.
      
      Namespace clashes were fixed up in a preparatory patch - and some
      stylistic fallout will be fixed up in a subsequent patch.
      
      ( NOTE: 'counters' are still the proper terminology when we deal
        with hardware registers - and these sed scripts are a bit
        over-eager in renaming them. I've undone some of that, but
        in case there's something left where 'counter' would be
        better than 'event' we can undo that on an individual basis
        instead of touching an otherwise nicely automated patch. )
      Suggested-by: NStephane Eranian <eranian@google.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Reviewed-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <linux-arch@vger.kernel.org>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cdd6c482
    • P
      sched: Simplify sys_sched_rr_get_interval() system call · 0d721cea
      Peter Williams 提交于
      By removing the need for it to know details of scheduling classes.
      
      This allows PlugSched to define orthogonal scheduling classes.
      Signed-off-by: NPeter Williams <pwil3058@bigpond.net.au>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <06d1b89ee15a0eef82d7.1253496713@mudlark.pw.nest>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0d721cea
    • A
      sched: Fix raciness in runqueue_is_locked() · 89f19f04
      Andrew Morton 提交于
      runqueue_is_locked() is unavoidably racy due to a poor interface design.
      It does
      
      	cpu = get_cpu()
      	ret = some_perpcu_thing(cpu);
      	put_cpu(cpu);
      	return ret;
      
      Its return value is unreliable.
      
      Fix.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <200909191855.n8JItiko022148@imap1.linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      89f19f04
  5. 18 9月, 2009 1 次提交
    • P
      rcu: Simplify rcu_read_unlock_special() quiescent-state accounting · c3422bea
      Paul E. McKenney 提交于
      The earlier approach required two scheduling-clock ticks to note an
      preemptable-RCU quiescent state in the situation in which the
      scheduling-clock interrupt is unlucky enough to always interrupt an
      RCU read-side critical section.
      
      With this change, the quiescent state is instead noted by the
      outermost rcu_read_unlock() immediately following the first
      scheduling-clock tick, or, alternatively, by the first subsequent
      context switch.  Therefore, this change also speeds up grace
      periods.
      Suggested-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      LKML-Reference: <12528585111945-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c3422bea
  6. 17 9月, 2009 1 次提交
    • P
      sched: Add new wakeup preemption mode: WAKEUP_RUNNING · ad4b78bb
      Peter Zijlstra 提交于
      Create a new wakeup preemption mode, preempt towards tasks that run
      shorter on avg. It sets next buddy to be sure we actually run the task
      we preempted for.
      
      Test results:
      
       root@twins:~# while :; do :; done &
       [1] 6537
       root@twins:~# while :; do :; done &
       [2] 6538
       root@twins:~# while :; do :; done &
       [3] 6539
       root@twins:~# while :; do :; done &
       [4] 6540
      
       root@twins:/home/peter# ./latt -c4 sleep 4
       Entries: 48 (clients=4)
      
       Averages:
       ------------------------------
              Max          4750 usec
              Avg           497 usec
              Stdev         737 usec
      
       root@twins:/home/peter# echo WAKEUP_RUNNING > /debug/sched_features
      
       root@twins:/home/peter# ./latt -c4 sleep 4
       Entries: 48 (clients=4)
      
       Averages:
       ------------------------------
              Max            14 usec
              Avg             5 usec
              Stdev           3 usec
      
      Disabled by default - needs more testing.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      LKML-Reference: <new-submission>
      ad4b78bb
  7. 16 9月, 2009 1 次提交
  8. 15 9月, 2009 7 次提交
    • P
      sched: Add WF_FORK · a7558e01
      Peter Zijlstra 提交于
      Avoid the cache buddies from biasing the time distribution away
      from fork()ers. Normally the next buddy will be the preferred
      scheduling target, but this makes fork()s prefer to run the new
      child, whereas we prefer to run the parent, since that will
      generate more work.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a7558e01
    • P
      sched: Rename sync arguments · 7d478721
      Peter Zijlstra 提交于
      In order to extend the functions to have more than 1 flag (sync),
      rename the argument to flags, and explicitly define a WF_ space for
      individual flags.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7d478721
    • P
      sched: Rename select_task_rq() argument · 0763a660
      Peter Zijlstra 提交于
      In order to be able to rename the sync argument, we need to rename
      the current flag argument.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0763a660
    • P
      x86: sched: Provide arch implementations using aperf/mperf · 47fe38fc
      Peter Zijlstra 提交于
      APERF/MPERF support for cpu_power.
      
      APERF/MPERF is arch defined to be a relative scale of work capacity
      per logical cpu, this is assumed to include SMT and Turbo mode.
      
      APERF/MPERF are specified to both reset to 0 when either counter
      wraps, which is highly inconvenient, since that'll give a blimp
      when that happens. The manual specifies writing 0 to the counters
      after each read, but that's 1) too expensive, and 2) destroys the
      possibility of sharing these counters with other users, so we live
      with the blimp - the other existing user does too.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      47fe38fc
    • P
      sched: Merge select_task_rq_fair() and sched_balance_self() · c88d5910
      Peter Zijlstra 提交于
      The problem with wake_idle() is that is doesn't respect things like
      cpu_power, which means it doesn't deal well with SMT nor the recent
      RT interaction.
      
      To cure this, it needs to do what sched_balance_self() does, which
      leads to the possibility of merging select_task_rq_fair() and
      sched_balance_self().
      
      Modify sched_balance_self() to:
      
        - update_shares() when walking up the domain tree,
          (it only called it for the top domain, but it should
           have done this anyway), which allows us to remove
          this ugly bit from try_to_wake_up().
      
        - do wake_affine() on the smallest domain that contains
          both this (the waking) and the prev (the wakee) cpu for
          WAKE invocations.
      
      Then use the top-down balance steps it had to replace wake_idle().
      
      This leads to the dissapearance of SD_WAKE_BALANCE and
      SD_WAKE_IDLE_FAR, with SD_WAKE_IDLE replaced with SD_BALANCE_WAKE.
      
      SD_WAKE_AFFINE needs SD_BALANCE_WAKE to be effective.
      
      Touch all topology bits to replace the old with new SD flags --
      platforms might need re-tuning, enabling SD_BALANCE_WAKE
      conditionally on a NUMA distance seems like a good additional
      feature, magny-core and small nehalem systems would want this
      enabled, systems with slow interconnects would not.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c88d5910
    • P
      sched: Add TASK_WAKING · e9c84311
      Peter Zijlstra 提交于
      We're going to want to drop rq->lock in try_to_wake_up() for a
      longer period of time, however we also want to deal with concurrent
      waking of the same task, which is currently handled by holding
      rq->lock.
      
      So introduce a new TASK state, namely TASK_WAKING, which indicates
      someone is already waking the task (other wakers will fail p->state
      & state).
      
      We also keep preemption disabled over the whole ttwu().
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e9c84311
    • P
      sched: Hook sched_balance_self() into sched_class::select_task_rq() · 5f3edc1b
      Peter Zijlstra 提交于
      Rather ugly patch to fully place the sched_balance_self() code
      inside the fair class.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5f3edc1b
  9. 09 9月, 2009 1 次提交
    • M
      sched: Turn off child_runs_first · 2bba22c5
      Mike Galbraith 提交于
      Set child_runs_first default to off.
      
      It hurts 'optimal' make -j<NR_CPUS> workloads as make jobs
      get preempted by child tasks, reducing parallelism.
      
      Note, this patch might make existing races in user
      applications more prominent than before - so breakages
      might be bisected to this commit.
      
      Child-runs-first is broken on SMP to begin with, and we
      already had it off briefly in v2.6.23 so most of the
      offenders ought to be fixed. Would be nice not to revert
      this commit but fix those apps finally ...
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1252486344.28645.18.camel@marge.simson.net>
      [ made the sysctl independent of CONFIG_SCHED_DEBUG, in case
        people want to work around broken apps. ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2bba22c5
  10. 04 9月, 2009 4 次提交
  11. 02 9月, 2009 2 次提交
    • D
      KEYS: Add a keyctl to install a process's session keyring on its parent [try #6] · ee18d64c
      David Howells 提交于
      Add a keyctl to install a process's session keyring onto its parent.  This
      replaces the parent's session keyring.  Because the COW credential code does
      not permit one process to change another process's credentials directly, the
      change is deferred until userspace next starts executing again.  Normally this
      will be after a wait*() syscall.
      
      To support this, three new security hooks have been provided:
      cred_alloc_blank() to allocate unset security creds, cred_transfer() to fill in
      the blank security creds and key_session_to_parent() - which asks the LSM if
      the process may replace its parent's session keyring.
      
      The replacement may only happen if the process has the same ownership details
      as its parent, and the process has LINK permission on the session keyring, and
      the session keyring is owned by the process, and the LSM permits it.
      
      Note that this requires alteration to each architecture's notify_resume path.
      This has been done for all arches barring blackfin, m68k* and xtensa, all of
      which need assembly alteration to support TIF_NOTIFY_RESUME.  This allows the
      replacement to be performed at the point the parent process resumes userspace
      execution.
      
      This allows the userspace AFS pioctl emulation to fully emulate newpag() and
      the VIOCSETTOK and VIOCSETTOK2 pioctls, all of which require the ability to
      alter the parent process's PAG membership.  However, since kAFS doesn't use
      PAGs per se, but rather dumps the keys into the session keyring, the session
      keyring of the parent must be replaced if, for example, VIOCSETTOK is passed
      the newpag flag.
      
      This can be tested with the following program:
      
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <keyutils.h>
      
      	#define KEYCTL_SESSION_TO_PARENT	18
      
      	#define OSERROR(X, S) do { if ((long)(X) == -1) { perror(S); exit(1); } } while(0)
      
      	int main(int argc, char **argv)
      	{
      		key_serial_t keyring, key;
      		long ret;
      
      		keyring = keyctl_join_session_keyring(argv[1]);
      		OSERROR(keyring, "keyctl_join_session_keyring");
      
      		key = add_key("user", "a", "b", 1, keyring);
      		OSERROR(key, "add_key");
      
      		ret = keyctl(KEYCTL_SESSION_TO_PARENT);
      		OSERROR(ret, "KEYCTL_SESSION_TO_PARENT");
      
      		return 0;
      	}
      
      Compiled and linked with -lkeyutils, you should see something like:
      
      	[dhowells@andromeda ~]$ keyctl show
      	Session Keyring
      	       -3 --alswrv   4043  4043  keyring: _ses
      	355907932 --alswrv   4043    -1   \_ keyring: _uid.4043
      	[dhowells@andromeda ~]$ /tmp/newpag
      	[dhowells@andromeda ~]$ keyctl show
      	Session Keyring
      	       -3 --alswrv   4043  4043  keyring: _ses
      	1055658746 --alswrv   4043  4043   \_ user: a
      	[dhowells@andromeda ~]$ /tmp/newpag hello
      	[dhowells@andromeda ~]$ keyctl show
      	Session Keyring
      	       -3 --alswrv   4043  4043  keyring: hello
      	340417692 --alswrv   4043  4043   \_ user: a
      
      Where the test program creates a new session keyring, sticks a user key named
      'a' into it and then installs it on its parent.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      ee18d64c
    • A
      sched: Provide iowait counters · 8f0dfc34
      Arjan van de Ven 提交于
      For counting how long an application has been waiting for
      (disk) IO, there currently is only the HZ sample driven
      information available, while for all other counters in this
      class, a high resolution version is available via
      CONFIG_SCHEDSTATS.
      
      In order to make an improved bootchart tool possible, we also
      need a higher resolution version of the iowait time.
      
      This patch below adds this scheduler statistic to the kernel.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <4A64B813.1080506@linux.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8f0dfc34
  12. 29 8月, 2009 2 次提交
  13. 23 8月, 2009 2 次提交
    • P
      rcu: Remove CONFIG_PREEMPT_RCU · 6b3ef48a
      Paul E. McKenney 提交于
      Now that CONFIG_TREE_PREEMPT_RCU is in place, there is no
      further need for CONFIG_PREEMPT_RCU.  Remove it, along with
      whatever subtle bugs it may (or may not) contain.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josht@linux.vnet.ibm.com
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      LKML-Reference: <125097461396-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6b3ef48a
    • P
      rcu: Merge preemptable-RCU functionality into hierarchical RCU · f41d911f
      Paul E. McKenney 提交于
      Create a kernel/rcutree_plugin.h file that contains definitions
      for preemptable RCU (or, under the #else branch of the #ifdef,
      empty definitions for the classic non-preemptable semantics).
      These definitions fit into plugins defined in kernel/rcutree.c
      for this purpose.
      
      This variant of preemptable RCU uses a new algorithm whose
      read-side expense is roughly that of classic hierarchical RCU
      under CONFIG_PREEMPT. This new algorithm's update-side expense
      is similar to that of classic hierarchical RCU, and, in absence
      of read-side preemption or blocking, is exactly that of classic
      hierarchical RCU.  Perhaps more important, this new algorithm
      has a much simpler implementation, saving well over 1,000 lines
      of code compared to mainline's implementation of preemptable
      RCU, which will hopefully be retired in favor of this new
      algorithm.
      
      The simplifications are obtained by maintaining per-task
      nesting state for running tasks, and using a simple
      lock-protected algorithm to handle accounting when tasks block
      within RCU read-side critical sections, making use of lessons
      learned while creating numerous user-level RCU implementations
      over the past 18 months.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josht@linux.vnet.ibm.com
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      LKML-Reference: <12509746134003-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f41d911f
  14. 19 8月, 2009 1 次提交
    • K
      mm: revert "oom: move oom_adj value" · 0753ba01
      KOSAKI Motohiro 提交于
      The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
      the mm_struct.  It was a very good first step for sanitize OOM.
      
      However Paul Menage reported the commit makes regression to his job
      scheduler.  Current OOM logic can kill OOM_DISABLED process.
      
      Why? His program has the code of similar to the following.
      
      	...
      	set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
      	...
      	if (vfork() == 0) {
      		set_oom_adj(0); /* Invoked child can be killed */
      		execve("foo-bar-cmd");
      	}
      	....
      
      vfork() parent and child are shared the same mm_struct.  then above
      set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
      change oom_adj for vfork() parent.  Then, vfork() parent (job scheduler)
      lost OOM immune and it was killed.
      
      Actually, fork-setting-exec idiom is very frequently used in userland program.
      We must not break this assumption.
      
      Then, this patch revert commit 2ff05b2b and related commit.
      
      Reverted commit list
      ---------------------
      - commit 2ff05b2b (oom: move oom_adj value from task_struct to mm_struct)
      - commit 4d8b9135 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
      - commit 81236810 (oom: only oom kill exiting tasks with attached memory)
      - commit 933b787b (mm: copy over oom_adj value at fork time)
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0753ba01
  15. 18 8月, 2009 1 次提交
  16. 03 8月, 2009 2 次提交
  17. 02 8月, 2009 2 次提交
    • G
      sched: Enhance the pre/post scheduling logic · 3f029d3c
      Gregory Haskins 提交于
      We currently have an explicit "needs_post" vtable method which
      returns a stack variable for whether we should later run
      post-schedule.  This leads to an awkward exchange of the
      variable as it bubbles back up out of the context switch. Peter
      Zijlstra observed that this information could be stored in the
      run-queue itself instead of handled on the stack.
      
      Therefore, we revert to the method of having context_switch
      return void, and update an internal rq->post_schedule variable
      when we require further processing.
      
      In addition, we fix a race condition where we try to access
      current->sched_class without holding the rq->lock.  This is
      technically racy, as the sched-class could change out from
      under us.  Instead, we reference the per-rq post_schedule
      variable with the runqueue unlocked, but with preemption
      disabled to see if we need to reacquire the rq->lock.
      
      Finally, we clean the code up slightly by removing the #ifdef
      CONFIG_SMP conditionals from the schedule() call, and implement
      some inline helper functions instead.
      
      This patch passes checkpatch, and rt-migrate.
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090729150422.17691.55590.stgit@dev.haskins.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3f029d3c
    • F
      sched: Fix cond_resched_lock() in !CONFIG_PREEMPT · 716a4234
      Frederic Weisbecker 提交于
      The might_sleep() test inside cond_resched_lock() assumes the
      spinlock is held and then preemption is disabled. This is true
      with CONFIG_PREEMPT but the preempt_count() doesn't change
      otherwise.
      
      Check by starting from the appropriate preempt offset depending
      on the config.
      Reported-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1248458723-12146-1-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      716a4234
  18. 18 7月, 2009 3 次提交