1. 25 7月, 2008 1 次提交
  2. 22 7月, 2008 1 次提交
  3. 19 7月, 2008 1 次提交
  4. 27 6月, 2008 1 次提交
  5. 19 6月, 2008 1 次提交
    • P
      rcu: make rcutorture more vicious: reinstate boot-time testing · 31a72bce
      Paul E. McKenney 提交于
      This patch re-institutes the ability to build rcutorture directly into
      the Linux kernel.  The reason that this capability was removed was that
      this could result in your kernel being pretty much useless, as rcutorture
      would be running starting from early boot.  This problem has been avoided
      by (1) making rcutorture run only three seconds of every six by default,
      (2) adding a CONFIG_RCU_TORTURE_TEST_RUNNABLE that permits rcutorture
      to be quiesced at boot time, and (3) adding a sysctl in /proc named
      /proc/sys/kernel/rcutorture_runnable that permits rcutorture to be
      quiesced and unquiesced when built into the kernel.
      
      Please note that this /proc file is -not- available when rcutorture
      is built as a module.  Please also note that to get the earlier
      take-no-prisoners behavior, you must use the boot command line to set
      rcutorture's "stutter" parameter to zero.
      
      The rcutorture quiescing mechanism is currently quite crude: loops
      in each rcutorture process that poll a global variable once per tick.
      Suggestions for improvement are welcome.  The default action will
      be to reduce the polling rate to a few times per second.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Suggested-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      31a72bce
  6. 25 5月, 2008 3 次提交
  7. 24 5月, 2008 1 次提交
  8. 17 5月, 2008 1 次提交
  9. 13 5月, 2008 1 次提交
  10. 29 4月, 2008 4 次提交
  11. 20 4月, 2008 2 次提交
  12. 05 3月, 2008 1 次提交
    • P
      sched: revert load_balance_monitor() changes · 62fb1851
      Peter Zijlstra 提交于
      The following commits cause a number of regressions:
      
        commit 58e2d4ca
        Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
        Date:   Fri Jan 25 21:08:00 2008 +0100
        sched: group scheduling, change how cpu load is calculated
      
        commit 6b2d7700
        Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
        Date:   Fri Jan 25 21:08:00 2008 +0100
        sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups
      
      Namely:
       - very frequent wakeups on SMP, reported by PowerTop users.
       - cacheline trashing on (large) SMP
       - some latencies larger than 500ms
      
      While there is a mergeable patch to fix the latter, the former issues
      are not fixable in a manner suitable for .25 (we're at -rc3 now).
      
      Hence we revert them and try again in v2.6.26.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Tested-by: NAlexey Zaytsev <alexey.zaytsev@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      62fb1851
  13. 14 2月, 2008 1 次提交
  14. 13 2月, 2008 1 次提交
  15. 09 2月, 2008 4 次提交
  16. 08 2月, 2008 1 次提交
    • D
      oom: add sysctl to enable task memory dump · fef1bdd6
      David Rientjes 提交于
      Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a
      dump of all system tasks (excluding kernel threads) when performing an
      OOM-killing.  Information includes pid, uid, tgid, vm size, rss, cpu,
      oom_adj score, and name.
      
      This is helpful for determining why there was an OOM condition and which
      rogue task caused it.
      
      It is configurable so that large systems, such as those with several
      thousand tasks, do not incur a performance penalty associated with dumping
      data they may not desire.
      
      If an OOM was triggered as a result of a memory controller, the tasklist
      shall be filtered to exclude tasks that are not a member of the same
      cgroup.
      
      Cc: Andrea Arcangeli <andrea@suse.de>
      Cc: Christoph Lameter <clameter@sgi.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fef1bdd6
  17. 07 2月, 2008 1 次提交
    • E
      get rid of NR_OPEN and introduce a sysctl_nr_open · 9cfe015a
      Eric Dumazet 提交于
      NR_OPEN (historically set to 1024*1024) actually forbids processes to open
      more than 1024*1024 handles.
      
      Unfortunatly some production servers hit the not so 'ridiculously high
      value' of 1024*1024 file descriptors per process.
      
      Changing NR_OPEN is not considered safe because of vmalloc space potential
      exhaust.
      
      This patch introduces a new sysctl (/proc/sys/fs/nr_open) wich defaults to
      1024*1024, so that admins can decide to change this limit if their workload
      needs it.
      
      [akpm@linux-foundation.org: export it for sparc64]
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9cfe015a
  18. 06 2月, 2008 2 次提交
    • S
      capabilities: introduce per-process capability bounding set · 3b7391de
      Serge E. Hallyn 提交于
      The capability bounding set is a set beyond which capabilities cannot grow.
       Currently cap_bset is per-system.  It can be manipulated through sysctl,
      but only init can add capabilities.  Root can remove capabilities.  By
      default it includes all caps except CAP_SETPCAP.
      
      This patch makes the bounding set per-process when file capabilities are
      enabled.  It is inherited at fork from parent.  Noone can add elements,
      CAP_SETPCAP is required to remove them.
      
      One example use of this is to start a safer container.  For instance, until
      device namespaces or per-container device whitelists are introduced, it is
      best to take CAP_MKNOD away from a container.
      
      The bounding set will not affect pP and pE immediately.  It will only
      affect pP' and pE' after subsequent exec()s.  It also does not affect pI,
      and exec() does not constrain pI'.  So to really start a shell with no way
      of regain CAP_MKNOD, you would do
      
      	prctl(PR_CAPBSET_DROP, CAP_MKNOD);
      	cap_t cap = cap_get_proc();
      	cap_value_t caparray[1];
      	caparray[0] = CAP_MKNOD;
      	cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
      	cap_set_proc(cap);
      	cap_free(cap);
      
      The following test program will get and set the bounding
      set (but not pI).  For instance
      
      	./bset get
      		(lists capabilities in bset)
      	./bset drop cap_net_raw
      		(starts shell with new bset)
      		(use capset, setuid binary, or binary with
      		file capabilities to try to increase caps)
      
      ************************************************************
      cap_bound.c
      ************************************************************
       #include <sys/prctl.h>
       #include <linux/capability.h>
       #include <sys/types.h>
       #include <unistd.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
      
       #ifndef PR_CAPBSET_READ
       #define PR_CAPBSET_READ 23
       #endif
      
       #ifndef PR_CAPBSET_DROP
       #define PR_CAPBSET_DROP 24
       #endif
      
      int usage(char *me)
      {
      	printf("Usage: %s get\n", me);
      	printf("       %s drop <capability>\n", me);
      	return 1;
      }
      
       #define numcaps 32
      char *captable[numcaps] = {
      	"cap_chown",
      	"cap_dac_override",
      	"cap_dac_read_search",
      	"cap_fowner",
      	"cap_fsetid",
      	"cap_kill",
      	"cap_setgid",
      	"cap_setuid",
      	"cap_setpcap",
      	"cap_linux_immutable",
      	"cap_net_bind_service",
      	"cap_net_broadcast",
      	"cap_net_admin",
      	"cap_net_raw",
      	"cap_ipc_lock",
      	"cap_ipc_owner",
      	"cap_sys_module",
      	"cap_sys_rawio",
      	"cap_sys_chroot",
      	"cap_sys_ptrace",
      	"cap_sys_pacct",
      	"cap_sys_admin",
      	"cap_sys_boot",
      	"cap_sys_nice",
      	"cap_sys_resource",
      	"cap_sys_time",
      	"cap_sys_tty_config",
      	"cap_mknod",
      	"cap_lease",
      	"cap_audit_write",
      	"cap_audit_control",
      	"cap_setfcap"
      };
      
      int getbcap(void)
      {
      	int comma=0;
      	unsigned long i;
      	int ret;
      
      	printf("i know of %d capabilities\n", numcaps);
      	printf("capability bounding set:");
      	for (i=0; i<numcaps; i++) {
      		ret = prctl(PR_CAPBSET_READ, i);
      		if (ret < 0)
      			perror("prctl");
      		else if (ret==1)
      			printf("%s%s", (comma++) ? ", " : " ", captable[i]);
      	}
      	printf("\n");
      	return 0;
      }
      
      int capdrop(char *str)
      {
      	unsigned long i;
      
      	int found=0;
      	for (i=0; i<numcaps; i++) {
      		if (strcmp(captable[i], str) == 0) {
      			found=1;
      			break;
      		}
      	}
      	if (!found)
      		return 1;
      	if (prctl(PR_CAPBSET_DROP, i)) {
      		perror("prctl");
      		return 1;
      	}
      	return 0;
      }
      
      int main(int argc, char *argv[])
      {
      	if (argc<2)
      		return usage(argv[0]);
      	if (strcmp(argv[1], "get")==0)
      		return getbcap();
      	if (strcmp(argv[1], "drop")!=0 || argc<3)
      		return usage(argv[0]);
      	if (capdrop(argv[2])) {
      		printf("unknown capability\n");
      		return 1;
      	}
      	return execl("/bin/bash", "/bin/bash", NULL);
      }
      ************************************************************
      
      [serue@us.ibm.com: fix typo]
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndrew G. Morgan <morgan@kernel.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: James Morris <jmorris@namei.org>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Casey Schaufler <casey@schaufler-ca.com>a
      Signed-off-by: N"Serge E. Hallyn" <serue@us.ibm.com>
      Tested-by: NJiri Slaby <jirislaby@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b7391de
    • B
      mm/page-writeback: highmem_is_dirtyable option · 195cf453
      Bron Gondwana 提交于
      Add vm.highmem_is_dirtyable toggle
      
      A 32 bit machine with HIGHMEM64 enabled running DCC has an MMAPed file of
      approximately 2Gb size which contains a hash format that is written
      randomly by the dbclean process.  On 2.6.16 this process took a few
      minutes.  With lowmem only accounting of dirty ratios, this takes about 12
      hours of 100% disk IO, all random writes.
      
      Include a toggle in /proc/sys/vm/highmem_is_dirtyable which can be set to 1 to
      add the highmem back to the total available memory count.
      
      [akpm@linux-foundation.org: Fix the CONFIG_DETECT_SOFTLOCKUP=y build]
      Signed-off-by: NBron Gondwana <brong@fastmail.fm>
      Cc: Ethan Solomita <solo@google.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: WU Fengguang <wfg@mail.ustc.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      195cf453
  19. 02 2月, 2008 1 次提交
    • E
      [AUDIT] break large execve argument logging into smaller messages · de6bbd1d
      Eric Paris 提交于
      execve arguments can be quite large.  There is no limit on the number of
      arguments and a 4G limit on the size of an argument.
      
      this patch prints those aruguments in bite sized pieces.  a userspace size
      limitation of 8k was discovered so this keeps messages around 7.5k
      
      single arguments larger than 7.5k in length are split into multiple records
      and can be identified as aX[Y]=
      Signed-off-by: NEric Paris <eparis@redhat.com>
      de6bbd1d
  20. 30 1月, 2008 1 次提交
  21. 29 1月, 2008 4 次提交
  22. 26 1月, 2008 5 次提交
    • I
      softlockup: fix signedness · 90739081
      Ingo Molnar 提交于
      fix softlockup tunables signedness.
      
      mark tunables read-mostly.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      90739081
    • A
      sched: latencytop support · 9745512c
      Arjan van de Ven 提交于
      LatencyTOP kernel infrastructure; it measures latencies in the
      scheduler and tracks it system wide and per process.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9745512c
    • P
      sched: rt time limit · fa85ae24
      Peter Zijlstra 提交于
      Very simple time limit on the realtime scheduling classes.
      Allow the rq's realtime class to consume sched_rt_ratio of every
      sched_rt_period slice. If the class exceeds this quota the fair class
      will preempt the realtime class.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fa85ae24
    • I
      softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks · 82a1fcb9
      Ingo Molnar 提交于
      this patch extends the soft-lockup detector to automatically
      detect hung TASK_UNINTERRUPTIBLE tasks. Such hung tasks are
      printed the following way:
      
       ------------------>
       INFO: task prctl:3042 blocked for more than 120 seconds.
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message
       prctl         D fd5e3793     0  3042   2997
              f6050f38 00000046 00000001 fd5e3793 00000009 c06d8264 c06dae80 00000286
              f6050f40 f6050f00 f7d34d90 f7d34fc8 c1e1be80 00000001 f6050000 00000000
              f7e92d00 00000286 f6050f18 c0489d1a f6050f40 00006605 00000000 c0133a5b
       Call Trace:
        [<c04883a5>] schedule_timeout+0x6d/0x8b
        [<c04883d8>] schedule_timeout_uninterruptible+0x15/0x17
        [<c0133a76>] msleep+0x10/0x16
        [<c0138974>] sys_prctl+0x30/0x1e2
        [<c0104c52>] sysenter_past_esp+0x5f/0xa5
        =======================
       2 locks held by prctl/3042:
       #0:  (&sb->s_type->i_mutex_key#5){--..}, at: [<c0197d11>] do_fsync+0x38/0x7a
       #1:  (jbd_handle){--..}, at: [<c01ca3d2>] journal_start+0xc7/0xe9
       <------------------
      
      the current default timeout is 120 seconds. Such messages are printed
      up to 10 times per bootup. If the system has crashed already then the
      messages are not printed.
      
      if lockdep is enabled then all held locks are printed as well.
      
      this feature is a natural extension to the softlockup-detector (kernel
      locked up without scheduling) and to the NMI watchdog (kernel locked up
      with IRQs disabled).
      
      [ Gautham R Shenoy <ego@in.ibm.com>: CPU hotplug fixes. ]
      [ Andrew Morton <akpm@linux-foundation.org>: build warning fix. ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      82a1fcb9
    • S
      sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups · 6b2d7700
      Srivatsa Vaddagiri 提交于
      The current load balancing scheme isn't good enough for precise
      group fairness.
      
      For example: on a 8-cpu system, I created 3 groups as under:
      
      	a = 8 tasks (cpu.shares = 1024)
      	b = 4 tasks (cpu.shares = 1024)
      	c = 3 tasks (cpu.shares = 1024)
      
      a, b and c are task groups that have equal weight. We would expect each
      of the groups to receive 33.33% of cpu bandwidth under a fair scheduler.
      
      This is what I get with the latest scheduler git tree:
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      --------------------------------------------------------------------------------
      Col1  | Col2    | Col3  |  Col4
      ------|---------|-------|-------------------------------------------------------
      a     | 277.676 | 57.8% | 54.1%  54.1%  54.1%  54.2%  56.7%  62.2%  62.8% 64.5%
      b     | 116.108 | 24.2% | 47.4%  48.1%  48.7%  49.3%
      c     |  86.326 | 18.0% | 47.5%  47.9%  48.5%
      --------------------------------------------------------------------------------
      
      Explanation of o/p:
      
      Col1 -> Group name
      Col2 -> Cumulative execution time (in seconds) received by all tasks of that
      	group in a 60sec window across 8 cpus
      Col3 -> CPU bandwidth received by the group in the 60sec window, expressed in
              percentage. Col3 data is derived as:
      		Col3 = 100 * Col2 / (NR_CPUS * 60)
      Col4 -> CPU bandwidth received by each individual task of the group.
      		Col4 = 100 * cpu_time_recd_by_task / 60
      
      [I can share the test case that produces a similar o/p if reqd]
      
      The deviation from desired group fairness is as below:
      
      	a = +24.47%
      	b = -9.13%
      	c = -15.33%
      
      which is quite high.
      
      After the patch below is applied, here are the results:
      
      --------------------------------------------------------------------------------
      Col1  | Col2    | Col3  |  Col4
      ------|---------|-------|-------------------------------------------------------
      a     | 163.112 | 34.0% | 33.2%  33.4%  33.5%  33.5%  33.7%  34.4%  34.8% 35.3%
      b     | 156.220 | 32.5% | 63.3%  64.5%  66.1%  66.5%
      c     | 160.653 | 33.5% | 85.8%  90.6%  91.4%
      --------------------------------------------------------------------------------
      
      Deviation from desired group fairness is as below:
      
      	a = +0.67%
      	b = -0.83%
      	c = +0.17%
      
      which is far better IMO. Most of other runs have yielded a deviation within
      +-2% at the most, which is good.
      
      Why do we see bad (group) fairness with current scheuler?
      =========================================================
      
      Currently cpu's weight is just the summation of individual task weights.
      This can yield incorrect results. For ex: consider three groups as below
      on a 2-cpu system:
      
      	CPU0	CPU1
      ---------------------------
      	A (10)  B(5)
      		C(5)
      ---------------------------
      
      Group A has 10 tasks, all on CPU0, Group B and C have 5 tasks each all
      of which are on CPU1. Each task has the same weight (NICE_0_LOAD =
      1024).
      
      The current scheme would yield a cpu weight of 10240 (10*1024) for each cpu and
      the load balancer will think both CPUs are perfectly balanced and won't
      move around any tasks. This, however, would yield this bandwidth:
      
      	A = 50%
      	B = 25%
      	C = 25%
      
      which is not the desired result.
      
      What's changing in the patch?
      =============================
      
      	- How cpu weights are calculated when CONFIF_FAIR_GROUP_SCHED is
      	  defined (see below)
      	- API Change
      		- Two tunables introduced in sysfs (under SCHED_DEBUG) to
      		  control the frequency at which the load balance monitor
      		  thread runs.
      
      The basic change made in this patch is how cpu weight (rq->load.weight) is
      calculated. Its now calculated as the summation of group weights on a cpu,
      rather than summation of task weights. Weight exerted by a group on a
      cpu is dependent on the shares allocated to it and also the number of
      tasks the group has on that cpu compared to the total number of
      (runnable) tasks the group has in the system.
      
      Let,
      	W(K,i)  = Weight of group K on cpu i
      	T(K,i)  = Task load present in group K's cfs_rq on cpu i
      	T(K)    = Total task load of group K across various cpus
      	S(K) 	= Shares allocated to group K
      	NRCPUS	= Number of online cpus in the scheduler domain to
      	 	  which group K is assigned.
      
      Then,
      	W(K,i) = S(K) * NRCPUS * T(K,i) / T(K)
      
      A load balance monitor thread is created at bootup, which periodically
      runs and adjusts group's weight on each cpu. To avoid its overhead, two
      min/max tunables are introduced (under SCHED_DEBUG) to control the rate
      at which it runs.
      
      Fixes from: Peter Zijlstra <a.p.zijlstra@chello.nl>
      
      - don't start the load_balance_monitor when there is only a single cpu.
      - rename the kthread because its currently longer than TASK_COMM_LEN
      Signed-off-by: NSrivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6b2d7700
  23. 18 12月, 2007 1 次提交