1. 28 4月, 2008 1 次提交
    • A
      capabilities: implement per-process securebits · 3898b1b4
      Andrew G. Morgan 提交于
      Filesystem capability support makes it possible to do away with (set)uid-0
      based privilege and use capabilities instead.  That is, with filesystem
      support for capabilities but without this present patch, it is (conceptually)
      possible to manage a system with capabilities alone and never need to obtain
      privilege via (set)uid-0.
      
      Of course, conceptually isn't quite the same as currently possible since few
      user applications, certainly not enough to run a viable system, are currently
      prepared to leverage capabilities to exercise privilege.  Further, many
      applications exist that may never get upgraded in this way, and the kernel
      will continue to want to support their setuid-0 base privilege needs.
      
      Where pure-capability applications evolve and replace setuid-0 binaries, it is
      desirable that there be a mechanisms by which they can contain their
      privilege.  In addition to leveraging the per-process bounding and inheritable
      sets, this should include suppressing the privilege of the uid-0 superuser
      from the process' tree of children.
      
      The feature added by this patch can be leveraged to suppress the privilege
      associated with (set)uid-0.  This suppression requires CAP_SETPCAP to
      initiate, and only immediately affects the 'current' process (it is inherited
      through fork()/exec()).  This reimplementation differs significantly from the
      historical support for securebits which was system-wide, unwieldy and which
      has ultimately withered to a dead relic in the source of the modern kernel.
      
      With this patch applied a process, that is capable(CAP_SETPCAP), can now drop
      all legacy privilege (through uid=0) for itself and all subsequently
      fork()'d/exec()'d children with:
      
        prctl(PR_SET_SECUREBITS, 0x2f);
      
      This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES is
      enabled at configure time.
      
      [akpm@linux-foundation.org: fix uninitialised var warning]
      [serue@us.ibm.com: capabilities: use cap_task_prctl when !CONFIG_SECURITY]
      Signed-off-by: NAndrew G. Morgan <morgan@kernel.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Paul Moore <paul.moore@hp.com>
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3898b1b4
  2. 27 4月, 2008 1 次提交
    • C
      s390: KVM preparation: provide hook to enable pgstes in user pagetable · 402b0862
      Carsten Otte 提交于
      The SIE instruction on s390 uses the 2nd half of the page table page to
      virtualize the storage keys of a guest. This patch offers the s390_enable_sie
      function, which reorganizes the page tables of a single-threaded process to
      reserve space in the page table:
      s390_enable_sie makes sure that the process is single threaded and then uses
      dup_mm to create a new mm with reorganized page tables. The old mm is freed
      and the process has now a page status extended field after every page table.
      
      Code that wants to exploit pgstes should SELECT CONFIG_PGSTE.
      
      This patch has a small common code hit, namely making dup_mm non-static.
      
      Edit (Carsten): I've modified Martin's patch, following Jeremy Fitzhardinge's
      review feedback. Now we do have the prototype for dup_mm in
      include/linux/sched.h. Following Martin's suggestion, s390_enable_sie() does now
      call task_lock() to prevent race against ptrace modification of mm_users.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NCarsten Otte <cotte@de.ibm.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAvi Kivity <avi@qumranet.com>
      402b0862
  3. 24 4月, 2008 1 次提交
  4. 20 4月, 2008 10 次提交
  5. 19 4月, 2008 1 次提交
  6. 26 3月, 2008 1 次提交
    • T
      NOHZ: reevaluate idle sleep length after add_timer_on() · 06d8308c
      Thomas Gleixner 提交于
      add_timer_on() can add a timer on a CPU which is currently in a long
      idle sleep, but the timer wheel is not reevaluated by the nohz code on
      that CPU. So a timer can be delayed for quite a long time. This
      triggered a false positive in the clocksource watchdog code.
      
      To avoid this we need to wake up the idle CPU and enforce the
      reevaluation of the timer wheel for the next timer event.
      
      Add a function, which checks a given CPU for idle state, marks the
      idle task with NEED_RESCHED and sends a reschedule IPI to notify the
      other CPU of the change in the timer wheel.
      
      Call this function from add_timer_on().
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: stable@kernel.org
      
      --
       include/linux/sched.h |    6 ++++++
       kernel/sched.c        |   43 +++++++++++++++++++++++++++++++++++++++++++
       kernel/timer.c        |   10 +++++++++-
       3 files changed, 58 insertions(+), 1 deletion(-)
      06d8308c
  7. 21 3月, 2008 1 次提交
  8. 19 3月, 2008 1 次提交
    • I
      sched: improve affine wakeups · 4ae7d5ce
      Ingo Molnar 提交于
      improve affine wakeups. Maintain the 'overlap' metric based on CFS's
      sum_exec_runtime - which means the amount of time a task executes
      after it wakes up some other task.
      
      Use the 'overlap' for the wakeup decisions: if the 'overlap' is short,
      it means there's strong workload coupling between this task and the
      woken up task. If the 'overlap' is large then the workload is decoupled
      and the scheduler will move them to separate CPUs more easily.
      
      ( Also slightly move the preempt_check within try_to_wake_up() - this has
        no effect on functionality but allows 'early wakeups' (for still-on-rq
        tasks) to be correctly accounted as well.)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4ae7d5ce
  9. 07 3月, 2008 1 次提交
  10. 05 3月, 2008 1 次提交
    • P
      sched: revert load_balance_monitor() changes · 62fb1851
      Peter Zijlstra 提交于
      The following commits cause a number of regressions:
      
        commit 58e2d4ca
        Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
        Date:   Fri Jan 25 21:08:00 2008 +0100
        sched: group scheduling, change how cpu load is calculated
      
        commit 6b2d7700
        Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
        Date:   Fri Jan 25 21:08:00 2008 +0100
        sched: group scheduler, fix fairness of cpu bandwidth allocation for task groups
      
      Namely:
       - very frequent wakeups on SMP, reported by PowerTop users.
       - cacheline trashing on (large) SMP
       - some latencies larger than 500ms
      
      While there is a mergeable patch to fix the latter, the former issues
      are not fixable in a manner suitable for .25 (we're at -rc3 now).
      
      Hence we revert them and try again in v2.6.26.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      CC: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
      Tested-by: NAlexey Zaytsev <alexey.zaytsev@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      62fb1851
  11. 26 2月, 2008 1 次提交
  12. 25 2月, 2008 1 次提交
  13. 14 2月, 2008 1 次提交
  14. 13 2月, 2008 2 次提交
  15. 09 2月, 2008 4 次提交
  16. 08 2月, 2008 1 次提交
  17. 07 2月, 2008 1 次提交
  18. 06 2月, 2008 4 次提交
    • S
      capabilities: introduce per-process capability bounding set · 3b7391de
      Serge E. Hallyn 提交于
      The capability bounding set is a set beyond which capabilities cannot grow.
       Currently cap_bset is per-system.  It can be manipulated through sysctl,
      but only init can add capabilities.  Root can remove capabilities.  By
      default it includes all caps except CAP_SETPCAP.
      
      This patch makes the bounding set per-process when file capabilities are
      enabled.  It is inherited at fork from parent.  Noone can add elements,
      CAP_SETPCAP is required to remove them.
      
      One example use of this is to start a safer container.  For instance, until
      device namespaces or per-container device whitelists are introduced, it is
      best to take CAP_MKNOD away from a container.
      
      The bounding set will not affect pP and pE immediately.  It will only
      affect pP' and pE' after subsequent exec()s.  It also does not affect pI,
      and exec() does not constrain pI'.  So to really start a shell with no way
      of regain CAP_MKNOD, you would do
      
      	prctl(PR_CAPBSET_DROP, CAP_MKNOD);
      	cap_t cap = cap_get_proc();
      	cap_value_t caparray[1];
      	caparray[0] = CAP_MKNOD;
      	cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
      	cap_set_proc(cap);
      	cap_free(cap);
      
      The following test program will get and set the bounding
      set (but not pI).  For instance
      
      	./bset get
      		(lists capabilities in bset)
      	./bset drop cap_net_raw
      		(starts shell with new bset)
      		(use capset, setuid binary, or binary with
      		file capabilities to try to increase caps)
      
      ************************************************************
      cap_bound.c
      ************************************************************
       #include <sys/prctl.h>
       #include <linux/capability.h>
       #include <sys/types.h>
       #include <unistd.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
      
       #ifndef PR_CAPBSET_READ
       #define PR_CAPBSET_READ 23
       #endif
      
       #ifndef PR_CAPBSET_DROP
       #define PR_CAPBSET_DROP 24
       #endif
      
      int usage(char *me)
      {
      	printf("Usage: %s get\n", me);
      	printf("       %s drop <capability>\n", me);
      	return 1;
      }
      
       #define numcaps 32
      char *captable[numcaps] = {
      	"cap_chown",
      	"cap_dac_override",
      	"cap_dac_read_search",
      	"cap_fowner",
      	"cap_fsetid",
      	"cap_kill",
      	"cap_setgid",
      	"cap_setuid",
      	"cap_setpcap",
      	"cap_linux_immutable",
      	"cap_net_bind_service",
      	"cap_net_broadcast",
      	"cap_net_admin",
      	"cap_net_raw",
      	"cap_ipc_lock",
      	"cap_ipc_owner",
      	"cap_sys_module",
      	"cap_sys_rawio",
      	"cap_sys_chroot",
      	"cap_sys_ptrace",
      	"cap_sys_pacct",
      	"cap_sys_admin",
      	"cap_sys_boot",
      	"cap_sys_nice",
      	"cap_sys_resource",
      	"cap_sys_time",
      	"cap_sys_tty_config",
      	"cap_mknod",
      	"cap_lease",
      	"cap_audit_write",
      	"cap_audit_control",
      	"cap_setfcap"
      };
      
      int getbcap(void)
      {
      	int comma=0;
      	unsigned long i;
      	int ret;
      
      	printf("i know of %d capabilities\n", numcaps);
      	printf("capability bounding set:");
      	for (i=0; i<numcaps; i++) {
      		ret = prctl(PR_CAPBSET_READ, i);
      		if (ret < 0)
      			perror("prctl");
      		else if (ret==1)
      			printf("%s%s", (comma++) ? ", " : " ", captable[i]);
      	}
      	printf("\n");
      	return 0;
      }
      
      int capdrop(char *str)
      {
      	unsigned long i;
      
      	int found=0;
      	for (i=0; i<numcaps; i++) {
      		if (strcmp(captable[i], str) == 0) {
      			found=1;
      			break;
      		}
      	}
      	if (!found)
      		return 1;
      	if (prctl(PR_CAPBSET_DROP, i)) {
      		perror("prctl");
      		return 1;
      	}
      	return 0;
      }
      
      int main(int argc, char *argv[])
      {
      	if (argc<2)
      		return usage(argv[0]);
      	if (strcmp(argv[1], "get")==0)
      		return getbcap();
      	if (strcmp(argv[1], "drop")!=0 || argc<3)
      		return usage(argv[0]);
      	if (capdrop(argv[2])) {
      		printf("unknown capability\n");
      		return 1;
      	}
      	return execl("/bin/bash", "/bin/bash", NULL);
      }
      ************************************************************
      
      [serue@us.ibm.com: fix typo]
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndrew G. Morgan <morgan@kernel.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: James Morris <jmorris@namei.org>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Casey Schaufler <casey@schaufler-ca.com>a
      Signed-off-by: N"Serge E. Hallyn" <serue@us.ibm.com>
      Tested-by: NJiri Slaby <jirislaby@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b7391de
    • D
      maps4: rework TASK_SIZE macros · 82455257
      Dave Hansen 提交于
      The following replaces the earlier patches sent.  It should address
      David Rientjes's comments, and has been compile tested on all the
      architectures that it touches, save for parisc.
      
      For the /proc/<pid>/pagemap code[1], we need to able to query how
      much virtual address space a particular task has.  The trick is
      that we do it through /proc and can't use TASK_SIZE since it
      references "current" on some arches.  The process opening the
      /proc file might be a 32-bit process opening a 64-bit process's
      pagemap file.
      
      x86_64 already has a TASK_SIZE_OF() macro:
      
      #define TASK_SIZE_OF(child)     ((test_tsk_thread_flag(child, TIF_IA32)) ? IA32_PAGE_OFFSET : TASK_SIZE64)
      
      I'd like to have that for other architectures.  So, add it
      for all the architectures that actually use "current" in
      their TASK_SIZE.  For the others, just add a quick #define
      in sched.h to use plain old TASK_SIZE.
      
      1. http://www.linuxworld.com/news/2007/042407-kernel.html
      
      - MIPS portion from Ralf Baechle <ralf@linux-mips.org>
      
      [akpm@linux-foundation.org: fix mips build]
      Signed-off-by: NDave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NRalf Baechle <ralf@linux-mips.org>
      Signed-off-by: NMatt Mackall <mpm@selenic.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82455257
    • O
      exec: rework the group exit and fix the race with kill · ed5d2cac
      Oleg Nesterov 提交于
      As Roland pointed out, we have the very old problem with exec.  de_thread()
      sets SIGNAL_GROUP_EXIT, kills other threads, changes ->group_leader and then
      clears signal->flags.  All signals (even fatal ones) sent in this window
      (which is not too small) will be lost.
      
      With this patch exec doesn't abuse SIGNAL_GROUP_EXIT.  signal_group_exit(),
      the new helper, should be used to detect exit_group() or exec() in progress.
      It can have more users, but this patch does only strictly necessary changes.
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Cc: Davide Libenzi <davidel@xmailserver.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Robin Holt <holt@sgi.com>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed5d2cac
    • A
      get_task_comm(): return the result · 59714d65
      Andrew Morton 提交于
      It was dumb to make get_task_comm() return void.  Change it to return a
      pointer to the resulting output for caller convenience.
      
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59714d65
  19. 02 2月, 2008 2 次提交
  20. 30 1月, 2008 1 次提交
    • N
      spinlock: lockbreak cleanup · 95c354fe
      Nick Piggin 提交于
      The break_lock data structure and code for spinlocks is quite nasty.
      Not only does it double the size of a spinlock but it changes locking to
      a potentially less optimal trylock.
      
      Put all of that under CONFIG_GENERIC_LOCKBREAK, and introduce a
      __raw_spin_is_contended that uses the lock data itself to determine whether
      there are waiters on the lock, to be used if CONFIG_GENERIC_LOCKBREAK is
      not set.
      
      Rename need_lockbreak to spin_needbreak, make it use spin_is_contended to
      decouple it from the spinlock implementation, and make it typesafe (rwlocks
      do not have any need_lockbreak sites -- why do they even get bloated up
      with that break_lock then?).
      Signed-off-by: NNick Piggin <npiggin@suse.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      95c354fe
  21. 28 1月, 2008 2 次提交
  22. 26 1月, 2008 1 次提交