1. 13 1月, 2012 1 次提交
    • C
      c/r: prctl: add PR_SET_MM codes to set up mm_struct entries · 028ee4be
      Cyrill Gorcunov 提交于
      When we restore a task we need to set up text, data and data heap sizes
      from userspace to the values a task had at checkpoint time.  This patch
      adds auxilary prctl codes for that.
      
      While most of them have a statistical nature (their values are involved
      into calculation of /proc/<pid>/statm output) the start_brk and brk values
      are used to compute an allowed size of program data segment expansion.
      Which means an arbitrary changes of this values might be dangerous
      operation.  So to restrict access the following requirements applied to
      prctl calls:
      
       - The process has to have CAP_SYS_ADMIN capability granted.
       - For all opcodes except start_brk/brk members an appropriate
         VMA area must exist and should fit certain VMA flags,
         such as:
         - code segment must be executable but not writable;
         - data segment must not be executable.
      
      start_brk/brk values must not intersect with data segment and must not
      exceed RLIMIT_DATA resource limit.
      
      Still the main guard is CAP_SYS_ADMIN capability check.
      
      Note the kernel should be compiled with CONFIG_CHECKPOINT_RESTORE support
      otherwise these prctl calls will return -EINVAL.
      
      [akpm@linux-foundation.org: cache current->mm in a local, saving 200 bytes text]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Andrew Vagin <avagin@openvz.org>
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      028ee4be
  2. 04 10月, 2009 1 次提交
    • A
      HWPOISON: Clean up PR_MCE_KILL interface · 1087e9b4
      Andi Kleen 提交于
      While writing the manpage I noticed some shortcomings in the
      current interface.
      
      - Define symbolic names for all the different values
      - Boundary check the kill mode values
      - For symmetry add a get interface too. This allows library
      code to get/set the current state.
      - For consistency define a PR_MCE_KILL_DEFAULT value
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      1087e9b4
  3. 21 9月, 2009 1 次提交
    • I
      perf: Do the big rename: Performance Counters -> Performance Events · cdd6c482
      Ingo Molnar 提交于
      Bye-bye Performance Counters, welcome Performance Events!
      
      In the past few months the perfcounters subsystem has grown out its
      initial role of counting hardware events, and has become (and is
      becoming) a much broader generic event enumeration, reporting, logging,
      monitoring, analysis facility.
      
      Naming its core object 'perf_counter' and naming the subsystem
      'perfcounters' has become more and more of a misnomer. With pending
      code like hw-breakpoints support the 'counter' name is less and
      less appropriate.
      
      All in one, we've decided to rename the subsystem to 'performance
      events' and to propagate this rename through all fields, variables
      and API names. (in an ABI compatible fashion)
      
      The word 'event' is also a bit shorter than 'counter' - which makes
      it slightly more convenient to write/handle as well.
      
      Thanks goes to Stephane Eranian who first observed this misnomer and
      suggested a rename.
      
      User-space tooling and ABI compatibility is not affected - this patch
      should be function-invariant. (Also, defconfigs were not touched to
      keep the size down.)
      
      This patch has been generated via the following script:
      
        FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
      
        sed -i \
          -e 's/PERF_EVENT_/PERF_RECORD_/g' \
          -e 's/PERF_COUNTER/PERF_EVENT/g' \
          -e 's/perf_counter/perf_event/g' \
          -e 's/nb_counters/nb_events/g' \
          -e 's/swcounter/swevent/g' \
          -e 's/tpcounter_event/tp_event/g' \
          $FILES
      
        for N in $(find . -name perf_counter.[ch]); do
          M=$(echo $N | sed 's/perf_counter/perf_event/g')
          mv $N $M
        done
      
        FILES=$(find . -name perf_event.*)
      
        sed -i \
          -e 's/COUNTER_MASK/REG_MASK/g' \
          -e 's/COUNTER/EVENT/g' \
          -e 's/\<event\>/event_id/g' \
          -e 's/counter/event/g' \
          -e 's/Counter/Event/g' \
          $FILES
      
      ... to keep it as correct as possible. This script can also be
      used by anyone who has pending perfcounters patches - it converts
      a Linux kernel tree over to the new naming. We tried to time this
      change to the point in time where the amount of pending patches
      is the smallest: the end of the merge window.
      
      Namespace clashes were fixed up in a preparatory patch - and some
      stylistic fallout will be fixed up in a subsequent patch.
      
      ( NOTE: 'counters' are still the proper terminology when we deal
        with hardware registers - and these sed scripts are a bit
        over-eager in renaming them. I've undone some of that, but
        in case there's something left where 'counter' would be
        better than 'event' we can undo that on an individual basis
        instead of touching an otherwise nicely automated patch. )
      Suggested-by: NStephane Eranian <eranian@google.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Reviewed-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <linux-arch@vger.kernel.org>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cdd6c482
  4. 16 9月, 2009 1 次提交
    • A
      HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process · 4db96cf0
      Andi Kleen 提交于
      This allows processes to override their early/late kill
      behaviour on hardware memory errors.
      
      Typically applications which are memory error aware is
      better of with early kill (see the error as soon
      as possible), all others with late kill (only
      see the error when the error is really impacting execution)
      
      There's a global sysctl, but this way an application
      can set its specific policy.
      
      We're using two bits, one to signify that the process
      stated its intention and that
      
      I also made the prctl future proof by enforcing
      the unused arguments are 0.
      
      The state is inherited to children.
      
      Note this makes us officially run out of process flags
      on 32bit, but the next patch can easily add another field.
      
      Manpage patch will be supplied separately.
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      4db96cf0
  5. 11 12月, 2008 1 次提交
  6. 06 9月, 2008 1 次提交
    • A
      hrtimer: create a "timer_slack" field in the task struct · 6976675d
      Arjan van de Ven 提交于
      We want to be able to control the default "rounding" that is used by
      select() and poll() and friends. This is a per process property
      (so that we can have a "nice" like program to start certain programs with
      a looser or stricter rounding) that can be set/get via a prctl().
      
      For this purpose, a field called "timer_slack_ns" is added to the task
      struct. In addition, a field called "default_timer_slack"ns" is added
      so that tasks easily can temporarily to a more/less accurate slack and then
      back to the default.
      
      The default value of the slack is set to 50 usec; this is significantly less
      than 2.6.27's average select() and poll() timing error but still allows
      the kernel to group timers somewhat to preserve power behavior. Applications
      and admins can override this via the prctl()
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      6976675d
  7. 28 4月, 2008 1 次提交
    • A
      capabilities: implement per-process securebits · 3898b1b4
      Andrew G. Morgan 提交于
      Filesystem capability support makes it possible to do away with (set)uid-0
      based privilege and use capabilities instead.  That is, with filesystem
      support for capabilities but without this present patch, it is (conceptually)
      possible to manage a system with capabilities alone and never need to obtain
      privilege via (set)uid-0.
      
      Of course, conceptually isn't quite the same as currently possible since few
      user applications, certainly not enough to run a viable system, are currently
      prepared to leverage capabilities to exercise privilege.  Further, many
      applications exist that may never get upgraded in this way, and the kernel
      will continue to want to support their setuid-0 base privilege needs.
      
      Where pure-capability applications evolve and replace setuid-0 binaries, it is
      desirable that there be a mechanisms by which they can contain their
      privilege.  In addition to leveraging the per-process bounding and inheritable
      sets, this should include suppressing the privilege of the uid-0 superuser
      from the process' tree of children.
      
      The feature added by this patch can be leveraged to suppress the privilege
      associated with (set)uid-0.  This suppression requires CAP_SETPCAP to
      initiate, and only immediately affects the 'current' process (it is inherited
      through fork()/exec()).  This reimplementation differs significantly from the
      historical support for securebits which was system-wide, unwieldy and which
      has ultimately withered to a dead relic in the source of the modern kernel.
      
      With this patch applied a process, that is capable(CAP_SETPCAP), can now drop
      all legacy privilege (through uid=0) for itself and all subsequently
      fork()'d/exec()'d children with:
      
        prctl(PR_SET_SECUREBITS, 0x2f);
      
      This patch represents a no-op unless CONFIG_SECURITY_FILE_CAPABILITIES is
      enabled at configure time.
      
      [akpm@linux-foundation.org: fix uninitialised var warning]
      [serue@us.ibm.com: capabilities: use cap_task_prctl when !CONFIG_SECURITY]
      Signed-off-by: NAndrew G. Morgan <morgan@kernel.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: Paul Moore <paul.moore@hp.com>
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3898b1b4
  8. 20 4月, 2008 1 次提交
  9. 06 2月, 2008 1 次提交
    • S
      capabilities: introduce per-process capability bounding set · 3b7391de
      Serge E. Hallyn 提交于
      The capability bounding set is a set beyond which capabilities cannot grow.
       Currently cap_bset is per-system.  It can be manipulated through sysctl,
      but only init can add capabilities.  Root can remove capabilities.  By
      default it includes all caps except CAP_SETPCAP.
      
      This patch makes the bounding set per-process when file capabilities are
      enabled.  It is inherited at fork from parent.  Noone can add elements,
      CAP_SETPCAP is required to remove them.
      
      One example use of this is to start a safer container.  For instance, until
      device namespaces or per-container device whitelists are introduced, it is
      best to take CAP_MKNOD away from a container.
      
      The bounding set will not affect pP and pE immediately.  It will only
      affect pP' and pE' after subsequent exec()s.  It also does not affect pI,
      and exec() does not constrain pI'.  So to really start a shell with no way
      of regain CAP_MKNOD, you would do
      
      	prctl(PR_CAPBSET_DROP, CAP_MKNOD);
      	cap_t cap = cap_get_proc();
      	cap_value_t caparray[1];
      	caparray[0] = CAP_MKNOD;
      	cap_set_flag(cap, CAP_INHERITABLE, 1, caparray, CAP_DROP);
      	cap_set_proc(cap);
      	cap_free(cap);
      
      The following test program will get and set the bounding
      set (but not pI).  For instance
      
      	./bset get
      		(lists capabilities in bset)
      	./bset drop cap_net_raw
      		(starts shell with new bset)
      		(use capset, setuid binary, or binary with
      		file capabilities to try to increase caps)
      
      ************************************************************
      cap_bound.c
      ************************************************************
       #include <sys/prctl.h>
       #include <linux/capability.h>
       #include <sys/types.h>
       #include <unistd.h>
       #include <stdio.h>
       #include <stdlib.h>
       #include <string.h>
      
       #ifndef PR_CAPBSET_READ
       #define PR_CAPBSET_READ 23
       #endif
      
       #ifndef PR_CAPBSET_DROP
       #define PR_CAPBSET_DROP 24
       #endif
      
      int usage(char *me)
      {
      	printf("Usage: %s get\n", me);
      	printf("       %s drop <capability>\n", me);
      	return 1;
      }
      
       #define numcaps 32
      char *captable[numcaps] = {
      	"cap_chown",
      	"cap_dac_override",
      	"cap_dac_read_search",
      	"cap_fowner",
      	"cap_fsetid",
      	"cap_kill",
      	"cap_setgid",
      	"cap_setuid",
      	"cap_setpcap",
      	"cap_linux_immutable",
      	"cap_net_bind_service",
      	"cap_net_broadcast",
      	"cap_net_admin",
      	"cap_net_raw",
      	"cap_ipc_lock",
      	"cap_ipc_owner",
      	"cap_sys_module",
      	"cap_sys_rawio",
      	"cap_sys_chroot",
      	"cap_sys_ptrace",
      	"cap_sys_pacct",
      	"cap_sys_admin",
      	"cap_sys_boot",
      	"cap_sys_nice",
      	"cap_sys_resource",
      	"cap_sys_time",
      	"cap_sys_tty_config",
      	"cap_mknod",
      	"cap_lease",
      	"cap_audit_write",
      	"cap_audit_control",
      	"cap_setfcap"
      };
      
      int getbcap(void)
      {
      	int comma=0;
      	unsigned long i;
      	int ret;
      
      	printf("i know of %d capabilities\n", numcaps);
      	printf("capability bounding set:");
      	for (i=0; i<numcaps; i++) {
      		ret = prctl(PR_CAPBSET_READ, i);
      		if (ret < 0)
      			perror("prctl");
      		else if (ret==1)
      			printf("%s%s", (comma++) ? ", " : " ", captable[i]);
      	}
      	printf("\n");
      	return 0;
      }
      
      int capdrop(char *str)
      {
      	unsigned long i;
      
      	int found=0;
      	for (i=0; i<numcaps; i++) {
      		if (strcmp(captable[i], str) == 0) {
      			found=1;
      			break;
      		}
      	}
      	if (!found)
      		return 1;
      	if (prctl(PR_CAPBSET_DROP, i)) {
      		perror("prctl");
      		return 1;
      	}
      	return 0;
      }
      
      int main(int argc, char *argv[])
      {
      	if (argc<2)
      		return usage(argv[0]);
      	if (strcmp(argv[1], "get")==0)
      		return getbcap();
      	if (strcmp(argv[1], "drop")!=0 || argc<3)
      		return usage(argv[0]);
      	if (capdrop(argv[2])) {
      		printf("unknown capability\n");
      		return 1;
      	}
      	return execl("/bin/bash", "/bin/bash", NULL);
      }
      ************************************************************
      
      [serue@us.ibm.com: fix typo]
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndrew G. Morgan <morgan@kernel.org>
      Cc: Stephen Smalley <sds@tycho.nsa.gov>
      Cc: James Morris <jmorris@namei.org>
      Cc: Chris Wright <chrisw@sous-sol.org>
      Cc: Casey Schaufler <casey@schaufler-ca.com>a
      Signed-off-by: N"Serge E. Hallyn" <serue@us.ibm.com>
      Tested-by: NJiri Slaby <jirislaby@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3b7391de
  10. 17 7月, 2007 1 次提交
  11. 09 6月, 2006 1 次提交
    • A
      [PATCH] Add a prctl to change the endianness of a process. · 651d765d
      Anton Blanchard 提交于
      This new prctl is intended for changing the execution mode of the
      processor, on processors that support both a little-endian mode and a
      big-endian mode.  It is intended for use by programs such as
      instruction set emulators (for example an x86 emulator on PowerPC),
      which may find it convenient to use the processor in an alternate
      endianness mode when executing translated instructions.
      
      Note that this does not imply the existence of a fully-fledged ABI for
      both endiannesses, or of compatibility code for converting system
      calls done in the non-native endianness mode.  The program is expected
      to arrange for all of its system call arguments to be presented in the
      native endianness.
      
      Switching between big and little-endian mode will require some care in
      constructing the instruction sequence for the switch.  Generally the
      instructions up to the instruction that invokes the prctl system call
      will have to be in the old endianness, and subsequent instructions
      will have to be in the new endianness.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      651d765d
  12. 17 4月, 2005 1 次提交
    • L
      Linux-2.6.12-rc2 · 1da177e4
      Linus Torvalds 提交于
      Initial git repository build. I'm not bothering with the full history,
      even though we have it. We can create a separate "historical" git
      archive of that later if we want to, and in the meantime it's about
      3.2GB when imported into git - space that would just make the early
      git days unnecessarily complicated, when we don't have a lot of good
      infrastructure for it.
      
      Let it rip!
      1da177e4