1. 22 5月, 2009 1 次提交
    • P
      perf_counter: Dynamically allocate tasks' perf_counter_context struct · a63eaf34
      Paul Mackerras 提交于
      This replaces the struct perf_counter_context in the task_struct with
      a pointer to a dynamically allocated perf_counter_context struct.  The
      main reason for doing is this is to allow us to transfer a
      perf_counter_context from one task to another when we do lazy PMU
      switching in a later patch.
      
      This has a few side-benefits: the task_struct becomes a little smaller,
      we save some memory because only tasks that have perf_counters attached
      get a perf_counter_context allocated for them, and we can remove the
      inclusion of <linux/perf_counter.h> in sched.h, meaning that we don't
      end up recompiling nearly everything whenever perf_counter.h changes.
      
      The perf_counter_context structures are reference-counted and freed
      when the last reference is dropped.  A context can have references
      from its task and the counters on its task.  Counters can outlive the
      task so it is possible that a context will be freed well after its
      task has exited.
      
      Contexts are allocated on fork if the parent had a context, or
      otherwise the first time that a per-task counter is created on a task.
      In the latter case, we set the context pointer in the task struct
      locklessly using an atomic compare-and-exchange operation in case we
      raced with some other task in creating a context for the subject task.
      
      This also removes the task pointer from the perf_counter struct.  The
      task pointer was not used anywhere and would make it harder to move a
      context from one task to another.  Anything that needed to know which
      task a counter was attached to was already using counter->ctx->task.
      
      The __perf_counter_init_context function moves up in perf_counter.c
      so that it can be called from find_get_context, and now initializes
      the refcount, but is otherwise unchanged.
      
      We were potentially calling list_del_counter twice: once from
      __perf_counter_exit_task when the task exits and once from
      __perf_counter_remove_from_context when the counter's fd gets closed.
      This adds a check in list_del_counter so it doesn't do anything if
      the counter has already been removed from the lists.
      
      Since perf_counter_task_sched_in doesn't do anything if the task doesn't
      have a context, and leaves cpuctx->task_ctx = NULL, this adds code to
      __perf_install_in_context to set cpuctx->task_ctx if necessary, i.e. in
      the case where the current task adds the first counter to itself and
      thus creates a context for itself.
      
      This also adds similar code to __perf_counter_enable to handle a
      similar situation which can arise when the counters have been disabled
      using prctl; that also leaves cpuctx->task_ctx = NULL.
      
      [ Impact: refactor counter context management to prepare for new feature ]
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      LKML-Reference: <18966.10075.781053.231153@cargo.ozlabs.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a63eaf34
  2. 08 4月, 2009 1 次提交
  3. 03 4月, 2009 4 次提交
    • O
      pids: kill signal_struct-> __pgrp/__session and friends · 1b0f7ffd
      Oleg Nesterov 提交于
      We are wasting 2 words in signal_struct without any reason to implement
      task_pgrp_nr() and task_session_nr().
      
      task_session_nr() has no callers since
      2e2ba22e, we can remove it.
      
      task_pgrp_nr() is still (I believe wrongly) used in fs/autofsX and
      fs/coda.
      
      This patch reimplements task_pgrp_nr() via task_pgrp_nr_ns(), and kills
      __pgrp/__session and the related helpers.
      
      The change in drivers/char/tty_io.c is cosmetic, but hopefully makes sense
      anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: Alan Cox <number6@the-village.bc.nu>		[tty parts]
      Cc: Cedric Le Goater <clg@fr.ibm.com>
      Cc: Dave Hansen <haveblue@us.ibm.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Serge Hallyn <serue@us.ibm.com>
      Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Roland McGrath <roland@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b0f7ffd
    • S
      signals: protect cinit from blocked fatal signals · b3bfa0cb
      Sukadev Bhattiprolu 提交于
      Normally SIG_DFL signals to global and container-init are dropped early.
      But if a signal is blocked when it is posted, we cannot drop the signal
      since the receiver may install a handler before unblocking the signal.
      Once this signal is queued however, the receiver container-init has no way
      of knowing if the signal was sent from an ancestor or descendant
      namespace.  This patch ensures that contianer-init drops all SIG_DFL
      signals in get_signal_to_deliver() except SIGKILL/SIGSTOP.
      
      If SIGSTOP/SIGKILL originate from a descendant of container-init they are
      never queued (i.e dropped in sig_ignored() in an earler patch).
      
      If SIGSTOP/SIGKILL originate from parent namespace, the signal is queued
      and container-init processes the signal.
      
      IOW, if get_signal_to_deliver() sees a sig_kernel_only() signal for global
      or container-init, the signal must have been generated internally or must
      have come from an ancestor ns and we process the signal.
      
      Further, the signal_group_exit() check was needed to cover the case of a
      multi-threaded init sending SIGKILL to other threads when doing an exit()
      or exec().  But since the new sig_kernel_only() check covers the SIGKILL,
      the signal_group_exit() check is no longer needed and can be removed.
      
      Finally, now that we have all pieces in place, set SIGNAL_UNKILLABLE for
      container-inits.
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Daniel Lezcano <daniel.lezcano@free.fr>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3bfa0cb
    • A
      Simplify copy_thread() · 6f2c55b8
      Alexey Dobriyan 提交于
      First argument unused since 2.3.11.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f2c55b8
    • D
      nommu: fix a number of issues with the per-MM VMA patch · 33e5d769
      David Howells 提交于
      Fix a number of issues with the per-MM VMA patch:
      
       (1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on
           a NOMMU system with more than 2G pages.  Makes no difference on a 32-bit
           system.
      
       (2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value,
           lest it overflow.
      
       (3) Move the allocation of the vm_area_struct slab back for fork.c.
      
       (4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs.
      
       (5) Use BUG_ON() rather than if () BUG().
      
       (6) Make the default validate_nommu_regions() a static inline rather than a
           #define.
      
       (7) Make free_page_series()'s objection to pages with a refcount != 1 more
           informative.
      
       (8) Adjust the __put_nommu_region() banner comment to indicate that the
           semaphore must be held for writing.
      
       (9) Limit the number of warnings about munmaps of non-mmapped regions.
      Reported-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Cc: Greg Ungerer <gerg@snapgear.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33e5d769
  4. 01 4月, 2009 3 次提交
    • A
      Get rid of indirect include of fs_struct.h · 5ad4e53b
      Al Viro 提交于
      Don't pull it in sched.h; very few files actually need it and those
      can include directly.  sched.h itself only needs forward declaration
      of struct fs_struct;
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      5ad4e53b
    • A
      New locking/refcounting for fs_struct · 498052bb
      Al Viro 提交于
      * all changes of current->fs are done under task_lock and write_lock of
        old fs->lock
      * refcount is not atomic anymore (same protection)
      * its decrements are done when removing reference from current; at the
        same time we decide whether to free it.
      * put_fs_struct() is gone
      * new field - ->in_exec.  Set by check_unsafe_exec() if we are trying to do
        execve() and only subthreads share fs_struct.  Cleared when finishing exec
        (success and failure alike).  Makes CLONE_FS fail with -EAGAIN if set.
      * check_unsafe_exec() may fail with -EAGAIN if another execve() from subthread
        is in progress.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      498052bb
    • A
      Take fs_struct handling to new file (fs/fs_struct.c) · 3e93cd67
      Al Viro 提交于
      Pure code move; two new helper functions for nfsd and daemonize
      (unshare_fs_struct() and daemonize_fs_struct() resp.; for now -
      the same code as used to be in callers).  unshare_fs_struct()
      exported (for nfsd, as copy_fs_struct()/exit_fs() used to be),
      copy_fs_struct() and exit_fs() don't need exports anymore.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      3e93cd67
  5. 30 3月, 2009 1 次提交
  6. 10 3月, 2009 1 次提交
  7. 11 2月, 2009 1 次提交
    • O
      ptrace, x86: fix the usage of ptrace_fork() · 06eb23b1
      Oleg Nesterov 提交于
      I noticed by pure accident we have ptrace_fork() and friends. This was
      added by "x86, bts: add fork and exit handling", commit
      bf53de90.
      
      I can't test this, ds_request_bts() returns -EOPNOTSUPP, but I strongly
      believe this needs the fix. I think something like this program
      
      	int main(void)
      	{
      		int pid = fork();
      
      		if (!pid) {
      			ptrace(PTRACE_TRACEME, 0, NULL, NULL);
      			kill(getpid(), SIGSTOP);
      			fork();
      		} else {
      			struct ptrace_bts_config bts = {
      				.flags = PTRACE_BTS_O_ALLOC,
      				.size  = 4 * 4096,
      			};
      
      			wait(NULL);
      
      			ptrace(PTRACE_SETOPTIONS, pid, NULL, PTRACE_O_TRACEFORK);
      			ptrace(PTRACE_BTS_CONFIG, pid, &bts, sizeof(bts));
      			ptrace(PTRACE_CONT, pid, NULL, NULL);
      
      			sleep(1);
      		}
      
      		return 0;
      	}
      
      should crash the kernel.
      
      If the task is traced by its natural parent ptrace_reparented() returns 0
      but we should clear ->btsxxx anyway.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NMarkus Metzger <markus.t.metzger@intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      06eb23b1
  8. 09 2月, 2009 1 次提交
  9. 07 2月, 2009 1 次提交
  10. 05 2月, 2009 1 次提交
    • P
      signal: re-add dead task accumulation stats. · 32bd671d
      Peter Zijlstra 提交于
      We're going to split the process wide cpu accounting into two parts:
      
       - clocks; which can take all the time they want since they run
                 from user context.
      
       - timers; which need constant time tracing but can affort the overhead
                 because they're default off -- and rare.
      
      The clock readout will go back to a full sum of the thread group, for this
      we need to re-add the exit stats that were removed in the initial itimer
      rework (f06febc9: timers: fix itimer/many thread hang).
      
      Furthermore, since that full sum can be rather slow for large thread groups
      and we have the complete dead task stats, revert the do_notify_parent time
      computation.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      32bd671d
  11. 17 1月, 2009 1 次提交
  12. 14 1月, 2009 2 次提交
  13. 11 1月, 2009 1 次提交
  14. 09 1月, 2009 1 次提交
    • E
      pid: generalize task_active_pid_ns · 61bce0f1
      Eric W. Biederman 提交于
      Currently task_active_pid_ns is not safe to call after a task becomes a
      zombie and exit_task_namespaces is called, as nsproxy becomes NULL.  By
      reading the pid namespace from the pid of the task we can trivially solve
      this problem at the cost of one extra memory read in what should be the
      same cacheline as we read the namespace from.
      
      When moving things around I have made task_active_pid_ns out of line
      because keeping it in pid_namespace.h would require adding includes of
      pid.h and sched.h that I don't think we want.
      
      This change does make task_active_pid_ns unsafe to call during
      copy_process until we attach a pid on the task_struct which seems to be a
      reasonable trade off.
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NSukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Bastian Blank <bastian@waldi.eu.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: Nadia Derbey <Nadia.Derbey@bull.net>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61bce0f1
  15. 08 1月, 2009 2 次提交
    • D
      NOMMU: Make VMAs per MM as for MMU-mode linux · 8feae131
      David Howells 提交于
      Make VMAs per mm_struct as for MMU-mode linux.  This solves two problems:
      
       (1) In SYSV SHM where nattch for a segment does not reflect the number of
           shmat's (and forks) done.
      
       (2) In mmap() where the VMA's vm_mm is set to point to the parent mm by an
           exec'ing process when VM_EXECUTABLE is specified, regardless of the fact
           that a VMA might be shared and already have its vm_mm assigned to another
           process or a dead process.
      
      A new struct (vm_region) is introduced to track a mapped region and to remember
      the circumstances under which it may be shared and the vm_list_struct structure
      is discarded as it's no longer required.
      
      This patch makes the following additional changes:
      
       (1) Regions are now allocated with alloc_pages() rather than kmalloc() and
           with no recourse to __GFP_COMP, so the pages are not composite.  Instead,
           each page has a reference on it held by the region.  Anything else that is
           interested in such a page will have to get a reference on it to retain it.
           When the pages are released due to unmapping, each page is passed to
           put_page() and will be freed when the page usage count reaches zero.
      
       (2) Excess pages are trimmed after an allocation as the allocation must be
           made as a power-of-2 quantity of pages.
      
       (3) VMAs are added to the parent MM's R/B tree and mmap lists.  As an MM may
           end up with overlapping VMAs within the tree, the VMA struct address is
           appended to the sort key.
      
       (4) Non-anonymous VMAs are now added to the backing inode's prio list.
      
       (5) Holes may be punched in anonymous VMAs with munmap(), releasing parts of
           the backing region.  The VMA and region structs will be split if
           necessary.
      
       (6) sys_shmdt() only releases one attachment to a SYSV IPC shared memory
           segment instead of all the attachments at that addresss.  Multiple
           shmat()'s return the same address under NOMMU-mode instead of different
           virtual addresses as under MMU-mode.
      
       (7) Core dumping for ELF-FDPIC requires fewer exceptions for NOMMU-mode.
      
       (8) /proc/maps is now the global list of mapped regions, and may list bits
           that aren't actually mapped anywhere.
      
       (9) /proc/meminfo gains a line (tagged "MmapCopy") that indicates the amount
           of RAM currently allocated by mmap to hold mappable regions that can't be
           mapped directly.  These are copies of the backing device or file if not
           anonymous.
      
      These changes make NOMMU mode more similar to MMU mode.  The downside is that
      NOMMU mode requires some extra memory to track things over NOMMU without this
      patch (VMAs are no longer shared, and there are now region structs).
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NMike Frysinger <vapier.adi@gmail.com>
      Acked-by: NPaul Mundt <lethal@linux-sh.org>
      8feae131
    • P
      itimers: remove the per-cpu-ish-ness · 490dea45
      Peter Zijlstra 提交于
      Either we bounce once cacheline per cpu per tick, yielding n^2 bounces
      or we just bounce a single..
      
      Also, using per-cpu allocations for the thread-groups complicates the
      per-cpu allocator in that its currently aimed to be a fixed sized
      allocator and the only possible extention to that would be vmap based,
      which is seriously constrained on 32 bit archs.
      
      So making the per-cpu memory requirement depend on the number of
      processes is an issue.
      
      Lastly, it didn't deal with cpu-hotplug, although admittedly that might
      be fixable.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      490dea45
  16. 07 1月, 2009 2 次提交
  17. 29 12月, 2008 1 次提交
    • J
      aio: make the lookup_ioctx() lockless · abf137dd
      Jens Axboe 提交于
      The mm->ioctx_list is currently protected by a reader-writer lock,
      so we always grab that lock on the read side for doing ioctx
      lookups. As the workload is extremely reader biased, turn this into
      an rcu hlist so we can make lookup_ioctx() lockless. Get rid of
      the rwlock and use a spinlock for providing update side exclusion.
      
      There's usually only 1 entry on this list, so it doesn't make sense
      to look into fancier data structures.
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      abf137dd
  18. 20 12月, 2008 1 次提交
    • M
      x86, bts: add fork and exit handling · bf53de90
      Markus Metzger 提交于
      Impact: introduce new ptrace facility
      
      Add arch_ptrace_untrace() function that is called when the tracer
      detaches (either voluntarily or when the tracing task dies);
      ptrace_disable() is only called on a voluntary detach.
      
      Add ptrace_fork() and arch_ptrace_fork(). They are called when a
      traced task is forked.
      
      Clear DS and BTS related fields on fork.
      
      Release DS resources and reclaim memory in ptrace_untrace(). This
      releases resources already when the tracing task dies. We used to do
      that when the traced task dies.
      Signed-off-by: NMarkus Metzger <markus.t.metzger@intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      bf53de90
  19. 11 12月, 2008 1 次提交
    • H
      fix mapping_writably_mapped() · b88ed205
      Hugh Dickins 提交于
      Lee Schermerhorn noticed yesterday that I broke the mapping_writably_mapped
      test in 2.6.7!  Bad bad bug, good good find.
      
      The i_mmap_writable count must be incremented for VM_SHARED (just as
      i_writecount is for VM_DENYWRITE, but while holding the i_mmap_lock)
      when dup_mmap() copies the vma for fork: it has its own more optimal
      version of __vma_link_file(), and I missed this out.  So the count
      was later going down to 0 (dangerous) when one end unmapped, then
      wrapping negative (inefficient) when the other end unmapped.
      
      The only impact on x86 would have been that setting a mandatory lock on
      a file which has at some time been opened O_RDWR and mapped MAP_SHARED
      (but not necessarily PROT_WRITE) across a fork, might fail with -EAGAIN
      when it should succeed, or succeed when it should fail.
      
      But those architectures which rely on flush_dcache_page() to flush
      userspace modifications back into the page before the kernel reads it,
      may in some cases have skipped the flush after such a fork - though any
      repetitive test will soon wrap the count negative, in which case it will
      flush_dcache_page() unnecessarily.
      
      Fix would be a two-liner, but mapping variable added, and comment moved.
      Reported-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b88ed205
  20. 09 12月, 2008 1 次提交
  21. 08 12月, 2008 2 次提交
    • T
      performance counters: core code · 0793a61d
      Thomas Gleixner 提交于
      Implement the core kernel bits of Performance Counters subsystem.
      
      The Linux Performance Counter subsystem provides an abstraction of
      performance counter hardware capabilities. It provides per task and per
      CPU counters, and it provides event capabilities on top of those.
      
      Performance counters are accessed via special file descriptors.
      There's one file descriptor per virtual counter used.
      
      The special file descriptor is opened via the perf_counter_open()
      system call:
      
       int
       perf_counter_open(u32 hw_event_type,
                         u32 hw_event_period,
                         u32 record_type,
                         pid_t pid,
                         int cpu);
      
      The syscall returns the new fd. The fd can be used via the normal
      VFS system calls: read() can be used to read the counter, fcntl()
      can be used to set the blocking mode, etc.
      
      Multiple counters can be kept open at a time, and the counters
      can be poll()ed.
      
      See more details in Documentation/perf-counters.txt.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0793a61d
    • S
      user namespaces: require cap_set{ug}id for CLONE_NEWUSER · 7657d904
      Serge E. Hallyn 提交于
      While ideally CLONE_NEWUSER will eventually require no
      privilege, the required permission checks are currently
      not there.  As a result, CLONE_NEWUSER has the same effect
      as a setuid(0)+setgroups(1,"0").  While we already require
      CAP_SYS_ADMIN, requiring CAP_SETUID and CAP_SETGID seems
      appropriate.
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      7657d904
  22. 04 12月, 2008 1 次提交
    • S
      ftrace: fix race in function graph during fork · e8e1abe9
      Steven Rostedt 提交于
      Impact: graph tracer race/crash fix
      
      There is a nasy race in startup of a new process running the
      function graph tracer. In fork.c:
      
      	total_forks++;
      	spin_unlock(&current->sighand->siglock);
      	write_unlock_irq(&tasklist_lock);
      	ftrace_graph_init_task(p);
      	proc_fork_connector(p);
      	cgroup_post_fork(p);
      	return p;
      
      The new task is free to run as soon as the tasklist_lock is released.
      This is before the ftrace_graph_init_task. If the task does run
      it will be using the same ret_stack and curr_ret_stack as the parent.
      This will cause crashes that are difficult to debug.
      
      This patch moves the ftrace_graph_init_task to just after the alloc_pid
      code. This fixes the above race.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e8e1abe9
  23. 26 11月, 2008 1 次提交
  24. 25 11月, 2008 1 次提交
    • S
      User namespaces: set of cleanups (v2) · 18b6e041
      Serge Hallyn 提交于
      The user_ns is moved from nsproxy to user_struct, so that a struct
      cred by itself is sufficient to determine access (which it otherwise
      would not be).  Corresponding ecryptfs fixes (by David Howells) are
      here as well.
      
      Fix refcounting.  The following rules now apply:
              1. The task pins the user struct.
              2. The user struct pins its user namespace.
              3. The user namespace pins the struct user which created it.
      
      User namespaces are cloned during copy_creds().  Unsharing a new user_ns
      is no longer possible.  (We could re-add that, but it'll cause code
      duplication and doesn't seem useful if PAM doesn't need to clone user
      namespaces).
      
      When a user namespace is created, its first user (uid 0) gets empty
      keyrings and a clean group_info.
      
      This incorporates a previous patch by David Howells.  Here
      is his original patch description:
      
      >I suggest adding the attached incremental patch.  It makes the following
      >changes:
      >
      > (1) Provides a current_user_ns() macro to wrap accesses to current's user
      >     namespace.
      >
      > (2) Fixes eCryptFS.
      >
      > (3) Renames create_new_userns() to create_user_ns() to be more consistent
      >     with the other associated functions and because the 'new' in the name is
      >     superfluous.
      >
      > (4) Moves the argument and permission checks made for CLONE_NEWUSER to the
      >     beginning of do_fork() so that they're done prior to making any attempts
      >     at allocation.
      >
      > (5) Calls create_user_ns() after prepare_creds(), and gives it the new creds
      >     to fill in rather than have it return the new root user.  I don't imagine
      >     the new root user being used for anything other than filling in a cred
      >     struct.
      >
      >     This also permits me to get rid of a get_uid() and a free_uid(), as the
      >     reference the creds were holding on the old user_struct can just be
      >     transferred to the new namespace's creator pointer.
      >
      > (6) Makes create_user_ns() reset the UIDs and GIDs of the creds under
      >     preparation rather than doing it in copy_creds().
      >
      >David
      
      >Signed-off-by: David Howells <dhowells@redhat.com>
      
      Changelog:
      	Oct 20: integrate dhowells comments
      		1. leave thread_keyring alone
      		2. use current_user_ns() in set_user()
      Signed-off-by: NSerge Hallyn <serue@us.ibm.com>
      18b6e041
  25. 24 11月, 2008 1 次提交
  26. 23 11月, 2008 2 次提交
  27. 16 11月, 2008 2 次提交
    • M
      tracepoints: add DECLARE_TRACE() and DEFINE_TRACE() · 7e066fb8
      Mathieu Desnoyers 提交于
      Impact: API *CHANGE*. Must update all tracepoint users.
      
      Add DEFINE_TRACE() to tracepoints to let them declare the tracepoint
      structure in a single spot for all the kernel. It helps reducing memory
      consumption, especially when declaring a lot of tracepoints, e.g. for
      kmalloc tracing.
      
      *API CHANGE WARNING*: now, DECLARE_TRACE() must be used in headers for
      tracepoint declarations rather than DEFINE_TRACE(). This is the sane way
      to do it. The name previously used was misleading.
      
      Updates scheduler instrumentation to follow this API change.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7e066fb8
    • L
      Move "exit_robust_list" into mm_release() · 8141c7f3
      Linus Torvalds 提交于
      We don't want to get rid of the futexes just at exit() time, we want to
      drop them when doing an execve() too, since that gets rid of the
      previous VM image too.
      
      Doing it at mm_release() time means that we automatically always do it
      when we disassociate a VM map from the task.
      
      Reported-by: pageexec@freemail.hu
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Brad Spengler <spender@grsecurity.net>
      Cc: Alex Efros <powerman@powerman.name>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8141c7f3
  28. 14 11月, 2008 2 次提交
    • D
      CRED: Differentiate objective and effective subjective credentials on a task · 3b11a1de
      David Howells 提交于
      Differentiate the objective and real subjective credentials from the effective
      subjective credentials on a task by introducing a second credentials pointer
      into the task_struct.
      
      task_struct::real_cred then refers to the objective and apparent real
      subjective credentials of a task, as perceived by the other tasks in the
      system.
      
      task_struct::cred then refers to the effective subjective credentials of a
      task, as used by that task when it's actually running.  These are not visible
      to the other tasks in the system.
      
      __task_cred(task) then refers to the objective/real credentials of the task in
      question.
      
      current_cred() refers to the effective subjective credentials of the current
      task.
      
      prepare_creds() uses the objective creds as a base and commit_creds() changes
      both pointers in the task_struct (indeed commit_creds() requires them to be the
      same).
      
      override_creds() and revert_creds() change the subjective creds pointer only,
      and the former returns the old subjective creds.  These are used by NFSD,
      faccessat() and do_coredump(), and will by used by CacheFiles.
      
      In SELinux, current_has_perm() is provided as an alternative to
      task_has_perm().  This uses the effective subjective context of current,
      whereas task_has_perm() uses the objective/real context of the subject.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      3b11a1de
    • D
      CRED: Inaugurate COW credentials · d84f4f99
      David Howells 提交于
      Inaugurate copy-on-write credentials management.  This uses RCU to manage the
      credentials pointer in the task_struct with respect to accesses by other tasks.
      A process may only modify its own credentials, and so does not need locking to
      access or modify its own credentials.
      
      A mutex (cred_replace_mutex) is added to the task_struct to control the effect
      of PTRACE_ATTACHED on credential calculations, particularly with respect to
      execve().
      
      With this patch, the contents of an active credentials struct may not be
      changed directly; rather a new set of credentials must be prepared, modified
      and committed using something like the following sequence of events:
      
      	struct cred *new = prepare_creds();
      	int ret = blah(new);
      	if (ret < 0) {
      		abort_creds(new);
      		return ret;
      	}
      	return commit_creds(new);
      
      There are some exceptions to this rule: the keyrings pointed to by the active
      credentials may be instantiated - keyrings violate the COW rule as managing
      COW keyrings is tricky, given that it is possible for a task to directly alter
      the keys in a keyring in use by another task.
      
      To help enforce this, various pointers to sets of credentials, such as those in
      the task_struct, are declared const.  The purpose of this is compile-time
      discouragement of altering credentials through those pointers.  Once a set of
      credentials has been made public through one of these pointers, it may not be
      modified, except under special circumstances:
      
        (1) Its reference count may incremented and decremented.
      
        (2) The keyrings to which it points may be modified, but not replaced.
      
      The only safe way to modify anything else is to create a replacement and commit
      using the functions described in Documentation/credentials.txt (which will be
      added by a later patch).
      
      This patch and the preceding patches have been tested with the LTP SELinux
      testsuite.
      
      This patch makes several logical sets of alteration:
      
       (1) execve().
      
           This now prepares and commits credentials in various places in the
           security code rather than altering the current creds directly.
      
       (2) Temporary credential overrides.
      
           do_coredump() and sys_faccessat() now prepare their own credentials and
           temporarily override the ones currently on the acting thread, whilst
           preventing interference from other threads by holding cred_replace_mutex
           on the thread being dumped.
      
           This will be replaced in a future patch by something that hands down the
           credentials directly to the functions being called, rather than altering
           the task's objective credentials.
      
       (3) LSM interface.
      
           A number of functions have been changed, added or removed:
      
           (*) security_capset_check(), ->capset_check()
           (*) security_capset_set(), ->capset_set()
      
           	 Removed in favour of security_capset().
      
           (*) security_capset(), ->capset()
      
           	 New.  This is passed a pointer to the new creds, a pointer to the old
           	 creds and the proposed capability sets.  It should fill in the new
           	 creds or return an error.  All pointers, barring the pointer to the
           	 new creds, are now const.
      
           (*) security_bprm_apply_creds(), ->bprm_apply_creds()
      
           	 Changed; now returns a value, which will cause the process to be
           	 killed if it's an error.
      
           (*) security_task_alloc(), ->task_alloc_security()
      
           	 Removed in favour of security_prepare_creds().
      
           (*) security_cred_free(), ->cred_free()
      
           	 New.  Free security data attached to cred->security.
      
           (*) security_prepare_creds(), ->cred_prepare()
      
           	 New. Duplicate any security data attached to cred->security.
      
           (*) security_commit_creds(), ->cred_commit()
      
           	 New. Apply any security effects for the upcoming installation of new
           	 security by commit_creds().
      
           (*) security_task_post_setuid(), ->task_post_setuid()
      
           	 Removed in favour of security_task_fix_setuid().
      
           (*) security_task_fix_setuid(), ->task_fix_setuid()
      
           	 Fix up the proposed new credentials for setuid().  This is used by
           	 cap_set_fix_setuid() to implicitly adjust capabilities in line with
           	 setuid() changes.  Changes are made to the new credentials, rather
           	 than the task itself as in security_task_post_setuid().
      
           (*) security_task_reparent_to_init(), ->task_reparent_to_init()
      
           	 Removed.  Instead the task being reparented to init is referred
           	 directly to init's credentials.
      
      	 NOTE!  This results in the loss of some state: SELinux's osid no
      	 longer records the sid of the thread that forked it.
      
           (*) security_key_alloc(), ->key_alloc()
           (*) security_key_permission(), ->key_permission()
      
           	 Changed.  These now take cred pointers rather than task pointers to
           	 refer to the security context.
      
       (4) sys_capset().
      
           This has been simplified and uses less locking.  The LSM functions it
           calls have been merged.
      
       (5) reparent_to_kthreadd().
      
           This gives the current thread the same credentials as init by simply using
           commit_thread() to point that way.
      
       (6) __sigqueue_alloc() and switch_uid()
      
           __sigqueue_alloc() can't stop the target task from changing its creds
           beneath it, so this function gets a reference to the currently applicable
           user_struct which it then passes into the sigqueue struct it returns if
           successful.
      
           switch_uid() is now called from commit_creds(), and possibly should be
           folded into that.  commit_creds() should take care of protecting
           __sigqueue_alloc().
      
       (7) [sg]et[ug]id() and co and [sg]et_current_groups.
      
           The set functions now all use prepare_creds(), commit_creds() and
           abort_creds() to build and check a new set of credentials before applying
           it.
      
           security_task_set[ug]id() is called inside the prepared section.  This
           guarantees that nothing else will affect the creds until we've finished.
      
           The calling of set_dumpable() has been moved into commit_creds().
      
           Much of the functionality of set_user() has been moved into
           commit_creds().
      
           The get functions all simply access the data directly.
      
       (8) security_task_prctl() and cap_task_prctl().
      
           security_task_prctl() has been modified to return -ENOSYS if it doesn't
           want to handle a function, or otherwise return the return value directly
           rather than through an argument.
      
           Additionally, cap_task_prctl() now prepares a new set of credentials, even
           if it doesn't end up using it.
      
       (9) Keyrings.
      
           A number of changes have been made to the keyrings code:
      
           (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have
           	 all been dropped and built in to the credentials functions directly.
           	 They may want separating out again later.
      
           (b) key_alloc() and search_process_keyrings() now take a cred pointer
           	 rather than a task pointer to specify the security context.
      
           (c) copy_creds() gives a new thread within the same thread group a new
           	 thread keyring if its parent had one, otherwise it discards the thread
           	 keyring.
      
           (d) The authorisation key now points directly to the credentials to extend
           	 the search into rather pointing to the task that carries them.
      
           (e) Installing thread, process or session keyrings causes a new set of
           	 credentials to be created, even though it's not strictly necessary for
           	 process or session keyrings (they're shared).
      
      (10) Usermode helper.
      
           The usermode helper code now carries a cred struct pointer in its
           subprocess_info struct instead of a new session keyring pointer.  This set
           of credentials is derived from init_cred and installed on the new process
           after it has been cloned.
      
           call_usermodehelper_setup() allocates the new credentials and
           call_usermodehelper_freeinfo() discards them if they haven't been used.  A
           special cred function (prepare_usermodeinfo_creds()) is provided
           specifically for call_usermodehelper_setup() to call.
      
           call_usermodehelper_setkeys() adjusts the credentials to sport the
           supplied keyring as the new session keyring.
      
      (11) SELinux.
      
           SELinux has a number of changes, in addition to those to support the LSM
           interface changes mentioned above:
      
           (a) selinux_setprocattr() no longer does its check for whether the
           	 current ptracer can access processes with the new SID inside the lock
           	 that covers getting the ptracer's SID.  Whilst this lock ensures that
           	 the check is done with the ptracer pinned, the result is only valid
           	 until the lock is released, so there's no point doing it inside the
           	 lock.
      
      (12) is_single_threaded().
      
           This function has been extracted from selinux_setprocattr() and put into
           a file of its own in the lib/ directory as join_session_keyring() now
           wants to use it too.
      
           The code in SELinux just checked to see whether a task shared mm_structs
           with other tasks (CLONE_VM), but that isn't good enough.  We really want
           to know if they're part of the same thread group (CLONE_THREAD).
      
      (13) nfsd.
      
           The NFS server daemon now has to use the COW credentials to set the
           credentials it is going to use.  It really needs to pass the credentials
           down to the functions it calls, but it can't do that until other patches
           in this series have been applied.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      d84f4f99