1. 24 4月, 2011 1 次提交
  2. 04 4月, 2011 1 次提交
  3. 09 12月, 2010 1 次提交
  4. 30 11月, 2010 1 次提交
  5. 28 10月, 2010 1 次提交
  6. 20 8月, 2010 2 次提交
    • P
      rcu: Add a TINY_PREEMPT_RCU · a57eb940
      Paul E. McKenney 提交于
      Implement a small-memory-footprint uniprocessor-only implementation of
      preemptible RCU.  This implementation uses but a single blocked-tasks
      list rather than the combinatorial number used per leaf rcu_node by
      TREE_PREEMPT_RCU, which reduces memory consumption and greatly simplifies
      processing.  This version also takes advantage of uniprocessor execution
      to accelerate grace periods in the case where there are no readers.
      
      The general design is otherwise broadly similar to that of TREE_PREEMPT_RCU.
      
      This implementation is a step towards having RCU implementation driven
      off of the SMP and PREEMPT kernel configuration variables, which can
      happen once this implementation has accumulated sufficient experience.
      
      Removed ACCESS_ONCE() from __rcu_read_unlock() and added barrier() as
      suggested by Steve Rostedt in order to avoid the compiler-reordering
      issue noted by Mathieu Desnoyers (http://lkml.org/lkml/2010/8/16/183).
      
      As can be seen below, CONFIG_TINY_PREEMPT_RCU represents almost 5Kbyte
      savings compared to CONFIG_TREE_PREEMPT_RCU.  Of course, for non-real-time
      workloads, CONFIG_TINY_RCU is even better.
      
      	CONFIG_TREE_PREEMPT_RCU
      
      	   text	   data	    bss	    dec	   filename
      	     13	      0	      0	     13	   kernel/rcupdate.o
      	   6170	    825	     28	   7023	   kernel/rcutree.o
      				   ----
      				   7026    Total
      
      	CONFIG_TINY_PREEMPT_RCU
      
      	   text	   data	    bss	    dec	   filename
      	     13	      0	      0	     13	   kernel/rcupdate.o
      	   2081	     81	      8	   2170	   kernel/rcutiny.o
      				   ----
      				   2183    Total
      
      	CONFIG_TINY_RCU (non-preemptible)
      
      	   text	   data	    bss	    dec	   filename
      	     13	      0	      0	     13	   kernel/rcupdate.o
      	    719	     25	      0	    744	   kernel/rcutiny.o
      				    ---
      				    757    Total
      Requested-by: NLoïc Minier <loic.minier@canonical.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a57eb940
    • A
      kernel: __rcu annotations · 4d2deb40
      Arnd Bergmann 提交于
      This adds annotations for RCU operations in core kernel components
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      4d2deb40
  7. 28 5月, 2010 4 次提交
  8. 12 5月, 2010 1 次提交
  9. 13 3月, 2010 1 次提交
  10. 03 3月, 2010 1 次提交
  11. 18 12月, 2009 1 次提交
  12. 16 12月, 2009 1 次提交
  13. 15 12月, 2009 1 次提交
  14. 24 11月, 2009 1 次提交
    • S
      remove CONFIG_SECURITY_FILE_CAPABILITIES compile option · b3a222e5
      Serge E. Hallyn 提交于
      As far as I know, all distros currently ship kernels with default
      CONFIG_SECURITY_FILE_CAPABILITIES=y.  Since having the option on
      leaves a 'no_file_caps' option to boot without file capabilities,
      the main reason to keep the option is that turning it off saves
      you (on my s390x partition) 5k.  In particular, vmlinux sizes
      came to:
      
      without patch fscaps=n:		 	53598392
      without patch fscaps=y:		 	53603406
      with this patch applied:		53603342
      
      with the security-next tree.
      
      Against this we must weigh the fact that there is no simple way for
      userspace to figure out whether file capabilities are supported,
      while things like per-process securebits, capability bounding
      sets, and adding bits to pI if CAP_SETPCAP is in pE are not supported
      with SECURITY_FILE_CAPABILITIES=n, leaving a bit of a problem for
      applications wanting to know whether they can use them and/or why
      something failed.
      
      It also adds another subtly different set of semantics which we must
      maintain at the risk of severe security regressions.
      
      So this patch removes the SECURITY_FILE_CAPABILITIES compile
      option.  It drops the kernel size by about 50k over the stock
      SECURITY_FILE_CAPABILITIES=y kernel, by removing the
      cap_limit_ptraced_target() function.
      
      Changelog:
      	Nov 20: remove cap_limit_ptraced_target() as it's logic
      		was ifndef'ed.
      Signed-off-by: NSerge E. Hallyn <serue@us.ibm.com>
      Acked-by: NAndrew G. Morgan" <morgan@kernel.org>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      b3a222e5
  15. 21 9月, 2009 1 次提交
    • I
      perf: Do the big rename: Performance Counters -> Performance Events · cdd6c482
      Ingo Molnar 提交于
      Bye-bye Performance Counters, welcome Performance Events!
      
      In the past few months the perfcounters subsystem has grown out its
      initial role of counting hardware events, and has become (and is
      becoming) a much broader generic event enumeration, reporting, logging,
      monitoring, analysis facility.
      
      Naming its core object 'perf_counter' and naming the subsystem
      'perfcounters' has become more and more of a misnomer. With pending
      code like hw-breakpoints support the 'counter' name is less and
      less appropriate.
      
      All in one, we've decided to rename the subsystem to 'performance
      events' and to propagate this rename through all fields, variables
      and API names. (in an ABI compatible fashion)
      
      The word 'event' is also a bit shorter than 'counter' - which makes
      it slightly more convenient to write/handle as well.
      
      Thanks goes to Stephane Eranian who first observed this misnomer and
      suggested a rename.
      
      User-space tooling and ABI compatibility is not affected - this patch
      should be function-invariant. (Also, defconfigs were not touched to
      keep the size down.)
      
      This patch has been generated via the following script:
      
        FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
      
        sed -i \
          -e 's/PERF_EVENT_/PERF_RECORD_/g' \
          -e 's/PERF_COUNTER/PERF_EVENT/g' \
          -e 's/perf_counter/perf_event/g' \
          -e 's/nb_counters/nb_events/g' \
          -e 's/swcounter/swevent/g' \
          -e 's/tpcounter_event/tp_event/g' \
          $FILES
      
        for N in $(find . -name perf_counter.[ch]); do
          M=$(echo $N | sed 's/perf_counter/perf_event/g')
          mv $N $M
        done
      
        FILES=$(find . -name perf_event.*)
      
        sed -i \
          -e 's/COUNTER_MASK/REG_MASK/g' \
          -e 's/COUNTER/EVENT/g' \
          -e 's/\<event\>/event_id/g' \
          -e 's/counter/event/g' \
          -e 's/Counter/Event/g' \
          $FILES
      
      ... to keep it as correct as possible. This script can also be
      used by anyone who has pending perfcounters patches - it converts
      a Linux kernel tree over to the new naming. We tried to time this
      change to the point in time where the amount of pending patches
      is the smallest: the end of the merge window.
      
      Namespace clashes were fixed up in a preparatory patch - and some
      stylistic fallout will be fixed up in a subsequent patch.
      
      ( NOTE: 'counters' are still the proper terminology when we deal
        with hardware registers - and these sed scripts are a bit
        over-eager in renaming them. I've undone some of that, but
        in case there's something left where 'counter' would be
        better than 'event' we can undo that on an individual basis
        instead of touching an otherwise nicely automated patch. )
      Suggested-by: NStephane Eranian <eranian@google.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Reviewed-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <linux-arch@vger.kernel.org>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cdd6c482
  16. 29 8月, 2009 1 次提交
    • P
      rcu: Create rcutree plugins to handle hotplug CPU for multi-level trees · dd5d19ba
      Paul E. McKenney 提交于
      When offlining CPUs from a multi-level tree, there is the
      possibility of offlining the last CPU from a given node when
      there are preempted RCU read-side critical sections that
      started life on one of the CPUs on that node.
      
      In this case, the corresponding tasks will be enqueued via the
      task_struct's rcu_node_entry list_head onto one of the
      rcu_node's blocked_tasks[] lists.  These tasks need to be moved
      somewhere else so that they will prevent the current grace
      period from ending. That somewhere is the root rcu_node.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josht@linux.vnet.ibm.com
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      LKML-Reference: <20090827215816.GA30472@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dd5d19ba
  17. 23 8月, 2009 2 次提交
    • P
      rcu: Remove CONFIG_PREEMPT_RCU · 6b3ef48a
      Paul E. McKenney 提交于
      Now that CONFIG_TREE_PREEMPT_RCU is in place, there is no
      further need for CONFIG_PREEMPT_RCU.  Remove it, along with
      whatever subtle bugs it may (or may not) contain.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josht@linux.vnet.ibm.com
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      LKML-Reference: <125097461396-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6b3ef48a
    • P
      rcu: Merge preemptable-RCU functionality into hierarchical RCU · f41d911f
      Paul E. McKenney 提交于
      Create a kernel/rcutree_plugin.h file that contains definitions
      for preemptable RCU (or, under the #else branch of the #ifdef,
      empty definitions for the classic non-preemptable semantics).
      These definitions fit into plugins defined in kernel/rcutree.c
      for this purpose.
      
      This variant of preemptable RCU uses a new algorithm whose
      read-side expense is roughly that of classic hierarchical RCU
      under CONFIG_PREEMPT. This new algorithm's update-side expense
      is similar to that of classic hierarchical RCU, and, in absence
      of read-side preemption or blocking, is exactly that of classic
      hierarchical RCU.  Perhaps more important, this new algorithm
      has a much simpler implementation, saving well over 1,000 lines
      of code compared to mainline's implementation of preemptable
      RCU, which will hopefully be retired in favor of this new
      algorithm.
      
      The simplifications are obtained by maintaining per-task
      nesting state for running tasks, and using a simple
      lock-protected algorithm to handle accounting when tasks block
      within RCU read-side critical sections, making use of lessons
      learned while creating numerous user-level RCU implementations
      over the past 18 months.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josht@linux.vnet.ibm.com
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      LKML-Reference: <12509746134003-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f41d911f
  18. 27 6月, 2009 1 次提交
  19. 17 6月, 2009 1 次提交
  20. 24 5月, 2009 1 次提交
  21. 22 5月, 2009 1 次提交
    • P
      perf_counter: Dynamically allocate tasks' perf_counter_context struct · a63eaf34
      Paul Mackerras 提交于
      This replaces the struct perf_counter_context in the task_struct with
      a pointer to a dynamically allocated perf_counter_context struct.  The
      main reason for doing is this is to allow us to transfer a
      perf_counter_context from one task to another when we do lazy PMU
      switching in a later patch.
      
      This has a few side-benefits: the task_struct becomes a little smaller,
      we save some memory because only tasks that have perf_counters attached
      get a perf_counter_context allocated for them, and we can remove the
      inclusion of <linux/perf_counter.h> in sched.h, meaning that we don't
      end up recompiling nearly everything whenever perf_counter.h changes.
      
      The perf_counter_context structures are reference-counted and freed
      when the last reference is dropped.  A context can have references
      from its task and the counters on its task.  Counters can outlive the
      task so it is possible that a context will be freed well after its
      task has exited.
      
      Contexts are allocated on fork if the parent had a context, or
      otherwise the first time that a per-task counter is created on a task.
      In the latter case, we set the context pointer in the task struct
      locklessly using an atomic compare-and-exchange operation in case we
      raced with some other task in creating a context for the subject task.
      
      This also removes the task pointer from the perf_counter struct.  The
      task pointer was not used anywhere and would make it harder to move a
      context from one task to another.  Anything that needed to know which
      task a counter was attached to was already using counter->ctx->task.
      
      The __perf_counter_init_context function moves up in perf_counter.c
      so that it can be called from find_get_context, and now initializes
      the refcount, but is otherwise unchanged.
      
      We were potentially calling list_del_counter twice: once from
      __perf_counter_exit_task when the task exits and once from
      __perf_counter_remove_from_context when the counter's fd gets closed.
      This adds a check in list_del_counter so it doesn't do anything if
      the counter has already been removed from the lists.
      
      Since perf_counter_task_sched_in doesn't do anything if the task doesn't
      have a context, and leaves cpuctx->task_ctx = NULL, this adds code to
      __perf_install_in_context to set cpuctx->task_ctx if necessary, i.e. in
      the case where the current task adds the first counter to itself and
      thus creates a context for itself.
      
      This also adds similar code to __perf_counter_enable to handle a
      similar situation which can arise when the counters have been disabled
      using prctl; that also leaves cpuctx->task_ctx = NULL.
      
      [ Impact: refactor counter context management to prepare for new feature ]
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      LKML-Reference: <18966.10075.781053.231153@cargo.ozlabs.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a63eaf34
  22. 11 5月, 2009 1 次提交
  23. 18 4月, 2009 1 次提交
    • S
      tracing: add same level recursion detection · 261842b7
      Steven Rostedt 提交于
      The tracing infrastructure allows for recursion. That is, an interrupt
      may interrupt the act of tracing an event, and that interrupt may very well
      perform its own trace. This is a recursive trace, and is fine to do.
      
      The problem arises when there is a bug, and the utility doing the trace
      calls something that recurses back into the tracer. This recursion is not
      caused by an external event like an interrupt, but by code that is not
      expected to recurse. The result could be a lockup.
      
      This patch adds a bitmask to the task structure that keeps track
      of the trace recursion. To find the interrupt depth, the following
      algorithm is used:
      
        level = hardirq_count() + softirq_count() + in_nmi;
      
      Here, level will be the depth of interrutps and softirqs, and even handles
      the nmi. Then the corresponding bit is set in the recursion bitmask.
      If the bit was already set, we know we had a recursion at the same level
      and we warn about it and fail the writing to the buffer.
      
      After the data has been committed to the buffer, we clear the bit.
      No atomics are needed. The only races are with interrupts and they reset
      the bitmask before returning anywy.
      
      [ Impact: detect same irq level trace recursion ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      261842b7
  24. 14 4月, 2009 1 次提交
  25. 07 4月, 2009 1 次提交
    • S
      function-graph: add proper initialization for init task · 5ac9f622
      Steven Rostedt 提交于
      Impact: fix to crash going to kexec
      
      The init task did not properly initialize the function graph pointers.
      Altough these pointers are NULL, they can not be assumed to be NULL
      for the init task, and must still be properly initialize.
      
      This usually is not an issue since a problem only arises when a task
      exits, and the init tasks do not usually exit. But when doing tests
      with kexec, the init tasks do exit, and the bug appears.
      
      This patch properly initializes the init tasks function graph data
      structures.
      Reported-and-Tested-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <alpine.DEB.2.00.0903252053080.5675@gandalf.stny.rr.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5ac9f622
  26. 06 4月, 2009 1 次提交
  27. 05 2月, 2009 1 次提交
    • P
      timers: split process wide cpu clocks/timers · 4cd4c1b4
      Peter Zijlstra 提交于
      Change the process wide cpu timers/clocks so that we:
      
       1) don't mess up the kernel with too many threads,
       2) don't have a per-cpu allocation for each process,
       3) have no impact when not used.
      
      In order to accomplish this we're going to split it into two parts:
      
       - clocks; which can take all the time they want since they run
                 from user context -- ie. sys_clock_gettime(CLOCK_PROCESS_CPUTIME_ID)
      
       - timers; which need constant time sampling but since they're
                 explicity used, the user can pay the overhead.
      
      The clock readout will go back to a full sum of the thread group, while the
      timers will run of a global 'clock' that only runs when needed, so only
      programs that make use of the facility pay the price.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Reviewed-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4cd4c1b4
  28. 08 1月, 2009 1 次提交
    • P
      itimers: remove the per-cpu-ish-ness · 490dea45
      Peter Zijlstra 提交于
      Either we bounce once cacheline per cpu per tick, yielding n^2 bounces
      or we just bounce a single..
      
      Also, using per-cpu allocations for the thread-groups complicates the
      per-cpu allocator in that its currently aimed to be a fixed sized
      allocator and the only possible extention to that would be vmap based,
      which is seriously constrained on 32 bit archs.
      
      So making the per-cpu memory requirement depend on the number of
      processes is an issue.
      
      Lastly, it didn't deal with cpu-hotplug, although admittedly that might
      be fixable.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      490dea45
  29. 01 1月, 2009 1 次提交
  30. 29 12月, 2008 1 次提交
    • G
      sched: create "pushable_tasks" list to limit pushing to one attempt · 917b627d
      Gregory Haskins 提交于
      The RT scheduler employs a "push/pull" design to actively balance tasks
      within the system (on a per disjoint cpuset basis).  When a task is
      awoken, it is immediately determined if there are any lower priority
      cpus which should be preempted.  This is opposed to the way normal
      SCHED_OTHER tasks behave, which will wait for a periodic rebalancing
      operation to occur before spreading out load.
      
      When a particular RQ has more than 1 active RT task, it is said to
      be in an "overloaded" state.  Once this occurs, the system enters
      the active balancing mode, where it will try to push the task away,
      or persuade a different cpu to pull it over.  The system will stay
      in this state until the system falls back below the <= 1 queued RT
      task per RQ.
      
      However, the current implementation suffers from a limitation in the
      push logic.  Once overloaded, all tasks (other than current) on the
      RQ are analyzed on every push operation, even if it was previously
      unpushable (due to affinity, etc).  Whats more, the operation stops
      at the first task that is unpushable and will not look at items
      lower in the queue.  This causes two problems:
      
      1) We can have the same tasks analyzed over and over again during each
         push, which extends out the fast path in the scheduler for no
         gain.  Consider a RQ that has dozens of tasks that are bound to a
         core.  Each one of those tasks will be encountered and skipped
         for each push operation while they are queued.
      
      2) There may be lower-priority tasks under the unpushable task that
         could have been successfully pushed, but will never be considered
         until either the unpushable task is cleared, or a pull operation
         succeeds.  The net result is a potential latency source for mid
         priority tasks.
      
      This patch aims to rectify these two conditions by introducing a new
      priority sorted list: "pushable_tasks".  A task is added to the list
      each time a task is activated or preempted.  It is removed from the
      list any time it is deactivated, made current, or fails to push.
      
      This works because a task only needs to be attempted to push once.
      After an initial failure to push, the other cpus will eventually try to
      pull the task when the conditions are proper.  This also solves the
      problem that we don't completely analyze all tasks due to encountering
      an unpushable tasks.  Now every task will have a push attempted (when
      appropriate).
      
      This reduces latency both by shorting the critical section of the
      rq->lock for certain workloads, and by making sure the algorithm
      considers all eligible tasks in the system.
      
      [ rostedt: added a couple more BUG_ONs ]
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      Acked-by: NSteven Rostedt <srostedt@redhat.com>
      917b627d
  31. 23 12月, 2008 2 次提交
  32. 25 11月, 2008 1 次提交
    • S
      User namespaces: set of cleanups (v2) · 18b6e041
      Serge Hallyn 提交于
      The user_ns is moved from nsproxy to user_struct, so that a struct
      cred by itself is sufficient to determine access (which it otherwise
      would not be).  Corresponding ecryptfs fixes (by David Howells) are
      here as well.
      
      Fix refcounting.  The following rules now apply:
              1. The task pins the user struct.
              2. The user struct pins its user namespace.
              3. The user namespace pins the struct user which created it.
      
      User namespaces are cloned during copy_creds().  Unsharing a new user_ns
      is no longer possible.  (We could re-add that, but it'll cause code
      duplication and doesn't seem useful if PAM doesn't need to clone user
      namespaces).
      
      When a user namespace is created, its first user (uid 0) gets empty
      keyrings and a clean group_info.
      
      This incorporates a previous patch by David Howells.  Here
      is his original patch description:
      
      >I suggest adding the attached incremental patch.  It makes the following
      >changes:
      >
      > (1) Provides a current_user_ns() macro to wrap accesses to current's user
      >     namespace.
      >
      > (2) Fixes eCryptFS.
      >
      > (3) Renames create_new_userns() to create_user_ns() to be more consistent
      >     with the other associated functions and because the 'new' in the name is
      >     superfluous.
      >
      > (4) Moves the argument and permission checks made for CLONE_NEWUSER to the
      >     beginning of do_fork() so that they're done prior to making any attempts
      >     at allocation.
      >
      > (5) Calls create_user_ns() after prepare_creds(), and gives it the new creds
      >     to fill in rather than have it return the new root user.  I don't imagine
      >     the new root user being used for anything other than filling in a cred
      >     struct.
      >
      >     This also permits me to get rid of a get_uid() and a free_uid(), as the
      >     reference the creds were holding on the old user_struct can just be
      >     transferred to the new namespace's creator pointer.
      >
      > (6) Makes create_user_ns() reset the UIDs and GIDs of the creds under
      >     preparation rather than doing it in copy_creds().
      >
      >David
      
      >Signed-off-by: David Howells <dhowells@redhat.com>
      
      Changelog:
      	Oct 20: integrate dhowells comments
      		1. leave thread_keyring alone
      		2. use current_user_ns() in set_user()
      Signed-off-by: NSerge Hallyn <serue@us.ibm.com>
      18b6e041
  33. 14 11月, 2008 2 次提交
    • D
      CRED: Differentiate objective and effective subjective credentials on a task · 3b11a1de
      David Howells 提交于
      Differentiate the objective and real subjective credentials from the effective
      subjective credentials on a task by introducing a second credentials pointer
      into the task_struct.
      
      task_struct::real_cred then refers to the objective and apparent real
      subjective credentials of a task, as perceived by the other tasks in the
      system.
      
      task_struct::cred then refers to the effective subjective credentials of a
      task, as used by that task when it's actually running.  These are not visible
      to the other tasks in the system.
      
      __task_cred(task) then refers to the objective/real credentials of the task in
      question.
      
      current_cred() refers to the effective subjective credentials of the current
      task.
      
      prepare_creds() uses the objective creds as a base and commit_creds() changes
      both pointers in the task_struct (indeed commit_creds() requires them to be the
      same).
      
      override_creds() and revert_creds() change the subjective creds pointer only,
      and the former returns the old subjective creds.  These are used by NFSD,
      faccessat() and do_coredump(), and will by used by CacheFiles.
      
      In SELinux, current_has_perm() is provided as an alternative to
      task_has_perm().  This uses the effective subjective context of current,
      whereas task_has_perm() uses the objective/real context of the subject.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      3b11a1de
    • D
      CRED: Inaugurate COW credentials · d84f4f99
      David Howells 提交于
      Inaugurate copy-on-write credentials management.  This uses RCU to manage the
      credentials pointer in the task_struct with respect to accesses by other tasks.
      A process may only modify its own credentials, and so does not need locking to
      access or modify its own credentials.
      
      A mutex (cred_replace_mutex) is added to the task_struct to control the effect
      of PTRACE_ATTACHED on credential calculations, particularly with respect to
      execve().
      
      With this patch, the contents of an active credentials struct may not be
      changed directly; rather a new set of credentials must be prepared, modified
      and committed using something like the following sequence of events:
      
      	struct cred *new = prepare_creds();
      	int ret = blah(new);
      	if (ret < 0) {
      		abort_creds(new);
      		return ret;
      	}
      	return commit_creds(new);
      
      There are some exceptions to this rule: the keyrings pointed to by the active
      credentials may be instantiated - keyrings violate the COW rule as managing
      COW keyrings is tricky, given that it is possible for a task to directly alter
      the keys in a keyring in use by another task.
      
      To help enforce this, various pointers to sets of credentials, such as those in
      the task_struct, are declared const.  The purpose of this is compile-time
      discouragement of altering credentials through those pointers.  Once a set of
      credentials has been made public through one of these pointers, it may not be
      modified, except under special circumstances:
      
        (1) Its reference count may incremented and decremented.
      
        (2) The keyrings to which it points may be modified, but not replaced.
      
      The only safe way to modify anything else is to create a replacement and commit
      using the functions described in Documentation/credentials.txt (which will be
      added by a later patch).
      
      This patch and the preceding patches have been tested with the LTP SELinux
      testsuite.
      
      This patch makes several logical sets of alteration:
      
       (1) execve().
      
           This now prepares and commits credentials in various places in the
           security code rather than altering the current creds directly.
      
       (2) Temporary credential overrides.
      
           do_coredump() and sys_faccessat() now prepare their own credentials and
           temporarily override the ones currently on the acting thread, whilst
           preventing interference from other threads by holding cred_replace_mutex
           on the thread being dumped.
      
           This will be replaced in a future patch by something that hands down the
           credentials directly to the functions being called, rather than altering
           the task's objective credentials.
      
       (3) LSM interface.
      
           A number of functions have been changed, added or removed:
      
           (*) security_capset_check(), ->capset_check()
           (*) security_capset_set(), ->capset_set()
      
           	 Removed in favour of security_capset().
      
           (*) security_capset(), ->capset()
      
           	 New.  This is passed a pointer to the new creds, a pointer to the old
           	 creds and the proposed capability sets.  It should fill in the new
           	 creds or return an error.  All pointers, barring the pointer to the
           	 new creds, are now const.
      
           (*) security_bprm_apply_creds(), ->bprm_apply_creds()
      
           	 Changed; now returns a value, which will cause the process to be
           	 killed if it's an error.
      
           (*) security_task_alloc(), ->task_alloc_security()
      
           	 Removed in favour of security_prepare_creds().
      
           (*) security_cred_free(), ->cred_free()
      
           	 New.  Free security data attached to cred->security.
      
           (*) security_prepare_creds(), ->cred_prepare()
      
           	 New. Duplicate any security data attached to cred->security.
      
           (*) security_commit_creds(), ->cred_commit()
      
           	 New. Apply any security effects for the upcoming installation of new
           	 security by commit_creds().
      
           (*) security_task_post_setuid(), ->task_post_setuid()
      
           	 Removed in favour of security_task_fix_setuid().
      
           (*) security_task_fix_setuid(), ->task_fix_setuid()
      
           	 Fix up the proposed new credentials for setuid().  This is used by
           	 cap_set_fix_setuid() to implicitly adjust capabilities in line with
           	 setuid() changes.  Changes are made to the new credentials, rather
           	 than the task itself as in security_task_post_setuid().
      
           (*) security_task_reparent_to_init(), ->task_reparent_to_init()
      
           	 Removed.  Instead the task being reparented to init is referred
           	 directly to init's credentials.
      
      	 NOTE!  This results in the loss of some state: SELinux's osid no
      	 longer records the sid of the thread that forked it.
      
           (*) security_key_alloc(), ->key_alloc()
           (*) security_key_permission(), ->key_permission()
      
           	 Changed.  These now take cred pointers rather than task pointers to
           	 refer to the security context.
      
       (4) sys_capset().
      
           This has been simplified and uses less locking.  The LSM functions it
           calls have been merged.
      
       (5) reparent_to_kthreadd().
      
           This gives the current thread the same credentials as init by simply using
           commit_thread() to point that way.
      
       (6) __sigqueue_alloc() and switch_uid()
      
           __sigqueue_alloc() can't stop the target task from changing its creds
           beneath it, so this function gets a reference to the currently applicable
           user_struct which it then passes into the sigqueue struct it returns if
           successful.
      
           switch_uid() is now called from commit_creds(), and possibly should be
           folded into that.  commit_creds() should take care of protecting
           __sigqueue_alloc().
      
       (7) [sg]et[ug]id() and co and [sg]et_current_groups.
      
           The set functions now all use prepare_creds(), commit_creds() and
           abort_creds() to build and check a new set of credentials before applying
           it.
      
           security_task_set[ug]id() is called inside the prepared section.  This
           guarantees that nothing else will affect the creds until we've finished.
      
           The calling of set_dumpable() has been moved into commit_creds().
      
           Much of the functionality of set_user() has been moved into
           commit_creds().
      
           The get functions all simply access the data directly.
      
       (8) security_task_prctl() and cap_task_prctl().
      
           security_task_prctl() has been modified to return -ENOSYS if it doesn't
           want to handle a function, or otherwise return the return value directly
           rather than through an argument.
      
           Additionally, cap_task_prctl() now prepares a new set of credentials, even
           if it doesn't end up using it.
      
       (9) Keyrings.
      
           A number of changes have been made to the keyrings code:
      
           (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have
           	 all been dropped and built in to the credentials functions directly.
           	 They may want separating out again later.
      
           (b) key_alloc() and search_process_keyrings() now take a cred pointer
           	 rather than a task pointer to specify the security context.
      
           (c) copy_creds() gives a new thread within the same thread group a new
           	 thread keyring if its parent had one, otherwise it discards the thread
           	 keyring.
      
           (d) The authorisation key now points directly to the credentials to extend
           	 the search into rather pointing to the task that carries them.
      
           (e) Installing thread, process or session keyrings causes a new set of
           	 credentials to be created, even though it's not strictly necessary for
           	 process or session keyrings (they're shared).
      
      (10) Usermode helper.
      
           The usermode helper code now carries a cred struct pointer in its
           subprocess_info struct instead of a new session keyring pointer.  This set
           of credentials is derived from init_cred and installed on the new process
           after it has been cloned.
      
           call_usermodehelper_setup() allocates the new credentials and
           call_usermodehelper_freeinfo() discards them if they haven't been used.  A
           special cred function (prepare_usermodeinfo_creds()) is provided
           specifically for call_usermodehelper_setup() to call.
      
           call_usermodehelper_setkeys() adjusts the credentials to sport the
           supplied keyring as the new session keyring.
      
      (11) SELinux.
      
           SELinux has a number of changes, in addition to those to support the LSM
           interface changes mentioned above:
      
           (a) selinux_setprocattr() no longer does its check for whether the
           	 current ptracer can access processes with the new SID inside the lock
           	 that covers getting the ptracer's SID.  Whilst this lock ensures that
           	 the check is done with the ptracer pinned, the result is only valid
           	 until the lock is released, so there's no point doing it inside the
           	 lock.
      
      (12) is_single_threaded().
      
           This function has been extracted from selinux_setprocattr() and put into
           a file of its own in the lib/ directory as join_session_keyring() now
           wants to use it too.
      
           The code in SELinux just checked to see whether a task shared mm_structs
           with other tasks (CLONE_VM), but that isn't good enough.  We really want
           to know if they're part of the same thread group (CLONE_THREAD).
      
      (13) nfsd.
      
           The NFS server daemon now has to use the COW credentials to set the
           credentials it is going to use.  It really needs to pass the credentials
           down to the functions it calls, but it can't do that until other patches
           in this series have been applied.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      d84f4f99