1. 14 11月, 2008 4 次提交
    • D
      CRED: Inaugurate COW credentials · d84f4f99
      David Howells 提交于
      Inaugurate copy-on-write credentials management.  This uses RCU to manage the
      credentials pointer in the task_struct with respect to accesses by other tasks.
      A process may only modify its own credentials, and so does not need locking to
      access or modify its own credentials.
      
      A mutex (cred_replace_mutex) is added to the task_struct to control the effect
      of PTRACE_ATTACHED on credential calculations, particularly with respect to
      execve().
      
      With this patch, the contents of an active credentials struct may not be
      changed directly; rather a new set of credentials must be prepared, modified
      and committed using something like the following sequence of events:
      
      	struct cred *new = prepare_creds();
      	int ret = blah(new);
      	if (ret < 0) {
      		abort_creds(new);
      		return ret;
      	}
      	return commit_creds(new);
      
      There are some exceptions to this rule: the keyrings pointed to by the active
      credentials may be instantiated - keyrings violate the COW rule as managing
      COW keyrings is tricky, given that it is possible for a task to directly alter
      the keys in a keyring in use by another task.
      
      To help enforce this, various pointers to sets of credentials, such as those in
      the task_struct, are declared const.  The purpose of this is compile-time
      discouragement of altering credentials through those pointers.  Once a set of
      credentials has been made public through one of these pointers, it may not be
      modified, except under special circumstances:
      
        (1) Its reference count may incremented and decremented.
      
        (2) The keyrings to which it points may be modified, but not replaced.
      
      The only safe way to modify anything else is to create a replacement and commit
      using the functions described in Documentation/credentials.txt (which will be
      added by a later patch).
      
      This patch and the preceding patches have been tested with the LTP SELinux
      testsuite.
      
      This patch makes several logical sets of alteration:
      
       (1) execve().
      
           This now prepares and commits credentials in various places in the
           security code rather than altering the current creds directly.
      
       (2) Temporary credential overrides.
      
           do_coredump() and sys_faccessat() now prepare their own credentials and
           temporarily override the ones currently on the acting thread, whilst
           preventing interference from other threads by holding cred_replace_mutex
           on the thread being dumped.
      
           This will be replaced in a future patch by something that hands down the
           credentials directly to the functions being called, rather than altering
           the task's objective credentials.
      
       (3) LSM interface.
      
           A number of functions have been changed, added or removed:
      
           (*) security_capset_check(), ->capset_check()
           (*) security_capset_set(), ->capset_set()
      
           	 Removed in favour of security_capset().
      
           (*) security_capset(), ->capset()
      
           	 New.  This is passed a pointer to the new creds, a pointer to the old
           	 creds and the proposed capability sets.  It should fill in the new
           	 creds or return an error.  All pointers, barring the pointer to the
           	 new creds, are now const.
      
           (*) security_bprm_apply_creds(), ->bprm_apply_creds()
      
           	 Changed; now returns a value, which will cause the process to be
           	 killed if it's an error.
      
           (*) security_task_alloc(), ->task_alloc_security()
      
           	 Removed in favour of security_prepare_creds().
      
           (*) security_cred_free(), ->cred_free()
      
           	 New.  Free security data attached to cred->security.
      
           (*) security_prepare_creds(), ->cred_prepare()
      
           	 New. Duplicate any security data attached to cred->security.
      
           (*) security_commit_creds(), ->cred_commit()
      
           	 New. Apply any security effects for the upcoming installation of new
           	 security by commit_creds().
      
           (*) security_task_post_setuid(), ->task_post_setuid()
      
           	 Removed in favour of security_task_fix_setuid().
      
           (*) security_task_fix_setuid(), ->task_fix_setuid()
      
           	 Fix up the proposed new credentials for setuid().  This is used by
           	 cap_set_fix_setuid() to implicitly adjust capabilities in line with
           	 setuid() changes.  Changes are made to the new credentials, rather
           	 than the task itself as in security_task_post_setuid().
      
           (*) security_task_reparent_to_init(), ->task_reparent_to_init()
      
           	 Removed.  Instead the task being reparented to init is referred
           	 directly to init's credentials.
      
      	 NOTE!  This results in the loss of some state: SELinux's osid no
      	 longer records the sid of the thread that forked it.
      
           (*) security_key_alloc(), ->key_alloc()
           (*) security_key_permission(), ->key_permission()
      
           	 Changed.  These now take cred pointers rather than task pointers to
           	 refer to the security context.
      
       (4) sys_capset().
      
           This has been simplified and uses less locking.  The LSM functions it
           calls have been merged.
      
       (5) reparent_to_kthreadd().
      
           This gives the current thread the same credentials as init by simply using
           commit_thread() to point that way.
      
       (6) __sigqueue_alloc() and switch_uid()
      
           __sigqueue_alloc() can't stop the target task from changing its creds
           beneath it, so this function gets a reference to the currently applicable
           user_struct which it then passes into the sigqueue struct it returns if
           successful.
      
           switch_uid() is now called from commit_creds(), and possibly should be
           folded into that.  commit_creds() should take care of protecting
           __sigqueue_alloc().
      
       (7) [sg]et[ug]id() and co and [sg]et_current_groups.
      
           The set functions now all use prepare_creds(), commit_creds() and
           abort_creds() to build and check a new set of credentials before applying
           it.
      
           security_task_set[ug]id() is called inside the prepared section.  This
           guarantees that nothing else will affect the creds until we've finished.
      
           The calling of set_dumpable() has been moved into commit_creds().
      
           Much of the functionality of set_user() has been moved into
           commit_creds().
      
           The get functions all simply access the data directly.
      
       (8) security_task_prctl() and cap_task_prctl().
      
           security_task_prctl() has been modified to return -ENOSYS if it doesn't
           want to handle a function, or otherwise return the return value directly
           rather than through an argument.
      
           Additionally, cap_task_prctl() now prepares a new set of credentials, even
           if it doesn't end up using it.
      
       (9) Keyrings.
      
           A number of changes have been made to the keyrings code:
      
           (a) switch_uid_keyring(), copy_keys(), exit_keys() and suid_keys() have
           	 all been dropped and built in to the credentials functions directly.
           	 They may want separating out again later.
      
           (b) key_alloc() and search_process_keyrings() now take a cred pointer
           	 rather than a task pointer to specify the security context.
      
           (c) copy_creds() gives a new thread within the same thread group a new
           	 thread keyring if its parent had one, otherwise it discards the thread
           	 keyring.
      
           (d) The authorisation key now points directly to the credentials to extend
           	 the search into rather pointing to the task that carries them.
      
           (e) Installing thread, process or session keyrings causes a new set of
           	 credentials to be created, even though it's not strictly necessary for
           	 process or session keyrings (they're shared).
      
      (10) Usermode helper.
      
           The usermode helper code now carries a cred struct pointer in its
           subprocess_info struct instead of a new session keyring pointer.  This set
           of credentials is derived from init_cred and installed on the new process
           after it has been cloned.
      
           call_usermodehelper_setup() allocates the new credentials and
           call_usermodehelper_freeinfo() discards them if they haven't been used.  A
           special cred function (prepare_usermodeinfo_creds()) is provided
           specifically for call_usermodehelper_setup() to call.
      
           call_usermodehelper_setkeys() adjusts the credentials to sport the
           supplied keyring as the new session keyring.
      
      (11) SELinux.
      
           SELinux has a number of changes, in addition to those to support the LSM
           interface changes mentioned above:
      
           (a) selinux_setprocattr() no longer does its check for whether the
           	 current ptracer can access processes with the new SID inside the lock
           	 that covers getting the ptracer's SID.  Whilst this lock ensures that
           	 the check is done with the ptracer pinned, the result is only valid
           	 until the lock is released, so there's no point doing it inside the
           	 lock.
      
      (12) is_single_threaded().
      
           This function has been extracted from selinux_setprocattr() and put into
           a file of its own in the lib/ directory as join_session_keyring() now
           wants to use it too.
      
           The code in SELinux just checked to see whether a task shared mm_structs
           with other tasks (CLONE_VM), but that isn't good enough.  We really want
           to know if they're part of the same thread group (CLONE_THREAD).
      
      (13) nfsd.
      
           The NFS server daemon now has to use the COW credentials to set the
           credentials it is going to use.  It really needs to pass the credentials
           down to the functions it calls, but it can't do that until other patches
           in this series have been applied.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      d84f4f99
    • D
      CRED: Wrap current->cred and a few other accessors · 86a264ab
      David Howells 提交于
      Wrap current->cred and a few other accessors to hide their actual
      implementation.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      86a264ab
    • D
      CRED: Separate task security context from task_struct · b6dff3ec
      David Howells 提交于
      Separate the task security context from task_struct.  At this point, the
      security data is temporarily embedded in the task_struct with two pointers
      pointing to it.
      
      Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
      entry.S via asm-offsets.
      
      With comment fixes Signed-off-by: Marc Dionne <marc.c.dionne@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJames Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      b6dff3ec
    • D
      CRED: Wrap task credential accesses in the filesystem subsystem · da9592ed
      David Howells 提交于
      Wrap access to task credentials so that they can be separated more easily from
      the task_struct during the introduction of COW creds.
      
      Change most current->(|e|s|fs)[ug]id to current_(|e|s|fs)[ug]id().
      
      Change some task->e?[ug]id to task_e?[ug]id().  In some places it makes more
      sense to use RCU directly rather than a convenient wrapper; these will be
      addressed by later patches.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Reviewed-by: NJames Morris <jmorris@namei.org>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      da9592ed
  2. 20 10月, 2008 1 次提交
  3. 17 10月, 2008 3 次提交
  4. 16 10月, 2008 1 次提交
  5. 29 9月, 2008 1 次提交
    • B
      mm owner: fix race between swapoff and exit · 31a78f23
      Balbir Singh 提交于
      There's a race between mm->owner assignment and swapoff, more easily
      seen when task slab poisoning is turned on.  The condition occurs when
      try_to_unuse() runs in parallel with an exiting task.  A similar race
      can occur with callers of get_task_mm(), such as /proc/<pid>/<mmstats>
      or ptrace or page migration.
      
      CPU0                                    CPU1
                                              try_to_unuse
                                              looks at mm = task0->mm
                                              increments mm->mm_users
      task 0 exits
      mm->owner needs to be updated, but no
      new owner is found (mm_users > 1, but
      no other task has task->mm = task0->mm)
      mm_update_next_owner() leaves
                                              mmput(mm) decrements mm->mm_users
      task0 freed
                                              dereferencing mm->owner fails
      
      The fix is to notify the subsystem via mm_owner_changed callback(),
      if no new owner is found, by specifying the new task as NULL.
      
      Jiri Slaby:
      mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
      must be set after that, so as not to pass NULL as old owner causing oops.
      
      Daisuke Nishimura:
      mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
      and its callers need to take account of this situation to avoid oops.
      
      Hugh Dickins:
      Lockdep warning and hang below exec_mmap() when testing these patches.
      exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
      so exec_mmap() now needs to do the same.  And with that repositioning,
      there's now no point in mm_need_new_owner() allowing for NULL mm.
      Reported-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Signed-off-by: NDaisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      31a78f23
  6. 29 7月, 2008 1 次提交
  7. 27 7月, 2008 5 次提交
  8. 26 7月, 2008 12 次提交
  9. 25 7月, 2008 1 次提交
  10. 11 7月, 2008 1 次提交
    • H
      exec: fix stack excutability without PT_GNU_STACK · 96a8e13e
      Hugh Dickins 提交于
      Kernel Bugzilla #11063 points out that on some architectures (e.g. x86_32)
      exec'ing an ELF without a PT_GNU_STACK program header should default to an
      executable stack; but this got broken by the unlimited argv feature because
      stack vma is now created before the right personality has been established:
      so breaking old binaries using nested function trampolines.
      
      Therefore re-evaluate VM_STACK_FLAGS in setup_arg_pages, where stack
      vm_flags used to be set, before the mprotect_fixup.  Checking through
      our existing VM_flags, none would have changed since insert_vm_struct:
      so this seems safer than finding a way through the personality labyrinth.
      
      Reported-by: pageexec@freemail.hu
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96a8e13e
  11. 17 6月, 2008 1 次提交
  12. 27 5月, 2008 1 次提交
  13. 17 5月, 2008 1 次提交
  14. 13 5月, 2008 1 次提交
  15. 02 5月, 2008 1 次提交
  16. 30 4月, 2008 2 次提交
  17. 29 4月, 2008 3 次提交
    • M
      procfs task exe symlink · 925d1c40
      Matt Helsley 提交于
      The kernel implements readlink of /proc/pid/exe by getting the file from
      the first executable VMA.  Then the path to the file is reconstructed and
      reported as the result.
      
      Because of the VMA walk the code is slightly different on nommu systems.
      This patch avoids separate /proc/pid/exe code on nommu systems.  Instead of
      walking the VMAs to find the first executable file-backed VMA we store a
      reference to the exec'd file in the mm_struct.
      
      That reference would prevent the filesystem holding the executable file
      from being unmounted even after unmapping the VMAs.  So we track the number
      of VM_EXECUTABLE VMAs and drop the new reference when the last one is
      unmapped.  This avoids pinning the mounted filesystem.
      
      [akpm@linux-foundation.org: improve comments]
      [yamamoto@valinux.co.jp: fix dup_mmap]
      Signed-off-by: NMatt Helsley <matthltc@us.ibm.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Cc: David Howells <dhowells@redhat.com>
      Cc:"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Hugh Dickins <hugh@veritas.com>
      Signed-off-by: NYAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      925d1c40
    • B
      cgroups: add an owner to the mm_struct · cf475ad2
      Balbir Singh 提交于
      Remove the mem_cgroup member from mm_struct and instead adds an owner.
      
      This approach was suggested by Paul Menage.  The advantage of this approach
      is that, once the mm->owner is known, using the subsystem id, the cgroup
      can be determined.  It also allows several control groups that are
      virtually grouped by mm_struct, to exist independent of the memory
      controller i.e., without adding mem_cgroup's for each controller, to
      mm_struct.
      
      A new config option CONFIG_MM_OWNER is added and the memory resource
      controller selects this config option.
      
      This patch also adds cgroup callbacks to notify subsystems when mm->owner
      changes.  The mm_cgroup_changed callback is called with the task_lock() of
      the new task held and is called just prior to changing the mm->owner.
      
      I am indebted to Paul Menage for the several reviews of this patchset and
      helping me make it lighter and simpler.
      
      This patch was tested on a powerpc box, it was compiled with both the
      MM_OWNER config turned on and off.
      
      After the thread group leader exits, it's moved to init_css_state by
      cgroup_exit(), thus all future charges from runnings threads would be
      redirected to the init_css_set's subsystem.
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelianov <xemul@openvz.org>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: Hirokazu Takahashi <taka@valinux.co.jp>
      Cc: David Rientjes <rientjes@google.com>,
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Reviewed-by: NPaul Menage <menage@google.com>
      Cc: Oleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf475ad2
    • T
      exec: remove argv_len from struct linux_binprm · 175a06ae
      Tetsuo Handa 提交于
      I noticed that 2.6.24.2 calculates bprm->argv_len at do_execve().  But it
      doesn't update bprm->argv_len after "remove_arg_zero() +
      copy_strings_kernel()" at load_script() etc.
      
      audit_bprm() is called from search_binary_handler() and
      search_binary_handler() is called from load_script() etc.  Thus, I think the
      condition check
      
        if (bprm->argv_len > (audit_argv_kb << 10))
                return -E2BIG;
      
      in audit_bprm() might return wrong result when strlen(removed_arg) !=
      strlen(spliced_args).  Why not update bprm->argv_len at load_script() etc.  ?
      
      By the way, 2.6.25-rc3 seems to not doing the condition check.  Is the field
      bprm->argv_len no longer needed?
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Ollie Wild <aaw@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      175a06ae