1. 21 10月, 2015 2 次提交
  2. 19 10月, 2015 2 次提交
  3. 25 9月, 2015 1 次提交
  4. 05 9月, 2015 1 次提交
    • A
      capabilities: ambient capabilities · 58319057
      Andy Lutomirski 提交于
      Credit where credit is due: this idea comes from Christoph Lameter with
      a lot of valuable input from Serge Hallyn.  This patch is heavily based
      on Christoph's patch.
      
      ===== The status quo =====
      
      On Linux, there are a number of capabilities defined by the kernel.  To
      perform various privileged tasks, processes can wield capabilities that
      they hold.
      
      Each task has four capability masks: effective (pE), permitted (pP),
      inheritable (pI), and a bounding set (X).  When the kernel checks for a
      capability, it checks pE.  The other capability masks serve to modify
      what capabilities can be in pE.
      
      Any task can remove capabilities from pE, pP, or pI at any time.  If a
      task has a capability in pP, it can add that capability to pE and/or pI.
      If a task has CAP_SETPCAP, then it can add any capability to pI, and it
      can remove capabilities from X.
      
      Tasks are not the only things that can have capabilities; files can also
      have capabilities.  A file can have no capabilty information at all [1].
      If a file has capability information, then it has a permitted mask (fP)
      and an inheritable mask (fI) as well as a single effective bit (fE) [2].
      File capabilities modify the capabilities of tasks that execve(2) them.
      
      A task that successfully calls execve has its capabilities modified for
      the file ultimately being excecuted (i.e.  the binary itself if that
      binary is ELF or for the interpreter if the binary is a script.) [3] In
      the capability evolution rules, for each mask Z, pZ represents the old
      value and pZ' represents the new value.  The rules are:
      
        pP' = (X & fP) | (pI & fI)
        pI' = pI
        pE' = (fE ? pP' : 0)
        X is unchanged
      
      For setuid binaries, fP, fI, and fE are modified by a moderately
      complicated set of rules that emulate POSIX behavior.  Similarly, if
      euid == 0 or ruid == 0, then fP, fI, and fE are modified differently
      (primary, fP and fI usually end up being the full set).  For nonroot
      users executing binaries with neither setuid nor file caps, fI and fP
      are empty and fE is false.
      
      As an extra complication, if you execute a process as nonroot and fE is
      set, then the "secure exec" rules are in effect: AT_SECURE gets set,
      LD_PRELOAD doesn't work, etc.
      
      This is rather messy.  We've learned that making any changes is
      dangerous, though: if a new kernel version allows an unprivileged
      program to change its security state in a way that persists cross
      execution of a setuid program or a program with file caps, this
      persistent state is surprisingly likely to allow setuid or file-capped
      programs to be exploited for privilege escalation.
      
      ===== The problem =====
      
      Capability inheritance is basically useless.
      
      If you aren't root and you execute an ordinary binary, fI is zero, so
      your capabilities have no effect whatsoever on pP'.  This means that you
      can't usefully execute a helper process or a shell command with elevated
      capabilities if you aren't root.
      
      On current kernels, you can sort of work around this by setting fI to
      the full set for most or all non-setuid executable files.  This causes
      pP' = pI for nonroot, and inheritance works.  No one does this because
      it's a PITA and it isn't even supported on most filesystems.
      
      If you try this, you'll discover that every nonroot program ends up with
      secure exec rules, breaking many things.
      
      This is a problem that has bitten many people who have tried to use
      capabilities for anything useful.
      
      ===== The proposed change =====
      
      This patch adds a fifth capability mask called the ambient mask (pA).
      pA does what most people expect pI to do.
      
      pA obeys the invariant that no bit can ever be set in pA if it is not
      set in both pP and pI.  Dropping a bit from pP or pI drops that bit from
      pA.  This ensures that existing programs that try to drop capabilities
      still do so, with a complication.  Because capability inheritance is so
      broken, setting KEEPCAPS, using setresuid to switch to nonroot uids, and
      then calling execve effectively drops capabilities.  Therefore,
      setresuid from root to nonroot conditionally clears pA unless
      SECBIT_NO_SETUID_FIXUP is set.  Processes that don't like this can
      re-add bits to pA afterwards.
      
      The capability evolution rules are changed:
      
        pA' = (file caps or setuid or setgid ? 0 : pA)
        pP' = (X & fP) | (pI & fI) | pA'
        pI' = pI
        pE' = (fE ? pP' : pA')
        X is unchanged
      
      If you are nonroot but you have a capability, you can add it to pA.  If
      you do so, your children get that capability in pA, pP, and pE.  For
      example, you can set pA = CAP_NET_BIND_SERVICE, and your children can
      automatically bind low-numbered ports.  Hallelujah!
      
      Unprivileged users can create user namespaces, map themselves to a
      nonzero uid, and create both privileged (relative to their namespace)
      and unprivileged process trees.  This is currently more or less
      impossible.  Hallelujah!
      
      You cannot use pA to try to subvert a setuid, setgid, or file-capped
      program: if you execute any such program, pA gets cleared and the
      resulting evolution rules are unchanged by this patch.
      
      Users with nonzero pA are unlikely to unintentionally leak that
      capability.  If they run programs that try to drop privileges, dropping
      privileges will still work.
      
      It's worth noting that the degree of paranoia in this patch could
      possibly be reduced without causing serious problems.  Specifically, if
      we allowed pA to persist across executing non-pA-aware setuid binaries
      and across setresuid, then, naively, the only capabilities that could
      leak as a result would be the capabilities in pA, and any attacker
      *already* has those capabilities.  This would make me nervous, though --
      setuid binaries that tried to privilege-separate might fail to do so,
      and putting CAP_DAC_READ_SEARCH or CAP_DAC_OVERRIDE into pA could have
      unexpected side effects.  (Whether these unexpected side effects would
      be exploitable is an open question.) I've therefore taken the more
      paranoid route.  We can revisit this later.
      
      An alternative would be to require PR_SET_NO_NEW_PRIVS before setting
      ambient capabilities.  I think that this would be annoying and would
      make granting otherwise unprivileged users minor ambient capabilities
      (CAP_NET_BIND_SERVICE or CAP_NET_RAW for example) much less useful than
      it is with this patch.
      
      ===== Footnotes =====
      
      [1] Files that are missing the "security.capability" xattr or that have
      unrecognized values for that xattr end up with has_cap set to false.
      The code that does that appears to be complicated for no good reason.
      
      [2] The libcap capability mask parsers and formatters are dangerously
      misleading and the documentation is flat-out wrong.  fE is *not* a mask;
      it's a single bit.  This has probably confused every single person who
      has tried to use file capabilities.
      
      [3] Linux very confusingly processes both the script and the interpreter
      if applicable, for reasons that elude me.  The results from thinking
      about a script's file capabilities and/or setuid bits are mostly
      discarded.
      
      Preliminary userspace code is here, but it needs updating:
      https://git.kernel.org/cgit/linux/kernel/git/luto/util-linux-playground.git/commit/?h=cap_ambient&id=7f5afbd175d2
      
      Here is a test program that can be used to verify the functionality
      (from Christoph):
      
      /*
       * Test program for the ambient capabilities. This program spawns a shell
       * that allows running processes with a defined set of capabilities.
       *
       * (C) 2015 Christoph Lameter <cl@linux.com>
       * Released under: GPL v3 or later.
       *
       *
       * Compile using:
       *
       *	gcc -o ambient_test ambient_test.o -lcap-ng
       *
       * This program must have the following capabilities to run properly:
       * Permissions for CAP_NET_RAW, CAP_NET_ADMIN, CAP_SYS_NICE
       *
       * A command to equip the binary with the right caps is:
       *
       *	setcap cap_net_raw,cap_net_admin,cap_sys_nice+p ambient_test
       *
       *
       * To get a shell with additional caps that can be inherited by other processes:
       *
       *	./ambient_test /bin/bash
       *
       *
       * Verifying that it works:
       *
       * From the bash spawed by ambient_test run
       *
       *	cat /proc/$$/status
       *
       * and have a look at the capabilities.
       */
      
      #include <stdlib.h>
      #include <stdio.h>
      #include <errno.h>
      #include <cap-ng.h>
      #include <sys/prctl.h>
      #include <linux/capability.h>
      
      /*
       * Definitions from the kernel header files. These are going to be removed
       * when the /usr/include files have these defined.
       */
      #define PR_CAP_AMBIENT 47
      #define PR_CAP_AMBIENT_IS_SET 1
      #define PR_CAP_AMBIENT_RAISE 2
      #define PR_CAP_AMBIENT_LOWER 3
      #define PR_CAP_AMBIENT_CLEAR_ALL 4
      
      static void set_ambient_cap(int cap)
      {
      	int rc;
      
      	capng_get_caps_process();
      	rc = capng_update(CAPNG_ADD, CAPNG_INHERITABLE, cap);
      	if (rc) {
      		printf("Cannot add inheritable cap\n");
      		exit(2);
      	}
      	capng_apply(CAPNG_SELECT_CAPS);
      
      	/* Note the two 0s at the end. Kernel checks for these */
      	if (prctl(PR_CAP_AMBIENT, PR_CAP_AMBIENT_RAISE, cap, 0, 0)) {
      		perror("Cannot set cap");
      		exit(1);
      	}
      }
      
      int main(int argc, char **argv)
      {
      	int rc;
      
      	set_ambient_cap(CAP_NET_RAW);
      	set_ambient_cap(CAP_NET_ADMIN);
      	set_ambient_cap(CAP_SYS_NICE);
      
      	printf("Ambient_test forking shell\n");
      	if (execv(argv[1], argv + 1))
      		perror("Cannot exec");
      
      	return 0;
      }
      
      Signed-off-by: Christoph Lameter <cl@linux.com> # Original author
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Acked-by: NSerge E. Hallyn <serge.hallyn@ubuntu.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Aaron Jones <aaronmdjones@gmail.com>
      Cc: Ted Ts'o <tytso@mit.edu>
      Cc: Andrew G. Morgan <morgan@kernel.org>
      Cc: Mimi Zohar <zohar@linux.vnet.ibm.com>
      Cc: Austin S Hemmelgarn <ahferroin7@gmail.com>
      Cc: Markku Savela <msa@moth.iki.fi>
      Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58319057
  5. 28 7月, 2015 1 次提交
  6. 12 4月, 2015 1 次提交
  7. 16 2月, 2015 1 次提交
  8. 23 1月, 2015 1 次提交
  9. 05 1月, 2015 1 次提交
    • S
      KEYS: close race between key lookup and freeing · a3a87844
      Sasha Levin 提交于
      When a key is being garbage collected, it's key->user would get put before
      the ->destroy() callback is called, where the key is removed from it's
      respective tracking structures.
      
      This leaves a key hanging in a semi-invalid state which leaves a window open
      for a different task to try an access key->user. An example is
      find_keyring_by_name() which would dereference key->user for a key that is
      in the process of being garbage collected (where key->user was freed but
      ->destroy() wasn't called yet - so it's still present in the linked list).
      
      This would cause either a panic, or corrupt memory.
      
      Fixes CVE-2014-9529.
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      a3a87844
  10. 16 12月, 2014 1 次提交
  11. 07 12月, 2014 1 次提交
  12. 02 12月, 2014 3 次提交
    • D
      KEYS: request_key() should reget expired keys rather than give EKEYEXPIRED · 0b0a8415
      David Howells 提交于
      Since the keyring facility can be viewed as a cache (at least in some
      applications), the local expiration time on the key should probably be viewed
      as a 'needs updating after this time' property rather than an absolute 'anyone
      now wanting to use this object is out of luck' property.
      
      Since request_key() is the main interface for the usage of keys, this should
      update or replace an expired key rather than issuing EKEYEXPIRED if the local
      expiration has been reached (ie. it should refresh the cache).
      
      For absolute conditions where refreshing the cache probably doesn't help, the
      key can be negatively instantiated using KEYCTL_REJECT_KEY with EKEYEXPIRED
      given as the error to issue.  This will still cause request_key() to return
      EKEYEXPIRED as that was explicitly set.
      
      In the future, if the key type has an update op available, we might want to
      upcall with the expired key and allow the upcall to update it.  We would pass
      a different operation name (the first column in /etc/request-key.conf) to the
      request-key program.
      
      request_key() returning EKEYEXPIRED is causing an NFS problem which Chuck
      Lever describes thusly:
      
      	After about 10 minutes, my NFSv4 functional tests fail because the
      	ownership of the test files goes to "-2". Looking at /proc/keys
      	shows that the id_resolv keys that map to my test user ID have
      	expired. The ownership problem persists until the expired keys are
      	purged from the keyring, and fresh keys are obtained.
      
      	I bisected the problem to 3.13 commit b2a4df20 ("KEYS: Expand
      	the capacity of a keyring"). This commit inadvertantly changes the
      	API contract of the internal function keyring_search_aux().
      
      	The root cause appears to be that b2a4df20 made "no state check"
      	the default behavior. "No state check" means the keyring search
      	iterator function skips checking the key's expiry timeout, and
      	returns expired keys.  request_key_and_link() depends on getting
      	an -EAGAIN result code to know when to perform an upcall to refresh
      	an expired key.
      
      This patch can be tested directly by:
      
      	keyctl request2 user debug:fred a @s
      	keyctl timeout %user:debug:fred 3
      	sleep 4
      	keyctl request2 user debug:fred a @s
      
      Without the patch, the last command gives error EKEYEXPIRED, but with the
      command it gives a new key.
      Reported-by: NCarl Hetherington <cth@carlh.net>
      Reported-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NChuck Lever <chuck.lever@oracle.com>
      0b0a8415
    • D
      KEYS: Simplify KEYRING_SEARCH_{NO,DO}_STATE_CHECK flags · 054f6180
      David Howells 提交于
      Simplify KEYRING_SEARCH_{NO,DO}_STATE_CHECK flags to be two variations of the
      same flag.  They are effectively mutually exclusive and one or the other
      should be provided, but not both.
      
      Keyring cycle detection and key possession determination are the only things
      that set NO_STATE_CHECK, except that neither flag really does anything there
      because neither purpose makes use of the keyring_search_iterator() function,
      but rather provides their own.
      
      For cycle detection we definitely want to check inside of expired keyrings,
      just so that we don't create a cycle we can't get rid of.  Revoked keyrings
      are cleared at revocation time and can't then be reused, so shouldn't be a
      problem either way.
      
      For possession determination, we *might* want to validate each keyring before
      searching it: do you possess a key that's hidden behind an expired or just
      plain inaccessible keyring?  Currently, the answer is yes.  Note that you
      cannot, however, possess a key behind a revoked keyring because they are
      cleared on revocation.
      
      keyring_search() sets DO_STATE_CHECK, which is correct.
      
      request_key_and_link() currently doesn't specify whether to check the key
      state or not - but it should set DO_STATE_CHECK.
      
      key_get_instantiation_authkey() also currently doesn't specify whether to
      check the key state or not - but it probably should also set DO_STATE_CHECK.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NChuck Lever <chuck.lever@oracle.com>
      054f6180
    • D
      KEYS: Fix the size of the key description passed to/from userspace · aa9d4437
      David Howells 提交于
      When a key description argument is imported into the kernel from userspace, as
      happens in add_key(), request_key(), KEYCTL_JOIN_SESSION_KEYRING,
      KEYCTL_SEARCH, the description is copied into a buffer up to PAGE_SIZE in size.
      PAGE_SIZE, however, is a variable quantity, depending on the arch.  Fix this at
      4096 instead (ie. 4095 plus a NUL termination) and define a constant
      (KEY_MAX_DESC_SIZE) to this end.
      
      When reading the description back with KEYCTL_DESCRIBE, a PAGE_SIZE internal
      buffer is allocated into which the information and description will be
      rendered.  This means that the description will get truncated if an extremely
      long description it has to be crammed into the buffer with the stringified
      information.  There is no particular need to copy the description into the
      buffer, so just copy it directly to userspace in a separate operation.
      Reported-by: NChristian Kastner <debian@kvr.at>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NChristian Kastner <debian@kvr.at>
      aa9d4437
  13. 17 9月, 2014 6 次提交
  14. 03 9月, 2014 1 次提交
  15. 28 7月, 2014 1 次提交
  16. 23 7月, 2014 7 次提交
  17. 19 7月, 2014 1 次提交
  18. 18 7月, 2014 1 次提交
    • D
      KEYS: Allow special keys (eg. DNS results) to be invalidated by CAP_SYS_ADMIN · 0c7774ab
      David Howells 提交于
      Special kernel keys, such as those used to hold DNS results for AFS, CIFS and
      NFS and those used to hold idmapper results for NFS, used to be
      'invalidateable' with key_revoke().  However, since the default permissions for
      keys were reduced:
      
      	Commit: 96b5c8fe
      	KEYS: Reduce initial permissions on keys
      
      it has become impossible to do this.
      
      Add a key flag (KEY_FLAG_ROOT_CAN_INVAL) that will permit a key to be
      invalidated by root.  This should not be used for system keyrings as the
      garbage collector will try and remove any invalidate key.  For system keyrings,
      KEY_FLAG_ROOT_CAN_CLEAR can be used instead.
      
      After this, from userspace, keyctl_invalidate() and "keyctl invalidate" can be
      used by any possessor of CAP_SYS_ADMIN (typically root) to invalidate DNS and
      idmapper keys.  Invalidated keys are immediately garbage collected and will be
      immediately rerequested if needed again.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NSteve Dickson <steved@redhat.com>
      0c7774ab
  19. 17 7月, 2014 1 次提交
  20. 16 7月, 2014 1 次提交
    • N
      sched: Remove proliferation of wait_on_bit() action functions · 74316201
      NeilBrown 提交于
      The current "wait_on_bit" interface requires an 'action'
      function to be provided which does the actual waiting.
      There are over 20 such functions, many of them identical.
      Most cases can be satisfied by one of just two functions, one
      which uses io_schedule() and one which just uses schedule().
      
      So:
       Rename wait_on_bit and        wait_on_bit_lock to
              wait_on_bit_action and wait_on_bit_lock_action
       to make it explicit that they need an action function.
      
       Introduce new wait_on_bit{,_lock} and wait_on_bit{,_lock}_io
       which are *not* given an action function but implicitly use
       a standard one.
       The decision to error-out if a signal is pending is now made
       based on the 'mode' argument rather than being encoded in the action
       function.
      
       All instances of the old wait_on_bit and wait_on_bit_lock which
       can use the new version have been changed accordingly and their
       action functions have been discarded.
       wait_on_bit{_lock} does not return any specific error code in the
       event of a signal so the caller must check for non-zero and
       interpolate their own error code as appropriate.
      
      The wait_on_bit() call in __fscache_wait_on_invalidate() was
      ambiguous as it specified TASK_UNINTERRUPTIBLE but used
      fscache_wait_bit_interruptible as an action function.
      David Howells confirms this should be uniformly
      "uninterruptible"
      
      The main remaining user of wait_on_bit{,_lock}_action is NFS
      which needs to use a freezer-aware schedule() call.
      
      A comment in fs/gfs2/glock.c notes that having multiple 'action'
      functions is useful as they display differently in the 'wchan'
      field of 'ps'. (and /proc/$PID/wchan).
      As the new bit_wait{,_io} functions are tagged "__sched", they
      will not show up at all, but something higher in the stack.  So
      the distinction will still be visible, only with different
      function names (gds2_glock_wait versus gfs2_glock_dq_wait in the
      gfs2/glock.c case).
      
      Since first version of this patch (against 3.15) two new action
      functions appeared, on in NFS and one in CIFS.  CIFS also now
      uses an action function that makes the same freezer aware
      schedule call as NFS.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Acked-by: David Howells <dhowells@redhat.com> (fscache, keys)
      Acked-by: Steven Whitehouse <swhiteho@redhat.com> (gfs2)
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Steve French <sfrench@samba.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20140707051603.28027.72349.stgit@notabene.brownSigned-off-by: NIngo Molnar <mingo@kernel.org>
      74316201
  21. 15 4月, 2014 1 次提交
  22. 15 3月, 2014 1 次提交
  23. 10 3月, 2014 1 次提交
  24. 06 3月, 2014 1 次提交
  25. 06 2月, 2014 1 次提交