1. 14 12月, 2014 1 次提交
    • D
      x86: hook up execveat system call · 27d6ec7a
      David Drysdale 提交于
      Hook up x86-64, i386 and x32 ABIs.
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27d6ec7a
  2. 27 9月, 2014 1 次提交
  3. 09 8月, 2014 1 次提交
    • D
      shm: add memfd_create() syscall · 9183df25
      David Herrmann 提交于
      memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
      that you can pass to mmap().  It can support sealing and avoids any
      connection to user-visible mount-points.  Thus, it's not subject to quotas
      on mounted file-systems, but can be used like malloc()'ed memory, but with
      a file-descriptor to it.
      
      memfd_create() returns the raw shmem file, so calls like ftruncate() can
      be used to modify the underlying inode.  Also calls like fstat() will
      return proper information and mark the file as regular file.  If you want
      sealing, you can specify MFD_ALLOW_SEALING.  Otherwise, sealing is not
      supported (like on all other regular files).
      
      Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
      subject to a filesystem size limit.  It is still properly accounted to
      memcg limits, though, and to the same overcommit or no-overcommit
      accounting as all user memory.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9183df25
  4. 06 8月, 2014 1 次提交
    • T
      random: introduce getrandom(2) system call · c6e9d6f3
      Theodore Ts'o 提交于
      The getrandom(2) system call was requested by the LibreSSL Portable
      developers.  It is analoguous to the getentropy(2) system call in
      OpenBSD.
      
      The rationale of this system call is to provide resiliance against
      file descriptor exhaustion attacks, where the attacker consumes all
      available file descriptors, forcing the use of the fallback code where
      /dev/[u]random is not available.  Since the fallback code is often not
      well-tested, it is better to eliminate this potential failure mode
      entirely.
      
      The other feature provided by this new system call is the ability to
      request randomness from the /dev/urandom entropy pool, but to block
      until at least 128 bits of entropy has been accumulated in the
      /dev/urandom entropy pool.  Historically, the emphasis in the
      /dev/urandom development has been to ensure that urandom pool is
      initialized as quickly as possible after system boot, and preferably
      before the init scripts start execution.
      
      This is because changing /dev/urandom reads to block represents an
      interface change that could potentially break userspace which is not
      acceptable.  In practice, on most x86 desktop and server systems, in
      general the entropy pool can be initialized before it is needed (and
      in modern kernels, we will printk a warning message if not).  However,
      on an embedded system, this may not be the case.  And so with this new
      interface, we can provide the functionality of blocking until the
      urandom pool has been initialized.  Any userspace program which uses
      this new functionality must take care to assure that if it is used
      during the boot process, that it will not cause the init scripts or
      other portions of the system startup to hang indefinitely.
      
      SYNOPSIS
      	#include <linux/random.h>
      
      	int getrandom(void *buf, size_t buflen, unsigned int flags);
      
      DESCRIPTION
      	The system call getrandom() fills the buffer pointed to by buf
      	with up to buflen random bytes which can be used to seed user
      	space random number generators (i.e., DRBG's) or for other
      	cryptographic uses.  It should not be used for Monte Carlo
      	simulations or other programs/algorithms which are doing
      	probabilistic sampling.
      
      	If the GRND_RANDOM flags bit is set, then draw from the
      	/dev/random pool instead of the /dev/urandom pool.  The
      	/dev/random pool is limited based on the entropy that can be
      	obtained from environmental noise, so if there is insufficient
      	entropy, the requested number of bytes may not be returned.
      	If there is no entropy available at all, getrandom(2) will
      	either block, or return an error with errno set to EAGAIN if
      	the GRND_NONBLOCK bit is set in flags.
      
      	If the GRND_RANDOM bit is not set, then the /dev/urandom pool
      	will be used.  Unlike using read(2) to fetch data from
      	/dev/urandom, if the urandom pool has not been sufficiently
      	initialized, getrandom(2) will block (or return -1 with the
      	errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags).
      
      	The getentropy(2) system call in OpenBSD can be emulated using
      	the following function:
      
                  int getentropy(void *buf, size_t buflen)
                  {
                          int     ret;
      
                          if (buflen > 256)
                                  goto failure;
                          ret = getrandom(buf, buflen, 0);
                          if (ret < 0)
                                  return ret;
                          if (ret == buflen)
                                  return 0;
                  failure:
                          errno = EIO;
                          return -1;
                  }
      
      RETURN VALUE
             On success, the number of bytes that was filled in the buf is
             returned.  This may not be all the bytes requested by the
             caller via buflen if insufficient entropy was present in the
             /dev/random pool, or if the system call was interrupted by a
             signal.
      
             On error, -1 is returned, and errno is set appropriately.
      
      ERRORS
      	EINVAL		An invalid flag was passed to getrandom(2)
      
      	EFAULT		buf is outside the accessible address space.
      
      	EAGAIN		The requested entropy was not available, and
      			getentropy(2) would have blocked if the
      			GRND_NONBLOCK flag was not set.
      
      	EINTR		While blocked waiting for entropy, the call was
      			interrupted by a signal handler; see the description
      			of how interrupted read(2) calls on "slow" devices
      			are handled with and without the SA_RESTART flag
      			in the signal(7) man page.
      
      NOTES
      	For small requests (buflen <= 256) getrandom(2) will not
      	return EINTR when reading from the urandom pool once the
      	entropy pool has been initialized, and it will return all of
      	the bytes that have been requested.  This is the recommended
      	way to use getrandom(2), and is designed for compatibility
      	with OpenBSD's getentropy() system call.
      
      	However, if you are using GRND_RANDOM, then getrandom(2) may
      	block until the entropy accounting determines that sufficient
      	environmental noise has been gathered such that getrandom(2)
      	will be operating as a NRBG instead of a DRBG for those people
      	who are working in the NIST SP 800-90 regime.  Since it may
      	block for a long time, these guarantees do *not* apply.  The
      	user may want to interrupt a hanging process using a signal,
      	so blocking until all of the requested bytes are returned
      	would be unfriendly.
      
      	For this reason, the user of getrandom(2) MUST always check
      	the return value, in case it returns some error, or if fewer
      	bytes than requested was returned.  In the case of
      	!GRND_RANDOM and small request, the latter should never
      	happen, but the careful userspace code (and all crypto code
      	should be careful) should check for this anyway!
      
      	Finally, unless you are doing long-term key generation (and
      	perhaps not even then), you probably shouldn't be using
      	GRND_RANDOM.  The cryptographic algorithms used for
      	/dev/urandom are quite conservative, and so should be
      	sufficient for all purposes.  The disadvantage of GRND_RANDOM
      	is that it can block, and the increased complexity required to
      	deal with partially fulfilled getrandom(2) requests.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NZach Brown <zab@zabbo.net>
      c6e9d6f3
  5. 19 7月, 2014 1 次提交
    • K
      seccomp: add "seccomp" syscall · 48dc92b9
      Kees Cook 提交于
      This adds the new "seccomp" syscall with both an "operation" and "flags"
      parameter for future expansion. The third argument is a pointer value,
      used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
      be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).
      
      In addition to the TSYNC flag later in this patch series, there is a
      non-zero chance that this syscall could be used for configuring a fixed
      argument area for seccomp-tracer-aware processes to pass syscall arguments
      in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
      for this syscall. Additionally, this syscall uses operation, flags,
      and user pointer for arguments because strictly passing arguments via
      a user pointer would mean seccomp itself would be unable to trivially
      filter the seccomp syscall itself.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NAndy Lutomirski <luto@amacapital.net>
      48dc92b9
  6. 12 4月, 2014 1 次提交
  7. 13 1月, 2014 1 次提交
    • D
      sched: Add new scheduler syscalls to support an extended scheduling parameters ABI · d50dde5a
      Dario Faggioli 提交于
      Add the syscalls needed for supporting scheduling algorithms
      with extended scheduling parameters (e.g., SCHED_DEADLINE).
      
      In general, it makes possible to specify a periodic/sporadic task,
      that executes for a given amount of runtime at each instance, and is
      scheduled according to the urgency of their own timing constraints,
      i.e.:
      
       - a (maximum/typical) instance execution time,
       - a minimum interval between consecutive instances,
       - a time constraint by which each instance must be completed.
      
      Thus, both the data structure that holds the scheduling parameters of
      the tasks and the system calls dealing with it must be extended.
      Unfortunately, modifying the existing struct sched_param would break
      the ABI and result in potentially serious compatibility issues with
      legacy binaries.
      
      For these reasons, this patch:
      
       - defines the new struct sched_attr, containing all the fields
         that are necessary for specifying a task in the computational
         model described above;
      
       - defines and implements the new scheduling related syscalls that
         manipulate it, i.e., sched_setattr() and sched_getattr().
      
      Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
      proof of concept and for developing and testing purposes. Making them
      available on other architectures is straightforward.
      
      Since no "user" for these new parameters is introduced in this patch,
      the implementation of the new system calls is just identical to their
      already existing counterpart. Future patches that implement scheduling
      policies able to exploit the new data structure must also take care of
      modifying the sched_*attr() calls accordingly with their own purposes.
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      [ Rewrote to use sched_attr. ]
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      [ Removed sched_setscheduler2() for now. ]
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d50dde5a
  8. 10 5月, 2013 1 次提交
  9. 04 3月, 2013 4 次提交
  10. 25 2月, 2013 1 次提交
  11. 24 2月, 2013 1 次提交
  12. 04 2月, 2013 9 次提交
  13. 20 12月, 2012 2 次提交
  14. 14 12月, 2012 1 次提交
    • K
      module: add syscall to load module from fd · 34e1169d
      Kees Cook 提交于
      As part of the effort to create a stronger boundary between root and
      kernel, Chrome OS wants to be able to enforce that kernel modules are
      being loaded only from our read-only crypto-hash verified (dm_verity)
      root filesystem. Since the init_module syscall hands the kernel a module
      as a memory blob, no reasoning about the origin of the blob can be made.
      
      Earlier proposals for appending signatures to kernel modules would not be
      useful in Chrome OS, since it would involve adding an additional set of
      keys to our kernel and builds for no good reason: we already trust the
      contents of our root filesystem. We don't need to verify those kernel
      modules a second time. Having to do signature checking on module loading
      would slow us down and be redundant. All we need to know is where a
      module is coming from so we can say yes/no to loading it.
      
      If a file descriptor is used as the source of a kernel module, many more
      things can be reasoned about. In Chrome OS's case, we could enforce that
      the module lives on the filesystem we expect it to live on.  In the case
      of IMA (or other LSMs), it would be possible, for example, to examine
      extended attributes that may contain signatures over the contents of
      the module.
      
      This introduces a new syscall (on x86), similar to init_module, that has
      only two arguments. The first argument is used as a file descriptor to
      the module and the second argument is a pointer to the NULL terminated
      string of module arguments.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (merge fixes)
      34e1169d
  15. 29 11月, 2012 1 次提交
  16. 01 10月, 2012 1 次提交
  17. 01 6月, 2012 1 次提交
    • C
      syscalls, x86: add __NR_kcmp syscall · d97b46a6
      Cyrill Gorcunov 提交于
      While doing the checkpoint-restore in the user space one need to determine
      whether various kernel objects (like mm_struct-s of file_struct-s) are
      shared between tasks and restore this state.
      
      The 2nd step can be solved by using appropriate CLONE_ flags and the
      unshare syscall, while there's currently no ways for solving the 1st one.
      
      One of the ways for checking whether two tasks share e.g.  mm_struct is to
      provide some mm_struct ID of a task to its proc file, but showing such
      info considered to be not that good for security reasons.
      
      Thus after some debates we end up in conclusion that using that named
      'comparison' syscall might be the best candidate.  So here is it --
      __NR_kcmp.
      
      It takes up to 5 arguments - the pids of the two tasks (which
      characteristics should be compared), the comparison type and (in case of
      comparison of files) two file descriptors.
      
      Lookups for pids are done in the caller's PID namespace only.
      
      At moment only x86 is supported and tested.
      
      [akpm@linux-foundation.org: fix up selftests, warnings]
      [akpm@linux-foundation.org: include errno.h]
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Valdis.Kletnieks@vt.edu
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d97b46a6
  18. 23 3月, 2012 1 次提交
  19. 21 2月, 2012 1 次提交
  20. 18 11月, 2011 1 次提交