1. 13 4月, 2016 1 次提交
  2. 05 3月, 2016 1 次提交
  3. 02 12月, 2015 1 次提交
  4. 06 11月, 2015 1 次提交
  5. 09 10月, 2015 1 次提交
  6. 12 9月, 2015 1 次提交
    • M
      sys_membarrier(): system-wide memory barrier (generic, x86) · 5b25b13a
      Mathieu Desnoyers 提交于
      Here is an implementation of a new system call, sys_membarrier(), which
      executes a memory barrier on all threads running on the system.  It is
      implemented by calling synchronize_sched().  It can be used to
      distribute the cost of user-space memory barriers asymmetrically by
      transforming pairs of memory barriers into pairs consisting of
      sys_membarrier() and a compiler barrier.  For synchronization primitives
      that distinguish between read-side and write-side (e.g.  userspace RCU
      [1], rwlocks), the read-side can be accelerated significantly by moving
      the bulk of the memory barrier overhead to the write-side.
      
      The existing applications of which I am aware that would be improved by
      this system call are as follows:
      
      * Through Userspace RCU library (http://urcu.so)
        - DNS server (Knot DNS) https://www.knot-dns.cz/
        - Network sniffer (http://netsniff-ng.org/)
        - Distributed object storage (https://sheepdog.github.io/sheepdog/)
        - User-space tracing (http://lttng.org)
        - Network storage system (https://www.gluster.org/)
        - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
        - Financial software (https://lkml.org/lkml/2015/3/23/189)
      
      Those projects use RCU in userspace to increase read-side speed and
      scalability compared to locking.  Especially in the case of RCU used by
      libraries, sys_membarrier can speed up the read-side by moving the bulk of
      the memory barrier cost to synchronize_rcu().
      
      * Direct users of sys_membarrier
        - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
      
      Microsoft core dotnet GC developers are planning to use the mprotect()
      side-effect of issuing memory barriers through IPIs as a way to implement
      Windows FlushProcessWriteBuffers() on Linux.  They are referring to
      sys_membarrier in their github thread, specifically stating that
      sys_membarrier() is what they are looking for.
      
      To explain the benefit of this scheme, let's introduce two example threads:
      
      Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
      Thread B (frequent, e.g. executing liburcu
      rcu_read_lock()/rcu_read_unlock())
      
      In a scheme where all smp_mb() in thread A are ordering memory accesses
      with respect to smp_mb() present in Thread B, we can change each
      smp_mb() within Thread A into calls to sys_membarrier() and each
      smp_mb() within Thread B into compiler barriers "barrier()".
      
      Before the change, we had, for each smp_mb() pairs:
      
      Thread A                    Thread B
      previous mem accesses       previous mem accesses
      smp_mb()                    smp_mb()
      following mem accesses      following mem accesses
      
      After the change, these pairs become:
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      As we can see, there are two possible scenarios: either Thread B memory
      accesses do not happen concurrently with Thread A accesses (1), or they
      do (2).
      
      1) Non-concurrent Thread A vs Thread B accesses:
      
      Thread A                    Thread B
      prev mem accesses
      sys_membarrier()
      follow mem accesses
                                  prev mem accesses
                                  barrier()
                                  follow mem accesses
      
      In this case, thread B accesses will be weakly ordered. This is OK,
      because at that point, thread A is not particularly interested in
      ordering them with respect to its own accesses.
      
      2) Concurrent Thread A vs Thread B accesses
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      In this case, thread B accesses, which are ensured to be in program
      order thanks to the compiler barrier, will be "upgraded" to full
      smp_mb() by synchronize_sched().
      
      * Benchmarks
      
      On Intel Xeon E5405 (8 cores)
      (one thread is calling sys_membarrier, the other 7 threads are busy
      looping)
      
      1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
      
      * User-space user of this system call: Userspace RCU library
      
      Both the signal-based and the sys_membarrier userspace RCU schemes
      permit us to remove the memory barrier from the userspace RCU
      rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
      accelerating them. These memory barriers are replaced by compiler
      barriers on the read-side, and all matching memory barriers on the
      write-side are turned into an invocation of a memory barrier on all
      active threads in the process. By letting the kernel perform this
      synchronization rather than dumbly sending a signal to every process
      threads (as we currently do), we diminish the number of unnecessary wake
      ups and only issue the memory barriers on active threads. Non-running
      threads do not need to execute such barrier anyway, because these are
      implied by the scheduler context switches.
      
      Results in liburcu:
      
      Operations in 10s, 6 readers, 2 writers:
      
      memory barriers in reader:    1701557485 reads, 2202847 writes
      signal-based scheme:          9830061167 reads,    6700 writes
      sys_membarrier:               9952759104 reads,     425 writes
      sys_membarrier (dyn. check):  7970328887 reads,     425 writes
      
      The dynamic sys_membarrier availability check adds some overhead to
      the read-side compared to the signal-based scheme, but besides that,
      sys_membarrier slightly outperforms the signal-based scheme. However,
      this non-expedited sys_membarrier implementation has a much slower grace
      period than signal and memory barrier schemes.
      
      Besides diminishing the number of wake-ups, one major advantage of the
      membarrier system call over the signal-based scheme is that it does not
      need to reserve a signal. This plays much more nicely with libraries,
      and with processes injected into for tracing purposes, for which we
      cannot expect that signals will be unused by the application.
      
      An expedited version of this system call can be added later on to speed
      up the grace period. Its implementation will likely depend on reading
      the cpu_curr()->mm without holding each CPU's rq lock.
      
      This patch adds the system call to x86 and to asm-generic.
      
      [1] http://urcu.so
      
      membarrier(2) man page:
      
      MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
      
      NAME
             membarrier - issue memory barriers on a set of threads
      
      SYNOPSIS
             #include <linux/membarrier.h>
      
             int membarrier(int cmd, int flags);
      
      DESCRIPTION
             The cmd argument is one of the following:
      
             MEMBARRIER_CMD_QUERY
                    Query  the  set  of  supported commands. It returns a bitmask of
                    supported commands.
      
             MEMBARRIER_CMD_SHARED
                    Execute a memory barrier on all threads running on  the  system.
                    Upon  return from system call, the caller thread is ensured that
                    all running threads have passed through a state where all memory
                    accesses  to  user-space  addresses  match program order between
                    entry to and return from the system  call  (non-running  threads
                    are de facto in such a state). This covers threads from all pro=E2=80=90
                    cesses running on the system.  This command returns 0.
      
             The flags argument needs to be 0. For future extensions.
      
             All memory accesses performed  in  program  order  from  each  targeted
             thread is guaranteed to be ordered with respect to sys_membarrier(). If
             we use the semantic "barrier()" to represent a compiler barrier forcing
             memory  accesses  to  be performed in program order across the barrier,
             and smp_mb() to represent explicit memory barriers forcing full  memory
             ordering  across  the barrier, we have the following ordering table for
             each pair of barrier(), sys_membarrier() and smp_mb():
      
             The pair ordering is detailed as (O: ordered, X: not ordered):
      
                                    barrier()   smp_mb() sys_membarrier()
                    barrier()          X           X            O
                    smp_mb()           X           O            O
                    sys_membarrier()   O           O            O
      
      RETURN VALUE
             On success, these system calls return zero.  On error, -1 is  returned,
             and errno is set appropriately. For a given command, with flags
             argument set to 0, this system call is guaranteed to always return the
             same value until reboot.
      
      ERRORS
             ENOSYS System call is not implemented.
      
             EINVAL Invalid arguments.
      
      Linux                             2015-04-15                     MEMBARRIER(2)
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Nicholas Miell <nmiell@comcast.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b25b13a
  7. 05 9月, 2015 1 次提交
  8. 21 7月, 2015 1 次提交
    • A
      x86/entry/syscalls: Wire up 32-bit direct socket calls · 9dea5dc9
      Andy Lutomirski 提交于
      On x86_64, there's no socketcall syscall; instead all of the
      socket calls are real syscalls.  For 32-bit programs, we're
      stuck offering the socketcall syscall, but it would be nice to
      expose the direct calls as well.  This will enable seccomp to
      filter socket calls (for new userspace only, but that's fine for
      some applications) and it will provide a tiny performance boost.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Alexander Larsson <alexl@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Cosimo Cecchi <cosimo@endlessm.com>
      Cc: Dan Nicholson <nicholson@endlessm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rajalakshmi Srinivasaraghavan <raji@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tulio Magno Quites Machado Filho <tuliom@linux.vnet.ibm.com>
      Cc: libc-alpha <libc-alpha@sourceware.org>
      Link: http://lkml.kernel.org/r/cb5138299d37d5800e2d135b01a7667fa6115854.1436912629.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9dea5dc9
  9. 04 6月, 2015 1 次提交
    • I
      x86/asm/entry: Move the arch/x86/syscalls/ definitions to arch/x86/entry/syscalls/ · 1f57d5d8
      Ingo Molnar 提交于
      The build time generated syscall definitions are entry code related, move
      them into the arch/x86/entry/ directory.
      
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1f57d5d8
  10. 04 3月, 2015 1 次提交
  11. 14 12月, 2014 1 次提交
    • D
      x86: hook up execveat system call · 27d6ec7a
      David Drysdale 提交于
      Hook up x86-64, i386 and x32 ABIs.
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27d6ec7a
  12. 27 9月, 2014 1 次提交
  13. 09 8月, 2014 1 次提交
    • D
      shm: add memfd_create() syscall · 9183df25
      David Herrmann 提交于
      memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
      that you can pass to mmap().  It can support sealing and avoids any
      connection to user-visible mount-points.  Thus, it's not subject to quotas
      on mounted file-systems, but can be used like malloc()'ed memory, but with
      a file-descriptor to it.
      
      memfd_create() returns the raw shmem file, so calls like ftruncate() can
      be used to modify the underlying inode.  Also calls like fstat() will
      return proper information and mark the file as regular file.  If you want
      sealing, you can specify MFD_ALLOW_SEALING.  Otherwise, sealing is not
      supported (like on all other regular files).
      
      Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
      subject to a filesystem size limit.  It is still properly accounted to
      memcg limits, though, and to the same overcommit or no-overcommit
      accounting as all user memory.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9183df25
  14. 06 8月, 2014 1 次提交
    • T
      random: introduce getrandom(2) system call · c6e9d6f3
      Theodore Ts'o 提交于
      The getrandom(2) system call was requested by the LibreSSL Portable
      developers.  It is analoguous to the getentropy(2) system call in
      OpenBSD.
      
      The rationale of this system call is to provide resiliance against
      file descriptor exhaustion attacks, where the attacker consumes all
      available file descriptors, forcing the use of the fallback code where
      /dev/[u]random is not available.  Since the fallback code is often not
      well-tested, it is better to eliminate this potential failure mode
      entirely.
      
      The other feature provided by this new system call is the ability to
      request randomness from the /dev/urandom entropy pool, but to block
      until at least 128 bits of entropy has been accumulated in the
      /dev/urandom entropy pool.  Historically, the emphasis in the
      /dev/urandom development has been to ensure that urandom pool is
      initialized as quickly as possible after system boot, and preferably
      before the init scripts start execution.
      
      This is because changing /dev/urandom reads to block represents an
      interface change that could potentially break userspace which is not
      acceptable.  In practice, on most x86 desktop and server systems, in
      general the entropy pool can be initialized before it is needed (and
      in modern kernels, we will printk a warning message if not).  However,
      on an embedded system, this may not be the case.  And so with this new
      interface, we can provide the functionality of blocking until the
      urandom pool has been initialized.  Any userspace program which uses
      this new functionality must take care to assure that if it is used
      during the boot process, that it will not cause the init scripts or
      other portions of the system startup to hang indefinitely.
      
      SYNOPSIS
      	#include <linux/random.h>
      
      	int getrandom(void *buf, size_t buflen, unsigned int flags);
      
      DESCRIPTION
      	The system call getrandom() fills the buffer pointed to by buf
      	with up to buflen random bytes which can be used to seed user
      	space random number generators (i.e., DRBG's) or for other
      	cryptographic uses.  It should not be used for Monte Carlo
      	simulations or other programs/algorithms which are doing
      	probabilistic sampling.
      
      	If the GRND_RANDOM flags bit is set, then draw from the
      	/dev/random pool instead of the /dev/urandom pool.  The
      	/dev/random pool is limited based on the entropy that can be
      	obtained from environmental noise, so if there is insufficient
      	entropy, the requested number of bytes may not be returned.
      	If there is no entropy available at all, getrandom(2) will
      	either block, or return an error with errno set to EAGAIN if
      	the GRND_NONBLOCK bit is set in flags.
      
      	If the GRND_RANDOM bit is not set, then the /dev/urandom pool
      	will be used.  Unlike using read(2) to fetch data from
      	/dev/urandom, if the urandom pool has not been sufficiently
      	initialized, getrandom(2) will block (or return -1 with the
      	errno set to EAGAIN if the GRND_NONBLOCK bit is set in flags).
      
      	The getentropy(2) system call in OpenBSD can be emulated using
      	the following function:
      
                  int getentropy(void *buf, size_t buflen)
                  {
                          int     ret;
      
                          if (buflen > 256)
                                  goto failure;
                          ret = getrandom(buf, buflen, 0);
                          if (ret < 0)
                                  return ret;
                          if (ret == buflen)
                                  return 0;
                  failure:
                          errno = EIO;
                          return -1;
                  }
      
      RETURN VALUE
             On success, the number of bytes that was filled in the buf is
             returned.  This may not be all the bytes requested by the
             caller via buflen if insufficient entropy was present in the
             /dev/random pool, or if the system call was interrupted by a
             signal.
      
             On error, -1 is returned, and errno is set appropriately.
      
      ERRORS
      	EINVAL		An invalid flag was passed to getrandom(2)
      
      	EFAULT		buf is outside the accessible address space.
      
      	EAGAIN		The requested entropy was not available, and
      			getentropy(2) would have blocked if the
      			GRND_NONBLOCK flag was not set.
      
      	EINTR		While blocked waiting for entropy, the call was
      			interrupted by a signal handler; see the description
      			of how interrupted read(2) calls on "slow" devices
      			are handled with and without the SA_RESTART flag
      			in the signal(7) man page.
      
      NOTES
      	For small requests (buflen <= 256) getrandom(2) will not
      	return EINTR when reading from the urandom pool once the
      	entropy pool has been initialized, and it will return all of
      	the bytes that have been requested.  This is the recommended
      	way to use getrandom(2), and is designed for compatibility
      	with OpenBSD's getentropy() system call.
      
      	However, if you are using GRND_RANDOM, then getrandom(2) may
      	block until the entropy accounting determines that sufficient
      	environmental noise has been gathered such that getrandom(2)
      	will be operating as a NRBG instead of a DRBG for those people
      	who are working in the NIST SP 800-90 regime.  Since it may
      	block for a long time, these guarantees do *not* apply.  The
      	user may want to interrupt a hanging process using a signal,
      	so blocking until all of the requested bytes are returned
      	would be unfriendly.
      
      	For this reason, the user of getrandom(2) MUST always check
      	the return value, in case it returns some error, or if fewer
      	bytes than requested was returned.  In the case of
      	!GRND_RANDOM and small request, the latter should never
      	happen, but the careful userspace code (and all crypto code
      	should be careful) should check for this anyway!
      
      	Finally, unless you are doing long-term key generation (and
      	perhaps not even then), you probably shouldn't be using
      	GRND_RANDOM.  The cryptographic algorithms used for
      	/dev/urandom are quite conservative, and so should be
      	sufficient for all purposes.  The disadvantage of GRND_RANDOM
      	is that it can block, and the increased complexity required to
      	deal with partially fulfilled getrandom(2) requests.
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NZach Brown <zab@zabbo.net>
      c6e9d6f3
  15. 19 7月, 2014 1 次提交
    • K
      seccomp: add "seccomp" syscall · 48dc92b9
      Kees Cook 提交于
      This adds the new "seccomp" syscall with both an "operation" and "flags"
      parameter for future expansion. The third argument is a pointer value,
      used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
      be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).
      
      In addition to the TSYNC flag later in this patch series, there is a
      non-zero chance that this syscall could be used for configuring a fixed
      argument area for seccomp-tracer-aware processes to pass syscall arguments
      in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
      for this syscall. Additionally, this syscall uses operation, flags,
      and user pointer for arguments because strictly passing arguments via
      a user pointer would mean seccomp itself would be unable to trivially
      filter the seccomp syscall itself.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NAndy Lutomirski <luto@amacapital.net>
      48dc92b9
  16. 12 4月, 2014 1 次提交
  17. 13 1月, 2014 1 次提交
    • D
      sched: Add new scheduler syscalls to support an extended scheduling parameters ABI · d50dde5a
      Dario Faggioli 提交于
      Add the syscalls needed for supporting scheduling algorithms
      with extended scheduling parameters (e.g., SCHED_DEADLINE).
      
      In general, it makes possible to specify a periodic/sporadic task,
      that executes for a given amount of runtime at each instance, and is
      scheduled according to the urgency of their own timing constraints,
      i.e.:
      
       - a (maximum/typical) instance execution time,
       - a minimum interval between consecutive instances,
       - a time constraint by which each instance must be completed.
      
      Thus, both the data structure that holds the scheduling parameters of
      the tasks and the system calls dealing with it must be extended.
      Unfortunately, modifying the existing struct sched_param would break
      the ABI and result in potentially serious compatibility issues with
      legacy binaries.
      
      For these reasons, this patch:
      
       - defines the new struct sched_attr, containing all the fields
         that are necessary for specifying a task in the computational
         model described above;
      
       - defines and implements the new scheduling related syscalls that
         manipulate it, i.e., sched_setattr() and sched_getattr().
      
      Syscalls are introduced for x86 (32 and 64 bits) and ARM only, as a
      proof of concept and for developing and testing purposes. Making them
      available on other architectures is straightforward.
      
      Since no "user" for these new parameters is introduced in this patch,
      the implementation of the new system calls is just identical to their
      already existing counterpart. Future patches that implement scheduling
      policies able to exploit the new data structure must also take care of
      modifying the sched_*attr() calls accordingly with their own purposes.
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      [ Rewrote to use sched_attr. ]
      Signed-off-by: NJuri Lelli <juri.lelli@gmail.com>
      [ Removed sched_setscheduler2() for now. ]
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1383831828-15501-3-git-send-email-juri.lelli@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d50dde5a
  18. 10 5月, 2013 1 次提交
  19. 04 3月, 2013 4 次提交
  20. 25 2月, 2013 1 次提交
  21. 24 2月, 2013 1 次提交
  22. 04 2月, 2013 9 次提交
  23. 20 12月, 2012 2 次提交
  24. 14 12月, 2012 1 次提交
    • K
      module: add syscall to load module from fd · 34e1169d
      Kees Cook 提交于
      As part of the effort to create a stronger boundary between root and
      kernel, Chrome OS wants to be able to enforce that kernel modules are
      being loaded only from our read-only crypto-hash verified (dm_verity)
      root filesystem. Since the init_module syscall hands the kernel a module
      as a memory blob, no reasoning about the origin of the blob can be made.
      
      Earlier proposals for appending signatures to kernel modules would not be
      useful in Chrome OS, since it would involve adding an additional set of
      keys to our kernel and builds for no good reason: we already trust the
      contents of our root filesystem. We don't need to verify those kernel
      modules a second time. Having to do signature checking on module loading
      would slow us down and be redundant. All we need to know is where a
      module is coming from so we can say yes/no to loading it.
      
      If a file descriptor is used as the source of a kernel module, many more
      things can be reasoned about. In Chrome OS's case, we could enforce that
      the module lives on the filesystem we expect it to live on.  In the case
      of IMA (or other LSMs), it would be possible, for example, to examine
      extended attributes that may contain signatures over the contents of
      the module.
      
      This introduces a new syscall (on x86), similar to init_module, that has
      only two arguments. The first argument is used as a file descriptor to
      the module and the second argument is a pointer to the NULL terminated
      string of module arguments.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (merge fixes)
      34e1169d
  25. 29 11月, 2012 1 次提交
  26. 01 10月, 2012 1 次提交
  27. 01 6月, 2012 1 次提交
    • C
      syscalls, x86: add __NR_kcmp syscall · d97b46a6
      Cyrill Gorcunov 提交于
      While doing the checkpoint-restore in the user space one need to determine
      whether various kernel objects (like mm_struct-s of file_struct-s) are
      shared between tasks and restore this state.
      
      The 2nd step can be solved by using appropriate CLONE_ flags and the
      unshare syscall, while there's currently no ways for solving the 1st one.
      
      One of the ways for checking whether two tasks share e.g.  mm_struct is to
      provide some mm_struct ID of a task to its proc file, but showing such
      info considered to be not that good for security reasons.
      
      Thus after some debates we end up in conclusion that using that named
      'comparison' syscall might be the best candidate.  So here is it --
      __NR_kcmp.
      
      It takes up to 5 arguments - the pids of the two tasks (which
      characteristics should be compared), the comparison type and (in case of
      comparison of files) two file descriptors.
      
      Lookups for pids are done in the caller's PID namespace only.
      
      At moment only x86 is supported and tested.
      
      [akpm@linux-foundation.org: fix up selftests, warnings]
      [akpm@linux-foundation.org: include errno.h]
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Valdis.Kletnieks@vt.edu
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d97b46a6
  28. 23 3月, 2012 1 次提交