1. 23 12月, 2016 1 次提交
    • A
      move aio compat to fs/aio.c · c00d2c7e
      Al Viro 提交于
      ... and fix the minor buglet in compat io_submit() - native one
      kills ioctx as cleanup when put_user() fails.  Get rid of
      bogus compat_... in !CONFIG_AIO case, while we are at it - they
      should simply fail with ENOSYS, same as for native counterparts.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c00d2c7e
  2. 13 9月, 2016 1 次提交
  3. 02 12月, 2015 1 次提交
    • Z
      vfs: add copy_file_range syscall and vfs helper · 29732938
      Zach Brown 提交于
      Add a copy_file_range() system call for offloading copies between
      regular files.
      
      This gives an interface to underlying layers of the storage stack which
      can copy without reading and writing all the data.  There are a few
      candidates that should support copy offloading in the nearer term:
      
      - btrfs shares extent references with its clone ioctl
      - NFS has patches to add a COPY command which copies on the server
      - SCSI has a family of XCOPY commands which copy in the device
      
      This system call avoids the complexity of also accelerating the creation
      of the destination file by operating on an existing destination file
      descriptor, not a path.
      
      Currently the high level vfs entry point limits copy offloading to files
      on the same mount and super (and not in the same file).  This can be
      relaxed if we get implementations which can copy between file systems
      safely.
      Signed-off-by: NZach Brown <zab@redhat.com>
      [Anna Schumaker: Change -EINVAL to -EBADF during file verification,
                       Change flags parameter from int to unsigned int,
                       Add function to include/linux/syscalls.h,
                       Check copy len after file open mode,
                       Don't forbid ranges inside the same file,
                       Use rw_verify_area() to veriy ranges,
                       Use file_out rather than file_in,
                       Add COPY_FR_REFLINK flag]
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      29732938
  4. 06 11月, 2015 1 次提交
  5. 12 9月, 2015 1 次提交
    • M
      sys_membarrier(): system-wide memory barrier (generic, x86) · 5b25b13a
      Mathieu Desnoyers 提交于
      Here is an implementation of a new system call, sys_membarrier(), which
      executes a memory barrier on all threads running on the system.  It is
      implemented by calling synchronize_sched().  It can be used to
      distribute the cost of user-space memory barriers asymmetrically by
      transforming pairs of memory barriers into pairs consisting of
      sys_membarrier() and a compiler barrier.  For synchronization primitives
      that distinguish between read-side and write-side (e.g.  userspace RCU
      [1], rwlocks), the read-side can be accelerated significantly by moving
      the bulk of the memory barrier overhead to the write-side.
      
      The existing applications of which I am aware that would be improved by
      this system call are as follows:
      
      * Through Userspace RCU library (http://urcu.so)
        - DNS server (Knot DNS) https://www.knot-dns.cz/
        - Network sniffer (http://netsniff-ng.org/)
        - Distributed object storage (https://sheepdog.github.io/sheepdog/)
        - User-space tracing (http://lttng.org)
        - Network storage system (https://www.gluster.org/)
        - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
        - Financial software (https://lkml.org/lkml/2015/3/23/189)
      
      Those projects use RCU in userspace to increase read-side speed and
      scalability compared to locking.  Especially in the case of RCU used by
      libraries, sys_membarrier can speed up the read-side by moving the bulk of
      the memory barrier cost to synchronize_rcu().
      
      * Direct users of sys_membarrier
        - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
      
      Microsoft core dotnet GC developers are planning to use the mprotect()
      side-effect of issuing memory barriers through IPIs as a way to implement
      Windows FlushProcessWriteBuffers() on Linux.  They are referring to
      sys_membarrier in their github thread, specifically stating that
      sys_membarrier() is what they are looking for.
      
      To explain the benefit of this scheme, let's introduce two example threads:
      
      Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
      Thread B (frequent, e.g. executing liburcu
      rcu_read_lock()/rcu_read_unlock())
      
      In a scheme where all smp_mb() in thread A are ordering memory accesses
      with respect to smp_mb() present in Thread B, we can change each
      smp_mb() within Thread A into calls to sys_membarrier() and each
      smp_mb() within Thread B into compiler barriers "barrier()".
      
      Before the change, we had, for each smp_mb() pairs:
      
      Thread A                    Thread B
      previous mem accesses       previous mem accesses
      smp_mb()                    smp_mb()
      following mem accesses      following mem accesses
      
      After the change, these pairs become:
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      As we can see, there are two possible scenarios: either Thread B memory
      accesses do not happen concurrently with Thread A accesses (1), or they
      do (2).
      
      1) Non-concurrent Thread A vs Thread B accesses:
      
      Thread A                    Thread B
      prev mem accesses
      sys_membarrier()
      follow mem accesses
                                  prev mem accesses
                                  barrier()
                                  follow mem accesses
      
      In this case, thread B accesses will be weakly ordered. This is OK,
      because at that point, thread A is not particularly interested in
      ordering them with respect to its own accesses.
      
      2) Concurrent Thread A vs Thread B accesses
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      In this case, thread B accesses, which are ensured to be in program
      order thanks to the compiler barrier, will be "upgraded" to full
      smp_mb() by synchronize_sched().
      
      * Benchmarks
      
      On Intel Xeon E5405 (8 cores)
      (one thread is calling sys_membarrier, the other 7 threads are busy
      looping)
      
      1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
      
      * User-space user of this system call: Userspace RCU library
      
      Both the signal-based and the sys_membarrier userspace RCU schemes
      permit us to remove the memory barrier from the userspace RCU
      rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
      accelerating them. These memory barriers are replaced by compiler
      barriers on the read-side, and all matching memory barriers on the
      write-side are turned into an invocation of a memory barrier on all
      active threads in the process. By letting the kernel perform this
      synchronization rather than dumbly sending a signal to every process
      threads (as we currently do), we diminish the number of unnecessary wake
      ups and only issue the memory barriers on active threads. Non-running
      threads do not need to execute such barrier anyway, because these are
      implied by the scheduler context switches.
      
      Results in liburcu:
      
      Operations in 10s, 6 readers, 2 writers:
      
      memory barriers in reader:    1701557485 reads, 2202847 writes
      signal-based scheme:          9830061167 reads,    6700 writes
      sys_membarrier:               9952759104 reads,     425 writes
      sys_membarrier (dyn. check):  7970328887 reads,     425 writes
      
      The dynamic sys_membarrier availability check adds some overhead to
      the read-side compared to the signal-based scheme, but besides that,
      sys_membarrier slightly outperforms the signal-based scheme. However,
      this non-expedited sys_membarrier implementation has a much slower grace
      period than signal and memory barrier schemes.
      
      Besides diminishing the number of wake-ups, one major advantage of the
      membarrier system call over the signal-based scheme is that it does not
      need to reserve a signal. This plays much more nicely with libraries,
      and with processes injected into for tracing purposes, for which we
      cannot expect that signals will be unused by the application.
      
      An expedited version of this system call can be added later on to speed
      up the grace period. Its implementation will likely depend on reading
      the cpu_curr()->mm without holding each CPU's rq lock.
      
      This patch adds the system call to x86 and to asm-generic.
      
      [1] http://urcu.so
      
      membarrier(2) man page:
      
      MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
      
      NAME
             membarrier - issue memory barriers on a set of threads
      
      SYNOPSIS
             #include <linux/membarrier.h>
      
             int membarrier(int cmd, int flags);
      
      DESCRIPTION
             The cmd argument is one of the following:
      
             MEMBARRIER_CMD_QUERY
                    Query  the  set  of  supported commands. It returns a bitmask of
                    supported commands.
      
             MEMBARRIER_CMD_SHARED
                    Execute a memory barrier on all threads running on  the  system.
                    Upon  return from system call, the caller thread is ensured that
                    all running threads have passed through a state where all memory
                    accesses  to  user-space  addresses  match program order between
                    entry to and return from the system  call  (non-running  threads
                    are de facto in such a state). This covers threads from all pro=E2=80=90
                    cesses running on the system.  This command returns 0.
      
             The flags argument needs to be 0. For future extensions.
      
             All memory accesses performed  in  program  order  from  each  targeted
             thread is guaranteed to be ordered with respect to sys_membarrier(). If
             we use the semantic "barrier()" to represent a compiler barrier forcing
             memory  accesses  to  be performed in program order across the barrier,
             and smp_mb() to represent explicit memory barriers forcing full  memory
             ordering  across  the barrier, we have the following ordering table for
             each pair of barrier(), sys_membarrier() and smp_mb():
      
             The pair ordering is detailed as (O: ordered, X: not ordered):
      
                                    barrier()   smp_mb() sys_membarrier()
                    barrier()          X           X            O
                    smp_mb()           X           O            O
                    sys_membarrier()   O           O            O
      
      RETURN VALUE
             On success, these system calls return zero.  On error, -1 is  returned,
             and errno is set appropriately. For a given command, with flags
             argument set to 0, this system call is guaranteed to always return the
             same value until reboot.
      
      ERRORS
             ENOSYS System call is not implemented.
      
             EINVAL Invalid arguments.
      
      Linux                             2015-04-15                     MEMBARRIER(2)
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Nicholas Miell <nmiell@comcast.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b25b13a
  6. 05 9月, 2015 1 次提交
  7. 31 7月, 2015 1 次提交
    • A
      x86/ldt: Make modify_ldt() optional · a5b9e5a2
      Andy Lutomirski 提交于
      The modify_ldt syscall exposes a large attack surface and is
      unnecessary for modern userspace.  Make it optional.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Cooper <andrew.cooper3@citrix.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Jan Beulich <jbeulich@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: security@kernel.org <security@kernel.org>
      Cc: xen-devel <xen-devel@lists.xen.org>
      Link: http://lkml.kernel.org/r/a605166a771c343fd64802dece77a903507333bd.1438291540.git.luto@kernel.org
      [ Made MATH_EMULATION dependent on MODIFY_LDT_SYSCALL. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a5b9e5a2
  8. 16 4月, 2015 1 次提交
  9. 14 12月, 2014 1 次提交
    • D
      syscalls: implement execveat() system call · 51f39a1f
      David Drysdale 提交于
      This patchset adds execveat(2) for x86, and is derived from Meredydd
      Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
      
      The primary aim of adding an execveat syscall is to allow an
      implementation of fexecve(3) that does not rely on the /proc filesystem,
      at least for executables (rather than scripts).  The current glibc version
      of fexecve(3) is implemented via /proc, which causes problems in sandboxed
      or otherwise restricted environments.
      
      Given the desire for a /proc-free fexecve() implementation, HPA suggested
      (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
      an appropriate generalization.
      
      Also, having a new syscall means that it can take a flags argument without
      back-compatibility concerns.  The current implementation just defines the
      AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
      added in future -- for example, flags for new namespaces (as suggested at
      https://lkml.org/lkml/2006/7/11/474).
      
      Related history:
       - https://lkml.org/lkml/2006/12/27/123 is an example of someone
         realizing that fexecve() is likely to fail in a chroot environment.
       - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
         documenting the /proc requirement of fexecve(3) in its manpage, to
         "prevent other people from wasting their time".
       - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
         problem where a process that did setuid() could not fexecve()
         because it no longer had access to /proc/self/fd; this has since
         been fixed.
      
      This patch (of 4):
      
      Add a new execveat(2) system call.  execveat() is to execve() as openat()
      is to open(): it takes a file descriptor that refers to a directory, and
      resolves the filename relative to that.
      
      In addition, if the filename is empty and AT_EMPTY_PATH is specified,
      execveat() executes the file to which the file descriptor refers.  This
      replicates the functionality of fexecve(), which is a system call in other
      UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
      so relies on /proc being mounted).
      
      The filename fed to the executed program as argv[0] (or the name of the
      script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
      (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
      reflecting how the executable was found.  This does however mean that
      execution of a script in a /proc-less environment won't work; also, script
      execution via an O_CLOEXEC file descriptor fails (as the file will not be
      accessible after exec).
      
      Based on patches by Meredydd Luff.
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51f39a1f
  10. 19 11月, 2014 1 次提交
  11. 27 9月, 2014 1 次提交
  12. 18 8月, 2014 1 次提交
    • J
      mm: Support compiling out madvise and fadvise · d3ac21ca
      Josh Triplett 提交于
      Many embedded systems will not need these syscalls, and omitting them
      saves space.  Add a new EXPERT config option CONFIG_ADVISE_SYSCALLS
      (default y) to support compiling them out.
      
      bloat-o-meter:
      add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-2250 (-2250)
      function                                     old     new   delta
      sys_fadvise64                                 57       -     -57
      sys_fadvise64_64                             691       -    -691
      sys_madvise                                 1502       -   -1502
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      d3ac21ca
  13. 09 8月, 2014 2 次提交
    • V
      kexec: new syscall kexec_file_load() declaration · f0895685
      Vivek Goyal 提交于
      This is the new syscall kexec_file_load() declaration/interface.  I have
      reserved the syscall number only for x86_64 so far.  Other architectures
      (including i386) can reserve syscall number when they enable the support
      for this new syscall.
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Matthew Garrett <mjg59@srcf.ucam.org>
      Cc: Greg Kroah-Hartman <greg@kroah.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: WANG Chao <chaowang@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0895685
    • D
      shm: add memfd_create() syscall · 9183df25
      David Herrmann 提交于
      memfd_create() is similar to mmap(MAP_ANON), but returns a file-descriptor
      that you can pass to mmap().  It can support sealing and avoids any
      connection to user-visible mount-points.  Thus, it's not subject to quotas
      on mounted file-systems, but can be used like malloc()'ed memory, but with
      a file-descriptor to it.
      
      memfd_create() returns the raw shmem file, so calls like ftruncate() can
      be used to modify the underlying inode.  Also calls like fstat() will
      return proper information and mark the file as regular file.  If you want
      sealing, you can specify MFD_ALLOW_SEALING.  Otherwise, sealing is not
      supported (like on all other regular files).
      
      Compared to O_TMPFILE, it does not require a tmpfs mount-point and is not
      subject to a filesystem size limit.  It is still properly accounted to
      memcg limits, though, and to the same overcommit or no-overcommit
      accounting as all user memory.
      Signed-off-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ryan Lortie <desrt@desrt.ca>
      Cc: Lennart Poettering <lennart@poettering.net>
      Cc: Daniel Mack <zonque@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9183df25
  14. 19 7月, 2014 1 次提交
    • K
      seccomp: add "seccomp" syscall · 48dc92b9
      Kees Cook 提交于
      This adds the new "seccomp" syscall with both an "operation" and "flags"
      parameter for future expansion. The third argument is a pointer value,
      used with the SECCOMP_SET_MODE_FILTER operation. Currently, flags must
      be 0. This is functionally equivalent to prctl(PR_SET_SECCOMP, ...).
      
      In addition to the TSYNC flag later in this patch series, there is a
      non-zero chance that this syscall could be used for configuring a fixed
      argument area for seccomp-tracer-aware processes to pass syscall arguments
      in the future. Hence, the use of "seccomp" not simply "seccomp_add_filter"
      for this syscall. Additionally, this syscall uses operation, flags,
      and user pointer for arguments because strictly passing arguments via
      a user pointer would mean seccomp itself would be unable to trivially
      filter the seccomp syscall itself.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NAndy Lutomirski <luto@amacapital.net>
      48dc92b9
  15. 05 6月, 2014 1 次提交
    • F
      sys_sgetmask/sys_ssetmask: add CONFIG_SGETMASK_SYSCALL · f6187769
      Fabian Frederick 提交于
      sys_sgetmask and sys_ssetmask are obsolete system calls no longer
      supported in libc.
      
      This patch replaces architecture related __ARCH_WANT_SYS_SGETMAX by expert
      mode configuration.That option is enabled by default for those
      architectures.
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6187769
  16. 04 4月, 2014 2 次提交
  17. 10 5月, 2013 1 次提交
  18. 04 3月, 2013 2 次提交
  19. 14 12月, 2012 1 次提交
    • K
      module: add syscall to load module from fd · 34e1169d
      Kees Cook 提交于
      As part of the effort to create a stronger boundary between root and
      kernel, Chrome OS wants to be able to enforce that kernel modules are
      being loaded only from our read-only crypto-hash verified (dm_verity)
      root filesystem. Since the init_module syscall hands the kernel a module
      as a memory blob, no reasoning about the origin of the blob can be made.
      
      Earlier proposals for appending signatures to kernel modules would not be
      useful in Chrome OS, since it would involve adding an additional set of
      keys to our kernel and builds for no good reason: we already trust the
      contents of our root filesystem. We don't need to verify those kernel
      modules a second time. Having to do signature checking on module loading
      would slow us down and be redundant. All we need to know is where a
      module is coming from so we can say yes/no to loading it.
      
      If a file descriptor is used as the source of a kernel module, many more
      things can be reasoned about. In Chrome OS's case, we could enforce that
      the module lives on the filesystem we expect it to live on.  In the case
      of IMA (or other LSMs), it would be possible, for example, to examine
      extended attributes that may contain signatures over the contents of
      the module.
      
      This introduces a new syscall (on x86), similar to init_module, that has
      only two arguments. The first argument is used as a file descriptor to
      the module and the second argument is a pointer to the NULL terminated
      string of module arguments.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: Rusty Russell <rusty@rustcorp.com.au> (merge fixes)
      34e1169d
  20. 01 6月, 2012 1 次提交
    • C
      syscalls, x86: add __NR_kcmp syscall · d97b46a6
      Cyrill Gorcunov 提交于
      While doing the checkpoint-restore in the user space one need to determine
      whether various kernel objects (like mm_struct-s of file_struct-s) are
      shared between tasks and restore this state.
      
      The 2nd step can be solved by using appropriate CLONE_ flags and the
      unshare syscall, while there's currently no ways for solving the 1st one.
      
      One of the ways for checking whether two tasks share e.g.  mm_struct is to
      provide some mm_struct ID of a task to its proc file, but showing such
      info considered to be not that good for security reasons.
      
      Thus after some debates we end up in conclusion that using that named
      'comparison' syscall might be the best candidate.  So here is it --
      __NR_kcmp.
      
      It takes up to 5 arguments - the pids of the two tasks (which
      characteristics should be compared), the comparison type and (in case of
      comparison of files) two file descriptors.
      
      Lookups for pids are done in the caller's PID namespace only.
      
      At moment only x86 is supported and tested.
      
      [akpm@linux-foundation.org: fix up selftests, warnings]
      [akpm@linux-foundation.org: include errno.h]
      [akpm@linux-foundation.org: tweak comment text]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Andrey Vagin <avagin@openvz.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Vasiliy Kulikov <segoon@openwall.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Valdis.Kletnieks@vt.edu
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d97b46a6
  21. 01 11月, 2011 1 次提交
    • C
      Cross Memory Attach · fcf63409
      Christopher Yeoh 提交于
      The basic idea behind cross memory attach is to allow MPI programs doing
      intra-node communication to do a single copy of the message rather than a
      double copy of the message via shared memory.
      
      The following patch attempts to achieve this by allowing a destination
      process, given an address and size from a source process, to copy memory
      directly from the source process into its own address space via a system
      call.  There is also a symmetrical ability to copy from the current
      process's address space into a destination process's address space.
      
      - Use of /proc/pid/mem has been considered, but there are issues with
        using it:
        - Does not allow for specifying iovecs for both src and dest, assuming
          preadv or pwritev was implemented either the area read from or
        written to would need to be contiguous.
        - Currently mem_read allows only processes who are currently
        ptrace'ing the target and are still able to ptrace the target to read
        from the target. This check could possibly be moved to the open call,
        but its not clear exactly what race this restriction is stopping
        (reason  appears to have been lost)
        - Having to send the fd of /proc/self/mem via SCM_RIGHTS on unix
        domain socket is a bit ugly from a userspace point of view,
        especially when you may have hundreds if not (eventually) thousands
        of processes  that all need to do this with each other
        - Doesn't allow for some future use of the interface we would like to
        consider adding in the future (see below)
        - Interestingly reading from /proc/pid/mem currently actually
        involves two copies! (But this could be fixed pretty easily)
      
      As mentioned previously use of vmsplice instead was considered, but has
      problems.  Since you need the reader and writer working co-operatively if
      the pipe is not drained then you block.  Which requires some wrapping to
      do non blocking on the send side or polling on the receive.  In all to all
      communication it requires ordering otherwise you can deadlock.  And in the
      example of many MPI tasks writing to one MPI task vmsplice serialises the
      copying.
      
      There are some cases of MPI collectives where even a single copy interface
      does not get us the performance gain we could.  For example in an
      MPI_Reduce rather than copy the data from the source we would like to
      instead use it directly in a mathops (say the reduce is doing a sum) as
      this would save us doing a copy.  We don't need to keep a copy of the data
      from the source.  I haven't implemented this, but I think this interface
      could in the future do all this through the use of the flags - eg could
      specify the math operation and type and the kernel rather than just
      copying the data would apply the specified operation between the source
      and destination and store it in the destination.
      
      Although we don't have a "second user" of the interface (though I've had
      some nibbles from people who may be interested in using it for intra
      process messaging which is not MPI).  This interface is something which
      hardware vendors are already doing for their custom drivers to implement
      fast local communication.  And so in addition to this being useful for
      OpenMPI it would mean the driver maintainers don't have to fix things up
      when the mm changes.
      
      There was some discussion about how much faster a true zero copy would
      go. Here's a link back to the email with some testing I did on that:
      
      http://marc.info/?l=linux-mm&m=130105930902915&w=2
      
      There is a basic man page for the proposed interface here:
      
      http://ozlabs.org/~cyeoh/cma/process_vm_readv.txt
      
      This has been implemented for x86 and powerpc, other architecture should
      mainly (I think) just need to add syscall numbers for the process_vm_readv
      and process_vm_writev. There are 32 bit compatibility versions for
      64-bit kernels.
      
      For arch maintainers there are some simple tests to be able to quickly
      verify that the syscalls are working correctly here:
      
      http://ozlabs.org/~cyeoh/cma/cma-test-20110718.tgzSigned-off-by: NChris Yeoh <yeohc@au1.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: <linux-man@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcf63409
  22. 27 8月, 2011 1 次提交
  23. 21 5月, 2011 1 次提交
  24. 06 5月, 2011 1 次提交
    • A
      net: Add sendmmsg socket system call · 228e548e
      Anton Blanchard 提交于
      This patch adds a multiple message send syscall and is the send
      version of the existing recvmmsg syscall. This is heavily
      based on the patch by Arnaldo that added recvmmsg.
      
      I wrote a microbenchmark to test the performance gains of using
      this new syscall:
      
      http://ozlabs.org/~anton/junkcode/sendmmsg_test.c
      
      The test was run on a ppc64 box with a 10 Gbit network card. The
      benchmark can send both UDP and RAW ethernet packets.
      
      64B UDP
      
      batch   pkts/sec
      1       804570
      2       872800 (+ 8 %)
      4       916556 (+14 %)
      8       939712 (+17 %)
      16      952688 (+18 %)
      32      956448 (+19 %)
      64      964800 (+20 %)
      
      64B raw socket
      
      batch   pkts/sec
      1       1201449
      2       1350028 (+12 %)
      4       1461416 (+22 %)
      8       1513080 (+26 %)
      16      1541216 (+28 %)
      32      1553440 (+29 %)
      64      1557888 (+30 %)
      
      We see a 20% improvement in throughput on UDP send and 30%
      on raw socket send.
      
      [ Add sparc syscall entries. -DaveM ]
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      228e548e
  25. 15 3月, 2011 2 次提交
  26. 23 9月, 2010 1 次提交
  27. 28 7月, 2010 2 次提交
  28. 13 3月, 2010 1 次提交
    • C
      Add generic sys_ipc wrapper · baed7fc9
      Christoph Hellwig 提交于
      Add a generic implementation of the ipc demultiplexer syscall.  Except for
      s390 and sparc64 all implementations of the sys_ipc are nearly identical.
      
      There are slight differences in the types of the parameters, where mips
      and powerpc as the only 64-bit architectures with sys_ipc use unsigned
      long for the "third" argument as it gets casted to a pointer later, while
      it traditionally is an "int" like most other paramters.  frv goes even
      further and uses unsigned long for all parameters execept for "ptr" which
      is a pointer type everywhere.  The change from int to unsigned long for
      "third" and back to "int" for the others on frv should be fine due to the
      in-register calling conventions for syscalls (we already had a similar
      issue with the generic sys_ptrace), but I'd prefer to have the arch
      maintainers looks over this in details.
      
      Except for that h8300, m68k and m68knommu lack an impplementation of the
      semtimedop sub call which this patch adds, and various architectures have
      gets used - at least on i386 it seems superflous as the compat code on
      x86-64 and ia64 doesn't even bother to implement it.
      
      [akpm@linux-foundation.org: add sys_ipc to sys_ni.c]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Hirokazu Takata <takata@linux-m32r.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@elte.hu>
      Reviewed-by: NH. Peter Anvin <hpa@zytor.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Andreas Schwab <schwab@linux-m68k.org>
      Acked-by: NJesper Nilsson <jesper.nilsson@axis.com>
      Acked-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NKyle McMartin <kyle@mcmartin.ca>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      baed7fc9
  29. 06 11月, 2009 1 次提交
  30. 13 10月, 2009 1 次提交
    • A
      net: Introduce recvmmsg socket syscall · a2e27255
      Arnaldo Carvalho de Melo 提交于
      Meaning receive multiple messages, reducing the number of syscalls and
      net stack entry/exit operations.
      
      Next patches will introduce mechanisms where protocols that want to
      optimize this operation will provide an unlocked_recvmsg operation.
      
      This takes into account comments made by:
      
      . Paul Moore: sock_recvmsg is called only for the first datagram,
        sock_recvmsg_nosec is used for the rest.
      
      . Caitlin Bestler: recvmmsg now has a struct timespec timeout, that
        works in the same fashion as the ppoll one.
      
        If the underlying protocol returns a datagram with MSG_OOB set, this
        will make recvmmsg return right away with as many datagrams (+ the OOB
        one) it has received so far.
      
      . Rémi Denis-Courmont & Steven Whitehouse: If we receive N < vlen
        datagrams and then recvmsg returns an error, recvmmsg will return
        the successfully received datagrams, store the error and return it
        in the next call.
      
      This paves the way for a subsequent optimization, sk_prot->unlocked_recvmsg,
      where we will be able to acquire the lock only at batch start and end, not at
      every underlying recvmsg call.
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a2e27255
  31. 22 9月, 2009 1 次提交
  32. 21 9月, 2009 1 次提交
    • I
      perf: Do the big rename: Performance Counters -> Performance Events · cdd6c482
      Ingo Molnar 提交于
      Bye-bye Performance Counters, welcome Performance Events!
      
      In the past few months the perfcounters subsystem has grown out its
      initial role of counting hardware events, and has become (and is
      becoming) a much broader generic event enumeration, reporting, logging,
      monitoring, analysis facility.
      
      Naming its core object 'perf_counter' and naming the subsystem
      'perfcounters' has become more and more of a misnomer. With pending
      code like hw-breakpoints support the 'counter' name is less and
      less appropriate.
      
      All in one, we've decided to rename the subsystem to 'performance
      events' and to propagate this rename through all fields, variables
      and API names. (in an ABI compatible fashion)
      
      The word 'event' is also a bit shorter than 'counter' - which makes
      it slightly more convenient to write/handle as well.
      
      Thanks goes to Stephane Eranian who first observed this misnomer and
      suggested a rename.
      
      User-space tooling and ABI compatibility is not affected - this patch
      should be function-invariant. (Also, defconfigs were not touched to
      keep the size down.)
      
      This patch has been generated via the following script:
      
        FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
      
        sed -i \
          -e 's/PERF_EVENT_/PERF_RECORD_/g' \
          -e 's/PERF_COUNTER/PERF_EVENT/g' \
          -e 's/perf_counter/perf_event/g' \
          -e 's/nb_counters/nb_events/g' \
          -e 's/swcounter/swevent/g' \
          -e 's/tpcounter_event/tp_event/g' \
          $FILES
      
        for N in $(find . -name perf_counter.[ch]); do
          M=$(echo $N | sed 's/perf_counter/perf_event/g')
          mv $N $M
        done
      
        FILES=$(find . -name perf_event.*)
      
        sed -i \
          -e 's/COUNTER_MASK/REG_MASK/g' \
          -e 's/COUNTER/EVENT/g' \
          -e 's/\<event\>/event_id/g' \
          -e 's/counter/event/g' \
          -e 's/Counter/Event/g' \
          $FILES
      
      ... to keep it as correct as possible. This script can also be
      used by anyone who has pending perfcounters patches - it converts
      a Linux kernel tree over to the new naming. We tried to time this
      change to the point in time where the amount of pending patches
      is the smallest: the end of the merge window.
      
      Namespace clashes were fixed up in a preparatory patch - and some
      stylistic fallout will be fixed up in a subsequent patch.
      
      ( NOTE: 'counters' are still the proper terminology when we deal
        with hardware registers - and these sed scripts are a bit
        over-eager in renaming them. I've undone some of that, but
        in case there's something left where 'counter' would be
        better than 'event' we can undo that on an individual basis
        instead of touching an otherwise nicely automated patch. )
      Suggested-by: NStephane Eranian <eranian@google.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NPaul Mackerras <paulus@samba.org>
      Reviewed-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: <linux-arch@vger.kernel.org>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cdd6c482
  33. 14 1月, 2009 1 次提交
  34. 08 12月, 2008 1 次提交
    • T
      performance counters: core code · 0793a61d
      Thomas Gleixner 提交于
      Implement the core kernel bits of Performance Counters subsystem.
      
      The Linux Performance Counter subsystem provides an abstraction of
      performance counter hardware capabilities. It provides per task and per
      CPU counters, and it provides event capabilities on top of those.
      
      Performance counters are accessed via special file descriptors.
      There's one file descriptor per virtual counter used.
      
      The special file descriptor is opened via the perf_counter_open()
      system call:
      
       int
       perf_counter_open(u32 hw_event_type,
                         u32 hw_event_period,
                         u32 record_type,
                         pid_t pid,
                         int cpu);
      
      The syscall returns the new fd. The fd can be used via the normal
      VFS system calls: read() can be used to read the counter, fcntl()
      can be used to set the blocking mode, etc.
      
      Multiple counters can be kept open at a time, and the counters
      can be poll()ed.
      
      See more details in Documentation/perf-counters.txt.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0793a61d
  35. 20 11月, 2008 1 次提交
    • U
      reintroduce accept4 · de11defe
      Ulrich Drepper 提交于
      Introduce a new accept4() system call.  The addition of this system call
      matches analogous changes in 2.6.27 (dup3(), evenfd2(), signalfd4(),
      inotify_init1(), epoll_create1(), pipe2()) which added new system calls
      that differed from analogous traditional system calls in adding a flags
      argument that can be used to access additional functionality.
      
      The accept4() system call is exactly the same as accept(), except that
      it adds a flags bit-mask argument.  Two flags are initially implemented.
      (Most of the new system calls in 2.6.27 also had both of these flags.)
      
      SOCK_CLOEXEC causes the close-on-exec (FD_CLOEXEC) flag to be enabled
      for the new file descriptor returned by accept4().  This is a useful
      security feature to avoid leaking information in a multithreaded
      program where one thread is doing an accept() at the same time as
      another thread is doing a fork() plus exec().  More details here:
      http://udrepper.livejournal.com/20407.html "Secure File Descriptor Handling",
      Ulrich Drepper).
      
      The other flag is SOCK_NONBLOCK, which causes the O_NONBLOCK flag
      to be enabled on the new open file description created by accept4().
      (This flag is merely a convenience, saving the use of additional calls
      fcntl(F_GETFL) and fcntl (F_SETFL) to achieve the same result.
      
      Here's a test program.  Works on x86-32.  Should work on x86-64, but
      I (mtk) don't have a system to hand to test with.
      
      It tests accept4() with each of the four possible combinations of
      SOCK_CLOEXEC and SOCK_NONBLOCK set/clear in 'flags', and verifies
      that the appropriate flags are set on the file descriptor/open file
      description returned by accept4().
      
      I tested Ulrich's patch in this thread by applying against 2.6.28-rc2,
      and it passes according to my test program.
      
      /* test_accept4.c
      
        Copyright (C) 2008, Linux Foundation, written by Michael Kerrisk
             <mtk.manpages@gmail.com>
      
        Licensed under the GNU GPLv2 or later.
      */
      #define _GNU_SOURCE
      #include <unistd.h>
      #include <sys/syscall.h>
      #include <sys/socket.h>
      #include <netinet/in.h>
      #include <stdlib.h>
      #include <fcntl.h>
      #include <stdio.h>
      #include <string.h>
      
      #define PORT_NUM 33333
      
      #define die(msg) do { perror(msg); exit(EXIT_FAILURE); } while (0)
      
      /**********************************************************************/
      
      /* The following is what we need until glibc gets a wrapper for
        accept4() */
      
      /* Flags for socket(), socketpair(), accept4() */
      #ifndef SOCK_CLOEXEC
      #define SOCK_CLOEXEC    O_CLOEXEC
      #endif
      #ifndef SOCK_NONBLOCK
      #define SOCK_NONBLOCK   O_NONBLOCK
      #endif
      
      #ifdef __x86_64__
      #define SYS_accept4 288
      #elif __i386__
      #define USE_SOCKETCALL 1
      #define SYS_ACCEPT4 18
      #else
      #error "Sorry -- don't know the syscall # on this architecture"
      #endif
      
      static int
      accept4(int fd, struct sockaddr *sockaddr, socklen_t *addrlen, int flags)
      {
         printf("Calling accept4(): flags = %x", flags);
         if (flags != 0) {
             printf(" (");
             if (flags & SOCK_CLOEXEC)
                 printf("SOCK_CLOEXEC");
             if ((flags & SOCK_CLOEXEC) && (flags & SOCK_NONBLOCK))
                 printf(" ");
             if (flags & SOCK_NONBLOCK)
                 printf("SOCK_NONBLOCK");
             printf(")");
         }
         printf("\n");
      
      #if USE_SOCKETCALL
         long args[6];
      
         args[0] = fd;
         args[1] = (long) sockaddr;
         args[2] = (long) addrlen;
         args[3] = flags;
      
         return syscall(SYS_socketcall, SYS_ACCEPT4, args);
      #else
         return syscall(SYS_accept4, fd, sockaddr, addrlen, flags);
      #endif
      }
      
      /**********************************************************************/
      
      static int
      do_test(int lfd, struct sockaddr_in *conn_addr,
             int closeonexec_flag, int nonblock_flag)
      {
         int connfd, acceptfd;
         int fdf, flf, fdf_pass, flf_pass;
         struct sockaddr_in claddr;
         socklen_t addrlen;
      
         printf("=======================================\n");
      
         connfd = socket(AF_INET, SOCK_STREAM, 0);
         if (connfd == -1)
             die("socket");
         if (connect(connfd, (struct sockaddr *) conn_addr,
                     sizeof(struct sockaddr_in)) == -1)
             die("connect");
      
         addrlen = sizeof(struct sockaddr_in);
         acceptfd = accept4(lfd, (struct sockaddr *) &claddr, &addrlen,
                            closeonexec_flag | nonblock_flag);
         if (acceptfd == -1) {
             perror("accept4()");
             close(connfd);
             return 0;
         }
      
         fdf = fcntl(acceptfd, F_GETFD);
         if (fdf == -1)
             die("fcntl:F_GETFD");
         fdf_pass = ((fdf & FD_CLOEXEC) != 0) ==
                    ((closeonexec_flag & SOCK_CLOEXEC) != 0);
         printf("Close-on-exec flag is %sset (%s); ",
                 (fdf & FD_CLOEXEC) ? "" : "not ",
                 fdf_pass ? "OK" : "failed");
      
         flf = fcntl(acceptfd, F_GETFL);
         if (flf == -1)
             die("fcntl:F_GETFD");
         flf_pass = ((flf & O_NONBLOCK) != 0) ==
                    ((nonblock_flag & SOCK_NONBLOCK) !=0);
         printf("nonblock flag is %sset (%s)\n",
                 (flf & O_NONBLOCK) ? "" : "not ",
                 flf_pass ? "OK" : "failed");
      
         close(acceptfd);
         close(connfd);
      
         printf("Test result: %s\n", (fdf_pass && flf_pass) ? "PASS" : "FAIL");
         return fdf_pass && flf_pass;
      }
      
      static int
      create_listening_socket(int port_num)
      {
         struct sockaddr_in svaddr;
         int lfd;
         int optval;
      
         memset(&svaddr, 0, sizeof(struct sockaddr_in));
         svaddr.sin_family = AF_INET;
         svaddr.sin_addr.s_addr = htonl(INADDR_ANY);
         svaddr.sin_port = htons(port_num);
      
         lfd = socket(AF_INET, SOCK_STREAM, 0);
         if (lfd == -1)
             die("socket");
      
         optval = 1;
         if (setsockopt(lfd, SOL_SOCKET, SO_REUSEADDR, &optval,
                        sizeof(optval)) == -1)
             die("setsockopt");
      
         if (bind(lfd, (struct sockaddr *) &svaddr,
                  sizeof(struct sockaddr_in)) == -1)
             die("bind");
      
         if (listen(lfd, 5) == -1)
             die("listen");
      
         return lfd;
      }
      
      int
      main(int argc, char *argv[])
      {
         struct sockaddr_in conn_addr;
         int lfd;
         int port_num;
         int passed;
      
         passed = 1;
      
         port_num = (argc > 1) ? atoi(argv[1]) : PORT_NUM;
      
         memset(&conn_addr, 0, sizeof(struct sockaddr_in));
         conn_addr.sin_family = AF_INET;
         conn_addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
         conn_addr.sin_port = htons(port_num);
      
         lfd = create_listening_socket(port_num);
      
         if (!do_test(lfd, &conn_addr, 0, 0))
             passed = 0;
         if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, 0))
             passed = 0;
         if (!do_test(lfd, &conn_addr, 0, SOCK_NONBLOCK))
             passed = 0;
         if (!do_test(lfd, &conn_addr, SOCK_CLOEXEC, SOCK_NONBLOCK))
             passed = 0;
      
         close(lfd);
      
         exit(passed ? EXIT_SUCCESS : EXIT_FAILURE);
      }
      
      [mtk.manpages@gmail.com: rewrote changelog, updated test program]
      Signed-off-by: NUlrich Drepper <drepper@redhat.com>
      Tested-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Cc: <linux-api@vger.kernel.org>
      Cc: <linux-arch@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de11defe