1. 03 4月, 2018 19 次提交
  2. 26 3月, 2018 1 次提交
  3. 18 9月, 2017 1 次提交
  4. 12 9月, 2017 1 次提交
  5. 01 9月, 2017 2 次提交
  6. 08 8月, 2017 1 次提交
    • Y
      bpf: add support for sys_enter_* and sys_exit_* tracepoints · cf5f5cea
      Yonghong Song 提交于
      Currently, bpf programs cannot be attached to sys_enter_* and sys_exit_*
      style tracepoints. The iovisor/bcc issue #748
      (https://github.com/iovisor/bcc/issues/748) documents this issue.
      For example, if you try to attach a bpf program to tracepoints
      syscalls/sys_enter_newfstat, you will get the following error:
         # ./tools/trace.py t:syscalls:sys_enter_newfstat
         Ioctl(PERF_EVENT_IOC_SET_BPF): Invalid argument
         Failed to attach BPF to tracepoint
      
      The main reason is that syscalls/sys_enter_* and syscalls/sys_exit_*
      tracepoints are treated differently from other tracepoints and there
      is no bpf hook to it.
      
      This patch adds bpf support for these syscalls tracepoints by
        . permitting bpf attachment in ioctl PERF_EVENT_IOC_SET_BPF
        . calling bpf programs in perf_syscall_enter and perf_syscall_exit
      
      The legality of bpf program ctx access is also checked.
      Function trace_event_get_offsets returns correct max offset for each
      specific syscall tracepoint, which is compared against the maximum offset
      access in bpf program.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf5f5cea
  7. 08 7月, 2017 1 次提交
    • T
      x86/syscalls: Check address limit on user-mode return · 5ea0727b
      Thomas Garnier 提交于
      Ensure the address limit is a user-mode segment before returning to
      user-mode. Otherwise a process can corrupt kernel-mode memory and elevate
      privileges [1].
      
      The set_fs function sets the TIF_SETFS flag to force a slow path on
      return. In the slow path, the address limit is checked to be USER_DS if
      needed.
      
      The addr_limit_user_check function is added as a cross-architecture
      function to check the address limit.
      
      [1] https://bugs.chromium.org/p/project-zero/issues/detail?id=990Signed-off-by: NThomas Garnier <thgarnie@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: kernel-hardening@lists.openwall.com
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Pratyush Anand <panand@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Will Drewry <wad@chromium.org>
      Cc: linux-api@vger.kernel.org
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Link: http://lkml.kernel.org/r/20170615011203.144108-1-thgarnie@google.com
      5ea0727b
  8. 28 5月, 2017 1 次提交
  9. 03 3月, 2017 1 次提交
    • D
      statx: Add a system call to make enhanced file info available · a528d35e
      David Howells 提交于
      Add a system call to make extended file information available, including
      file creation and some attribute flags where available through the
      underlying filesystem.
      
      The getattr inode operation is altered to take two additional arguments: a
      u32 request_mask and an unsigned int flags that indicate the
      synchronisation mode.  This change is propagated to the vfs_getattr*()
      function.
      
      Functions like vfs_stat() are now inline wrappers around new functions
      vfs_statx() and vfs_statx_fd() to reduce stack usage.
      
      ========
      OVERVIEW
      ========
      
      The idea was initially proposed as a set of xattrs that could be retrieved
      with getxattr(), but the general preference proved to be for a new syscall
      with an extended stat structure.
      
      A number of requests were gathered for features to be included.  The
      following have been included:
      
       (1) Make the fields a consistent size on all arches and make them large.
      
       (2) Spare space, request flags and information flags are provided for
           future expansion.
      
       (3) Better support for the y2038 problem [Arnd Bergmann] (tv_sec is an
           __s64).
      
       (4) Creation time: The SMB protocol carries the creation time, which could
           be exported by Samba, which will in turn help CIFS make use of
           FS-Cache as that can be used for coherency data (stx_btime).
      
           This is also specified in NFSv4 as a recommended attribute and could
           be exported by NFSD [Steve French].
      
       (5) Lightweight stat: Ask for just those details of interest, and allow a
           netfs (such as NFS) to approximate anything not of interest, possibly
           without going to the server [Trond Myklebust, Ulrich Drepper, Andreas
           Dilger] (AT_STATX_DONT_SYNC).
      
       (6) Heavyweight stat: Force a netfs to go to the server, even if it thinks
           its cached attributes are up to date [Trond Myklebust]
           (AT_STATX_FORCE_SYNC).
      
      And the following have been left out for future extension:
      
       (7) Data version number: Could be used by userspace NFS servers [Aneesh
           Kumar].
      
           Can also be used to modify fill_post_wcc() in NFSD which retrieves
           i_version directly, but has just called vfs_getattr().  It could get
           it from the kstat struct if it used vfs_xgetattr() instead.
      
           (There's disagreement on the exact semantics of a single field, since
           not all filesystems do this the same way).
      
       (8) BSD stat compatibility: Including more fields from the BSD stat such
           as creation time (st_btime) and inode generation number (st_gen)
           [Jeremy Allison, Bernd Schubert].
      
       (9) Inode generation number: Useful for FUSE and userspace NFS servers
           [Bernd Schubert].
      
           (This was asked for but later deemed unnecessary with the
           open-by-handle capability available and caused disagreement as to
           whether it's a security hole or not).
      
      (10) Extra coherency data may be useful in making backups [Andreas Dilger].
      
           (No particular data were offered, but things like last backup
           timestamp, the data version number and the DOS archive bit would come
           into this category).
      
      (11) Allow the filesystem to indicate what it can/cannot provide: A
           filesystem can now say it doesn't support a standard stat feature if
           that isn't available, so if, for instance, inode numbers or UIDs don't
           exist or are fabricated locally...
      
           (This requires a separate system call - I have an fsinfo() call idea
           for this).
      
      (12) Store a 16-byte volume ID in the superblock that can be returned in
           struct xstat [Steve French].
      
           (Deferred to fsinfo).
      
      (13) Include granularity fields in the time data to indicate the
           granularity of each of the times (NFSv4 time_delta) [Steve French].
      
           (Deferred to fsinfo).
      
      (14) FS_IOC_GETFLAGS value.  These could be translated to BSD's st_flags.
           Note that the Linux IOC flags are a mess and filesystems such as Ext4
           define flags that aren't in linux/fs.h, so translation in the kernel
           may be a necessity (or, possibly, we provide the filesystem type too).
      
           (Some attributes are made available in stx_attributes, but the general
           feeling was that the IOC flags were to ext[234]-specific and shouldn't
           be exposed through statx this way).
      
      (15) Mask of features available on file (eg: ACLs, seclabel) [Brad Boyer,
           Michael Kerrisk].
      
           (Deferred, probably to fsinfo.  Finding out if there's an ACL or
           seclabal might require extra filesystem operations).
      
      (16) Femtosecond-resolution timestamps [Dave Chinner].
      
           (A __reserved field has been left in the statx_timestamp struct for
           this - if there proves to be a need).
      
      (17) A set multiple attributes syscall to go with this.
      
      ===============
      NEW SYSTEM CALL
      ===============
      
      The new system call is:
      
      	int ret = statx(int dfd,
      			const char *filename,
      			unsigned int flags,
      			unsigned int mask,
      			struct statx *buffer);
      
      The dfd, filename and flags parameters indicate the file to query, in a
      similar way to fstatat().  There is no equivalent of lstat() as that can be
      emulated with statx() by passing AT_SYMLINK_NOFOLLOW in flags.  There is
      also no equivalent of fstat() as that can be emulated by passing a NULL
      filename to statx() with the fd of interest in dfd.
      
      Whether or not statx() synchronises the attributes with the backing store
      can be controlled by OR'ing a value into the flags argument (this typically
      only affects network filesystems):
      
       (1) AT_STATX_SYNC_AS_STAT tells statx() to behave as stat() does in this
           respect.
      
       (2) AT_STATX_FORCE_SYNC will require a network filesystem to synchronise
           its attributes with the server - which might require data writeback to
           occur to get the timestamps correct.
      
       (3) AT_STATX_DONT_SYNC will suppress synchronisation with the server in a
           network filesystem.  The resulting values should be considered
           approximate.
      
      mask is a bitmask indicating the fields in struct statx that are of
      interest to the caller.  The user should set this to STATX_BASIC_STATS to
      get the basic set returned by stat().  It should be noted that asking for
      more information may entail extra I/O operations.
      
      buffer points to the destination for the data.  This must be 256 bytes in
      size.
      
      ======================
      MAIN ATTRIBUTES RECORD
      ======================
      
      The following structures are defined in which to return the main attribute
      set:
      
      	struct statx_timestamp {
      		__s64	tv_sec;
      		__s32	tv_nsec;
      		__s32	__reserved;
      	};
      
      	struct statx {
      		__u32	stx_mask;
      		__u32	stx_blksize;
      		__u64	stx_attributes;
      		__u32	stx_nlink;
      		__u32	stx_uid;
      		__u32	stx_gid;
      		__u16	stx_mode;
      		__u16	__spare0[1];
      		__u64	stx_ino;
      		__u64	stx_size;
      		__u64	stx_blocks;
      		__u64	__spare1[1];
      		struct statx_timestamp	stx_atime;
      		struct statx_timestamp	stx_btime;
      		struct statx_timestamp	stx_ctime;
      		struct statx_timestamp	stx_mtime;
      		__u32	stx_rdev_major;
      		__u32	stx_rdev_minor;
      		__u32	stx_dev_major;
      		__u32	stx_dev_minor;
      		__u64	__spare2[14];
      	};
      
      The defined bits in request_mask and stx_mask are:
      
      	STATX_TYPE		Want/got stx_mode & S_IFMT
      	STATX_MODE		Want/got stx_mode & ~S_IFMT
      	STATX_NLINK		Want/got stx_nlink
      	STATX_UID		Want/got stx_uid
      	STATX_GID		Want/got stx_gid
      	STATX_ATIME		Want/got stx_atime{,_ns}
      	STATX_MTIME		Want/got stx_mtime{,_ns}
      	STATX_CTIME		Want/got stx_ctime{,_ns}
      	STATX_INO		Want/got stx_ino
      	STATX_SIZE		Want/got stx_size
      	STATX_BLOCKS		Want/got stx_blocks
      	STATX_BASIC_STATS	[The stuff in the normal stat struct]
      	STATX_BTIME		Want/got stx_btime{,_ns}
      	STATX_ALL		[All currently available stuff]
      
      stx_btime is the file creation time, stx_mask is a bitmask indicating the
      data provided and __spares*[] are where as-yet undefined fields can be
      placed.
      
      Time fields are structures with separate seconds and nanoseconds fields
      plus a reserved field in case we want to add even finer resolution.  Note
      that times will be negative if before 1970; in such a case, the nanosecond
      fields will also be negative if not zero.
      
      The bits defined in the stx_attributes field convey information about a
      file, how it is accessed, where it is and what it does.  The following
      attributes map to FS_*_FL flags and are the same numerical value:
      
      	STATX_ATTR_COMPRESSED		File is compressed by the fs
      	STATX_ATTR_IMMUTABLE		File is marked immutable
      	STATX_ATTR_APPEND		File is append-only
      	STATX_ATTR_NODUMP		File is not to be dumped
      	STATX_ATTR_ENCRYPTED		File requires key to decrypt in fs
      
      Within the kernel, the supported flags are listed by:
      
      	KSTAT_ATTR_FS_IOC_FLAGS
      
      [Are any other IOC flags of sufficient general interest to be exposed
      through this interface?]
      
      New flags include:
      
      	STATX_ATTR_AUTOMOUNT		Object is an automount trigger
      
      These are for the use of GUI tools that might want to mark files specially,
      depending on what they are.
      
      Fields in struct statx come in a number of classes:
      
       (0) stx_dev_*, stx_blksize.
      
           These are local system information and are always available.
      
       (1) stx_mode, stx_nlinks, stx_uid, stx_gid, stx_[amc]time, stx_ino,
           stx_size, stx_blocks.
      
           These will be returned whether the caller asks for them or not.  The
           corresponding bits in stx_mask will be set to indicate whether they
           actually have valid values.
      
           If the caller didn't ask for them, then they may be approximated.  For
           example, NFS won't waste any time updating them from the server,
           unless as a byproduct of updating something requested.
      
           If the values don't actually exist for the underlying object (such as
           UID or GID on a DOS file), then the bit won't be set in the stx_mask,
           even if the caller asked for the value.  In such a case, the returned
           value will be a fabrication.
      
           Note that there are instances where the type might not be valid, for
           instance Windows reparse points.
      
       (2) stx_rdev_*.
      
           This will be set only if stx_mode indicates we're looking at a
           blockdev or a chardev, otherwise will be 0.
      
       (3) stx_btime.
      
           Similar to (1), except this will be set to 0 if it doesn't exist.
      
      =======
      TESTING
      =======
      
      The following test program can be used to test the statx system call:
      
      	samples/statx/test-statx.c
      
      Just compile and run, passing it paths to the files you want to examine.
      The file is built automatically if CONFIG_SAMPLES is enabled.
      
      Here's some example output.  Firstly, an NFS directory that crosses to
      another FSID.  Note that the AUTOMOUNT attribute is set because transiting
      this directory will cause d_automount to be invoked by the VFS.
      
      	[root@andromeda ~]# /tmp/test-statx -A /warthog/data
      	statx(/warthog/data) = 0
      	results=7ff
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:26           Inode: 1703937     Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      	Attributes: 0000000000001000 (-------- -------- -------- -------- -------- -------- ---m---- --------)
      
      Secondly, the result of automounting on that directory.
      
      	[root@andromeda ~]# /tmp/test-statx /warthog/data
      	statx(/warthog/data) = 0
      	results=7ff
      	  Size: 4096            Blocks: 8          IO Block: 1048576  directory
      	Device: 00:27           Inode: 2           Links: 125
      	Access: (3777/drwxrwxrwx)  Uid:     0   Gid:  4041
      	Access: 2016-11-24 09:02:12.219699527+0000
      	Modify: 2016-11-17 10:44:36.225653653+0000
      	Change: 2016-11-17 10:44:36.225653653+0000
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      a528d35e
  10. 18 10月, 2016 1 次提交
    • D
      generic syscalls: kill cruft from removed pkey syscalls · 71757904
      Dave Hansen 提交于
      pkey_set() and pkey_get() were syscalls present in older versions
      of the protection keys patches.  They were fully excised from the
      x86 code, but some cruft was left in the generic syscall code.  The
      C++ comments were intended to help to make it more glaring to me to
      fix them before actually submitting them.  That technique worked,
      but later than I would have liked.
      
      I test-compiled this for arm64.
      
      Fixes: a60f7b69 ("generic syscalls: Wire up memory protection keys syscalls")
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NArnd Bergmann <arnd@arndb.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: x86@kernel.org
      Cc: linux-arch@vger.kernel.org
      Cc: mgorman@techsingularity.net
      Cc: linux-api@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: luto@kernel.org
      Cc: akpm@linux-foundation.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71757904
  11. 09 9月, 2016 1 次提交
  12. 21 5月, 2016 1 次提交
  13. 05 3月, 2016 1 次提交
  14. 02 12月, 2015 1 次提交
    • Z
      vfs: add copy_file_range syscall and vfs helper · 29732938
      Zach Brown 提交于
      Add a copy_file_range() system call for offloading copies between
      regular files.
      
      This gives an interface to underlying layers of the storage stack which
      can copy without reading and writing all the data.  There are a few
      candidates that should support copy offloading in the nearer term:
      
      - btrfs shares extent references with its clone ioctl
      - NFS has patches to add a COPY command which copies on the server
      - SCSI has a family of XCOPY commands which copy in the device
      
      This system call avoids the complexity of also accelerating the creation
      of the destination file by operating on an existing destination file
      descriptor, not a path.
      
      Currently the high level vfs entry point limits copy offloading to files
      on the same mount and super (and not in the same file).  This can be
      relaxed if we get implementations which can copy between file systems
      safely.
      Signed-off-by: NZach Brown <zab@redhat.com>
      [Anna Schumaker: Change -EINVAL to -EBADF during file verification,
                       Change flags parameter from int to unsigned int,
                       Add function to include/linux/syscalls.h,
                       Check copy len after file open mode,
                       Don't forbid ranges inside the same file,
                       Use rw_verify_area() to veriy ranges,
                       Use file_out rather than file_in,
                       Add COPY_FR_REFLINK flag]
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      29732938
  15. 25 11月, 2015 1 次提交
    • A
      arm64: fix building without CONFIG_UID16 · fbc416ff
      Arnd Bergmann 提交于
      As reported by Michal Simek, building an ARM64 kernel with CONFIG_UID16
      disabled currently fails because the system call table still needs to
      reference the individual function entry points that are provided by
      kernel/sys_ni.c in this case, and the declarations are hidden inside
      of #ifdef CONFIG_UID16:
      
      arch/arm64/include/asm/unistd32.h:57:8: error: 'sys_lchown16' undeclared here (not in a function)
       __SYSCALL(__NR_lchown, sys_lchown16)
      
      I believe this problem only exists on ARM64, because older architectures
      tend to not need declarations when their system call table is built
      in assembly code, while newer architectures tend to not need UID16
      support. ARM64 only uses these system calls for compatibility with
      32-bit ARM binaries.
      
      This changes the CONFIG_UID16 check into CONFIG_HAVE_UID16, which is
      set unconditionally on ARM64 with CONFIG_COMPAT, so we see the
      declarations whenever we need them, but otherwise the behavior is
      unchanged.
      
      Fixes: af1839eb ("Kconfig: clean up the long arch list for the UID16 config option")
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      fbc416ff
  16. 06 11月, 2015 1 次提交
  17. 12 9月, 2015 1 次提交
    • M
      sys_membarrier(): system-wide memory barrier (generic, x86) · 5b25b13a
      Mathieu Desnoyers 提交于
      Here is an implementation of a new system call, sys_membarrier(), which
      executes a memory barrier on all threads running on the system.  It is
      implemented by calling synchronize_sched().  It can be used to
      distribute the cost of user-space memory barriers asymmetrically by
      transforming pairs of memory barriers into pairs consisting of
      sys_membarrier() and a compiler barrier.  For synchronization primitives
      that distinguish between read-side and write-side (e.g.  userspace RCU
      [1], rwlocks), the read-side can be accelerated significantly by moving
      the bulk of the memory barrier overhead to the write-side.
      
      The existing applications of which I am aware that would be improved by
      this system call are as follows:
      
      * Through Userspace RCU library (http://urcu.so)
        - DNS server (Knot DNS) https://www.knot-dns.cz/
        - Network sniffer (http://netsniff-ng.org/)
        - Distributed object storage (https://sheepdog.github.io/sheepdog/)
        - User-space tracing (http://lttng.org)
        - Network storage system (https://www.gluster.org/)
        - Virtual routers (https://events.linuxfoundation.org/sites/events/files/slides/DPDK_RCU_0MQ.pdf)
        - Financial software (https://lkml.org/lkml/2015/3/23/189)
      
      Those projects use RCU in userspace to increase read-side speed and
      scalability compared to locking.  Especially in the case of RCU used by
      libraries, sys_membarrier can speed up the read-side by moving the bulk of
      the memory barrier cost to synchronize_rcu().
      
      * Direct users of sys_membarrier
        - core dotnet garbage collector (https://github.com/dotnet/coreclr/issues/198)
      
      Microsoft core dotnet GC developers are planning to use the mprotect()
      side-effect of issuing memory barriers through IPIs as a way to implement
      Windows FlushProcessWriteBuffers() on Linux.  They are referring to
      sys_membarrier in their github thread, specifically stating that
      sys_membarrier() is what they are looking for.
      
      To explain the benefit of this scheme, let's introduce two example threads:
      
      Thread A (non-frequent, e.g. executing liburcu synchronize_rcu())
      Thread B (frequent, e.g. executing liburcu
      rcu_read_lock()/rcu_read_unlock())
      
      In a scheme where all smp_mb() in thread A are ordering memory accesses
      with respect to smp_mb() present in Thread B, we can change each
      smp_mb() within Thread A into calls to sys_membarrier() and each
      smp_mb() within Thread B into compiler barriers "barrier()".
      
      Before the change, we had, for each smp_mb() pairs:
      
      Thread A                    Thread B
      previous mem accesses       previous mem accesses
      smp_mb()                    smp_mb()
      following mem accesses      following mem accesses
      
      After the change, these pairs become:
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      As we can see, there are two possible scenarios: either Thread B memory
      accesses do not happen concurrently with Thread A accesses (1), or they
      do (2).
      
      1) Non-concurrent Thread A vs Thread B accesses:
      
      Thread A                    Thread B
      prev mem accesses
      sys_membarrier()
      follow mem accesses
                                  prev mem accesses
                                  barrier()
                                  follow mem accesses
      
      In this case, thread B accesses will be weakly ordered. This is OK,
      because at that point, thread A is not particularly interested in
      ordering them with respect to its own accesses.
      
      2) Concurrent Thread A vs Thread B accesses
      
      Thread A                    Thread B
      prev mem accesses           prev mem accesses
      sys_membarrier()            barrier()
      follow mem accesses         follow mem accesses
      
      In this case, thread B accesses, which are ensured to be in program
      order thanks to the compiler barrier, will be "upgraded" to full
      smp_mb() by synchronize_sched().
      
      * Benchmarks
      
      On Intel Xeon E5405 (8 cores)
      (one thread is calling sys_membarrier, the other 7 threads are busy
      looping)
      
      1000 non-expedited sys_membarrier calls in 33s =3D 33 milliseconds/call.
      
      * User-space user of this system call: Userspace RCU library
      
      Both the signal-based and the sys_membarrier userspace RCU schemes
      permit us to remove the memory barrier from the userspace RCU
      rcu_read_lock() and rcu_read_unlock() primitives, thus significantly
      accelerating them. These memory barriers are replaced by compiler
      barriers on the read-side, and all matching memory barriers on the
      write-side are turned into an invocation of a memory barrier on all
      active threads in the process. By letting the kernel perform this
      synchronization rather than dumbly sending a signal to every process
      threads (as we currently do), we diminish the number of unnecessary wake
      ups and only issue the memory barriers on active threads. Non-running
      threads do not need to execute such barrier anyway, because these are
      implied by the scheduler context switches.
      
      Results in liburcu:
      
      Operations in 10s, 6 readers, 2 writers:
      
      memory barriers in reader:    1701557485 reads, 2202847 writes
      signal-based scheme:          9830061167 reads,    6700 writes
      sys_membarrier:               9952759104 reads,     425 writes
      sys_membarrier (dyn. check):  7970328887 reads,     425 writes
      
      The dynamic sys_membarrier availability check adds some overhead to
      the read-side compared to the signal-based scheme, but besides that,
      sys_membarrier slightly outperforms the signal-based scheme. However,
      this non-expedited sys_membarrier implementation has a much slower grace
      period than signal and memory barrier schemes.
      
      Besides diminishing the number of wake-ups, one major advantage of the
      membarrier system call over the signal-based scheme is that it does not
      need to reserve a signal. This plays much more nicely with libraries,
      and with processes injected into for tracing purposes, for which we
      cannot expect that signals will be unused by the application.
      
      An expedited version of this system call can be added later on to speed
      up the grace period. Its implementation will likely depend on reading
      the cpu_curr()->mm without holding each CPU's rq lock.
      
      This patch adds the system call to x86 and to asm-generic.
      
      [1] http://urcu.so
      
      membarrier(2) man page:
      
      MEMBARRIER(2)              Linux Programmer's Manual             MEMBARRIER(2)
      
      NAME
             membarrier - issue memory barriers on a set of threads
      
      SYNOPSIS
             #include <linux/membarrier.h>
      
             int membarrier(int cmd, int flags);
      
      DESCRIPTION
             The cmd argument is one of the following:
      
             MEMBARRIER_CMD_QUERY
                    Query  the  set  of  supported commands. It returns a bitmask of
                    supported commands.
      
             MEMBARRIER_CMD_SHARED
                    Execute a memory barrier on all threads running on  the  system.
                    Upon  return from system call, the caller thread is ensured that
                    all running threads have passed through a state where all memory
                    accesses  to  user-space  addresses  match program order between
                    entry to and return from the system  call  (non-running  threads
                    are de facto in such a state). This covers threads from all pro=E2=80=90
                    cesses running on the system.  This command returns 0.
      
             The flags argument needs to be 0. For future extensions.
      
             All memory accesses performed  in  program  order  from  each  targeted
             thread is guaranteed to be ordered with respect to sys_membarrier(). If
             we use the semantic "barrier()" to represent a compiler barrier forcing
             memory  accesses  to  be performed in program order across the barrier,
             and smp_mb() to represent explicit memory barriers forcing full  memory
             ordering  across  the barrier, we have the following ordering table for
             each pair of barrier(), sys_membarrier() and smp_mb():
      
             The pair ordering is detailed as (O: ordered, X: not ordered):
      
                                    barrier()   smp_mb() sys_membarrier()
                    barrier()          X           X            O
                    smp_mb()           X           O            O
                    sys_membarrier()   O           O            O
      
      RETURN VALUE
             On success, these system calls return zero.  On error, -1 is  returned,
             and errno is set appropriately. For a given command, with flags
             argument set to 0, this system call is guaranteed to always return the
             same value until reboot.
      
      ERRORS
             ENOSYS System call is not implemented.
      
             EINVAL Invalid arguments.
      
      Linux                             2015-04-15                     MEMBARRIER(2)
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Nicholas Miell <nmiell@comcast.net>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alan Cox <gnomes@lxorguk.ukuu.org.uk>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b25b13a
  18. 05 9月, 2015 1 次提交
  19. 26 6月, 2015 1 次提交
    • J
      clone: support passing tls argument via C rather than pt_regs magic · 3033f14a
      Josh Triplett 提交于
      clone has some of the quirkiest syscall handling in the kernel, with a
      pile of special cases, historical curiosities, and architecture-specific
      calling conventions.  In particular, clone with CLONE_SETTLS accepts a
      parameter "tls" that the C entry point completely ignores and some
      assembly entry points overwrite; instead, the low-level arch-specific
      code pulls the tls parameter out of the arch-specific register captured
      as part of pt_regs on entry to the kernel.  That's a massive hack, and
      it makes the arch-specific code only work when called via the specific
      existing syscall entry points; because of this hack, any new clone-like
      system call would have to accept an identical tls argument in exactly
      the same arch-specific position, rather than providing a unified system
      call entry point across architectures.
      
      The first patch allows architectures to handle the tls argument via
      normal C parameter passing, if they opt in by selecting
      HAVE_COPY_THREAD_TLS.  The second patch makes 32-bit and 64-bit x86 opt
      into this.
      
      These two patches came out of the clone4 series, which isn't ready for
      this merge window, but these first two cleanup patches were entirely
      uncontroversial and have acks.  I'd like to go ahead and submit these
      two so that other architectures can begin building on top of this and
      opting into HAVE_COPY_THREAD_TLS.  However, I'm also happy to wait and
      send these through the next merge window (along with v3 of clone4) if
      anyone would prefer that.
      
      This patch (of 2):
      
      clone with CLONE_SETTLS accepts an argument to set the thread-local
      storage area for the new thread.  sys_clone declares an int argument
      tls_val in the appropriate point in the argument list (based on the
      various CLONE_BACKWARDS variants), but doesn't actually use or pass along
      that argument.  Instead, sys_clone calls do_fork, which calls
      copy_process, which calls the arch-specific copy_thread, and copy_thread
      pulls the corresponding syscall argument out of the pt_regs captured at
      kernel entry (knowing what argument of clone that architecture passes tls
      in).
      
      Apart from being awful and inscrutable, that also only works because only
      one code path into copy_thread can pass the CLONE_SETTLS flag, and that
      code path comes from sys_clone with its architecture-specific
      argument-passing order.  This prevents introducing a new version of the
      clone system call without propagating the same architecture-specific
      position of the tls argument.
      
      However, there's no reason to pull the argument out of pt_regs when
      sys_clone could just pass it down via C function call arguments.
      
      Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt into,
      and a new copy_thread_tls that accepts the tls parameter as an additional
      unsigned long (syscall-argument-sized) argument.  Change sys_clone's tls
      argument to an unsigned long (which does not change the ABI), and pass
      that down to copy_thread_tls.
      
      Architectures that don't opt into copy_thread_tls will continue to ignore
      the C argument to sys_clone in favor of the pt_regs captured at kernel
      entry, and thus will be unable to introduce new versions of the clone
      syscall.
      
      Patch co-authored by Josh Triplett and Thiago Macieira.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thiago Macieira <thiago.macieira@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3033f14a
  20. 14 5月, 2015 1 次提交
  21. 27 1月, 2015 1 次提交