1. 20 4月, 2018 4 次提交
    • A
      y2038: ipc: Enable COMPAT_32BIT_TIME · b0d17578
      Arnd Bergmann 提交于
      Three ipc syscalls (mq_timedsend, mq_timedreceive and and semtimedop)
      take a timespec argument. After we move 32-bit architectures over to
      useing 64-bit time_t based syscalls, we need seperate entry points for
      the old 32-bit based interfaces.
      
      This changes the #ifdef guards for the existing 32-bit compat syscalls
      to check for CONFIG_COMPAT_32BIT_TIME instead, which will then be
      enabled on all existing 32-bit architectures.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      b0d17578
    • A
      y2038: ipc: Use __kernel_timespec · 21fc538d
      Arnd Bergmann 提交于
      This is a preparatation for changing over __kernel_timespec to 64-bit
      times, which involves assigning new system call numbers for mq_timedsend(),
      mq_timedreceive() and semtimedop() for compatibility with future y2038
      proof user space.
      
      The existing ABIs will remain available through compat code.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      21fc538d
    • A
      y2038: ipc: Report long times to user space · c2ab975c
      Arnd Bergmann 提交于
      The shmid64_ds/semid64_ds/msqid64_ds data structures have been extended
      to contain extra fields for storing the upper bits of the time stamps,
      this patch does the other half of the job and and fills the new fields on
      32-bit architectures as well as 32-bit tasks running on a 64-bit kernel
      in compat mode.
      
      There should be no change for native 64-bit tasks.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      c2ab975c
    • A
      y2038: ipc: Use ktime_get_real_seconds consistently · 2a70b787
      Arnd Bergmann 提交于
      In some places, we still used get_seconds() instead of
      ktime_get_real_seconds(), and I'm changing the remaining ones now to
      all use ktime_get_real_seconds() so we use the full available range for
      timestamps instead of overflowing the 'unsigned long' return value in
      year 2106 on 32-bit kernels.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      2a70b787
  2. 14 4月, 2018 1 次提交
    • E
      ipc/shm: fix use-after-free of shm file via remap_file_pages() · 3f05317d
      Eric Biggers 提交于
      syzbot reported a use-after-free of shm_file_data(file)->file->f_op in
      shm_get_unmapped_area(), called via sys_remap_file_pages().
      
      Unfortunately it couldn't generate a reproducer, but I found a bug which
      I think caused it.  When remap_file_pages() is passed a full System V
      shared memory segment, the memory is first unmapped, then a new map is
      created using the ->vm_file.  Between these steps, the shm ID can be
      removed and reused for a new shm segment.  But, shm_mmap() only checks
      whether the ID is currently valid before calling the underlying file's
      ->mmap(); it doesn't check whether it was reused.  Thus it can use the
      wrong underlying file, one that was already freed.
      
      Fix this by making the "outer" shm file (the one that gets put in
      ->vm_file) hold a reference to the real shm file, and by making
      __shm_open() require that the file associated with the shm ID matches
      the one associated with the "outer" file.
      
      Taking the reference to the real shm file is needed to fully solve the
      problem, since otherwise sfd->file could point to a freed file, which
      then could be reallocated for the reused shm ID, causing the wrong shm
      segment to be mapped (and without the required permission checks).
      
      Commit 1ac0b6de ("ipc/shm: handle removed segments gracefully in
      shm_mmap()") almost fixed this bug, but it didn't go far enough because
      it didn't consider the case where the shm ID is reused.
      
      The following program usually reproduces this bug:
      
      	#include <stdlib.h>
      	#include <sys/shm.h>
      	#include <sys/syscall.h>
      	#include <unistd.h>
      
      	int main()
      	{
      		int is_parent = (fork() != 0);
      		srand(getpid());
      		for (;;) {
      			int id = shmget(0xF00F, 4096, IPC_CREAT|0700);
      			if (is_parent) {
      				void *addr = shmat(id, NULL, 0);
      				usleep(rand() % 50);
      				while (!syscall(__NR_remap_file_pages, addr, 4096, 0, 0, 0));
      			} else {
      				usleep(rand() % 50);
      				shmctl(id, IPC_RMID, NULL);
      			}
      		}
      	}
      
      It causes the following NULL pointer dereference due to a 'struct file'
      being used while it's being freed.  (I couldn't actually get a KASAN
      use-after-free splat like in the syzbot report.  But I think it's
      possible with this bug; it would just take a more extraordinary race...)
      
      	BUG: unable to handle kernel NULL pointer dereference at 0000000000000058
      	PGD 0 P4D 0
      	Oops: 0000 [#1] SMP NOPTI
      	CPU: 9 PID: 258 Comm: syz_ipc Not tainted 4.16.0-05140-gf8cf2f16 #189
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
      	RIP: 0010:d_inode include/linux/dcache.h:519 [inline]
      	RIP: 0010:touch_atime+0x25/0xd0 fs/inode.c:1724
      	[...]
      	Call Trace:
      	 file_accessed include/linux/fs.h:2063 [inline]
      	 shmem_mmap+0x25/0x40 mm/shmem.c:2149
      	 call_mmap include/linux/fs.h:1789 [inline]
      	 shm_mmap+0x34/0x80 ipc/shm.c:465
      	 call_mmap include/linux/fs.h:1789 [inline]
      	 mmap_region+0x309/0x5b0 mm/mmap.c:1712
      	 do_mmap+0x294/0x4a0 mm/mmap.c:1483
      	 do_mmap_pgoff include/linux/mm.h:2235 [inline]
      	 SYSC_remap_file_pages mm/mmap.c:2853 [inline]
      	 SyS_remap_file_pages+0x232/0x310 mm/mmap.c:2769
      	 do_syscall_64+0x64/0x1a0 arch/x86/entry/common.c:287
      	 entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
      [ebiggers@google.com: add comment]
        Link: http://lkml.kernel.org/r/20180410192850.235835-1-ebiggers3@gmail.com
      Link: http://lkml.kernel.org/r/20180409043039.28915-1-ebiggers3@gmail.com
      Reported-by: syzbot+d11f321e7f1923157eac80aa990b446596f46439@syzkaller.appspotmail.com
      Fixes: c8d78c18 ("mm: replace remap_file_pages() syscall with emulation")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f05317d
  3. 12 4月, 2018 5 次提交
    • A
      ipc/shm.c: shm_split(): remove unneeded test for NULL shm_file_data.vm_ops · a61fc2cb
      Andrew Morton 提交于
      This was added by the recent "ipc/shm.c: add split function to
      shm_vm_ops", but it is not necessary.
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a61fc2cb
    • D
      ipc/msg: introduce msgctl(MSG_STAT_ANY) · 23c8cec8
      Davidlohr Bueso 提交于
      There is a permission discrepancy when consulting msq ipc object
      metadata between /proc/sysvipc/msg (0444) and the MSG_STAT shmctl
      command.  The later does permission checks for the object vs S_IRUGO.
      As such there can be cases where EACCESS is returned via syscall but the
      info is displayed anyways in the procfs files.
      
      While this might have security implications via info leaking (albeit no
      writing to the msq metadata), this behavior goes way back and showing
      all the objects regardless of the permissions was most likely an
      overlook - so we are stuck with it.  Furthermore, modifying either the
      syscall or the procfs file can cause userspace programs to break (ie
      ipcs).  Some applications require getting the procfs info (without root
      privileges) and can be rather slow in comparison with a syscall -- up to
      500x in some reported cases for shm.
      
      This patch introduces a new MSG_STAT_ANY command such that the msq ipc
      object permissions are ignored, and only audited instead.  In addition,
      I've left the lsm security hook checks in place, as if some policy can
      block the call, then the user has no other choice than just parsing the
      procfs file.
      
      Link: http://lkml.kernel.org/r/20180215162458.10059-4-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reported-by: NRobert Kettler <robert.kettler@outlook.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23c8cec8
    • D
      ipc/sem: introduce semctl(SEM_STAT_ANY) · a280d6dc
      Davidlohr Bueso 提交于
      There is a permission discrepancy when consulting shm ipc object
      metadata between /proc/sysvipc/sem (0444) and the SEM_STAT semctl
      command.  The later does permission checks for the object vs S_IRUGO.
      As such there can be cases where EACCESS is returned via syscall but the
      info is displayed anyways in the procfs files.
      
      While this might have security implications via info leaking (albeit no
      writing to the sma metadata), this behavior goes way back and showing
      all the objects regardless of the permissions was most likely an
      overlook - so we are stuck with it.  Furthermore, modifying either the
      syscall or the procfs file can cause userspace programs to break (ie
      ipcs).  Some applications require getting the procfs info (without root
      privileges) and can be rather slow in comparison with a syscall -- up to
      500x in some reported cases for shm.
      
      This patch introduces a new SEM_STAT_ANY command such that the sem ipc
      object permissions are ignored, and only audited instead.  In addition,
      I've left the lsm security hook checks in place, as if some policy can
      block the call, then the user has no other choice than just parsing the
      procfs file.
      
      Link: http://lkml.kernel.org/r/20180215162458.10059-3-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Reported-by: NRobert Kettler <robert.kettler@outlook.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a280d6dc
    • D
      ipc/shm: introduce shmctl(SHM_STAT_ANY) · c21a6970
      Davidlohr Bueso 提交于
      Patch series "sysvipc: introduce STAT_ANY commands", v2.
      
      The following patches adds the discussed (see [1]) new command for shm
      as well as for sems and msq as they are subject to the same
      discrepancies for ipc object permission checks between the syscall and
      via procfs.  These new commands are justified in that (1) we are stuck
      with this semantics as changing syscall and procfs can break userland;
      and (2) some users can benefit from performance (for large amounts of
      shm segments, for example) from not having to parse the procfs
      interface.
      
      Once merged, I will submit the necesary manpage updates.  But I'm thinking
      something like:
      
      : diff --git a/man2/shmctl.2 b/man2/shmctl.2
      : index 7bb503999941..bb00bbe21a57 100644
      : --- a/man2/shmctl.2
      : +++ b/man2/shmctl.2
      : @@ -41,6 +41,7 @@
      :  .\" 2005-04-25, mtk -- noted aberrant Linux behavior w.r.t. new
      :  .\"	attaches to a segment that has already been marked for deletion.
      :  .\" 2005-08-02, mtk: Added IPC_INFO, SHM_INFO, SHM_STAT descriptions.
      : +.\" 2018-02-13, dbueso: Added SHM_STAT_ANY description.
      :  .\"
      :  .TH SHMCTL 2 2017-09-15 "Linux" "Linux Programmer's Manual"
      :  .SH NAME
      : @@ -242,6 +243,18 @@ However, the
      :  argument is not a segment identifier, but instead an index into
      :  the kernel's internal array that maintains information about
      :  all shared memory segments on the system.
      : +.TP
      : +.BR SHM_STAT_ANY " (Linux-specific)"
      : +Return a
      : +.I shmid_ds
      : +structure as for
      : +.BR SHM_STAT .
      : +However, the
      : +.I shm_perm.mode
      : +is not checked for read access for
      : +.IR shmid ,
      : +resembing the behaviour of
      : +/proc/sysvipc/shm.
      :  .PP
      :  The caller can prevent or allow swapping of a shared
      :  memory segment with the following \fIcmd\fP values:
      : @@ -287,7 +300,7 @@ operation returns the index of the highest used entry in the
      :  kernel's internal array recording information about all
      :  shared memory segments.
      :  (This information can be used with repeated
      : -.B SHM_STAT
      : +.B SHM_STAT/SHM_STAT_ANY
      :  operations to obtain information about all shared memory segments
      :  on the system.)
      :  A successful
      : @@ -328,7 +341,7 @@ isn't accessible.
      :  \fIshmid\fP is not a valid identifier, or \fIcmd\fP
      :  is not a valid command.
      :  Or: for a
      : -.B SHM_STAT
      : +.B SHM_STAT/SHM_STAT_ANY
      :  operation, the index value specified in
      :  .I shmid
      :  referred to an array slot that is currently unused.
      
      This patch (of 3):
      
      There is a permission discrepancy when consulting shm ipc object metadata
      between /proc/sysvipc/shm (0444) and the SHM_STAT shmctl command.  The
      later does permission checks for the object vs S_IRUGO.  As such there can
      be cases where EACCESS is returned via syscall but the info is displayed
      anyways in the procfs files.
      
      While this might have security implications via info leaking (albeit no
      writing to the shm metadata), this behavior goes way back and showing all
      the objects regardless of the permissions was most likely an overlook - so
      we are stuck with it.  Furthermore, modifying either the syscall or the
      procfs file can cause userspace programs to break (ie ipcs).  Some
      applications require getting the procfs info (without root privileges) and
      can be rather slow in comparison with a syscall -- up to 500x in some
      reported cases.
      
      This patch introduces a new SHM_STAT_ANY command such that the shm ipc
      object permissions are ignored, and only audited instead.  In addition,
      I've left the lsm security hook checks in place, as if some policy can
      block the call, then the user has no other choice than just parsing the
      procfs file.
      
      [1] https://lkml.org/lkml/2017/12/19/220
      
      Link: http://lkml.kernel.org/r/20180215162458.10059-2-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Robert Kettler <robert.kettler@outlook.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c21a6970
    • A
      proc: move /proc/sysvipc creation to where it belongs · e74a0eff
      Alexey Dobriyan 提交于
      Move the proc_mkdir() call within the sysvipc subsystem such that we
      avoid polluting proc_root_init() with petty cpp.
      
      [dave@stgolabs.net: contributed changelog]
      Link: http://lkml.kernel.org/r/20180216161732.GA10297@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e74a0eff
  4. 03 4月, 2018 10 次提交
  5. 29 3月, 2018 2 次提交
    • M
      ipc/shm.c: add split function to shm_vm_ops · 3d942ee0
      Mike Kravetz 提交于
      If System V shmget/shmat operations are used to create a hugetlbfs
      backed mapping, it is possible to munmap part of the mapping and split
      the underlying vma such that it is not huge page aligned.  This will
      untimately result in the following BUG:
      
        kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
        Oops: Exception in kernel mode, sig: 5 [#1]
        LE SMP NR_CPUS=2048 NUMA PowerNV
        Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
        CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G         C  E 4.15.0-10-generic #11-Ubuntu
        NIP:  c00000000036e764 LR: c00000000036ee48 CTR: 0000000000000009
        REGS: c000003fbcdcf810 TRAP: 0700   Tainted: G         C  E (4.15.0-10-generic)
        MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24002222  XER: 20040000
        CFAR: c00000000036ee44 SOFTE: 1
        NIP __unmap_hugepage_range+0xa4/0x760
        LR __unmap_hugepage_range_final+0x28/0x50
        Call Trace:
          0x7115e4e00000 (unreliable)
          __unmap_hugepage_range_final+0x28/0x50
          unmap_single_vma+0x11c/0x190
          unmap_vmas+0x94/0x140
          exit_mmap+0x9c/0x1d0
          mmput+0xa8/0x1d0
          do_exit+0x360/0xc80
          do_group_exit+0x60/0x100
          SyS_exit_group+0x24/0x30
          system_call+0x58/0x6c
        ---[ end trace ee88f958a1c62605 ]---
      
      This bug was introduced by commit 31383c68 ("mm, hugetlbfs:
      introduce ->split() to vm_operations_struct").  A split function was
      added to vm_operations_struct to determine if a mapping can be split.
      This was mostly for device-dax and hugetlbfs mappings which have
      specific alignment constraints.
      
      Mappings initiated via shmget/shmat have their original vm_ops
      overwritten with shm_vm_ops.  shm_vm_ops functions will call back to the
      original vm_ops if needed.  Add such a split function to shm_vm_ops.
      
      Link: http://lkml.kernel.org/r/20180321161314.7711-1-mike.kravetz@oracle.com
      Fixes: 31383c68 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Reviewed-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Tested-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d942ee0
    • E
      ipc/shm: Fix pid freeing. · 2236d4d3
      Eric W. Biederman 提交于
      The 0day kernel test build report reported an oops:
      >
      >  IP: put_pid+0x22/0x5c
      >  PGD 19efa067 P4D 19efa067 PUD 0
      >  Oops: 0000 [#1]
      >  CPU: 0 PID: 727 Comm: trinity Not tainted 4.16.0-rc2-00010-g98f929b1 #1
      >  RIP: 0010:put_pid+0x22/0x5c
      >  RSP: 0018:ffff986719f73e48 EFLAGS: 00010202
      >  RAX: 00000006d765f710 RBX: ffff98671a4fa4d0 RCX: ffff986719f73d40
      >  RDX: 000000006f6e6125 RSI: 0000000000000000 RDI: ffffffffa01e6d21
      >  RBP: ffffffffa0955fe0 R08: 0000000000000020 R09: 0000000000000000
      >  R10: 0000000000000078 R11: ffff986719f73e76 R12: 0000000000001000
      >  R13: 00000000ffffffea R14: 0000000054000fb0 R15: 0000000000000000
      >  FS:  00000000028c2880(0000) GS:ffffffffa06ad000(0000) knlGS:0000000000000000
      >  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      >  CR2: 0000000677846439 CR3: 0000000019fc1005 CR4: 00000000000606b0
      >  Call Trace:
      >   ? ipc_update_pid+0x36/0x3e
      >   ? newseg+0x34c/0x3a6
      >   ? ipcget+0x5d/0x528
      >   ? entry_SYSCALL_64_after_hwframe+0x52/0xb7
      >   ? SyS_shmget+0x5a/0x84
      >   ? do_syscall_64+0x194/0x1b3
      >   ? entry_SYSCALL_64_after_hwframe+0x42/0xb7
      >  Code: ff 05 e7 20 9b 03 58 c9 c3 48 ff 05 85 21 9b 03 48 85 ff 74 4f 8b 47 04 8b 17 48 ff 05 7c 21 9b 03 48 83 c0 03 48 c1 e0 04 ff ca <48> 8b 44 07 08 74 1f 48 ff 05 6c 21 9b 03 ff 0f 0f 94 c2 48 ff
      >  RIP: put_pid+0x22/0x5c RSP: ffff986719f73e48
      >  CR2: 0000000677846439
      >  ---[ end trace ab8c5cb4389d37c5 ]---
      >  Kernel panic - not syncing: Fatal exception
      
      In newseg when changing shm_cprid and shm_lprid from pid_t to struct
      pid* I misread the kvmalloc as kvzalloc and thought shp was
      initialized to 0.  As that is not the case it is not safe to for the
      error handling to address shm_cprid and shm_lprid before they are
      initialized.
      
      Therefore move the cleanup of shm_cprid and shm_lprid from the no_file
      error cleanup path to the no_id error cleanup path.  Ensuring that an
      early error exit won't cause the oops above.
      Reported-by: Nkernel test robot <fengguang.wu@intel.com>
      Reviewed-by: NNagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      2236d4d3
  6. 28 3月, 2018 4 次提交
    • E
      ipc: Directly call the security hook in ipc_ops.associate · 50ab44b1
      Eric W. Biederman 提交于
      After the last round of cleanups the shm, sem, and msg associate
      operations just became trivial wrappers around the appropriate security
      method.  Simplify things further by just calling the security method
      directly.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      50ab44b1
    • E
      ipc/sem: Fix semctl(..., GETPID, ...) between pid namespaces · 51d6f263
      Eric W. Biederman 提交于
      Today the last process to update a semaphore is remembered and
      reported in the pid namespace of that process.  If there are processes
      in any other pid namespace querying that process id with GETPID the
      result will be unusable nonsense as it does not make any
      sense in your own pid namespace.
      
      Due to ipc_update_pid I don't think you will be able to get System V
      ipc semaphores into a troublesome cache line ping-pong.  Using struct
      pids from separate process are not a problem because they do not share
      a cache line.  Using struct pid from different threads of the same
      process are unlikely to be a problem as the reference count update
      can be avoided.
      
      Further linux futexes are a much better tool for the job of mutual
      exclusion between processes than System V semaphores.  So I expect
      programs that  are performance limited by their interprocess mutual
      exclusion primitive will be using futexes.
      
      So while it is possible that enhancing the storage of the last
      rocess of a System V semaphore from an integer to a struct pid
      will cause a performance regression because of the effect
      of frequently updating the pid reference count.  I don't expect
      that to happen in practice.
      
      This change updates semctl(..., GETPID, ...) to return the
      process id of the last process to update a semphore inthe
      pid namespace of the calling process.
      
      Fixes: b488893a ("pid namespaces: changes to show virtual ids to user")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      51d6f263
    • E
      ipc/msg: Fix msgctl(..., IPC_STAT, ...) between pid namespaces · 39a4940e
      Eric W. Biederman 提交于
      Today msg_lspid and msg_lrpid are remembered in the pid namespace of
      the creator and the processes that last send or received a sysvipc
      message.  If you have processes in multiple pid namespaces that is
      just wrong.  The process ids reported will not make the least bit of
      sense.
      
      This fix is slightly more susceptible to a performance problem than
      the related fix for System V shared memory.  By definition the pids
      are updated by msgsnd and msgrcv, the fast path of System V message
      queues.  The only concern over the previous implementation is the
      incrementing and decrementing of the pid reference count.  As that is
      the only difference and multiple updates by of the task_tgid by
      threads in the same process have been shown in af_unix sockets to
      create a cache line ping-pong between cpus of the same processor.
      
      In this case I don't expect cache lines holding pid reference counts
      to ping pong between cpus.  As senders and receivers update different
      pids there is a natural separation there.  Further if multiple threads
      of the same process either send or receive messages the pid will be
      updated to the same value and ipc_update_pid will avoid the reference
      count update.
      
      Which means in the common case I expect msg_lspid and msg_lrpid to
      remain constant, and reference counts not to be updated when messages
      are sent.
      
      In rare cases it may be possible to trigger the issue which was
      observed for af_unix sockets, but it will require multiple processes
      with multiple threads to be either sending or receiving messages.  It
      just does not feel likely that anyone would do that in practice.
      
      This change updates msgctl(..., IPC_STAT, ...) to return msg_lspid and
      msg_lrpid in the pid namespace of the process calling stat.
      
      This change also updates cat /proc/sysvipc/msg to return print msg_lspid
      and msg_lrpid in the pid namespace of the process that opened the proc
      file.
      
      Fixes: b488893a ("pid namespaces: changes to show virtual ids to user")
      Reviewed-by: NNagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      39a4940e
    • E
      ipc/shm: Fix shmctl(..., IPC_STAT, ...) between pid namespaces. · 98f929b1
      Eric W. Biederman 提交于
      Today shm_cpid and shm_lpid are remembered in the pid namespace of the
      creator and the processes that last touched a sysvipc shared memory
      segment.   If you have processes in multiple pid namespaces that
      is just wrong, and I don't know how this has been over-looked for
      so long.
      
      As only creation and shared memory attach and shared memory detach
      update the pids I do not expect there to be a repeat of the issues
      when struct pid was attached to each af_unix skb, which in some
      notable cases cut the performance in half.  The problem was threads of
      the same process updating same struct pid from different cpus causing
      the cache line to be highly contended and bounce between cpus.
      
      As creation, attach, and detach are expected to be rare operations for
      sysvipc shared memory segments I do not expect that kind of cache line
      ping pong to cause probems.  In addition because the pid is at a fixed
      location in the structure instead of being dynamic on a skb, the
      reference count of the pid does not need to be updated on each
      operation if the pid is the same.  This ability to simply skip the pid
      reference count changes if the pid is unchanging further reduces the
      likelihood of the a cache line holding a pid reference count
      ping-ponging between cpus.
      
      Fixes: b488893a ("pid namespaces: changes to show virtual ids to user")
      Reviewed-by: NNagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      98f929b1
  7. 25 3月, 2018 5 次提交
    • E
      Revert "mqueue: switch to on-demand creation of internal mount" · cfb2f6f6
      Eric W. Biederman 提交于
      This reverts commit 36735a6a.
      
      Aleksa Sarai <asarai@suse.de> writes:
      > [REGRESSION v4.16-rc6] [PATCH] mqueue: forbid unprivileged user access to internal mount
      >
      > Felix reported weird behaviour on 4.16.0-rc6 with regards to mqueue[1],
      > which was introduced by 36735a6a ("mqueue: switch to on-demand
      > creation of internal mount").
      >
      > Basically, the reproducer boils down to being able to mount mqueue if
      > you create a new user namespace, even if you don't unshare the IPC
      > namespace.
      >
      > Previously this was not possible, and you would get an -EPERM. The mount
      > is the *host* mqueue mount, which is being cached and just returned from
      > mqueue_mount(). To be honest, I'm not sure if this is safe or not (or if
      > it was intentional -- since I'm not familiar with mqueue).
      >
      > To me it looks like there is a missing permission check. I've included a
      > patch below that I've compile-tested, and should block the above case.
      > Can someone please tell me if I'm missing something? Is this actually
      > safe?
      >
      > [1]: https://github.com/docker/docker/issues/36674
      
      The issue is a lot deeper than a missing permission check.  sb->s_user_ns
      was is improperly set as well.  So in addition to the filesystem being
      mounted when it should not be mounted, so things are not allow that should
      be.
      
      We are practically to the release of 4.16 and there is no agreement between
      Al Viro and myself on what the code should looks like to fix things properly.
      So revert the code to what it was before so that we can take our time
      and discuss this properly.
      
      Fixes: 36735a6a ("mqueue: switch to on-demand creation of internal mount")
      Reported-by: NFelix Abecassis <fabecassis@nvidia.com>
      Reported-by: NAleksa Sarai <asarai@suse.de>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      cfb2f6f6
    • E
      ipc/util: Helpers for making the sysvipc operations pid namespace aware · 03f1fc09
      Eric W. Biederman 提交于
      Capture the pid namespace when /proc/sysvipc/msg /proc/sysvipc/shm
      and /proc/sysvipc/sem are opened, and make it available through
      the new helper ipc_seq_pid_ns.
      
      This makes it possible to report the pids in these files in the
      pid namespace of the opener of the files.
      
      Implement ipc_update_pid.  A simple impline helper that will only update
      a struct pid pointer if the new value does not equal the old value.  This
      removes the need for wordy code sequences like:
      
      	old = object->pid;
      	object->pid = new;
      	put_pid(old);
      
      and
      
      	old = object->pid;
      	if (old != new) {
      		object->pid = new;
      		put_pid(old);
      	}
      
      Allowing the following to be written instead:
      
      	ipc_update_pid(&object->pid, new);
      
      Which is easier to read and ensures that the pid reference count is
      not touched the old and the new values are the same.  Not touching
      the reference count in this case is important to help avoid issues
      like af_unix experienced, where multiple threads of the same
      process managed to bounce the struct pid between cpu cache lines,
      but updating the pids reference count.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      03f1fc09
    • E
      ipc: Move IPCMNI from include/ipc.h into ipc/util.h · f83a396d
      Eric W. Biederman 提交于
      The definition IPCMNI is only used in ipc/util.h and ipc/util.c.  So
      there is no reason to keep it in a header file that the whole kernel
      can see.  Move it into util.h to simplify future maintenance.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      f83a396d
    • E
      msg: Move struct msg_queue into ipc/msg.c · 34b56df9
      Eric W. Biederman 提交于
      All of the users are now in ipc/msg.c so make the definition local to
      that file to make code maintenance easier.  AKA to prevent rebuilding
      the entire kernel when struct msg_queue changes.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      34b56df9
    • E
      shm: Move struct shmid_kernel into ipc/shm.c · a2e102cd
      Eric W. Biederman 提交于
      All of the users are now in ipc/shm.c so make the definition local to
      that file to make code maintenance easier.  AKA to prevent rebuilding
      the entire kernel when struct shmid_kernel changes.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      a2e102cd
  8. 23 3月, 2018 4 次提交
  9. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  10. 07 2月, 2018 2 次提交
  11. 13 1月, 2018 1 次提交
    • E
      signal: Ensure generic siginfos the kernel sends have all bits initialized · faf1f22b
      Eric W. Biederman 提交于
      Call clear_siginfo to ensure stack allocated siginfos are fully
      initialized before being passed to the signal sending functions.
      
      This ensures that if there is the kind of confusion documented by
      TRAP_FIXME, FPE_FIXME, or BUS_FIXME the kernel won't send unitialized
      data to userspace when the kernel generates a signal with SI_USER but
      the copy to userspace assumes it is a different kind of signal, and
      different fields are initialized.
      
      This also prepares the way for turning copy_siginfo_to_user
      into a copy_to_user, by removing the need in many cases to perform
      a field by field copy simply to skip the uninitialized fields.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      faf1f22b
  12. 06 1月, 2018 1 次提交
    • A
      mqueue: switch to on-demand creation of internal mount · 36735a6a
      Al Viro 提交于
      Instead of doing that upon each ipcns creation, we do that the first
      time mq_open(2) or mqueue mount is done in an ipcns.  What's more,
      doing that allows to get rid of mount_ns() use - we can go with
      considerably cheaper mount_nodev(), avoiding the loop over all
      mqueue superblock instances; ipcns->mq_mnt is used to locate preexisting
      instance in O(1) time instead of O(instances) mount_ns() would've
      cost us.
      
      Based upon the version by Giuseppe Scrivano <gscrivan@redhat.com>; I've
      added handling of userland mqueue mounts (original had been broken in
      that area) and added a switch to mount_nodev().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      36735a6a