提交 · c21a6970ae727839a2f300cd8dd957de0d0238c3 · openanolis / cloud-kernel

12 4月, 2018 2 次提交

ipc/shm: introduce shmctl(SHM_STAT_ANY) · c21a6970

由 Davidlohr Bueso 提交于 4月 10, 2018

Patch series "sysvipc: introduce STAT_ANY commands", v2.

The following patches adds the discussed (see [1]) new command for shm
as well as for sems and msq as they are subject to the same
discrepancies for ipc object permission checks between the syscall and
via procfs.  These new commands are justified in that (1) we are stuck
with this semantics as changing syscall and procfs can break userland;
and (2) some users can benefit from performance (for large amounts of
shm segments, for example) from not having to parse the procfs
interface.

Once merged, I will submit the necesary manpage updates.  But I'm thinking
something like:

: diff --git a/man2/shmctl.2 b/man2/shmctl.2
: index 7bb503999941..bb00bbe21a57 100644
: --- a/man2/shmctl.2
: +++ b/man2/shmctl.2
: @@ -41,6 +41,7 @@
:  .\" 2005-04-25, mtk -- noted aberrant Linux behavior w.r.t. new
:  .\"	attaches to a segment that has already been marked for deletion.
:  .\" 2005-08-02, mtk: Added IPC_INFO, SHM_INFO, SHM_STAT descriptions.
: +.\" 2018-02-13, dbueso: Added SHM_STAT_ANY description.
:  .\"
:  .TH SHMCTL 2 2017-09-15 "Linux" "Linux Programmer's Manual"
:  .SH NAME
: @@ -242,6 +243,18 @@ However, the
:  argument is not a segment identifier, but instead an index into
:  the kernel's internal array that maintains information about
:  all shared memory segments on the system.
: +.TP
: +.BR SHM_STAT_ANY " (Linux-specific)"
: +Return a
: +.I shmid_ds
: +structure as for
: +.BR SHM_STAT .
: +However, the
: +.I shm_perm.mode
: +is not checked for read access for
: +.IR shmid ,
: +resembing the behaviour of
: +/proc/sysvipc/shm.
:  .PP
:  The caller can prevent or allow swapping of a shared
:  memory segment with the following \fIcmd\fP values:
: @@ -287,7 +300,7 @@ operation returns the index of the highest used entry in the
:  kernel's internal array recording information about all
:  shared memory segments.
:  (This information can be used with repeated
: -.B SHM_STAT
: +.B SHM_STAT/SHM_STAT_ANY
:  operations to obtain information about all shared memory segments
:  on the system.)
:  A successful
: @@ -328,7 +341,7 @@ isn't accessible.
:  \fIshmid\fP is not a valid identifier, or \fIcmd\fP
:  is not a valid command.
:  Or: for a
: -.B SHM_STAT
: +.B SHM_STAT/SHM_STAT_ANY
:  operation, the index value specified in
:  .I shmid
:  referred to an array slot that is currently unused.

This patch (of 3):

There is a permission discrepancy when consulting shm ipc object metadata
between /proc/sysvipc/shm (0444) and the SHM_STAT shmctl command.  The
later does permission checks for the object vs S_IRUGO.  As such there can
be cases where EACCESS is returned via syscall but the info is displayed
anyways in the procfs files.

While this might have security implications via info leaking (albeit no
writing to the shm metadata), this behavior goes way back and showing all
the objects regardless of the permissions was most likely an overlook - so
we are stuck with it.  Furthermore, modifying either the syscall or the
procfs file can cause userspace programs to break (ie ipcs).  Some
applications require getting the procfs info (without root privileges) and
can be rather slow in comparison with a syscall -- up to 500x in some
reported cases.

This patch introduces a new SHM_STAT_ANY command such that the shm ipc
object permissions are ignored, and only audited instead.  In addition,
I've left the lsm security hook checks in place, as if some policy can
block the call, then the user has no other choice than just parsing the
procfs file.

[1] https://lkml.org/lkml/2017/12/19/220

Link: http://lkml.kernel.org/r/20180215162458.10059-2-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Robert Kettler <robert.kettler@outlook.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c21a6970

proc: move /proc/sysvipc creation to where it belongs · e74a0eff

由 Alexey Dobriyan 提交于 4月 10, 2018

Move the proc_mkdir() call within the sysvipc subsystem such that we
avoid polluting proc_root_init() with petty cpp.

[dave@stgolabs.net: contributed changelog]
Link: http://lkml.kernel.org/r/20180216161732.GA10297@avx2Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NDavidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e74a0eff

03 4月, 2018 10 次提交

ipc: add msgsnd syscall/compat_syscall wrappers · 31c213f2

由 Dominik Brodowski 提交于 3月 20, 2018

Provide ksys_msgsnd() and compat_ksys_msgsnd() wrappers to avoid in-kernel
calls to these syscalls. The ksys_ prefix denotes that these functions are
meant as a drop-in replacement for the syscalls. In particular, they use
the same calling convention as sys_msgsnd() and compat_sys_msgsnd().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>

31c213f2

ipc: add msgrcv syscall/compat_syscall wrappers · 078faac9

由 Dominik Brodowski 提交于 3月 20, 2018

Provide ksys_msgrcv() and compat_ksys_msgrcv() wrappers to avoid in-kernel
calls to these syscalls. The ksys_ prefix denotes that these functions are
meant as a drop-in replacement for the syscalls. In particular, they use
the same calling convention as sys_msgrcv() and compat_sys_msgrcv().

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>

078faac9

ipc: add msgctl syscall/compat_syscall wrappers · e340db56

由 Dominik Brodowski 提交于 3月 20, 2018

Provide ksys_msgctl() and compat_ksys_msgctl() wrappers to avoid in-kernel
calls to these syscalls. The ksys_ prefix denotes that these functions are
meant as a drop-in replacement for the syscalls. In particular, they use
the same calling convention as sys_msgctl() and compat_sys_msgctl().

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>

e340db56

ipc: add shmctl syscall/compat_syscall wrappers · c84d0791

由 Dominik Brodowski 提交于 3月 20, 2018

Provide ksys_shmctl() and compat_ksys_shmctl() wrappers to avoid in-kernel
calls to these syscalls. The ksys_ prefix denotes that these functions are
meant as a drop-in replacement for the syscalls. In particular, they use
the same calling convention as sys_shmctl() and compat_sys_shmctl().

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>

c84d0791

ipc: add shmdt syscall wrapper · da1e2744

由 Dominik Brodowski 提交于 3月 20, 2018

Provide ksys_shmdt() wrapper to avoid in-kernel calls to this syscall.
The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_shmdt().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>

da1e2744

ipc: add shmget syscall wrapper · 65749e0b

由 Dominik Brodowski 提交于 3月 20, 2018

Provide ksys_shmget() wrapper to avoid in-kernel calls to this syscall.
The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_shmget().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>

65749e0b

ipc: add msgget syscall wrapper · 3d65661a

由 Dominik Brodowski 提交于 3月 20, 2018

Provide ksys_msgget() wrapper to avoid in-kernel calls to this syscall.
The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_msgget().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>

3d65661a

ipc: add semctl syscall/compat_syscall wrappers · d969c6fa

由 Dominik Brodowski 提交于 3月 20, 2018

Provide ksys_semctl() and compat_ksys_semctl() wrappers to avoid in-kernel
calls to these syscalls. The ksys_ prefix denotes that these functions are
meant as a drop-in replacement for the syscalls. In particular, they use
the same calling convention as sys_semctl() and compat_sys_semctl().

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>

d969c6fa

ipc: add semget syscall wrapper · 69894718

由 Dominik Brodowski 提交于 3月 20, 2018

Provide ksys_semget() wrapper to avoid in-kernel calls to this syscall.
The ksys_ prefix denotes that this function is meant as a drop-in
replacement for the syscall. In particular, it uses the same calling
convention as sys_semget().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>

69894718

ipc: add semtimedop syscall/compat_syscall wrappers · 41f4f0e2

由 Dominik Brodowski 提交于 3月 20, 2018

Provide ksys_semtimedop() and compat_ksys_semtimedop() wrappers to avoid
in-kernel calls to these syscalls. The ksys_ prefix denotes that these
functions are meant as a drop-in replacement for the syscalls. In
particular, they use the same calling convention as sys_semtimedop() and
compat_sys_semtimedop().

This patch is part of a series which removes in-kernel calls to syscalls.
On this basis, the syscall entry path can be streamlined. For details, see
http://lkml.kernel.org/r/20180325162527.GA17492@light.dominikbrodowski.net

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>

41f4f0e2

29 3月, 2018 2 次提交

ipc/shm.c: add split function to shm_vm_ops · 3d942ee0

由 Mike Kravetz 提交于 3月 28, 2018

If System V shmget/shmat operations are used to create a hugetlbfs
backed mapping, it is possible to munmap part of the mapping and split
the underlying vma such that it is not huge page aligned.  This will
untimately result in the following BUG:

  kernel BUG at /build/linux-jWa1Fv/linux-4.15.0/mm/hugetlb.c:3310!
  Oops: Exception in kernel mode, sig: 5 [#1]
  LE SMP NR_CPUS=2048 NUMA PowerNV
  Modules linked in: kcm nfc af_alg caif_socket caif phonet fcrypt
  CPU: 18 PID: 43243 Comm: trinity-subchil Tainted: G         C  E 4.15.0-10-generic #11-Ubuntu
  NIP:  c00000000036e764 LR: c00000000036ee48 CTR: 0000000000000009
  REGS: c000003fbcdcf810 TRAP: 0700   Tainted: G         C  E (4.15.0-10-generic)
  MSR:  9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE>  CR: 24002222  XER: 20040000
  CFAR: c00000000036ee44 SOFTE: 1
  NIP __unmap_hugepage_range+0xa4/0x760
  LR __unmap_hugepage_range_final+0x28/0x50
  Call Trace:
    0x7115e4e00000 (unreliable)
    __unmap_hugepage_range_final+0x28/0x50
    unmap_single_vma+0x11c/0x190
    unmap_vmas+0x94/0x140
    exit_mmap+0x9c/0x1d0
    mmput+0xa8/0x1d0
    do_exit+0x360/0xc80
    do_group_exit+0x60/0x100
    SyS_exit_group+0x24/0x30
    system_call+0x58/0x6c
  ---[ end trace ee88f958a1c62605 ]---

This bug was introduced by commit 31383c68 ("mm, hugetlbfs:
introduce ->split() to vm_operations_struct").  A split function was
added to vm_operations_struct to determine if a mapping can be split.
This was mostly for device-dax and hugetlbfs mappings which have
specific alignment constraints.

Mappings initiated via shmget/shmat have their original vm_ops
overwritten with shm_vm_ops.  shm_vm_ops functions will call back to the
original vm_ops if needed.  Add such a split function to shm_vm_ops.

Link: http://lkml.kernel.org/r/20180321161314.7711-1-mike.kravetz@oracle.com
Fixes: 31383c68 ("mm, hugetlbfs: introduce ->split() to vm_operations_struct")
Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
Reported-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
Tested-by: NLaurent Dufour <ldufour@linux.vnet.ibm.com>
Reviewed-by: NDan Williams <dan.j.williams@intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

3d942ee0

ipc/shm: Fix pid freeing. · 2236d4d3

由 Eric W. Biederman 提交于 3月 28, 2018

The 0day kernel test build report reported an oops:
>
>  IP: put_pid+0x22/0x5c
>  PGD 19efa067 P4D 19efa067 PUD 0
>  Oops: 0000 [#1]
>  CPU: 0 PID: 727 Comm: trinity Not tainted 4.16.0-rc2-00010-g98f929b1 #1
>  RIP: 0010:put_pid+0x22/0x5c
>  RSP: 0018:ffff986719f73e48 EFLAGS: 00010202
>  RAX: 00000006d765f710 RBX: ffff98671a4fa4d0 RCX: ffff986719f73d40
>  RDX: 000000006f6e6125 RSI: 0000000000000000 RDI: ffffffffa01e6d21
>  RBP: ffffffffa0955fe0 R08: 0000000000000020 R09: 0000000000000000
>  R10: 0000000000000078 R11: ffff986719f73e76 R12: 0000000000001000
>  R13: 00000000ffffffea R14: 0000000054000fb0 R15: 0000000000000000
>  FS:  00000000028c2880(0000) GS:ffffffffa06ad000(0000) knlGS:0000000000000000
>  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  CR2: 0000000677846439 CR3: 0000000019fc1005 CR4: 00000000000606b0
>  Call Trace:
>   ? ipc_update_pid+0x36/0x3e
>   ? newseg+0x34c/0x3a6
>   ? ipcget+0x5d/0x528
>   ? entry_SYSCALL_64_after_hwframe+0x52/0xb7
>   ? SyS_shmget+0x5a/0x84
>   ? do_syscall_64+0x194/0x1b3
>   ? entry_SYSCALL_64_after_hwframe+0x42/0xb7
>  Code: ff 05 e7 20 9b 03 58 c9 c3 48 ff 05 85 21 9b 03 48 85 ff 74 4f 8b 47 04 8b 17 48 ff 05 7c 21 9b 03 48 83 c0 03 48 c1 e0 04 ff ca <48> 8b 44 07 08 74 1f 48 ff 05 6c 21 9b 03 ff 0f 0f 94 c2 48 ff
>  RIP: put_pid+0x22/0x5c RSP: ffff986719f73e48
>  CR2: 0000000677846439
>  ---[ end trace ab8c5cb4389d37c5 ]---
>  Kernel panic - not syncing: Fatal exception

In newseg when changing shm_cprid and shm_lprid from pid_t to struct
pid* I misread the kvmalloc as kvzalloc and thought shp was
initialized to 0.  As that is not the case it is not safe to for the
error handling to address shm_cprid and shm_lprid before they are
initialized.

Therefore move the cleanup of shm_cprid and shm_lprid from the no_file
error cleanup path to the no_id error cleanup path.  Ensuring that an
early error exit won't cause the oops above.
Reported-by: Nkernel test robot <fengguang.wu@intel.com>
Reviewed-by: NNagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>

2236d4d3

28 3月, 2018 4 次提交

ipc: Directly call the security hook in ipc_ops.associate · 50ab44b1

由 Eric W. Biederman 提交于 3月 23, 2018

After the last round of cleanups the shm, sem, and msg associate
operations just became trivial wrappers around the appropriate security
method.  Simplify things further by just calling the security method
directly.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

50ab44b1

ipc/sem: Fix semctl(..., GETPID, ...) between pid namespaces · 51d6f263

由 Eric W. Biederman 提交于 3月 23, 2018

Today the last process to update a semaphore is remembered and
reported in the pid namespace of that process.  If there are processes
in any other pid namespace querying that process id with GETPID the
result will be unusable nonsense as it does not make any
sense in your own pid namespace.

Due to ipc_update_pid I don't think you will be able to get System V
ipc semaphores into a troublesome cache line ping-pong.  Using struct
pids from separate process are not a problem because they do not share
a cache line.  Using struct pid from different threads of the same
process are unlikely to be a problem as the reference count update
can be avoided.

Further linux futexes are a much better tool for the job of mutual
exclusion between processes than System V semaphores.  So I expect
programs that  are performance limited by their interprocess mutual
exclusion primitive will be using futexes.

So while it is possible that enhancing the storage of the last
rocess of a System V semaphore from an integer to a struct pid
will cause a performance regression because of the effect
of frequently updating the pid reference count.  I don't expect
that to happen in practice.

This change updates semctl(..., GETPID, ...) to return the
process id of the last process to update a semphore inthe
pid namespace of the calling process.

Fixes: b488893a ("pid namespaces: changes to show virtual ids to user")
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

51d6f263

ipc/msg: Fix msgctl(..., IPC_STAT, ...) between pid namespaces · 39a4940e

由 Eric W. Biederman 提交于 3月 23, 2018

Today msg_lspid and msg_lrpid are remembered in the pid namespace of
the creator and the processes that last send or received a sysvipc
message.  If you have processes in multiple pid namespaces that is
just wrong.  The process ids reported will not make the least bit of
sense.

This fix is slightly more susceptible to a performance problem than
the related fix for System V shared memory.  By definition the pids
are updated by msgsnd and msgrcv, the fast path of System V message
queues.  The only concern over the previous implementation is the
incrementing and decrementing of the pid reference count.  As that is
the only difference and multiple updates by of the task_tgid by
threads in the same process have been shown in af_unix sockets to
create a cache line ping-pong between cpus of the same processor.

In this case I don't expect cache lines holding pid reference counts
to ping pong between cpus.  As senders and receivers update different
pids there is a natural separation there.  Further if multiple threads
of the same process either send or receive messages the pid will be
updated to the same value and ipc_update_pid will avoid the reference
count update.

Which means in the common case I expect msg_lspid and msg_lrpid to
remain constant, and reference counts not to be updated when messages
are sent.

In rare cases it may be possible to trigger the issue which was
observed for af_unix sockets, but it will require multiple processes
with multiple threads to be either sending or receiving messages.  It
just does not feel likely that anyone would do that in practice.

This change updates msgctl(..., IPC_STAT, ...) to return msg_lspid and
msg_lrpid in the pid namespace of the process calling stat.

This change also updates cat /proc/sysvipc/msg to return print msg_lspid
and msg_lrpid in the pid namespace of the process that opened the proc
file.

Fixes: b488893a ("pid namespaces: changes to show virtual ids to user")
Reviewed-by: NNagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

39a4940e

ipc/shm: Fix shmctl(..., IPC_STAT, ...) between pid namespaces. · 98f929b1

由 Eric W. Biederman 提交于 3月 23, 2018

Today shm_cpid and shm_lpid are remembered in the pid namespace of the
creator and the processes that last touched a sysvipc shared memory
segment. If you have processes in multiple pid namespaces that
is just wrong, and I don't know how this has been over-looked for
so long.

As only creation and shared memory attach and shared memory detach
update the pids I do not expect there to be a repeat of the issues
when struct pid was attached to each af_unix skb, which in some
notable cases cut the performance in half. The problem was threads of
the same process updating same struct pid from different cpus causing
the cache line to be highly contended and bounce between cpus.

As creation, attach, and detach are expected to be rare operations for
sysvipc shared memory segments I do not expect that kind of cache line
ping pong to cause probems. In addition because the pid is at a fixed
location in the structure instead of being dynamic on a skb, the
reference count of the pid does not need to be updated on each
operation if the pid is the same. This ability to simply skip the pid
reference count changes if the pid is unchanging further reduces the
likelihood of the a cache line holding a pid reference count
ping-ponging between cpus.

Fixes: b488893a ("pid namespaces: changes to show virtual ids to user")
Reviewed-by: NNagarathnam Muthusamy <nagarathnam.muthusamy@oracle.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

98f929b1

25 3月, 2018 5 次提交

Revert "mqueue: switch to on-demand creation of internal mount" · cfb2f6f6

由 Eric W. Biederman 提交于 3月 24, 2018

This reverts commit 36735a6a.

Aleksa Sarai <asarai@suse.de> writes:
> [REGRESSION v4.16-rc6] [PATCH] mqueue: forbid unprivileged user access to internal mount
>
> Felix reported weird behaviour on 4.16.0-rc6 with regards to mqueue[1],
> which was introduced by 36735a6a ("mqueue: switch to on-demand
> creation of internal mount").
>
> Basically, the reproducer boils down to being able to mount mqueue if
> you create a new user namespace, even if you don't unshare the IPC
> namespace.
>
> Previously this was not possible, and you would get an -EPERM. The mount
> is the *host* mqueue mount, which is being cached and just returned from
> mqueue_mount(). To be honest, I'm not sure if this is safe or not (or if
> it was intentional -- since I'm not familiar with mqueue).
>
> To me it looks like there is a missing permission check. I've included a
> patch below that I've compile-tested, and should block the above case.
> Can someone please tell me if I'm missing something? Is this actually
> safe?
>
> [1]: https://github.com/docker/docker/issues/36674

The issue is a lot deeper than a missing permission check.  sb->s_user_ns
was is improperly set as well.  So in addition to the filesystem being
mounted when it should not be mounted, so things are not allow that should
be.

We are practically to the release of 4.16 and there is no agreement between
Al Viro and myself on what the code should looks like to fix things properly.
So revert the code to what it was before so that we can take our time
and discuss this properly.

Fixes: 36735a6a ("mqueue: switch to on-demand creation of internal mount")
Reported-by: NFelix Abecassis <fabecassis@nvidia.com>
Reported-by: NAleksa Sarai <asarai@suse.de>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

cfb2f6f6

ipc/util: Helpers for making the sysvipc operations pid namespace aware · 03f1fc09

由 Eric W. Biederman 提交于 3月 23, 2018

Capture the pid namespace when /proc/sysvipc/msg /proc/sysvipc/shm
and /proc/sysvipc/sem are opened, and make it available through
the new helper ipc_seq_pid_ns.

This makes it possible to report the pids in these files in the
pid namespace of the opener of the files.

Implement ipc_update_pid.  A simple impline helper that will only update
a struct pid pointer if the new value does not equal the old value.  This
removes the need for wordy code sequences like:

	old = object->pid;
	object->pid = new;
	put_pid(old);

and

	old = object->pid;
	if (old != new) {
		object->pid = new;
		put_pid(old);
	}

Allowing the following to be written instead:

	ipc_update_pid(&object->pid, new);

Which is easier to read and ensures that the pid reference count is
not touched the old and the new values are the same.  Not touching
the reference count in this case is important to help avoid issues
like af_unix experienced, where multiple threads of the same
process managed to bounce the struct pid between cpu cache lines,
but updating the pids reference count.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

03f1fc09

ipc: Move IPCMNI from include/ipc.h into ipc/util.h · f83a396d

由 Eric W. Biederman 提交于 3月 22, 2018

The definition IPCMNI is only used in ipc/util.h and ipc/util.c. So
there is no reason to keep it in a header file that the whole kernel
can see. Move it into util.h to simplify future maintenance.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

f83a396d

msg: Move struct msg_queue into ipc/msg.c · 34b56df9

由 Eric W. Biederman 提交于 3月 22, 2018

All of the users are now in ipc/msg.c so make the definition local to
that file to make code maintenance easier. AKA to prevent rebuilding
the entire kernel when struct msg_queue changes.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

34b56df9

shm: Move struct shmid_kernel into ipc/shm.c · a2e102cd

由 Eric W. Biederman 提交于 3月 22, 2018

All of the users are now in ipc/shm.c so make the definition local to
that file to make code maintenance easier. AKA to prevent rebuilding
the entire kernel when struct shmid_kernel changes.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

a2e102cd

23 3月, 2018 4 次提交

sem: Move struct sem and struct sem_array into ipc/sem.c · 1a5c1349

由 Eric W. Biederman 提交于 3月 22, 2018

All of the users are now in ipc/sem.c so make the definitions
local to that file to make code maintenance easier.  AKA
to prevent rebuilding the entire kernel when one of these
files is changed.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

1a5c1349

msg/security: Pass kern_ipc_perm not msg_queue into the msg_queue security hooks · d8c6e854

由 Eric W. Biederman 提交于 3月 22, 2018

All of the implementations of security hooks that take msg_queue only
access q_perm the struct kern_ipc_perm member.  This means the
dependencies of the msg_queue security hooks can be simplified by
passing the kern_ipc_perm member of msg_queue.

Making this change will allow struct msg_queue to become private to
ipc/msg.c.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

d8c6e854

shm/security: Pass kern_ipc_perm not shmid_kernel into the shm security hooks · 7191adff

由 Eric W. Biederman 提交于 3月 22, 2018

All of the implementations of security hooks that take shmid_kernel only
access shm_perm the struct kern_ipc_perm member. This means the
dependencies of the shm security hooks can be simplified by passing
the kern_ipc_perm member of shmid_kernel..

Making this change will allow struct shmid_kernel to become private to ipc/shm.c.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

7191adff

sem/security: Pass kern_ipc_perm not sem_array into the sem security hooks · aefad959

由 Eric W. Biederman 提交于 3月 22, 2018

All of the implementations of security hooks that take sem_array only
access sem_perm the struct kern_ipc_perm member.  This means the
dependencies of the sem security hooks can be simplified by passing
the kern_ipc_perm member of sem_array.

Making this change will allow struct sem and struct sem_array
to become private to ipc/sem.c.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

aefad959

12 2月, 2018 1 次提交

vfs: do bulk POLL* -> EPOLL* replacement · a9a08845

由 Linus Torvalds 提交于 2月 11, 2018

This is the mindless scripted replacement of kernel use of POLL*
variables as described by Al, done by this script:

    for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
        L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
        for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
    done

with de-mangling cleanups yet to come.

NOTE! On almost all architectures, the EPOLL* constants have the same
values as the POLL* constants do.  But they keyword here is "almost".
For various bad reasons they aren't the same, and epoll() doesn't
actually work quite correctly in some cases due to this on Sparc et al.

The next patch from Al will sort out the final differences, and we
should be all done.
Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

a9a08845

07 2月, 2018 2 次提交

ipc/mqueue.c: have RT tasks queue in by priority in wq_add() · 68e34f4e

由 Jonathan Haws 提交于 2月 06, 2018

Previous behavior added tasks to the work queue using the static_prio
value instead of the dynamic priority value in prio.  This caused RT tasks
to be added to the work queue in a FIFO manner rather than by priority.
Normal tasks were handled by priority.

This fix utilizes the dynamic priority of the task to ensure that both RT
and normal tasks are added to the work queue in priority order.  Utilizing
the dynamic priority (prio) rather than the base priority (normal_prio)
was chosen to ensure that if a task had a boosted priority when it was
added to the work queue, it would be woken sooner to to ensure that it
releases any other locks it may be holding in a more timely manner.  It is
understood that the task could have a lower priority when it wakes than
when it was added to the queue in this (unlikely) case.

Link: http://lkml.kernel.org/r/1513006652-7014-1-git-send-email-jhaws@sdl.usu.eduSigned-off-by: NJonathan Haws <jhaws@sdl.usu.edu>
Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Reviewed-by: NDavidlohr Bueso <dbueso@suse.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

68e34f4e

ipc: fix ipc data structures inconsistency · 87ad4b0d

由 Philippe Mikoyan 提交于 2月 06, 2018

As described in the title, this patch fixes <ipc>id_ds inconsistency when
<ipc>ctl_stat executes concurrently with some ds-changing function, e.g.
shmat, msgsnd or whatever.

For instance, if shmctl(IPC_STAT) is running concurrently
with shmat, following data structure can be returned:
{... shm_lpid = 0, shm_nattch = 1, ...}

Link: http://lkml.kernel.org/r/20171202153456.6514-1-philippe.mikoyan@skat.systemsSigned-off-by: NPhilippe Mikoyan <philippe.mikoyan@skat.systems>
Reviewed-by: NDavidlohr Bueso <dbueso@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

87ad4b0d

13 1月, 2018 1 次提交

signal: Ensure generic siginfos the kernel sends have all bits initialized · faf1f22b

由 Eric W. Biederman 提交于 1月 05, 2018

Call clear_siginfo to ensure stack allocated siginfos are fully
initialized before being passed to the signal sending functions.

This ensures that if there is the kind of confusion documented by
TRAP_FIXME, FPE_FIXME, or BUS_FIXME the kernel won't send unitialized
data to userspace when the kernel generates a signal with SI_USER but
the copy to userspace assumes it is a different kind of signal, and
different fields are initialized.

This also prepares the way for turning copy_siginfo_to_user
into a copy_to_user, by removing the need in many cases to perform
a field by field copy simply to skip the uninitialized fields.
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

faf1f22b

06 1月, 2018 7 次提交

mqueue: switch to on-demand creation of internal mount · 36735a6a

由 Al Viro 提交于 12月 25, 2017

Instead of doing that upon each ipcns creation, we do that the first
time mq_open(2) or mqueue mount is done in an ipcns.  What's more,
doing that allows to get rid of mount_ns() use - we can go with
considerably cheaper mount_nodev(), avoiding the loop over all
mqueue superblock instances; ipcns->mq_mnt is used to locate preexisting
instance in O(1) time instead of O(instances) mount_ns() would've
cost us.

Based upon the version by Giuseppe Scrivano <gscrivan@redhat.com>; I've
added handling of userland mqueue mounts (original had been broken in
that area) and added a switch to mount_nodev().
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

36735a6a

A
tidy do_mq_open() up a bit · a713fd7f
由 Al Viro 提交于 12月 01, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
a713fd7f
A
mqueue: clean prepare_open() up · 9b20d7fc
由 Al Viro 提交于 12月 01, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
9b20d7fc
A
do_mq_open(): move all work prior to dentry_open() into a helper · 066cc813
由 Al Viro 提交于 12月 01, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
066cc813
A
mqueue: fold mq_attr_ok() into mqueue_get_inode() · 05c1b290
由 Al Viro 提交于 12月 01, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
05c1b290
A
move dentry_open() calls up into do_mq_open() · af4a5372
由 Al Viro 提交于 12月 01, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
af4a5372
A
mqueue: switch to vfs_mkobj(), quit abusing ->d_fsdata · eecec19d
由 Al Viro 提交于 12月 01, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
eecec19d

28 11月, 2017 2 次提交

A
ipc, kernel, mm: annotate ->poll() instances · 9dd95748
由 Al Viro 提交于 7月 03, 2017
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
9dd95748

Rename superblock flags (MS_xyz -> SB_xyz) · 1751e8a6

由 Linus Torvalds 提交于 11月 27, 2017

This is a pure automated search-and-replace of the internal kernel
superblock flags.

The s_flags are now called SB_*, with the names and the values for the
moment mirroring the MS_* flags that they're equivalent to.

Note how the MS_xyz flags are the ones passed to the mount system call,
while the SB_xyz flags are what we then use in sb->s_flags.

The script to do this was:

    # places to look in; re security/*: it generally should *not* be
    # touched (that stuff parses mount(2) arguments directly), but
    # there are two places where we really deal with superblock flags.
    FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
            include/linux/fs.h include/uapi/linux/bfs_fs.h \
            security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
    # the list of MS_... constants
    SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
          DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
          POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
          I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
          ACTIVE NOUSER"

    SED_PROG=
    for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done

    # we want files that contain at least one of MS_...,
    # with fs/namespace.c and fs/pnode.c excluded.
    L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')

    for f in $L; do sed -i $f $SED_PROG; done
Requested-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1751e8a6

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功