1. 08 4月, 2020 1 次提交
  2. 04 2月, 2020 2 次提交
  3. 26 9月, 2019 2 次提交
  4. 06 9月, 2019 1 次提交
  5. 17 7月, 2019 1 次提交
    • K
      ipc/mqueue.c: only perform resource calculation if user valid · a318f12e
      Kees Cook 提交于
      Andreas Christoforou reported:
      
        UBSAN: Undefined behaviour in ipc/mqueue.c:414:49 signed integer overflow:
        9 * 2305843009213693951 cannot be represented in type 'long int'
        ...
        Call Trace:
          mqueue_evict_inode+0x8e7/0xa10 ipc/mqueue.c:414
          evict+0x472/0x8c0 fs/inode.c:558
          iput_final fs/inode.c:1547 [inline]
          iput+0x51d/0x8c0 fs/inode.c:1573
          mqueue_get_inode+0x8eb/0x1070 ipc/mqueue.c:320
          mqueue_create_attr+0x198/0x440 ipc/mqueue.c:459
          vfs_mkobj+0x39e/0x580 fs/namei.c:2892
          prepare_open ipc/mqueue.c:731 [inline]
          do_mq_open+0x6da/0x8e0 ipc/mqueue.c:771
      
      Which could be triggered by:
      
              struct mq_attr attr = {
                      .mq_flags = 0,
                      .mq_maxmsg = 9,
                      .mq_msgsize = 0x1fffffffffffffff,
                      .mq_curmsgs = 0,
              };
      
              if (mq_open("/testing", 0x40, 3, &attr) == (mqd_t) -1)
                      perror("mq_open");
      
      mqueue_get_inode() was correctly rejecting the giant mq_msgsize, and
      preparing to return -EINVAL.  During the cleanup, it calls
      mqueue_evict_inode() which performed resource usage tracking math for
      updating "user", before checking if there was a valid "user" at all
      (which would indicate that the calculations would be sane).  Instead,
      delay this check to after seeing a valid "user".
      
      The overflow was real, but the results went unused, so while the flaw is
      harmless, it's noisy for kernel fuzzers, so just fix it by moving the
      calculation under the non-NULL "user" where it actually gets used.
      
      Link: http://lkml.kernel.org/r/201906072207.ECB65450@keescookSigned-off-by: NKees Cook <keescook@chromium.org>
      Reported-by: NAndreas Christoforou <andreaschristofo@gmail.com>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a318f12e
  6. 26 5月, 2019 1 次提交
  7. 15 5月, 2019 3 次提交
    • D
      ipc/mqueue: optimize msg_get() · a5091fda
      Davidlohr Bueso 提交于
      Our msg priorities became an rbtree as of d6629859 ("ipc/mqueue:
      improve performance of send/recv").  However, consuming a msg in
      msg_get() remains logarithmic (still being better than the case before
      of course).  By applying well known techniques to cache pointers we can
      have the node with the highest priority in O(1), which is specially nice
      for the rt cases.  Furthermore, some callers can call msg_get() in a
      loop.
      
      A new msg_tree_erase() helper is also added to encapsulate the tree
      removal and node_cache game.  Passes ltp mq testcases.
      
      Link: http://lkml.kernel.org/r/20190321190216.1719-2-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5091fda
    • D
      ipc/mqueue: remove redundant wq task assignment · 0ecb5821
      Davidlohr Bueso 提交于
      We already store the current task fo the new waiter before calling
      wq_sleep() in both send and recv paths.  Trivially remove the redundant
      assignment.
      
      Link: http://lkml.kernel.org/r/20190321190216.1719-1-dave@stgolabs.netSigned-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ecb5821
    • L
      ipc: prevent lockup on alloc_msg and free_msg · d6a2946a
      Li Rongqing 提交于
      msgctl10 of ltp triggers the following lockup When CONFIG_KASAN is
      enabled on large memory SMP systems, the pages initialization can take a
      long time, if msgctl10 requests a huge block memory, and it will block
      rcu scheduler, so release cpu actively.
      
      After adding schedule() in free_msg, free_msg can not be called when
      holding spinlock, so adding msg to a tmp list, and free it out of
      spinlock
      
        rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
        rcu:     Tasks blocked on level-1 rcu_node (CPUs 16-31): P32505
        rcu:     Tasks blocked on level-1 rcu_node (CPUs 48-63): P34978
        rcu:     (detected by 11, t=35024 jiffies, g=44237529, q=16542267)
        msgctl10        R  running task    21608 32505   2794 0x00000082
        Call Trace:
         preempt_schedule_irq+0x4c/0xb0
         retint_kernel+0x1b/0x2d
        RIP: 0010:__is_insn_slot_addr+0xfb/0x250
        Code: 82 1d 00 48 8b 9b 90 00 00 00 4c 89 f7 49 c1 ee 03 e8 59 83 1d 00 48 b8 00 00 00 00 00 fc ff df 4c 39 eb 48 89 9d 58 ff ff ff <41> c6 04 06 f8 74 66 4c 8d 75 98 4c 89 f1 48 c1 e9 03 48 01 c8 48
        RSP: 0018:ffff88bce041f758 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
        RAX: dffffc0000000000 RBX: ffffffff8471bc50 RCX: ffffffff828a2a57
        RDX: dffffc0000000000 RSI: dffffc0000000000 RDI: ffff88bce041f780
        RBP: ffff88bce041f828 R08: ffffed15f3f4c5b3 R09: ffffed15f3f4c5b3
        R10: 0000000000000001 R11: ffffed15f3f4c5b2 R12: 000000318aee9b73
        R13: ffffffff8471bc50 R14: 1ffff1179c083ef0 R15: 1ffff1179c083eec
         kernel_text_address+0xc1/0x100
         __kernel_text_address+0xe/0x30
         unwind_get_return_address+0x2f/0x50
         __save_stack_trace+0x92/0x100
         create_object+0x380/0x650
         __kmalloc+0x14c/0x2b0
         load_msg+0x38/0x1a0
         do_msgsnd+0x19e/0xcf0
         do_syscall_64+0x117/0x400
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
        rcu:     Tasks blocked on level-1 rcu_node (CPUs 0-15): P32170
        rcu:     (detected by 14, t=35016 jiffies, g=44237525, q=12423063)
        msgctl10        R  running task    21608 32170  32155 0x00000082
        Call Trace:
         preempt_schedule_irq+0x4c/0xb0
         retint_kernel+0x1b/0x2d
        RIP: 0010:lock_acquire+0x4d/0x340
        Code: 48 81 ec c0 00 00 00 45 89 c6 4d 89 cf 48 8d 6c 24 20 48 89 3c 24 48 8d bb e4 0c 00 00 89 74 24 0c 48 c7 44 24 20 b3 8a b5 41 <48> c1 ed 03 48 c7 44 24 28 b4 25 18 84 48 c7 44 24 30 d0 54 7a 82
        RSP: 0018:ffff88af83417738 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
        RAX: dffffc0000000000 RBX: ffff88bd335f3080 RCX: 0000000000000002
        RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88bd335f3d64
        RBP: ffff88af83417758 R08: 0000000000000000 R09: 0000000000000000
        R10: 0000000000000001 R11: ffffed13f3f745b2 R12: 0000000000000000
        R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
         is_bpf_text_address+0x32/0xe0
         kernel_text_address+0xec/0x100
         __kernel_text_address+0xe/0x30
         unwind_get_return_address+0x2f/0x50
         __save_stack_trace+0x92/0x100
         save_stack+0x32/0xb0
         __kasan_slab_free+0x130/0x180
         kfree+0xfa/0x2d0
         free_msg+0x24/0x50
         do_msgrcv+0x508/0xe60
         do_syscall_64+0x117/0x400
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Davidlohr said:
       "So after releasing the lock, the msg rbtree/list is empty and new
        calls will not see those in the newly populated tmp_msg list, and
        therefore they cannot access the delayed msg freeing pointers, which
        is good. Also the fact that the node_cache is now freed before the
        actual messages seems to be harmless as this is wanted for
        msg_insert() avoiding GFP_ATOMIC allocations, and after releasing the
        info->lock the thing is freed anyway so it should not change things"
      
      Link: http://lkml.kernel.org/r/1552029161-4957-1-git-send-email-lirongqing@baidu.comSigned-off-by: NLi RongQing <lirongqing@baidu.com>
      Signed-off-by: NZhang Yu <zhangyu31@baidu.com>
      Reviewed-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6a2946a
  8. 02 5月, 2019 1 次提交
  9. 28 2月, 2019 1 次提交
    • D
      ipc: Convert mqueue fs to fs_context · 935c6912
      David Howells 提交于
      Convert the mqueue filesystem to use the filesystem context stuff.
      
      Notes:
      
       (1) The relevant ipc namespace is selected in when the context is
           initialised (and it defaults to the current task's ipc namespace).
           The caller can override this before calling vfs_get_tree().
      
       (2) Rather than simply calling kern_mount_data(), mq_init_ns() and
           mq_internal_mount() create a context, adjust it and then do the rest
           of the mount procedure.
      
       (3) The lazy mqueue mounting on creation of a new namespace is retained
           from a previous patch, but the avoidance of sget() if no superblock
           yet exists is reverted and the superblock is again keyed on the
           namespace pointer.
      
           Yes, there was a performance gain in not searching the superblock
           hash, but it's only paid once per ipc namespace - and only if someone
           uses mqueue within that namespace, so I'm not sure it's worth it,
           especially as calling sget() allows avoidance of recursion.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      935c6912
  10. 07 2月, 2019 1 次提交
    • A
      y2038: syscalls: rename y2038 compat syscalls · 8dabe724
      Arnd Bergmann 提交于
      A lot of system calls that pass a time_t somewhere have an implementation
      using a COMPAT_SYSCALL_DEFINEx() on 64-bit architectures, and have
      been reworked so that this implementation can now be used on 32-bit
      architectures as well.
      
      The missing step is to redefine them using the regular SYSCALL_DEFINEx()
      to get them out of the compat namespace and make it possible to build them
      on 32-bit architectures.
      
      Any system call that ends in 'time' gets a '32' suffix on its name for
      that version, while the others get a '_time32' suffix, to distinguish
      them from the normal version, which takes a 64-bit time argument in the
      future.
      
      In this step, only 64-bit architectures are changed, doing this rename
      first lets us avoid touching the 32-bit architectures twice.
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      8dabe724
  11. 03 10月, 2018 1 次提交
    • E
      signal: Distinguish between kernel_siginfo and siginfo · ae7795bc
      Eric W. Biederman 提交于
      Linus recently observed that if we did not worry about the padding
      member in struct siginfo it is only about 48 bytes, and 48 bytes is
      much nicer than 128 bytes for allocating on the stack and copying
      around in the kernel.
      
      The obvious thing of only adding the padding when userspace is
      including siginfo.h won't work as there are sigframe definitions in
      the kernel that embed struct siginfo.
      
      So split siginfo in two; kernel_siginfo and siginfo.  Keeping the
      traditional name for the userspace definition.  While the version that
      is used internally to the kernel and ultimately will not be padded to
      128 bytes is called kernel_siginfo.
      
      The definition of struct kernel_siginfo I have put in include/signal_types.h
      
      A set of buildtime checks has been added to verify the two structures have
      the same field offsets.
      
      To make it easy to verify the change kernel_siginfo retains the same
      size as siginfo.  The reduction in size comes in a following change.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      ae7795bc
  12. 27 8月, 2018 1 次提交
    • A
      y2038: globally rename compat_time to old_time32 · 9afc5eee
      Arnd Bergmann 提交于
      Christoph Hellwig suggested a slightly different path for handling
      backwards compatibility with the 32-bit time_t based system calls:
      
      Rather than simply reusing the compat_sys_* entry points on 32-bit
      architectures unchanged, we get rid of those entry points and the
      compat_time types by renaming them to something that makes more sense
      on 32-bit architectures (which don't have a compat mode otherwise),
      and then share the entry points under the new name with the 64-bit
      architectures that use them for implementing the compatibility.
      
      The following types and interfaces are renamed here, and moved
      from linux/compat_time.h to linux/time32.h:
      
      old				new
      ---				---
      compat_time_t			old_time32_t
      struct compat_timeval		struct old_timeval32
      struct compat_timespec		struct old_timespec32
      struct compat_itimerspec	struct old_itimerspec32
      ns_to_compat_timeval()		ns_to_old_timeval32()
      get_compat_itimerspec64()	get_old_itimerspec32()
      put_compat_itimerspec64()	put_old_itimerspec32()
      compat_get_timespec64()		get_old_timespec32()
      compat_put_timespec64()		put_old_timespec32()
      
      As we already have aliases in place, this patch addresses only the
      instances that are relevant to the system call interface in particular,
      not those that occur in device drivers and other modules. Those
      will get handled separately, while providing the 64-bit version
      of the respective interfaces.
      
      I'm not renaming the timex, rusage and itimerval structures, as we are
      still debating what the new interface will look like, and whether we
      will need a replacement at all.
      
      This also doesn't change the names of the syscall entry points, which can
      be done more easily when we actually switch over the 32-bit architectures
      to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
      SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.
      Suggested-by: NChristoph Hellwig <hch@infradead.org>
      Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      9afc5eee
  13. 20 4月, 2018 2 次提交
    • A
      y2038: ipc: Enable COMPAT_32BIT_TIME · b0d17578
      Arnd Bergmann 提交于
      Three ipc syscalls (mq_timedsend, mq_timedreceive and and semtimedop)
      take a timespec argument. After we move 32-bit architectures over to
      useing 64-bit time_t based syscalls, we need seperate entry points for
      the old 32-bit based interfaces.
      
      This changes the #ifdef guards for the existing 32-bit compat syscalls
      to check for CONFIG_COMPAT_32BIT_TIME instead, which will then be
      enabled on all existing 32-bit architectures.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      b0d17578
    • A
      y2038: ipc: Use __kernel_timespec · 21fc538d
      Arnd Bergmann 提交于
      This is a preparatation for changing over __kernel_timespec to 64-bit
      times, which involves assigning new system call numbers for mq_timedsend(),
      mq_timedreceive() and semtimedop() for compatibility with future y2038
      proof user space.
      
      The existing ABIs will remain available through compat code.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      21fc538d
  14. 25 3月, 2018 1 次提交
    • E
      Revert "mqueue: switch to on-demand creation of internal mount" · cfb2f6f6
      Eric W. Biederman 提交于
      This reverts commit 36735a6a.
      
      Aleksa Sarai <asarai@suse.de> writes:
      > [REGRESSION v4.16-rc6] [PATCH] mqueue: forbid unprivileged user access to internal mount
      >
      > Felix reported weird behaviour on 4.16.0-rc6 with regards to mqueue[1],
      > which was introduced by 36735a6a ("mqueue: switch to on-demand
      > creation of internal mount").
      >
      > Basically, the reproducer boils down to being able to mount mqueue if
      > you create a new user namespace, even if you don't unshare the IPC
      > namespace.
      >
      > Previously this was not possible, and you would get an -EPERM. The mount
      > is the *host* mqueue mount, which is being cached and just returned from
      > mqueue_mount(). To be honest, I'm not sure if this is safe or not (or if
      > it was intentional -- since I'm not familiar with mqueue).
      >
      > To me it looks like there is a missing permission check. I've included a
      > patch below that I've compile-tested, and should block the above case.
      > Can someone please tell me if I'm missing something? Is this actually
      > safe?
      >
      > [1]: https://github.com/docker/docker/issues/36674
      
      The issue is a lot deeper than a missing permission check.  sb->s_user_ns
      was is improperly set as well.  So in addition to the filesystem being
      mounted when it should not be mounted, so things are not allow that should
      be.
      
      We are practically to the release of 4.16 and there is no agreement between
      Al Viro and myself on what the code should looks like to fix things properly.
      So revert the code to what it was before so that we can take our time
      and discuss this properly.
      
      Fixes: 36735a6a ("mqueue: switch to on-demand creation of internal mount")
      Reported-by: NFelix Abecassis <fabecassis@nvidia.com>
      Reported-by: NAleksa Sarai <asarai@suse.de>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      cfb2f6f6
  15. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  16. 07 2月, 2018 1 次提交
  17. 13 1月, 2018 1 次提交
    • E
      signal: Ensure generic siginfos the kernel sends have all bits initialized · faf1f22b
      Eric W. Biederman 提交于
      Call clear_siginfo to ensure stack allocated siginfos are fully
      initialized before being passed to the signal sending functions.
      
      This ensures that if there is the kind of confusion documented by
      TRAP_FIXME, FPE_FIXME, or BUS_FIXME the kernel won't send unitialized
      data to userspace when the kernel generates a signal with SI_USER but
      the copy to userspace assumes it is a different kind of signal, and
      different fields are initialized.
      
      This also prepares the way for turning copy_siginfo_to_user
      into a copy_to_user, by removing the need in many cases to perform
      a field by field copy simply to skip the uninitialized fields.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      faf1f22b
  18. 06 1月, 2018 7 次提交
  19. 28 11月, 2017 2 次提交
    • A
      ipc, kernel, mm: annotate ->poll() instances · 9dd95748
      Al Viro 提交于
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      9dd95748
    • L
      Rename superblock flags (MS_xyz -> SB_xyz) · 1751e8a6
      Linus Torvalds 提交于
      This is a pure automated search-and-replace of the internal kernel
      superblock flags.
      
      The s_flags are now called SB_*, with the names and the values for the
      moment mirroring the MS_* flags that they're equivalent to.
      
      Note how the MS_xyz flags are the ones passed to the mount system call,
      while the SB_xyz flags are what we then use in sb->s_flags.
      
      The script to do this was:
      
          # places to look in; re security/*: it generally should *not* be
          # touched (that stuff parses mount(2) arguments directly), but
          # there are two places where we really deal with superblock flags.
          FILES="drivers/mtd drivers/staging/lustre fs ipc mm \
                  include/linux/fs.h include/uapi/linux/bfs_fs.h \
                  security/apparmor/apparmorfs.c security/apparmor/include/lib.h"
          # the list of MS_... constants
          SYMS="RDONLY NOSUID NODEV NOEXEC SYNCHRONOUS REMOUNT MANDLOCK \
                DIRSYNC NOATIME NODIRATIME BIND MOVE REC VERBOSE SILENT \
                POSIXACL UNBINDABLE PRIVATE SLAVE SHARED RELATIME KERNMOUNT \
                I_VERSION STRICTATIME LAZYTIME SUBMOUNT NOREMOTELOCK NOSEC BORN \
                ACTIVE NOUSER"
      
          SED_PROG=
          for i in $SYMS; do SED_PROG="$SED_PROG -e s/MS_$i/SB_$i/g"; done
      
          # we want files that contain at least one of MS_...,
          # with fs/namespace.c and fs/pnode.c excluded.
          L=$(for i in $SYMS; do git grep -w -l MS_$i $FILES; done| sort|uniq|grep -v '^fs/namespace.c'|grep -v '^fs/pnode.c')
      
          for f in $L; do sed -i $f $SED_PROG; done
      Requested-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1751e8a6
  20. 04 9月, 2017 1 次提交
  21. 10 7月, 2017 1 次提交
    • C
      mqueue: fix a use-after-free in sys_mq_notify() · f991af3d
      Cong Wang 提交于
      The retry logic for netlink_attachskb() inside sys_mq_notify()
      is nasty and vulnerable:
      
      1) The sock refcnt is already released when retry is needed
      2) The fd is controllable by user-space because we already
         release the file refcnt
      
      so we when retry but the fd has been just closed by user-space
      during this small window, we end up calling netlink_detachskb()
      on the error path which releases the sock again, later when
      the user-space closes this socket a use-after-free could be
      triggered.
      
      Setting 'sock' to NULL here should be sufficient to fix it.
      Reported-by: NGeneBlue <geneblue.mail@gmail.com>
      Signed-off-by: NCong Wang <xiyou.wangcong@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f991af3d
  22. 05 7月, 2017 1 次提交
  23. 02 3月, 2017 3 次提交
  24. 28 2月, 2017 1 次提交
  25. 21 11月, 2016 1 次提交
  26. 28 9月, 2016 1 次提交