1. 07 6月, 2014 15 次提交
  2. 08 4月, 2014 2 次提交
  3. 17 3月, 2014 1 次提交
    • M
      ipc: Fix 2 bugs in msgrcv() MSG_COPY implementation · 4f87dac3
      Michael Kerrisk 提交于
      While testing and documenting the msgrcv() MSG_COPY flag that Stanislav
      Kinsbursky added in commit 4a674f34 ("ipc: introduce message queue
      copy feature" => kernel 3.8), I discovered a couple of bugs in the
      implementation.  The two bugs concern MSG_COPY interactions with other
      msgrcv() flags, namely:
      
       (A) MSG_COPY + MSG_EXCEPT
       (B) MSG_COPY + !IPC_NOWAIT
      
      The bugs are distinct (and the fix for the first one is obvious),
      however my fix for both is a single-line patch, which is why I'm
      combining them in a single mail, rather than writing two mails+patches.
      
       ===== (A) MSG_COPY + MSG_EXCEPT =====
      
      With the addition of the MSG_COPY flag, there are now two msgrcv()
      flags--MSG_COPY and MSG_EXCEPT--that modify the meaning of the 'msgtyp'
      argument in unrelated ways.  Specifying both in the same call is a
      logical error that is currently permitted, with the effect that MSG_COPY
      has priority and MSG_EXCEPT is ignored.  The call should give an error
      if both flags are specified.  The patch below implements that behavior.
      
       ===== (B) (B) MSG_COPY + !IPC_NOWAIT =====
      
      The test code that was submitted in commit 3a665531 ("selftests: IPC
      message queue copy feature test") shows MSG_COPY being used in
      conjunction with IPC_NOWAIT.  In other words, if there is no message at
      the position 'msgtyp'.  return immediately with the error in ENOMSG.
      
      What was not (fully) tested is the behavior if MSG_COPY is specified
      *without* IPC_NOWAIT, and there is an odd behavior.  If the queue
      contains less than 'msgtyp' messages, then the call blocks until the
      next message is written to the queue.  At that point, the msgrcv() call
      returns a copy of the newly added message, regardless of whether that
      message is at the ordinal position 'msgtyp'.  This is clearly bogus, and
      problematic for applications that might want to make use of the MSG_COPY
      flag.
      
      I considered the following possible solutions to this problem:
      
       (1) Force the call to block until a message *does* appear at the
           position 'msgtyp'.
      
       (2) If the MSG_COPY flag is specified, the kernel should implicitly add
           IPC_NOWAIT, so that the call fails with ENOMSG for this case.
      
       (3) If the MSG_COPY flag is specified, but IPC_NOWAIT is not, generate
           an error (probably, EINVAL is the right one).
      
      I do not know if any application would really want to have the
      functionality of solution (1), especially since an application can
      determine in advance the number of messages in the queue using msgctl()
      IPC_STAT.  Obviously, this solution would be the most work to implement.
      
      Solution (2) would have the effect of silently fixing any applications
      that tried to employ broken behavior.  However, it would mean that if we
      later decided to implement solution (1), then user-space could not
      easily detect what the kernel supports (but, since I'm somewhat doubtful
      that solution (1) is needed, I'm not sure that this is much of a
      problem).
      
      Solution (3) would have the effect of informing broken applications that
      they are doing something broken.  The downside is that this would cause
      a ABI breakage for any applications that are currently employing the
      broken behavior.  However:
      
      a) Those applications are almost certainly not getting the results they
         expect.
      b) Possibly, those applications don't even exist, because MSG_COPY is
         currently hidden behind CONFIG_CHECKPOINT_RESTORE.
      
      The upside of solution (3) is that if we later decided to implement
      solution (1), user-space could determine what the kernel supports, via
      the error return.
      
      In my view, solution (3) is mildly preferable to solution (2), and
      solution (1) could still be done later if anyone really cares.  The
      patch below implements solution (3).
      
      PS.  For anyone out there still listening, it's the usual story:
      documenting an API (and the thinking about, and the testing of the API,
      that documentation entails) is the one of the single best ways of
      finding bugs in the API, as I've learned from a lot of experience.  Best
      to do that documentation before releasing the API.
      Signed-off-by: NMichael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: NStanislav Kinsbursky <skinsbursky@parallels.com>
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Cc: stable@vger.kernel.org
      Cc: Serge Hallyn <serge.hallyn@canonical.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4f87dac3
  4. 06 3月, 2014 3 次提交
  5. 26 2月, 2014 1 次提交
    • D
      ipc,mqueue: remove limits for the amount of system-wide queues · f3713fd9
      Davidlohr Bueso 提交于
      Commit 93e6f119 ("ipc/mqueue: cleanup definition names and
      locations") added global hardcoded limits to the amount of message
      queues that can be created.  While these limits are per-namespace,
      reality is that it ends up breaking userspace applications.
      Historically users have, at least in theory, been able to create up to
      INT_MAX queues, and limiting it to just 1024 is way too low and dramatic
      for some workloads and use cases.  For instance, Madars reports:
      
       "This update imposes bad limits on our multi-process application.  As
        our app uses approaches that each process opens its own set of queues
        (usually something about 3-5 queues per process).  In some scenarios
        we might run up to 3000 processes or more (which of-course for linux
        is not a problem).  Thus we might need up to 9000 queues or more.  All
        processes run under one user."
      
      Other affected users can be found in launchpad bug #1155695:
        https://bugs.launchpad.net/ubuntu/+source/manpages/+bug/1155695
      
      Instead of increasing this limit, revert it entirely and fallback to the
      original way of dealing queue limits -- where once a user's resource
      limit is reached, and all memory is used, new queues cannot be created.
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reported-by: NMadars Vitolins <m@silodev.com>
      Acked-by: NDoug Ledford <dledford@redhat.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>	[3.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3713fd9
  6. 03 2月, 2014 1 次提交
    • H
      compat: Get rid of (get|put)_compat_time(val|spec) · 81993e81
      H. Peter Anvin 提交于
      We have two APIs for compatiblity timespec/val, with confusingly
      similar names.  compat_(get|put)_time(val|spec) *do* handle the case
      where COMPAT_USE_64BIT_TIME is set, whereas
      (get|put)_compat_time(val|spec) do not.  This is an accident waiting
      to happen.
      
      Clean it up by favoring the full-service version; the limited version
      is replaced with double-underscore versions static to kernel/compat.c.
      
      A common pattern is to convert a struct timespec to kernel format in
      an allocation on the user stack.  Unfortunately it is open-coded in
      several places.  Since this allocation isn't actually needed if
      COMPAT_USE_64BIT_TIME is true (since user format == kernel format)
      encapsulate that whole pattern into the function
      compat_convert_timespec().  An equivalent function should be written
      for struct timeval if it is needed in the future.
      
      Finally, get rid of compat_(get|put)_timeval_convert(): each was only
      used once, and the latter was not even doing what the function said
      (no conversion actually was being done.)  Moving the conversion into
      compat_sys_settimeofday() itself makes the code much more similar to
      sys_settimeofday() itself.
      
      v3: Remove unused compat_convert_timeval().
      
      v2: Drop bogus "const" in the destination argument for
          compat_convert_time*().
      
      Cc: Mauro Carvalho Chehab <m.chehab@samsung.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Hans Verkuil <hans.verkuil@cisco.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Mateusz Guzik <mguzik@redhat.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Tested-by: NH.J. Lu <hjl.tools@gmail.com>
      Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com>
      81993e81
  7. 28 1月, 2014 11 次提交
  8. 22 11月, 2013 2 次提交
    • J
      ipc,shm: correct error return value in shmctl (SHM_UNLOCK) · 3a72660b
      Jesper Nilsson 提交于
      Commit 2caacaa8 ("ipc,shm: shorten critical region for shmctl")
      restructured the ipc shm to shorten critical region, but introduced a
      path where the return value could be -EPERM, even if the operation
      actually was performed.
      
      Before the commit, the err return value was reset by the return value
      from security_shm_shmctl() after the if (!ns_capable(...)) statement.
      
      Now, we still exit the if statement with err set to -EPERM, and in the
      case of SHM_UNLOCK, it is not reset at all, and used as the return value
      from shmctl.
      
      To fix this, we only set err when errors occur, leaving the fallthrough
      case alone.
      Signed-off-by: NJesper Nilsson <jesper.nilsson@axis.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>	[3.12.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a72660b
    • G
      ipc,shm: fix shm_file deletion races · a399b29d
      Greg Thelen 提交于
      When IPC_RMID races with other shm operations there's potential for
      use-after-free of the shm object's associated file (shm_file).
      
      Here's the race before this patch:
      
        TASK 1                     TASK 2
        ------                     ------
        shm_rmid()
          ipc_lock_object()
                                   shmctl()
                                   shp = shm_obtain_object_check()
      
          shm_destroy()
            shum_unlock()
            fput(shp->shm_file)
                                   ipc_lock_object()
                                   shmem_lock(shp->shm_file)
                                   <OOPS>
      
      The oops is caused because shm_destroy() calls fput() after dropping the
      ipc_lock.  fput() clears the file's f_inode, f_path.dentry, and
      f_path.mnt, which causes various NULL pointer references in task 2.  I
      reliably see the oops in task 2 if with shmlock, shmu
      
      This patch fixes the races by:
      1) set shm_file=NULL in shm_destroy() while holding ipc_object_lock().
      2) modify at risk operations to check shm_file while holding
         ipc_object_lock().
      
      Example workloads, which each trigger oops...
      
      Workload 1:
        while true; do
          id=$(shmget 1 4096)
          shm_rmid $id &
          shmlock $id &
          wait
        done
      
        The oops stack shows accessing NULL f_inode due to racing fput:
          _raw_spin_lock
          shmem_lock
          SyS_shmctl
      
      Workload 2:
        while true; do
          id=$(shmget 1 4096)
          shmat $id 4096 &
          shm_rmid $id &
          wait
        done
      
        The oops stack is similar to workload 1 due to NULL f_inode:
          touch_atime
          shmem_mmap
          shm_mmap
          mmap_region
          do_mmap_pgoff
          do_shmat
          SyS_shmat
      
      Workload 3:
        while true; do
          id=$(shmget 1 4096)
          shmlock $id
          shm_rmid $id &
          shmunlock $id &
          wait
        done
      
        The oops stack shows second fput tripping on an NULL f_inode.  The
        first fput() completed via from shm_destroy(), but a racing thread did
        a get_file() and queued this fput():
          locks_remove_flock
          __fput
          ____fput
          task_work_run
          do_notify_resume
          int_signal
      
      Fixes: c2c737a0 ("ipc,shm: shorten critical region for shmat")
      Fixes: 2caacaa8 ("ipc,shm: shorten critical region for shmctl")
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>  # 3.10.17+ 3.11.6+
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a399b29d
  9. 13 11月, 2013 2 次提交
    • M
      ipc, msg: fix message length check for negative values · 4e9b45a1
      Mathias Krause 提交于
      On 64 bit systems the test for negative message sizes is bogus as the
      size, which may be positive when evaluated as a long, will get truncated
      to an int when passed to load_msg().  So a long might very well contain a
      positive value but when truncated to an int it would become negative.
      
      That in combination with a small negative value of msg_ctlmax (which will
      be promoted to an unsigned type for the comparison against msgsz, making
      it a big positive value and therefore make it pass the check) will lead to
      two problems: 1/ The kmalloc() call in alloc_msg() will allocate a too
      small buffer as the addition of alen is effectively a subtraction.  2/ The
      copy_from_user() call in load_msg() will first overflow the buffer with
      userland data and then, when the userland access generates an access
      violation, the fixup handler copy_user_handle_tail() will try to fill the
      remainder with zeros -- roughly 4GB.  That almost instantly results in a
      system crash or reset.
      
        ,-[ Reproducer (needs to be run as root) ]--
        | #include <sys/stat.h>
        | #include <sys/msg.h>
        | #include <unistd.h>
        | #include <fcntl.h>
        |
        | int main(void) {
        |     long msg = 1;
        |     int fd;
        |
        |     fd = open("/proc/sys/kernel/msgmax", O_WRONLY);
        |     write(fd, "-1", 2);
        |     close(fd);
        |
        |     msgsnd(0, &msg, 0xfffffff0, IPC_NOWAIT);
        |
        |     return 0;
        | }
        '---
      
      Fix the issue by preventing msgsz from getting truncated by consistently
      using size_t for the message length.  This way the size checks in
      do_msgsnd() could still be passed with a negative value for msg_ctlmax but
      we would fail on the buffer allocation in that case and error out.
      
      Also change the type of m_ts from int to size_t to avoid similar nastiness
      in other code paths -- it is used in similar constructs, i.e.  signed vs.
      unsigned checks.  It should never become negative under normal
      circumstances, though.
      
      Setting msg_ctlmax to a negative value is an odd configuration and should
      be prevented.  As that might break existing userland, it will be handled
      in a separate commit so it could easily be reverted and reworked without
      reintroducing the above described bug.
      
      Hardening mechanisms for user copy operations would have catched that bug
      early -- e.g.  checking slab object sizes on user copy operations as the
      usercopy feature of the PaX patch does.  Or, for that matter, detect the
      long vs.  int sign change due to truncation, as the size overflow plugin
      of the very same patch does.
      
      [akpm@linux-foundation.org: fix i386 min() warnings]
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Cc: Pax Team <pageexec@freemail.hu>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Brad Spengler <spender@grsecurity.net>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>	[ v2.3.27+ -- yes, that old ;) ]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e9b45a1
    • X
      ipc/util.c: remove unnecessary work pending test · 206fa940
      Xie XiuQi 提交于
      Remove unnecessary work pending test before calling schedule_work().  It
      has been tested in queue_work_on() already.  No functional changed.
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Cc: Tejun Heo <tj@kernel.org>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      206fa940
  10. 09 11月, 2013 1 次提交
    • J
      locks: break delegations on unlink · b21996e3
      J. Bruce Fields 提交于
      We need to break delegations on any operation that changes the set of
      links pointing to an inode.  Start with unlink.
      
      Such operations also hold the i_mutex on a parent directory.  Breaking a
      delegation may require waiting for a timeout (by default 90 seconds) in
      the case of a unresponsive NFS client.  To avoid blocking all directory
      operations, we therefore drop locks before waiting for the delegation.
      The logic then looks like:
      
      	acquire locks
      	...
      	test for delegation; if found:
      		take reference on inode
      		release locks
      		wait for delegation break
      		drop reference on inode
      		retry
      
      It is possible this could never terminate.  (Even if we take precautions
      to prevent another delegation being acquired on the same inode, we could
      get a different inode on each retry.)  But this seems very unlikely.
      
      The initial test for a delegation happens after the lock on the target
      inode is acquired, but the directory inode may have been acquired
      further up the call stack.  We therefore add a "struct inode **"
      argument to any intervening functions, which we use to pass the inode
      back up to the caller in the case it needs a delegation synchronously
      broken.
      
      Cc: David Howells <dhowells@redhat.com>
      Cc: Tyler Hicks <tyhicks@canonical.com>
      Cc: Dustin Kirkland <dustin.kirkland@gazzang.com>
      Acked-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      b21996e3
  11. 04 11月, 2013 1 次提交