1. 14 10月, 2014 1 次提交
    • O
      ipc/shm: kill the historical/wrong mm->start_stack check · bf77b94c
      Oleg Nesterov 提交于
      do_shmat() is the only user of ->start_stack (proc just reports its
      value), and this check looks ugly and wrong.
      
      The reason for this check is not clear at all, and it wrongly assumes that
      the stack can only grow down.
      
      But the main problem is that in general mm->start_stack has nothing to do
      with stack_vma->vm_start.  Not only the application can switch to another
      stack and even unmap this area, setup_arg_pages() expands the stack
      without updating mm->start_stack during exec().  This means that in the
      likely case "addr > start_stack - size - PAGE_SIZE * 5" is simply
      impossible after find_vma_intersection() == F, or the stack can't grow
      anyway because of RLIMIT_STACK.
      
      Many thanks to Hugh for his explanations.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Davidlohr Bueso <davidlohr.bueso@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf77b94c
  2. 09 8月, 2014 2 次提交
    • J
      shm: allow exit_shm in parallel if only marking orphans · 83293c0f
      Jack Miller 提交于
      If shm_rmid_force (the default state) is not set then the shmids are only
      marked as orphaned and does not require any add, delete, or locking of the
      tree structure.
      
      Seperate the sysctl on and off case, and only obtain the read lock.  The
      newly added list head can be deleted under the read lock because we are
      only called with current and will only change the semids allocated by this
      task and not manipulate the list.
      
      This commit assumes that up_read includes a sufficient memory barrier for
      the writes to be seen my others that later obtain a write lock.
      Signed-off-by: NMilton Miller <miltonm@bga.com>
      Signed-off-by: NJack Miller <millerjo@us.ibm.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Anton Blanchard <anton@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83293c0f
    • J
      shm: make exit_shm work proportional to task activity · ab602f79
      Jack Miller 提交于
      This is small set of patches our team has had kicking around for a few
      versions internally that fixes tasks getting hung on shm_exit when there
      are many threads hammering it at once.
      
      Anton wrote a simple test to cause the issue:
      
        http://ozlabs.org/~anton/junkcode/bust_shm_exit.c
      
      Before applying this patchset, this test code will cause either hanging
      tracebacks or pthread out of memory errors.
      
      After this patchset, it will still produce output like:
      
        root@somehost:~# ./bust_shm_exit 1024 160
        ...
        INFO: rcu_sched detected stalls on CPUs/tasks: {} (detected by 116, t=2111 jiffies, g=241, c=240, q=7113)
        INFO: Stall ended before state dump start
        ...
      
      But the task will continue to run along happily, so we consider this an
      improvement over hanging, even if it's a bit noisy.
      
      This patch (of 3):
      
      exit_shm obtains the ipc_ns shm rwsem for write and holds it while it
      walks every shared memory segment in the namespace.  Thus the amount of
      work is related to the number of shm segments in the namespace not the
      number of segments that might need to be cleaned.
      
      In addition, this occurs after the task has been notified the thread has
      exited, so the number of tasks waiting for the ns shm rwsem can grow
      without bound until memory is exausted.
      
      Add a list to the task struct of all shmids allocated by this task.  Init
      the list head in copy_process.  Use the ns->rwsem for locking.  Add
      segments after id is added, remove before removing from id.
      
      On unshare of NEW_IPCNS orphan any ids as if the task had exited, similar
      to handling of semaphore undo.
      
      I chose a define for the init sequence since its a simple list init,
      otherwise it would require a function call to avoid include loops between
      the semaphore code and the task struct.  Converting the list_del to
      list_del_init for the unshare cases would remove the exit followed by
      init, but I left it blow up if not inited.
      Signed-off-by: NMilton Miller <miltonm@bga.com>
      Signed-off-by: NJack Miller <millerjo@us.ibm.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Anton Blanchard <anton@samba.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab602f79
  3. 07 6月, 2014 6 次提交
  4. 28 1月, 2014 3 次提交
  5. 22 11月, 2013 2 次提交
    • J
      ipc,shm: correct error return value in shmctl (SHM_UNLOCK) · 3a72660b
      Jesper Nilsson 提交于
      Commit 2caacaa8 ("ipc,shm: shorten critical region for shmctl")
      restructured the ipc shm to shorten critical region, but introduced a
      path where the return value could be -EPERM, even if the operation
      actually was performed.
      
      Before the commit, the err return value was reset by the return value
      from security_shm_shmctl() after the if (!ns_capable(...)) statement.
      
      Now, we still exit the if statement with err set to -EPERM, and in the
      case of SHM_UNLOCK, it is not reset at all, and used as the return value
      from shmctl.
      
      To fix this, we only set err when errors occur, leaving the fallthrough
      case alone.
      Signed-off-by: NJesper Nilsson <jesper.nilsson@axis.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>	[3.12.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a72660b
    • G
      ipc,shm: fix shm_file deletion races · a399b29d
      Greg Thelen 提交于
      When IPC_RMID races with other shm operations there's potential for
      use-after-free of the shm object's associated file (shm_file).
      
      Here's the race before this patch:
      
        TASK 1                     TASK 2
        ------                     ------
        shm_rmid()
          ipc_lock_object()
                                   shmctl()
                                   shp = shm_obtain_object_check()
      
          shm_destroy()
            shum_unlock()
            fput(shp->shm_file)
                                   ipc_lock_object()
                                   shmem_lock(shp->shm_file)
                                   <OOPS>
      
      The oops is caused because shm_destroy() calls fput() after dropping the
      ipc_lock.  fput() clears the file's f_inode, f_path.dentry, and
      f_path.mnt, which causes various NULL pointer references in task 2.  I
      reliably see the oops in task 2 if with shmlock, shmu
      
      This patch fixes the races by:
      1) set shm_file=NULL in shm_destroy() while holding ipc_object_lock().
      2) modify at risk operations to check shm_file while holding
         ipc_object_lock().
      
      Example workloads, which each trigger oops...
      
      Workload 1:
        while true; do
          id=$(shmget 1 4096)
          shm_rmid $id &
          shmlock $id &
          wait
        done
      
        The oops stack shows accessing NULL f_inode due to racing fput:
          _raw_spin_lock
          shmem_lock
          SyS_shmctl
      
      Workload 2:
        while true; do
          id=$(shmget 1 4096)
          shmat $id 4096 &
          shm_rmid $id &
          wait
        done
      
        The oops stack is similar to workload 1 due to NULL f_inode:
          touch_atime
          shmem_mmap
          shm_mmap
          mmap_region
          do_mmap_pgoff
          do_shmat
          SyS_shmat
      
      Workload 3:
        while true; do
          id=$(shmget 1 4096)
          shmlock $id
          shm_rmid $id &
          shmunlock $id &
          wait
        done
      
        The oops stack shows second fput tripping on an NULL f_inode.  The
        first fput() completed via from shm_destroy(), but a racing thread did
        a get_file() and queued this fput():
          locks_remove_flock
          __fput
          ____fput
          task_work_run
          do_notify_resume
          int_signal
      
      Fixes: c2c737a0 ("ipc,shm: shorten critical region for shmat")
      Fixes: 2caacaa8 ("ipc,shm: shorten critical region for shmctl")
      Signed-off-by: NGreg Thelen <gthelen@google.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: <stable@vger.kernel.org>  # 3.10.17+ 3.11.6+
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a399b29d
  6. 25 9月, 2013 1 次提交
    • D
      ipc: fix race with LSMs · 53dad6d3
      Davidlohr Bueso 提交于
      Currently, IPC mechanisms do security and auditing related checks under
      RCU.  However, since security modules can free the security structure,
      for example, through selinux_[sem,msg_queue,shm]_free_security(), we can
      race if the structure is freed before other tasks are done with it,
      creating a use-after-free condition.  Manfred illustrates this nicely,
      for instance with shared mem and selinux:
      
       -> do_shmat calls rcu_read_lock()
       -> do_shmat calls shm_object_check().
           Checks that the object is still valid - but doesn't acquire any locks.
           Then it returns.
       -> do_shmat calls security_shm_shmat (e.g. selinux_shm_shmat)
       -> selinux_shm_shmat calls ipc_has_perm()
       -> ipc_has_perm accesses ipc_perms->security
      
      shm_close()
       -> shm_close acquires rw_mutex & shm_lock
       -> shm_close calls shm_destroy
       -> shm_destroy calls security_shm_free (e.g. selinux_shm_free_security)
       -> selinux_shm_free_security calls ipc_free_security(&shp->shm_perm)
       -> ipc_free_security calls kfree(ipc_perms->security)
      
      This patch delays the freeing of the security structures after all RCU
      readers are done.  Furthermore it aligns the security life cycle with
      that of the rest of IPC - freeing them based on the reference counter.
      For situations where we need not free security, the current behavior is
      kept.  Linus states:
      
       "... the old behavior was suspect for another reason too: having the
        security blob go away from under a user sounds like it could cause
        various other problems anyway, so I think the old code was at least
        _prone_ to bugs even if it didn't have catastrophic behavior."
      
      I have tested this patch with IPC testcases from LTP on both my
      quad-core laptop and on a 64 core NUMA server.  In both cases selinux is
      enabled, and tests pass for both voluntary and forced preemption models.
      While the mentioned races are theoretical (at least no one as reported
      them), I wanted to make sure that this new logic doesn't break anything
      we weren't aware of.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Acked-by: NManfred Spraul <manfred@colorfullife.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53dad6d3
  7. 12 9月, 2013 10 次提交
  8. 10 7月, 2013 4 次提交
  9. 10 5月, 2013 1 次提交
  10. 08 5月, 2013 1 次提交
  11. 01 5月, 2013 1 次提交
    • R
      ipc: sysv shared memory limited to 8TiB · d69f3bad
      Robin Holt 提交于
      Trying to run an application which was trying to put data into half of
      memory using shmget(), we found that having a shmall value below 8EiB-8TiB
      would prevent us from using anything more than 8TiB.  By setting
      kernel.shmall greater than 8EiB-8TiB would make the job work.
      
      In the newseg() function, ns->shm_tot which, at 8TiB is INT_MAX.
      
      ipc/shm.c:
       458 static int newseg(struct ipc_namespace *ns, struct ipc_params *params)
       459 {
      ...
       465         int numpages = (size + PAGE_SIZE -1) >> PAGE_SHIFT;
      ...
       474         if (ns->shm_tot + numpages > ns->shm_ctlall)
       475                 return -ENOSPC;
      
      [akpm@linux-foundation.org: make ipc/shm.c:newseg()'s numpages size_t, not int]
      Signed-off-by: NRobin Holt <holt@sgi.com>
      Reported-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d69f3bad
  12. 24 2月, 2013 2 次提交
  13. 23 2月, 2013 2 次提交
  14. 12 12月, 2012 1 次提交
    • A
      mm: support more pagesizes for MAP_HUGETLB/SHM_HUGETLB · 42d7395f
      Andi Kleen 提交于
      There was some desire in large applications using MAP_HUGETLB or
      SHM_HUGETLB to use 1GB huge pages on some mappings, and stay with 2MB on
      others.  This is useful together with NUMA policy: use 2MB interleaving
      on some mappings, but 1GB on local mappings.
      
      This patch extends the IPC/SHM syscall interfaces slightly to allow
      specifying the page size.
      
      It borrows some upper bits in the existing flag arguments and allows
      encoding the log of the desired page size in addition to the *_HUGETLB
      flag.  When 0 is specified the default size is used, this makes the
      change fully compatible.
      
      Extending the internal hugetlb code to handle this is straight forward.
      Instead of a single mount it just keeps an array of them and selects the
      right mount based on the specified page size.  When no page size is
      specified it uses the mount of the default page size.
      
      The change is not visible in /proc/mounts because internal mounts don't
      appear there.  It also has very little overhead: the additional mounts
      just consume a super block, but not more memory when not used.
      
      I also exported the new flags to the user headers (they were previously
      under __KERNEL__).  Right now only symbols for x86 and some other
      architecture for 1GB and 2MB are defined.  The interface should already
      work for all other architectures though.  Only architectures that define
      multiple hugetlb sizes actually need it (that is currently x86, tile,
      powerpc).  However tile and powerpc have user configurable hugetlb
      sizes, so it's not easy to add defines.  A program on those
      architectures would need to query sysfs and use the appropiate log2.
      
      [akpm@linux-foundation.org: cleanups]
      [rientjes@google.com: fix build]
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      42d7395f
  15. 07 9月, 2012 1 次提交
  16. 31 7月, 2012 1 次提交
  17. 08 6月, 2012 1 次提交