1. 17 4月, 2015 10 次提交
    • D
      prctl: avoid using mmap_sem for exe_file serialization · 6e399cd1
      Davidlohr Bueso 提交于
      Oleg cleverly suggested using xchg() to set the new mm->exe_file instead
      of calling set_mm_exe_file() which requires some form of serialization --
      mmap_sem in this case.  For archs that do not have atomic rmw instructions
      we still fallback to a spinlock alternative, so this should always be
      safe.  As such, we only need the mmap_sem for looking up the backing
      vm_file, which can be done sharing the lock.  Naturally, this means we
      need to manually deal with both the new and old file reference counting,
      and we need not worry about the MMF_EXE_FILE_CHANGED bits, which can
      probably be deleted in the future anyway.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e399cd1
    • K
      mm: rcu-protected get_mm_exe_file() · 90f31d0e
      Konstantin Khlebnikov 提交于
      This patch removes mm->mmap_sem from mm->exe_file read side.
      Also it kills dup_mm_exe_file() and moves exe_file duplication into
      dup_mmap() where both mmap_sems are locked.
      
      [akpm@linux-foundation.org: fix comment typo]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90f31d0e
    • H
      kernel/sysctl.c: threads-max observe limits · 16db3d3f
      Heinrich Schuchardt 提交于
      Users can change the maximum number of threads by writing to
      /proc/sys/kernel/threads-max.
      
      With the patch the value entered is checked against the same limits that
      apply when fork_init is called.
      Signed-off-by: NHeinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16db3d3f
    • H
      kernel/fork.c: avoid division by zero · ac1b398d
      Heinrich Schuchardt 提交于
      PAGE_SIZE is not guaranteed to be equal to or less than 8 times the
      THREAD_SIZE.
      
      E.g.  architecture hexagon may have page size 1M and thread size 4096.
      This would lead to a division by zero in the calculation of max_threads.
      
      With 32-bit calculation there is no solution which delivers valid results
      for all possible combinations of the parameters.  The code is only called
      once.  Hence a 64-bit calculation can be used as solution.
      
      [akpm@linux-foundation.org: use clamp_t(), per Oleg]
      Signed-off-by: NHeinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac1b398d
    • H
      kernel/fork.c: new function for max_threads · ff691f6e
      Heinrich Schuchardt 提交于
      PAGE_SIZE is not guaranteed to be equal to or less than 8 times the
      THREAD_SIZE.
      
      E.g.  architecture hexagon may have page size 1M and thread size 4096.
      This would lead to a division by zero in the calculation of max_threads.
      
      With this patch the buggy code is moved to a separate function
      set_max_threads.  The error is not fixed.
      
      After fixing the problem in a separate patch the new function can be
      reused to adjust max_threads after adding or removing memory.
      
      Argument mempages of function fork_init() is removed as totalram_pages is
      an exported symbol.
      
      The creation of separate patches for refactoring to a new function and for
      fixing the logic was suggested by Ingo Molnar.
      Signed-off-by: NHeinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff691f6e
    • J
      fork_init: update max_threads comment · 3ea7f5e2
      Jean Delvare 提交于
      The comment explaining what value max_threads is set to is outdated.  The
      maximum memory consumption ratio for thread structures was 1/2 until
      February 2002, then it was briefly changed to 1/16 before being set to 1/8
      which we still use today.  The comment was never updated to reflect that
      change, it's about time.
      Signed-off-by: NJean Delvare <jdelvare@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea7f5e2
    • M
      fork: report pid reservation failure properly · 35f71bc0
      Michal Hocko 提交于
      copy_process will report any failure in alloc_pid as ENOMEM currently
      which is misleading because the pid allocation might fail not only when
      the memory is short but also when the pid space is consumed already.
      
      The current man page even mentions this case:
      
      : EAGAIN
      :
      :       A system-imposed limit on the number of threads was encountered.
      :       There are a number of limits that may trigger this error: the
      :       RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
      :       limits the number of processes and threads for a real user ID, was
      :       reached; the kernel's system-wide limit on the number of processes
      :       and threads, /proc/sys/kernel/threads-max, was reached (see
      :       proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
      :       was reached (see proc(5)).
      
      so the current behavior is also incorrect wrt.  documentation.  POSIX man
      page also suggest returing EAGAIN when the process count limit is reached.
      
      This patch simply propagates error code from alloc_pid and makes sure we
      return -EAGAIN due to reservation failure.  This will make behavior of
      fork closer to both our documentation and POSIX.
      
      alloc_pid might alsoo fail when the reaper in the pid namespace is dead
      (the namespace basically disallows all new processes) and there is no
      good error code which would match documented ones. We have traditionally
      returned ENOMEM for this case which is misleading as well but as per
      Eric W. Biederman this behavior is documented in man pid_namespaces(7)
      
      : If the "init" process of a PID namespace terminates, the kernel
      : terminates all of the processes in the namespace via a SIGKILL signal.
      : This behavior reflects the fact that the "init" process is essential for
      : the correct operation of a PID namespace.  In this case, a subsequent
      : fork(2) into this PID namespace will fail with the error ENOMEM; it is
      : not possible to create a new processes in a PID namespace whose "init"
      : process has terminated.
      
      and introducing a new error code would be too risky so let's stick to
      ENOMEM for this case.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35f71bc0
    • V
      signal: remove warning about using SI_TKILL in rt_[tg]sigqueueinfo · 69828dce
      Vladimir Davydov 提交于
      Sending SI_TKILL from rt_[tg]sigqueueinfo was deprecated, so now we issue
      a warning on the first attempt of doing it.  We use WARN_ON_ONCE, which is
      not informative and, what is worse, taints the kernel, making the trinity
      syscall fuzzer complain false-positively from time to time.
      
      It does not look like we need this warning at all, because the behaviour
      changed quite a long time ago (2.6.39), and if an application relies on
      the old API, it gets EPERM anyway and can issue a warning by itself.
      
      So let us zap the warning in kernel.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69828dce
    • O
      ptrace: ptrace_detach() can no longer race with SIGKILL · 64a4096c
      Oleg Nesterov 提交于
      ptrace_detach() re-checks ->ptrace under tasklist lock and calls
      release_task() if __ptrace_detach() returns true.  This was needed because
      the __TASK_TRACED tracee could be killed/untraced, and it could even pass
      exit_notify() before we take tasklist_lock.
      
      But this is no longer possible after 9899d11f "ptrace: ensure
      arch_ptrace/ptrace_request can never race with SIGKILL".  We can turn
      these checks into WARN_ON() and remove release_task().
      
      While at it, document the setting of child->exit_code.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Pavel Labath <labath@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64a4096c
    • O
      ptrace: fix race between ptrace_resume() and wait_task_stopped() · b72c1869
      Oleg Nesterov 提交于
      ptrace_resume() is called when the tracee is still __TASK_TRACED.  We set
      tracee->exit_code and then wake_up_state() changes tracee->state.  If the
      tracer's sub-thread does wait() in between, task_stopped_code(ptrace => T)
      wrongly looks like another report from tracee.
      
      This confuses debugger, and since wait_task_stopped() clears ->exit_code
      the tracee can miss a signal.
      
      Test-case:
      
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <sys/wait.h>
      	#include <sys/ptrace.h>
      	#include <pthread.h>
      	#include <assert.h>
      
      	int pid;
      
      	void *waiter(void *arg)
      	{
      		int stat;
      
      		for (;;) {
      			assert(pid == wait(&stat));
      			assert(WIFSTOPPED(stat));
      			if (WSTOPSIG(stat) == SIGHUP)
      				continue;
      
      			assert(WSTOPSIG(stat) == SIGCONT);
      			printf("ERR! extra/wrong report:%x\n", stat);
      		}
      	}
      
      	int main(void)
      	{
      		pthread_t thread;
      
      		pid = fork();
      		if (!pid) {
      			assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
      			for (;;)
      				kill(getpid(), SIGHUP);
      		}
      
      		assert(pthread_create(&thread, NULL, waiter, NULL) == 0);
      
      		for (;;)
      			ptrace(PTRACE_CONT, pid, 0, SIGCONT);
      
      		return 0;
      	}
      
      Note for stable: the bug is very old, but without 9899d11f "ptrace:
      ensure arch_ptrace/ptrace_request can never race with SIGKILL" the fix
      should use lock_task_sighand(child).
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NPavel Labath <labath@google.com>
      Tested-by: NPavel Labath <labath@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b72c1869
  2. 16 4月, 2015 8 次提交
  3. 15 4月, 2015 10 次提交
  4. 13 4月, 2015 3 次提交
  5. 12 4月, 2015 1 次提交
  6. 09 4月, 2015 4 次提交
    • J
      locking/mutex: Further simplify mutex_spin_on_owner() · 01ac33c1
      Jason Low 提交于
      Similar to what Linus suggested for rwsem_spin_on_owner(), in
      mutex_spin_on_owner() instead of having while (true) and
      breaking out of the spin loop on lock->owner != owner, we can
      have the loop directly check for while (lock->owner == owner) to
      improve the readability of the code.
      
      It also shrinks the code a bit:
      
         text    data     bss     dec     hex filename
         3721       0       0    3721     e89 mutex.o.before
         3705       0       0    3705     e79 mutex.o.after
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Link: http://lkml.kernel.org/r/1428521960-5268-2-git-send-email-jason.low2@hp.com
      [ Added code generation info. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      01ac33c1
    • L
      Copy the kernel module data from user space in chunks · 3afe9f84
      Linus Torvalds 提交于
      Unlike most (all?) other copies from user space, kernel module loading
      is almost unlimited in size.  So we do a potentially huge
      "copy_from_user()" when we copy the module data from user space to the
      kernel buffer, which can be a latency concern when preemption is
      disabled (or voluntary).
      
      Also, because 'copy_from_user()' clears the tail of the kernel buffer on
      failures, even a *failed* copy can end up wasting a lot of time.
      
      Normally neither of these are concerns in real life, but they do trigger
      when doing stress-testing with trinity.  Running in a VM seems to add
      its own overheadm causing trinity module load testing to even trigger
      the watchdog.
      
      The simple fix is to just chunk up the module loading, so that it never
      tries to copy insanely big areas in one go.  That bounds the latency,
      and also the amount of (unnecessarily, in this case) cleared memory for
      the failure case.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3afe9f84
    • M
      genirq: Allow the irqchip state of an IRQ to be save/restored · 1b7047ed
      Marc Zyngier 提交于
      There is a number of cases where a kernel subsystem may want to
      introspect the state of an interrupt at the irqchip level:
      
      - When a peripheral is shared between virtual machines,
        its interrupt state becomes part of the guest's state,
        and must be switched accordingly. KVM on arm/arm64 requires
        this for its guest-visible timer
      - Some GPIO controllers seem to require peeking into the
        interrupt controller they are connected to to report
        their internal state
      
      This seem to be a pattern that is common enough for the core code
      to try and support this without too many horrible hacks. Introduce
      a pair of accessors (irq_get_irqchip_state/irq_set_irqchip_state)
      to retrieve the bits that can be of interest to another subsystem:
      pending, active, and masked.
      
      - irq_get_irqchip_state returns the state of the interrupt according
        to a parameter set to IRQCHIP_STATE_PENDING, IRQCHIP_STATE_ACTIVE,
        IRQCHIP_STATE_MASKED or IRQCHIP_STATE_LINE_LEVEL.
      - irq_set_irqchip_state similarly sets the state of the interrupt.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: NBjorn Andersson <bjorn.andersson@sonymobile.com>
      Tested-by: NBjorn Andersson <bjorn.andersson@sonymobile.com>
      Cc: linux-arm-kernel@lists.infradead.org
      Cc: Abhijeet Dharmapurikar <adharmap@codeaurora.org>
      Cc: Stephen Boyd <sboyd@codeaurora.org>
      Cc: Phong Vo <pvo@apm.com>
      Cc: Linus Walleij <linus.walleij@linaro.org>
      Cc: Tin Huynh <tnhuynh@apm.com>
      Cc: Y Vo <yvo@apm.com>
      Cc: Toan Le <toanle@apm.com>
      Cc: Bjorn Andersson <bjorn@kryo.se>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Link: http://lkml.kernel.org/r/1426676484-21812-2-git-send-email-marc.zyngier@arm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1b7047ed
    • M
      genirq: MSI: Fix freeing of unallocated MSI · fe0c52fc
      Marc Zyngier 提交于
      While debugging an unrelated issue with the GICv3 ITS driver, the
      following trace triggered:
      
      WARNING: CPU: 1 PID: 1 at kernel/irq/irqdomain.c:1121 irq_domain_free_irqs+0x160/0x17c()
      NULL pointer, cannot free irq
      Modules linked in:
      CPU: 1 PID: 1 Comm: swapper/0 Tainted: G        W      3.19.0-rc6+ #3690
      Hardware name: FVP Base (DT)
      Call trace:
      [<ffffffc000089398>] dump_backtrace+0x0/0x13c
      [<ffffffc0000894e4>] show_stack+0x10/0x1c
      [<ffffffc00066d134>] dump_stack+0x74/0x94
      [<ffffffc0000a92f8>] warn_slowpath_common+0x9c/0xd4
      [<ffffffc0000a938c>] warn_slowpath_fmt+0x5c/0x80
      [<ffffffc0000ee04c>] irq_domain_free_irqs+0x15c/0x17c
      [<ffffffc0000ef918>] msi_domain_free_irqs+0x58/0x74
      [<ffffffc000386f58>] free_msi_irqs+0xb4/0x1c0
      
          // The msi_prepare callback fails here
      
      [<ffffffc0003872c0>] pci_enable_msix+0x25c/0x3d4
      [<ffffffc00038746c>] pci_enable_msix_range+0x34/0x80
      [<ffffffc0003924ac>] vp_try_to_find_vqs+0xec/0x528
      [<ffffffc000392954>] vp_find_vqs+0x6c/0xa8
      [<ffffffc0003ee2a8>] init_vq+0x120/0x248
      [<ffffffc0003eefb0>] virtblk_probe+0xb0/0x6bc
      [<ffffffc00038fc34>] virtio_dev_probe+0x17c/0x214
      [<ffffffc0003d4a04>] driver_probe_device+0x7c/0x23c
      [<ffffffc0003d4cb0>] __driver_attach+0x98/0xa0
      [<ffffffc0003d2c60>] bus_for_each_dev+0x60/0xb4
      [<ffffffc0003d455c>] driver_attach+0x1c/0x28
      [<ffffffc0003d41b0>] bus_add_driver+0x150/0x208
      [<ffffffc0003d54c0>] driver_register+0x64/0x130
      [<ffffffc00038f9e8>] register_virtio_driver+0x24/0x68
      [<ffffffc00091320c>] init+0x70/0xac
      [<ffffffc0000828f0>] do_one_initcall+0x94/0x1d0
      [<ffffffc0008e9b00>] kernel_init_freeable+0x144/0x1e4
      [<ffffffc00066a434>] kernel_init+0xc/0xd8
      ---[ end trace f9ee562a77cc7bae ]---
      
      The ITS msi_prepare callback having failed, we end-up trying to
      free MSIs that have never been allocated. Oddly enough, the kernel
      is pretty upset about it.
      
      It turns out that this behaviour was expected before the MSI domain
      was introduced (and dealt with in arch_teardown_msi_irqs).
      
      The obvious fix is to detect this early enough and bail out.
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Reviewed-by: NJiang Liu <jiang.liu@linux.intel.com>
      Link: http://lkml.kernel.org/r/1422299419-6051-1-git-send-email-marc.zyngier@arm.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      fe0c52fc
  7. 08 4月, 2015 4 次提交