1. 27 5月, 2015 3 次提交
    • T
      cgroup: simplify threadgroup locking · b5ba75b5
      Tejun Heo 提交于
      Now that threadgroup locking is made global, code paths around it can
      be simplified.
      
      * lock-verify-unlock-retry dancing removed from __cgroup_procs_write().
      
      * Race protection against de_thread() removed from
        cgroup_update_dfl_csses().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      b5ba75b5
    • T
      sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem · d59cfc09
      Tejun Heo 提交于
      The cgroup side of threadgroup locking uses signal_struct->group_rwsem
      to synchronize against threadgroup changes.  This per-process rwsem
      adds small overhead to thread creation, exit and exec paths, forces
      cgroup code paths to do lock-verify-unlock-retry dance in a couple
      places and makes it impossible to atomically perform operations across
      multiple processes.
      
      This patch replaces signal_struct->group_rwsem with a global
      percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
      side and contained in cgroups proper.  This patch converts one-to-one.
      
      This does make writer side heavier and lower the granularity; however,
      cgroup process migration is a fairly cold path, we do want to optimize
      thread operations over it and cgroup migration operations don't take
      enough time for the lower granularity to matter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      d59cfc09
    • T
      sched, cgroup: reorganize threadgroup locking · 7d7efec3
      Tejun Heo 提交于
      threadgroup_change_begin/end() are used to mark the beginning and end
      of threadgroup modifying operations to allow code paths which require
      a threadgroup to stay stable across blocking operations to synchronize
      against those sections using threadgroup_lock/unlock().
      
      It's currently implemented as a general mechanism in sched.h using
      per-signal_struct rwsem; however, this never grew non-cgroup use cases
      and becomes noop if !CONFIG_CGROUPS.  It turns out that cgroups is
      gonna be better served with a different sycnrhonization scheme and is
      a bit silly to keep cgroups specific details as a general mechanism.
      
      What's general here is identifying the places where threadgroups are
      modified.  This patch restructures threadgroup locking so that
      threadgroup_change_begin/end() become a place where subsystems which
      need to sycnhronize against threadgroup changes can hook into.
      
      cgroup_threadgroup_change_begin/end() which operate on the
      per-signal_struct rwsem are created and threadgroup_lock/unlock() are
      moved to cgroup.c and made static.
      
      This is pure reorganization which doesn't cause any functional
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      7d7efec3
  2. 19 5月, 2015 1 次提交
  3. 23 4月, 2015 1 次提交
  4. 20 4月, 2015 1 次提交
  5. 17 4月, 2015 18 次提交
    • S
      tracing: Fix possible out of bounds memory access when parsing enums · 3193899d
      Steven Rostedt (Red Hat) 提交于
      The code that replaces the enum names with the enum values in the
      tracepoints' format files could possible miss the end of string nul
      character. This was caused by processing things like backslashes, quotes
      and other tokens. After processing the tokens, a check for the nul
      character needed to be done before continuing the loop, because the loop
      incremented the pointer before doing the check, which could bypass the nul
      character.
      
      Link: http://lkml.kernel.org/r/552E661D.5060502@oracle.com
      
      Reported-by: Sasha Levin <sasha.levin@oracle.com> # via KASan
      Tested-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Fixes: 0c564a53 "tracing: Add TRACE_DEFINE_ENUM() macro to map enums to their values"
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      3193899d
    • D
      oprofile: reduce mmap_sem hold for mm->exe_file · 11163348
      Davidlohr Bueso 提交于
      sync_buffer() needs the mmap_sem for two distinct operations, both only
      occurring upon user context switch handling:
      
       1) Dealing with the exe_file.
      
       2) Adding the dcookie data as we need to lookup the vma that
         backs it. This is done via add_sample() and add_data().
      
      This patch isolates 1), for it will no longer need the mmap_sem for
      serialization.  However, for now, make of the more standard
      get_mm_exe_file(), requiring only holding the mmap_sem to read the value,
      and relying on reference counting to make sure that the exe file won't
      dissappear underneath us while doing the get dcookie.
      
      As a consequence, for 2) we move the mmap_sem locking into where we really
      need it, in lookup_dcookie().  The benefits are twofold: reduce mmap_sem
      hold times, and cleaner code.
      
      [akpm@linux-foundation.org: export get_mm_exe_file for arch/x86/oprofile/oprofile.ko]
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Robert Richter <rric@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11163348
    • A
      gcov: fix softlockups · 9d796e66
      Andrey Ryabinin 提交于
      gcov profiling if enabled with other heavy compile-time instrumentation
      like KASan could trigger following softlockups:
      
        NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
        Modules linked in:
        irq event stamp: 22823276
        hardirqs last  enabled at (22823275): [<ffffffff86e8d10d>] mutex_lock_nested+0x7d9/0x930
        hardirqs last disabled at (22823276): [<ffffffff86e9521d>] apic_timer_interrupt+0x6d/0x80
        softirqs last  enabled at (22823172): [<ffffffff811ed969>] __do_softirq+0x4db/0x729
        softirqs last disabled at (22823167): [<ffffffff811edfcf>] irq_exit+0x7d/0x15b
        CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W       3.19.0-05245-gbb33326-dirty #3
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5.1-0-g8936dbb-20141113_115728-nilsson.home.kraxel.org 04/01/2014
        task: ffff88006cba8000 ti: ffff88006cbb0000 task.ti: ffff88006cbb0000
        RIP: kasan_mem_to_shadow+0x1e/0x1f
        Call Trace:
          strcmp+0x28/0x70
          get_node_by_name+0x66/0x99
          gcov_event+0x4f/0x69e
          gcov_enable_events+0x54/0x7b
          gcov_fs_init+0xf8/0x134
          do_one_initcall+0x1b2/0x288
          kernel_init_freeable+0x467/0x580
          kernel_init+0x15/0x18b
          ret_from_fork+0x7c/0xb0
        Kernel panic - not syncing: softlockup: hung tasks
      
      Fix this by sticking cond_resched() in gcov_enable_events().
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Cc: Peter Oberparleiter <oberpar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d796e66
    • H
      kernel/sysctl.c: detect overflows when converting to int · 230633d1
      Heinrich Schuchardt 提交于
      When converting unsigned long to int overflows may occur.  These currently
      are not detected when writing to the sysctl file system.
      
      E.g. on a system where int has 32 bits and long has 64 bits
        echo 0x800001234 > /proc/sys/kernel/threads-max
      has the same effect as
        echo 0x1234 > /proc/sys/kernel/threads-max
      
      The patch adds the missing check in do_proc_dointvec_conv.
      
      With the patch an overflow will result in an error EINVAL when writing to
      the the sysctl file system.
      Signed-off-by: NHeinrich Schuchardt <xypron.glpk@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      230633d1
    • D
      prctl: avoid using mmap_sem for exe_file serialization · 6e399cd1
      Davidlohr Bueso 提交于
      Oleg cleverly suggested using xchg() to set the new mm->exe_file instead
      of calling set_mm_exe_file() which requires some form of serialization --
      mmap_sem in this case.  For archs that do not have atomic rmw instructions
      we still fallback to a spinlock alternative, so this should always be
      safe.  As such, we only need the mmap_sem for looking up the backing
      vm_file, which can be done sharing the lock.  Naturally, this means we
      need to manually deal with both the new and old file reference counting,
      and we need not worry about the MMF_EXE_FILE_CHANGED bits, which can
      probably be deleted in the future anyway.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6e399cd1
    • K
      mm: rcu-protected get_mm_exe_file() · 90f31d0e
      Konstantin Khlebnikov 提交于
      This patch removes mm->mmap_sem from mm->exe_file read side.
      Also it kills dup_mm_exe_file() and moves exe_file duplication into
      dup_mmap() where both mmap_sems are locked.
      
      [akpm@linux-foundation.org: fix comment typo]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90f31d0e
    • H
      kernel/sysctl.c: threads-max observe limits · 16db3d3f
      Heinrich Schuchardt 提交于
      Users can change the maximum number of threads by writing to
      /proc/sys/kernel/threads-max.
      
      With the patch the value entered is checked against the same limits that
      apply when fork_init is called.
      Signed-off-by: NHeinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16db3d3f
    • H
      kernel/fork.c: avoid division by zero · ac1b398d
      Heinrich Schuchardt 提交于
      PAGE_SIZE is not guaranteed to be equal to or less than 8 times the
      THREAD_SIZE.
      
      E.g.  architecture hexagon may have page size 1M and thread size 4096.
      This would lead to a division by zero in the calculation of max_threads.
      
      With 32-bit calculation there is no solution which delivers valid results
      for all possible combinations of the parameters.  The code is only called
      once.  Hence a 64-bit calculation can be used as solution.
      
      [akpm@linux-foundation.org: use clamp_t(), per Oleg]
      Signed-off-by: NHeinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac1b398d
    • H
      kernel/fork.c: new function for max_threads · ff691f6e
      Heinrich Schuchardt 提交于
      PAGE_SIZE is not guaranteed to be equal to or less than 8 times the
      THREAD_SIZE.
      
      E.g.  architecture hexagon may have page size 1M and thread size 4096.
      This would lead to a division by zero in the calculation of max_threads.
      
      With this patch the buggy code is moved to a separate function
      set_max_threads.  The error is not fixed.
      
      After fixing the problem in a separate patch the new function can be
      reused to adjust max_threads after adding or removing memory.
      
      Argument mempages of function fork_init() is removed as totalram_pages is
      an exported symbol.
      
      The creation of separate patches for refactoring to a new function and for
      fixing the logic was suggested by Ingo Molnar.
      Signed-off-by: NHeinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff691f6e
    • J
      fork_init: update max_threads comment · 3ea7f5e2
      Jean Delvare 提交于
      The comment explaining what value max_threads is set to is outdated.  The
      maximum memory consumption ratio for thread structures was 1/2 until
      February 2002, then it was briefly changed to 1/16 before being set to 1/8
      which we still use today.  The comment was never updated to reflect that
      change, it's about time.
      Signed-off-by: NJean Delvare <jdelvare@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ea7f5e2
    • M
      fork: report pid reservation failure properly · 35f71bc0
      Michal Hocko 提交于
      copy_process will report any failure in alloc_pid as ENOMEM currently
      which is misleading because the pid allocation might fail not only when
      the memory is short but also when the pid space is consumed already.
      
      The current man page even mentions this case:
      
      : EAGAIN
      :
      :       A system-imposed limit on the number of threads was encountered.
      :       There are a number of limits that may trigger this error: the
      :       RLIMIT_NPROC soft resource limit (set via setrlimit(2)), which
      :       limits the number of processes and threads for a real user ID, was
      :       reached; the kernel's system-wide limit on the number of processes
      :       and threads, /proc/sys/kernel/threads-max, was reached (see
      :       proc(5)); or the maximum number of PIDs, /proc/sys/kernel/pid_max,
      :       was reached (see proc(5)).
      
      so the current behavior is also incorrect wrt.  documentation.  POSIX man
      page also suggest returing EAGAIN when the process count limit is reached.
      
      This patch simply propagates error code from alloc_pid and makes sure we
      return -EAGAIN due to reservation failure.  This will make behavior of
      fork closer to both our documentation and POSIX.
      
      alloc_pid might alsoo fail when the reaper in the pid namespace is dead
      (the namespace basically disallows all new processes) and there is no
      good error code which would match documented ones. We have traditionally
      returned ENOMEM for this case which is misleading as well but as per
      Eric W. Biederman this behavior is documented in man pid_namespaces(7)
      
      : If the "init" process of a PID namespace terminates, the kernel
      : terminates all of the processes in the namespace via a SIGKILL signal.
      : This behavior reflects the fact that the "init" process is essential for
      : the correct operation of a PID namespace.  In this case, a subsequent
      : fork(2) into this PID namespace will fail with the error ENOMEM; it is
      : not possible to create a new processes in a PID namespace whose "init"
      : process has terminated.
      
      and introducing a new error code would be too risky so let's stick to
      ENOMEM for this case.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35f71bc0
    • V
      signal: remove warning about using SI_TKILL in rt_[tg]sigqueueinfo · 69828dce
      Vladimir Davydov 提交于
      Sending SI_TKILL from rt_[tg]sigqueueinfo was deprecated, so now we issue
      a warning on the first attempt of doing it.  We use WARN_ON_ONCE, which is
      not informative and, what is worse, taints the kernel, making the trinity
      syscall fuzzer complain false-positively from time to time.
      
      It does not look like we need this warning at all, because the behaviour
      changed quite a long time ago (2.6.39), and if an application relies on
      the old API, it gets EPERM anyway and can issue a warning by itself.
      
      So let us zap the warning in kernel.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69828dce
    • O
      ptrace: ptrace_detach() can no longer race with SIGKILL · 64a4096c
      Oleg Nesterov 提交于
      ptrace_detach() re-checks ->ptrace under tasklist lock and calls
      release_task() if __ptrace_detach() returns true.  This was needed because
      the __TASK_TRACED tracee could be killed/untraced, and it could even pass
      exit_notify() before we take tasklist_lock.
      
      But this is no longer possible after 9899d11f "ptrace: ensure
      arch_ptrace/ptrace_request can never race with SIGKILL".  We can turn
      these checks into WARN_ON() and remove release_task().
      
      While at it, document the setting of child->exit_code.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Pavel Labath <labath@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64a4096c
    • O
      ptrace: fix race between ptrace_resume() and wait_task_stopped() · b72c1869
      Oleg Nesterov 提交于
      ptrace_resume() is called when the tracee is still __TASK_TRACED.  We set
      tracee->exit_code and then wake_up_state() changes tracee->state.  If the
      tracer's sub-thread does wait() in between, task_stopped_code(ptrace => T)
      wrongly looks like another report from tracee.
      
      This confuses debugger, and since wait_task_stopped() clears ->exit_code
      the tracee can miss a signal.
      
      Test-case:
      
      	#include <stdio.h>
      	#include <unistd.h>
      	#include <sys/wait.h>
      	#include <sys/ptrace.h>
      	#include <pthread.h>
      	#include <assert.h>
      
      	int pid;
      
      	void *waiter(void *arg)
      	{
      		int stat;
      
      		for (;;) {
      			assert(pid == wait(&stat));
      			assert(WIFSTOPPED(stat));
      			if (WSTOPSIG(stat) == SIGHUP)
      				continue;
      
      			assert(WSTOPSIG(stat) == SIGCONT);
      			printf("ERR! extra/wrong report:%x\n", stat);
      		}
      	}
      
      	int main(void)
      	{
      		pthread_t thread;
      
      		pid = fork();
      		if (!pid) {
      			assert(ptrace(PTRACE_TRACEME, 0,0,0) == 0);
      			for (;;)
      				kill(getpid(), SIGHUP);
      		}
      
      		assert(pthread_create(&thread, NULL, waiter, NULL) == 0);
      
      		for (;;)
      			ptrace(PTRACE_CONT, pid, 0, SIGCONT);
      
      		return 0;
      	}
      
      Note for stable: the bug is very old, but without 9899d11f "ptrace:
      ensure arch_ptrace/ptrace_request can never race with SIGKILL" the fix
      should use lock_task_sighand(child).
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reported-by: NPavel Labath <labath@google.com>
      Tested-by: NPavel Labath <labath@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b72c1869
    • L
      smp: Fix smp_call_function_single_async() locking · 8053871d
      Linus Torvalds 提交于
      The current smp_function_call code suffers a number of problems, most
      notably smp_call_function_single_async() is broken.
      
      The problem is that flush_smp_call_function_queue() does csd_unlock()
      _after_ calling csd->func(). This means that a caller cannot properly
      synchronize the csd usage as it has to.
      
      Change the code to release the csd before calling ->func() for the
      async case, and put a WARN_ON_ONCE(csd->flags & CSD_FLAG_LOCK) in
      smp_call_function_single_async() to warn us of improper serialization,
      because any waiting there can results in deadlocks when called with
      IRQs disabled.
      
      Rename the (currently) unused WAIT flag to SYNCHRONOUS and (re)use it
      such that we know what to do in flush_smp_call_function_queue().
      
      Rework csd_{,un}lock() to use smp_load_acquire() / smp_store_release()
      to avoid some full barriers while more clearly providing lock
      semantics.
      
      Finally move the csd maintenance out of generic_exec_single() into its
      callers for clearer code.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [ Added changelog. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Rafael David Tinoco <inaddy@ubuntu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/CA+55aFz492bzLFhdbKN-Hygjcreup7CjMEYk3nTSfRWjppz-OA@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8053871d
    • P
      lockdep: Make print_lock() robust against concurrent release · d7bc3197
      Peter Zijlstra 提交于
      During sysrq's show-held-locks command it is possible that
      hlock_class() returns NULL for a given lock. The result is then (after
      the warning):
      
      	|BUG: unable to handle kernel NULL pointer dereference at 0000001c
      	|IP: [<c1088145>] get_usage_chars+0x5/0x100
      	|Call Trace:
      	| [<c1088263>] print_lock_name+0x23/0x60
      	| [<c1576b57>] print_lock+0x5d/0x7e
      	| [<c1088314>] lockdep_print_held_locks+0x74/0xe0
      	| [<c1088652>] debug_show_all_locks+0x132/0x1b0
      	| [<c1315c48>] sysrq_handle_showlocks+0x8/0x10
      
      This *might* happen because the thread on the other CPU drops the lock
      after we are looking ->lockdep_depth and ->held_locks points no longer
      to a lock that is held.
      
      The fix here is to simply ignore it and continue.
      Reported-by: NAndreas Messerschmid <andreas@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      d7bc3197
    • A
      bpf: fix two bugs in verification logic when accessing 'ctx' pointer · 725f9dcd
      Alexei Starovoitov 提交于
      1.
      first bug is a silly mistake. It broke tracing examples and prevented
      simple bpf programs from loading.
      
      In the following code:
      if (insn->imm == 0 && BPF_SIZE(insn->code) == BPF_W) {
      } else if (...) {
        // this part should have been executed when
        // insn->code == BPF_W and insn->imm != 0
      }
      
      Obviously it's not doing that. So simple instructions like:
      r2 = *(u64 *)(r1 + 8)
      will be rejected. Note the comments in the code around these branches
      were and still valid and indicate the true intent.
      
      Replace it with:
      if (BPF_SIZE(insn->code) != BPF_W)
        continue;
      
      if (insn->imm == 0) {
      } else if (...) {
        // now this code will be executed when
        // insn->code == BPF_W and insn->imm != 0
      }
      
      2.
      second bug is more subtle.
      If malicious code is using the same dest register as source register,
      the checks designed to prevent the same instruction to be used with different
      pointer types will fail to trigger, since we were assigning src_reg_type
      when it was already overwritten by check_mem_access().
      The fix is trivial. Just move line:
      src_reg_type = regs[insn->src_reg].type;
      before check_mem_access().
      Add new 'access skb fields bad4' test to check this case.
      
      Fixes: 9bac3d6d ("bpf: allow extended BPF programs access skb fields")
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      725f9dcd
    • A
      bpf: fix verifier memory corruption · c3de6317
      Alexei Starovoitov 提交于
      Due to missing bounds check the DAG pass of the BPF verifier can corrupt
      the memory which can cause random crashes during program loading:
      
      [8.449451] BUG: unable to handle kernel paging request at ffffffffffffffff
      [8.451293] IP: [<ffffffff811de33d>] kmem_cache_alloc_trace+0x8d/0x2f0
      [8.452329] Oops: 0000 [#1] SMP
      [8.452329] Call Trace:
      [8.452329]  [<ffffffff8116cc82>] bpf_check+0x852/0x2000
      [8.452329]  [<ffffffff8116b7e4>] bpf_prog_load+0x1e4/0x310
      [8.452329]  [<ffffffff811b190f>] ? might_fault+0x5f/0xb0
      [8.452329]  [<ffffffff8116c206>] SyS_bpf+0x806/0xa30
      
      Fixes: f1bca824 ("bpf: add search pruning optimization to verifier")
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c3de6317
  6. 16 4月, 2015 10 次提交
    • J
      tracing: Fix incorrect enabling of trace events by boot cmdline · 84fce9db
      Joonsoo Kim 提交于
      There is a problem that trace events are not properly enabled with
      boot cmdline. The problem is that if we pass "trace_event=kmem:mm_page_alloc"
      to the boot cmdline, it enables all kmem trace events, and not just
      the page_alloc event.
      
      This is caused by the parsing mechanism. When we parse the cmdline, the buffer
      contents is modified due to tokenization. And, if we use this buffer
      again, we will get the wrong result.
      
      Unfortunately, this buffer is be accessed three times to set trace events
      properly at boot time. So, we need to handle this situation.
      
      There is already code handling ",", but we need another for ":".
      This patch adds it.
      
      Link: http://lkml.kernel.org/r/1429159484-22977-1-git-send-email-iamjoonsoo.kim@lge.com
      
      Cc: stable@vger.kernel.org # 3.19+
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      [ added missing return ret; ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      84fce9db
    • R
      tracing: Handle ftrace_dump() atomic context in graph_trace_open() · ef99b88b
      Rabin Vincent 提交于
      graph_trace_open() can be called in atomic context from ftrace_dump().
      Use GFP_ATOMIC for the memory allocations when that's the case, in order
      to avoid the following splat.
      
       BUG: sleeping function called from invalid context at mm/slab.c:2849
       in_atomic(): 1, irqs_disabled(): 128, pid: 0, name: swapper/0
       Backtrace:
       ..
       [<8004dc94>] (__might_sleep) from [<801371f4>] (kmem_cache_alloc_trace+0x160/0x238)
        r7:87800040 r6:000080d0 r5:810d16e8 r4:000080d0
       [<80137094>] (kmem_cache_alloc_trace) from [<800cbd60>] (graph_trace_open+0x30/0xd0)
        r10:00000100 r9:809171a8 r8:00008e28 r7:810d16f0 r6:00000001 r5:810d16e8
        r4:810d16f0
       [<800cbd30>] (graph_trace_open) from [<800c79c4>] (trace_init_global_iter+0x50/0x9c)
        r8:00008e28 r7:808c853c r6:00000001 r5:810d16e8 r4:810d16f0 r3:800cbd30
       [<800c7974>] (trace_init_global_iter) from [<800c7aa0>] (ftrace_dump+0x90/0x2ec)
        r4:810d2580 r3:00000000
       [<800c7a10>] (ftrace_dump) from [<80414b2c>] (sysrq_ftrace_dump+0x1c/0x20)
        r10:00000100 r9:809171a8 r8:808f6e7c r7:00000001 r6:00000007 r5:0000007a
        r4:808d5394
       [<80414b10>] (sysrq_ftrace_dump) from [<800169b8>] (return_to_handler+0x0/0x18)
       [<80415498>] (__handle_sysrq) from [<800169b8>] (return_to_handler+0x0/0x18)
        r8:808c8100 r7:808c8444 r6:00000101 r5:00000010 r4:84eb3210
       [<80415668>] (handle_sysrq) from [<800169b8>] (return_to_handler+0x0/0x18)
       [<8042a760>] (pl011_int) from [<800169b8>] (return_to_handler+0x0/0x18)
        r10:809171bc r9:809171a8 r8:00000001 r7:00000026 r6:808c6000 r5:84f01e60
        r4:8454fe00
       [<8007782c>] (handle_irq_event_percpu) from [<80077b44>] (handle_irq_event+0x4c/0x6c)
        r10:808c7ef0 r9:87283e00 r8:00000001 r7:00000000 r6:8454fe00 r5:84f01e60
        r4:84f01e00
       [<80077af8>] (handle_irq_event) from [<8007aa28>] (handle_fasteoi_irq+0xf0/0x1ac)
        r6:808f52a4 r5:84f01e60 r4:84f01e00 r3:00000000
       [<8007a938>] (handle_fasteoi_irq) from [<80076dc0>] (generic_handle_irq+0x3c/0x4c)
        r6:00000026 r5:00000000 r4:00000026 r3:8007a938
       [<80076d84>] (generic_handle_irq) from [<80077128>] (__handle_domain_irq+0x8c/0xfc)
        r4:808c1e38 r3:0000002e
       [<8007709c>] (__handle_domain_irq) from [<800087b8>] (gic_handle_irq+0x34/0x6c)
        r10:80917748 r9:00000001 r8:88802100 r7:808c7ef0 r6:808c8fb0 r5:00000015
        r4:8880210c r3:808c7ef0
       [<80008784>] (gic_handle_irq) from [<80014044>] (__irq_svc+0x44/0x7c)
      
      Link: http://lkml.kernel.org/r/1428953721-31349-1-git-send-email-rabin@rab.in
      Link: http://lkml.kernel.org/r/1428957012-2319-1-git-send-email-rabin@rab.in
      
      Cc: stable@vger.kernel.org # 3.13+
      Signed-off-by: NRabin Vincent <rabin@rab.in>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      ef99b88b
    • J
      tracing: remove use of seq_printf return value · 962e3707
      Joe Perches 提交于
      The seq_printf return value, because it's frequently misused,
      will eventually be converted to void.
      
      See: commit 1f33c41c ("seq_file: Rename seq_overflow() to
           seq_has_overflowed() and make public")
      
      Miscellanea:
      
      o Remove unused return value from trace_lookup_stack
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      962e3707
    • J
      cgroup: remove use of seq_printf return value · 94ff212d
      Joe Perches 提交于
      The seq_printf return value, because it's frequently misused,
      will eventually be converted to void.
      
      See: commit 1f33c41c ("seq_file: Rename seq_overflow() to
           seq_has_overflowed() and make public")
      Signed-off-by: NJoe Perches <joe@perches.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94ff212d
    • J
      kernel/reboot.c: add orderly_reboot for graceful reboot · 7a54f46b
      Joel Stanley 提交于
      The kernel has orderly_poweroff which allows the kernel to initiate a
      graceful shutdown of userspace, by running /sbin/poweroff.  This adds
      orderly_reboot that will cause userspace to shut itself down by calling
      /sbin/reboot.
      
      This will be used for shutdown initiated by a system controller on
      platforms that do not use ACPI.
      
      orderly_reboot() should be used when the system wants to allow userspace
      to gracefully shut itself down.  For cases where the system may imminently
      catch on fire, the existing emergency_restart() provides an immediate
      reboot without involving userspace.
      Signed-off-by: NJoel Stanley <joel@jms.id.au>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Jeremy Kerr <jk@ozlabs.org>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a54f46b
    • A
      kernel/hung_task.c: change hung_task.c to use for_each_process_thread() · 972fae69
      Aaron Tomlin 提交于
      In check_hung_uninterruptible_tasks() avoid the use of deprecated
      while_each_thread().
      
      The "max_count" logic will prevent a livelock - see commit 0c740d0a
      ("introduce for_each_thread() to replace the buggy while_each_thread()").
      Having said this let's use for_each_process_thread().
      Signed-off-by: NAaron Tomlin <atomlin@redhat.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dave Wysochanski <dwysocha@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      972fae69
    • J
      kernel/resource.c: remove deprecated __check_region() and friends · 96831c0a
      Jakub Sitnicki 提交于
      All users of __check_region(), check_region(), and check_mem_region() are
      gone.  We got rid of the last user in v4.0-rc1.  Remove them.
      
      bloat-o-meter on x86_64 shows:
      
      add/remove: 0/3 grow/shrink: 0/0 up/down: 0/-102 (-102)
      function                                     old     new   delta
      __kstrtab___check_region                      15       -     -15
      __ksymtab___check_region                      16       -     -16
      __check_region                                71       -     -71
      Signed-off-by: NJakub Sitnicki <jsitnicki@gmail.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96831c0a
    • I
      kernel: conditionally support non-root users, groups and capabilities · 2813893f
      Iulia Manda 提交于
      There are a lot of embedded systems that run most or all of their
      functionality in init, running as root:root.  For these systems,
      supporting multiple users is not necessary.
      
      This patch adds a new symbol, CONFIG_MULTIUSER, that makes support for
      non-root users, non-root groups, and capabilities optional.  It is enabled
      under CONFIG_EXPERT menu.
      
      When this symbol is not defined, UID and GID are zero in any possible case
      and processes always have all capabilities.
      
      The following syscalls are compiled out: setuid, setregid, setgid,
      setreuid, setresuid, getresuid, setresgid, getresgid, setgroups,
      getgroups, setfsuid, setfsgid, capget, capset.
      
      Also, groups.c is compiled out completely.
      
      In kernel/capability.c, capable function was moved in order to avoid
      adding two ifdef blocks.
      
      This change saves about 25 KB on a defconfig build.  The most minimal
      kernels have total text sizes in the high hundreds of kB rather than
      low MB.  (The 25k goes down a bit with allnoconfig, but not that much.
      
      The kernel was booted in Qemu.  All the common functionalities work.
      Adding users/groups is not possible, failing with -ENOSYS.
      
      Bloat-o-meter output:
      add/remove: 7/87 grow/shrink: 19/397 up/down: 1675/-26325 (-24650)
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NIulia Manda <iulia.manda21@gmail.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>
      Tested-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2813893f
    • E
      mm: allow compaction of unevictable pages · 5bbe3547
      Eric B Munson 提交于
      Currently, pages which are marked as unevictable are protected from
      compaction, but not from other types of migration.  The POSIX real time
      extension explicitly states that mlock() will prevent a major page
      fault, but the spirit of this is that mlock() should give a process the
      ability to control sources of latency, including minor page faults.
      However, the mlock manpage only explicitly says that a locked page will
      not be written to swap and this can cause some confusion.  The
      compaction code today does not give a developer who wants to avoid swap
      but wants to have large contiguous areas available any method to achieve
      this state.  This patch introduces a sysctl for controlling compaction
      behavior with respect to the unevictable lru.  Users who demand no page
      faults after a page is present can set compact_unevictable_allowed to 0
      and users who need the large contiguous areas can enable compaction on
      locked memory by leaving the default value of 1.
      
      To illustrate this problem I wrote a quick test program that mmaps a
      large number of 1MB files filled with random data.  These maps are
      created locked and read only.  Then every other mmap is unmapped and I
      attempt to allocate huge pages to the static huge page pool.  When the
      compact_unevictable_allowed sysctl is 0, I cannot allocate hugepages
      after fragmenting memory.  When the value is set to 1, allocations
      succeed.
      Signed-off-by: NEric B Munson <emunson@akamai.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5bbe3547
    • V
      memcg: zap mem_cgroup_lookup() · adbe427b
      Vladimir Davydov 提交于
      mem_cgroup_lookup() is a wrapper around mem_cgroup_from_id(), which
      checks that id != 0 before issuing the function call.  Today, there is
      no point in this additional check apart from optimization, because there
      is no css with id <= 0, so that css_from_id, called by
      mem_cgroup_from_id, will return NULL for any id <= 0.
      
      Since mem_cgroup_from_id is only called from mem_cgroup_lookup, let us
      zap mem_cgroup_lookup, substituting calls to it with mem_cgroup_from_id
      and moving the check if id > 0 to css_from_id.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      adbe427b
  7. 15 4月, 2015 6 次提交