1. 08 1月, 2015 1 次提交
  2. 04 12月, 2014 1 次提交
  3. 24 11月, 2014 2 次提交
    • A
      uprobes, x86: Fix _TIF_UPROBE vs _TIF_NOTIFY_RESUME · 82975bc6
      Andy Lutomirski 提交于
      x86 call do_notify_resume on paranoid returns if TIF_UPROBE is set but
      not on non-paranoid returns.  I suspect that this is a mistake and that
      the code only works because int3 is paranoid.
      
      Setting _TIF_NOTIFY_RESUME in the uprobe code was probably a workaround
      for the x86 bug.  With that bug fixed, we can remove _TIF_NOTIFY_RESUME
      from the uprobes code.
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82975bc6
    • T
      sched: Provide update_curr callbacks for stop/idle scheduling classes · 90e362f4
      Thomas Gleixner 提交于
      Chris bisected a NULL pointer deference in task_sched_runtime() to
      commit 6e998916 'sched/cputime: Fix clock_nanosleep()/clock_gettime()
      inconsistency'.
      
      Chris observed crashes in atop or other /proc walking programs when he
      started fork bombs on his machine.  He assumed that this is a new exit
      race, but that does not make any sense when looking at that commit.
      
      What's interesting is that, the commit provides update_curr callbacks
      for all scheduling classes except stop_task and idle_task.
      
      While nothing can ever hit that via the clock_nanosleep() and
      clock_gettime() interfaces, which have been the target of the commit in
      question, the author obviously forgot that there are other code paths
      which invoke task_sched_runtime()
      
      do_task_stat(()
       thread_group_cputime_adjusted()
         thread_group_cputime()
           task_cputime()
             task_sched_runtime()
              if (task_current(rq, p) && task_on_rq_queued(p)) {
                update_rq_clock(rq);
                up->sched_class->update_curr(rq);
              }
      
      If the stats are read for a stomp machine task, aka 'migration/N' and
      that task is current on its cpu, this will happily call the NULL pointer
      of stop_task->update_curr.  Ooops.
      
      Chris observation that this happens faster when he runs the fork bomb
      makes sense as the fork bomb will kick migration threads more often so
      the probability to hit the issue will increase.
      
      Add the missing update_curr callbacks to the scheduler classes stop_task
      and idle_task.  While idle tasks cannot be monitored via /proc we have
      other means to hit the idle case.
      
      Fixes: 6e998916 'sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency'
      Reported-by: NChris Mason <clm@fb.com>
      Reported-and-tested-by: NBorislav Petkov <bp@alien8.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90e362f4
  4. 16 11月, 2014 4 次提交
    • S
      sched/cputime: Fix clock_nanosleep()/clock_gettime() inconsistency · 6e998916
      Stanislaw Gruszka 提交于
      Commit d670ec13 "posix-cpu-timers: Cure SMP wobbles" fixes one glibc
      test case in cost of breaking another one. After that commit, calling
      clock_nanosleep(TIMER_ABSTIME, X) and then clock_gettime(&Y) can result
      of Y time being smaller than X time.
      
      Reproducer/tester can be found further below, it can be compiled and ran by:
      
      	gcc -o tst-cpuclock2 tst-cpuclock2.c -pthread
      	while ./tst-cpuclock2 ; do : ; done
      
      This reproducer, when running on a buggy kernel, will complain
      about "clock_gettime difference too small".
      
      Issue happens because on start in thread_group_cputimer() we initialize
      sum_exec_runtime of cputimer with threads runtime not yet accounted and
      then add the threads runtime to running cputimer again on scheduler
      tick, making it's sum_exec_runtime bigger than actual threads runtime.
      
      KOSAKI Motohiro posted a fix for this problem, but that patch was never
      applied: https://lkml.org/lkml/2013/5/26/191 .
      
      This patch takes different approach to cure the problem. It calls
      update_curr() when cputimer starts, that assure we will have updated
      stats of running threads and on the next schedule tick we will account
      only the runtime that elapsed from cputimer start. That also assure we
      have consistent state between cpu times of individual threads and cpu
      time of the process consisted by those threads.
      
      Full reproducer (tst-cpuclock2.c):
      
      	#define _GNU_SOURCE
      	#include <unistd.h>
      	#include <sys/syscall.h>
      	#include <stdio.h>
      	#include <time.h>
      	#include <pthread.h>
      	#include <stdint.h>
      	#include <inttypes.h>
      
      	/* Parameters for the Linux kernel ABI for CPU clocks.  */
      	#define CPUCLOCK_SCHED          2
      	#define MAKE_PROCESS_CPUCLOCK(pid, clock) \
      		((~(clockid_t) (pid) << 3) | (clockid_t) (clock))
      
      	static pthread_barrier_t barrier;
      
      	/* Help advance the clock.  */
      	static void *chew_cpu(void *arg)
      	{
      		pthread_barrier_wait(&barrier);
      		while (1) ;
      
      		return NULL;
      	}
      
      	/* Don't use the glibc wrapper.  */
      	static int do_nanosleep(int flags, const struct timespec *req)
      	{
      		clockid_t clock_id = MAKE_PROCESS_CPUCLOCK(0, CPUCLOCK_SCHED);
      
      		return syscall(SYS_clock_nanosleep, clock_id, flags, req, NULL);
      	}
      
      	static int64_t tsdiff(const struct timespec *before, const struct timespec *after)
      	{
      		int64_t before_i = before->tv_sec * 1000000000ULL + before->tv_nsec;
      		int64_t after_i = after->tv_sec * 1000000000ULL + after->tv_nsec;
      
      		return after_i - before_i;
      	}
      
      	int main(void)
      	{
      		int result = 0;
      		pthread_t th;
      
      		pthread_barrier_init(&barrier, NULL, 2);
      
      		if (pthread_create(&th, NULL, chew_cpu, NULL) != 0) {
      			perror("pthread_create");
      			return 1;
      		}
      
      		pthread_barrier_wait(&barrier);
      
      		/* The test.  */
      		struct timespec before, after, sleeptimeabs;
      		int64_t sleepdiff, diffabs;
      		const struct timespec sleeptime = {.tv_sec = 0,.tv_nsec = 100000000 };
      
      		/* The relative nanosleep.  Not sure why this is needed, but its presence
      		   seems to make it easier to reproduce the problem.  */
      		if (do_nanosleep(0, &sleeptime) != 0) {
      			perror("clock_nanosleep");
      			return 1;
      		}
      
      		/* Get the current time.  */
      		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &before) < 0) {
      			perror("clock_gettime[2]");
      			return 1;
      		}
      
      		/* Compute the absolute sleep time based on the current time.  */
      		uint64_t nsec = before.tv_nsec + sleeptime.tv_nsec;
      		sleeptimeabs.tv_sec = before.tv_sec + nsec / 1000000000;
      		sleeptimeabs.tv_nsec = nsec % 1000000000;
      
      		/* Sleep for the computed time.  */
      		if (do_nanosleep(TIMER_ABSTIME, &sleeptimeabs) != 0) {
      			perror("absolute clock_nanosleep");
      			return 1;
      		}
      
      		/* Get the time after the sleep.  */
      		if (clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &after) < 0) {
      			perror("clock_gettime[3]");
      			return 1;
      		}
      
      		/* The time after sleep should always be equal to or after the absolute sleep
      		   time passed to clock_nanosleep.  */
      		sleepdiff = tsdiff(&sleeptimeabs, &after);
      		if (sleepdiff < 0) {
      			printf("absolute clock_nanosleep woke too early: %" PRId64 "\n", sleepdiff);
      			result = 1;
      
      			printf("Before %llu.%09llu\n", before.tv_sec, before.tv_nsec);
      			printf("After  %llu.%09llu\n", after.tv_sec, after.tv_nsec);
      			printf("Sleep  %llu.%09llu\n", sleeptimeabs.tv_sec, sleeptimeabs.tv_nsec);
      		}
      
      		/* The difference between the timestamps taken before and after the
      		   clock_nanosleep call should be equal to or more than the duration of the
      		   sleep.  */
      		diffabs = tsdiff(&before, &after);
      		if (diffabs < sleeptime.tv_nsec) {
      			printf("clock_gettime difference too small: %" PRId64 "\n", diffabs);
      			result = 1;
      		}
      
      		pthread_cancel(th);
      
      		return result;
      	}
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20141112155843.GA24803@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6e998916
    • P
      sched/cputime: Fix cpu_timer_sample_group() double accounting · 23cfa361
      Peter Zijlstra 提交于
      While looking over the cpu-timer code I found that we appear to add
      the delta for the calling task twice, through:
      
        cpu_timer_sample_group()
          thread_group_cputimer()
            thread_group_cputime()
              times->sum_exec_runtime += task_sched_runtime();
      
          *sample = cputime.sum_exec_runtime + task_delta_exec();
      
      Which would make the sample run ahead, making the sleep short.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/r/20141112113737.GI10476@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      23cfa361
    • P
      sched/numa: Avoid selecting oneself as swap target · 7af68335
      Peter Zijlstra 提交于
      Because the whole numa task selection stuff runs with preemption
      enabled (its long and expensive) we can end up migrating and selecting
      oneself as a swap target. This doesn't really work out well -- we end
      up trying to acquire the same lock twice for the swap migrate -- so
      avoid this.
      Reported-and-Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20141110100328.GF29390@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7af68335
    • M
      perf: Fix corruption of sibling list with hotplug · 226424ee
      Mark Rutland 提交于
      When a CPU hotplugged out, we call perf_remove_from_context() (via
      perf_event_exit_cpu()) to rip each CPU-bound event out of its PMU's cpu
      context, but leave siblings grouped together. Freeing of these events is
      left to the mercy of the usual refcounting.
      
      When a CPU-bound event's refcount drops to zero we cross-call to
      __perf_remove_from_context() to clean it up, detaching grouped siblings.
      
      This works when the relevant CPU is online, but will fail if the CPU is
      currently offline, and we won't detach the event from its siblings
      before freeing the event, leaving the sibling list corrupt. If the
      sibling list is later walked (e.g. because the CPU cam online again
      before a remaining sibling's refcount drops to zero), we will walk the
      now corrupted siblings list, potentially dereferencing garbage values.
      
      Given that the events should never be scheduled again (as we removed
      them from their context), we can simply detatch siblings when the CPU
      goes down in the first place. If the CPU comes back online, the
      redundant call to __perf_remove_from_context() is safe.
      Reported-by: NDrew Richardson <drew.richardson@arm.com>
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: vincent.weaver@maine.edu
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1415203904-25308-2-git-send-email-mark.rutland@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      226424ee
  5. 14 11月, 2014 1 次提交
  6. 12 11月, 2014 1 次提交
    • M
      audit: keep inode pinned · 799b6014
      Miklos Szeredi 提交于
      Audit rules disappear when an inode they watch is evicted from the cache.
      This is likely not what we want.
      
      The guilty commit is "fsnotify: allow marks to not pin inodes in core",
      which didn't take into account that audit_tree adds watches with a zero
      mask.
      
      Adding any mask should fix this.
      
      Fixes: 90b1e7a5 ("fsnotify: allow marks to not pin inodes in core")
      Signed-off-by: NMiklos Szeredi <mszeredi@suse.cz>
      Cc: stable@vger.kernel.org # 2.6.36+
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      799b6014
  7. 11 11月, 2014 2 次提交
    • R
      tracing: Do not risk busy looping in buffer splice · 07906da7
      Rabin Vincent 提交于
      If the read loop in trace_buffers_splice_read() keeps failing due to
      memory allocation failures without reading even a single page then this
      function will keep busy looping.
      
      Remove the risk for that by exiting the function if memory allocation
      failures are seen.
      
      Link: http://lkml.kernel.org/r/1415309167-2373-2-git-send-email-rabin@rab.inSigned-off-by: NRabin Vincent <rabin@rab.in>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      07906da7
    • R
      tracing: Do not busy wait in buffer splice · e30f53aa
      Rabin Vincent 提交于
      On a !PREEMPT kernel, attempting to use trace-cmd results in a soft
      lockup:
      
       # trace-cmd record -e raw_syscalls:* -F false
       NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [trace-cmd:61]
       ...
       Call Trace:
        [<ffffffff8105b580>] ? __wake_up_common+0x90/0x90
        [<ffffffff81092e25>] wait_on_pipe+0x35/0x40
        [<ffffffff810936e3>] tracing_buffers_splice_read+0x2e3/0x3c0
        [<ffffffff81093300>] ? tracing_stats_read+0x2a0/0x2a0
        [<ffffffff812d10ab>] ? _raw_spin_unlock+0x2b/0x40
        [<ffffffff810dc87b>] ? do_read_fault+0x21b/0x290
        [<ffffffff810de56a>] ? handle_mm_fault+0x2ba/0xbd0
        [<ffffffff81095c80>] ? trace_event_buffer_lock_reserve+0x40/0x80
        [<ffffffff810951e2>] ? trace_buffer_lock_reserve+0x22/0x60
        [<ffffffff81095c80>] ? trace_event_buffer_lock_reserve+0x40/0x80
        [<ffffffff8112415d>] do_splice_to+0x6d/0x90
        [<ffffffff81126971>] SyS_splice+0x7c1/0x800
        [<ffffffff812d1edd>] tracesys_phase2+0xd3/0xd8
      
      The problem is this: tracing_buffers_splice_read() calls
      ring_buffer_wait() to wait for data in the ring buffers.  The buffers
      are not empty so ring_buffer_wait() returns immediately.  But
      tracing_buffers_splice_read() calls ring_buffer_read_page() with full=1,
      meaning it only wants to read a full page.  When the full page is not
      available, tracing_buffers_splice_read() tries to wait again with
      ring_buffer_wait(), which again returns immediately, and so on.
      
      Fix this by adding a "full" argument to ring_buffer_wait() which will
      make ring_buffer_wait() wait until the writer has left the reader's
      page, i.e.  until full-page reads will succeed.
      
      Link: http://lkml.kernel.org/r/1415645194-25379-1-git-send-email-rabin@rab.in
      
      Cc: stable@vger.kernel.org # 3.16+
      Fixes: b1169cc6 ("tracing: Remove mock up poll wait function")
      Signed-off-by: NRabin Vincent <rabin@rab.in>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e30f53aa
  8. 10 11月, 2014 1 次提交
    • A
      sched/numa: Fix out of bounds read in sched_init_numa() · c123588b
      Andrey Ryabinin 提交于
      On latest mm + KASan patchset I've got this:
      
          ==================================================================
          BUG: AddressSanitizer: out of bounds access in sched_init_smp+0x3ba/0x62c at addr ffff88006d4bee6c
          =============================================================================
          BUG kmalloc-8 (Not tainted): kasan error
          -----------------------------------------------------------------------------
      
          Disabling lock debugging due to kernel taint
          INFO: Allocated in alloc_vfsmnt+0xb0/0x2c0 age=75 cpu=0 pid=0
           __slab_alloc+0x4b4/0x4f0
           __kmalloc_track_caller+0x15f/0x1e0
           kstrdup+0x44/0x90
           alloc_vfsmnt+0xb0/0x2c0
           vfs_kern_mount+0x35/0x190
           kern_mount_data+0x25/0x50
           pid_ns_prepare_proc+0x19/0x50
           alloc_pid+0x5e2/0x630
           copy_process.part.41+0xdf5/0x2aa0
           do_fork+0xf5/0x460
           kernel_thread+0x21/0x30
           rest_init+0x1e/0x90
           start_kernel+0x522/0x531
           x86_64_start_reservations+0x2a/0x2c
           x86_64_start_kernel+0x15b/0x16a
          INFO: Slab 0xffffea0001b52f80 objects=24 used=22 fp=0xffff88006d4befc0 flags=0x100000000004080
          INFO: Object 0xffff88006d4bed20 @offset=3360 fp=0xffff88006d4bee70
      
          Bytes b4 ffff88006d4bed10: 00 00 00 00 00 00 00 00 5a 5a 5a 5a 5a 5a 5a 5a  ........ZZZZZZZZ
          Object ffff88006d4bed20: 70 72 6f 63 00 6b 6b a5                          proc.kk.
          Redzone ffff88006d4bed28: cc cc cc cc cc cc cc cc                          ........
          Padding ffff88006d4bee68: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ
          CPU: 0 PID: 1 Comm: swapper/0 Tainted: G    B          3.18.0-rc3-mm1+ #108
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
           ffff88006d4be000 0000000000000000 ffff88006d4bed20 ffff88006c86fd18
           ffffffff81cd0a59 0000000000000058 ffff88006d404240 ffff88006c86fd48
           ffffffff811fa3a8 ffff88006d404240 ffffea0001b52f80 ffff88006d4bed20
          Call Trace:
          dump_stack (lib/dump_stack.c:52)
          print_trailer (mm/slub.c:645)
          object_err (mm/slub.c:652)
          ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
          kasan_report_error (mm/kasan/report.c:102 mm/kasan/report.c:178)
          ? kasan_poison_shadow (mm/kasan/kasan.c:48)
          ? kasan_unpoison_shadow (mm/kasan/kasan.c:54)
          ? kasan_poison_shadow (mm/kasan/kasan.c:48)
          ? kasan_kmalloc (mm/kasan/kasan.c:311)
          __asan_load4 (mm/kasan/kasan.c:371)
          ? sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
          sched_init_smp (kernel/sched/core.c:6552 kernel/sched/core.c:7063)
          kernel_init_freeable (init/main.c:869 init/main.c:997)
          ? finish_task_switch (kernel/sched/sched.h:1036 kernel/sched/core.c:2248)
          ? rest_init (init/main.c:924)
          kernel_init (init/main.c:929)
          ? rest_init (init/main.c:924)
          ret_from_fork (arch/x86/kernel/entry_64.S:348)
          ? rest_init (init/main.c:924)
          Read of size 4 by task swapper/0:
          Memory state around the buggy address:
           ffff88006d4beb80: fc fc fc fc fc fc fc fc fc fc 00 fc fc fc fc fc
           ffff88006d4bec00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
           ffff88006d4bec80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
           ffff88006d4bed00: fc fc fc fc 00 fc fc fc fc fc fc fc fc fc fc fc
           ffff88006d4bed80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
          >ffff88006d4bee00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 04 fc
                                                                    ^
           ffff88006d4bee80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
           ffff88006d4bef00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
           ffff88006d4bef80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
           ffff88006d4bf000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
           ffff88006d4bf080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
          ==================================================================
      
      Zero 'level' (e.g. on non-NUMA system) causing out of bounds
      access in this line:
      
           sched_max_numa_distance = sched_domains_numa_distance[level - 1];
      
      Fix this by exiting from sched_init_numa() earlier.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Fixes: 9942f79b ("sched/numa: Export info needed for NUMA balancing on complex topologies")
      Cc: peterz@infradead.org
      Link: http://lkml.kernel.org/r/1415372020-1871-1-git-send-email-a.ryabinin@samsung.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c123588b
  9. 09 11月, 2014 1 次提交
  10. 04 11月, 2014 1 次提交
  11. 31 10月, 2014 2 次提交
    • R
      tracing/syscalls: Ignore numbers outside NR_syscalls' range · 086ba77a
      Rabin Vincent 提交于
      ARM has some private syscalls (for example, set_tls(2)) which lie
      outside the range of NR_syscalls.  If any of these are called while
      syscall tracing is being performed, out-of-bounds array access will
      occur in the ftrace and perf sys_{enter,exit} handlers.
      
       # trace-cmd record -e raw_syscalls:* true && trace-cmd report
       ...
       true-653   [000]   384.675777: sys_enter:            NR 192 (0, 1000, 3, 4000022, ffffffff, 0)
       true-653   [000]   384.675812: sys_exit:             NR 192 = 1995915264
       true-653   [000]   384.675971: sys_enter:            NR 983045 (76f74480, 76f74000, 76f74b28, 76f74480, 76f76f74, 1)
       true-653   [000]   384.675988: sys_exit:             NR 983045 = 0
       ...
      
       # trace-cmd record -e syscalls:* true
       [   17.289329] Unable to handle kernel paging request at virtual address aaaaaace
       [   17.289590] pgd = 9e71c000
       [   17.289696] [aaaaaace] *pgd=00000000
       [   17.289985] Internal error: Oops: 5 [#1] PREEMPT SMP ARM
       [   17.290169] Modules linked in:
       [   17.290391] CPU: 0 PID: 704 Comm: true Not tainted 3.18.0-rc2+ #21
       [   17.290585] task: 9f4dab00 ti: 9e710000 task.ti: 9e710000
       [   17.290747] PC is at ftrace_syscall_enter+0x48/0x1f8
       [   17.290866] LR is at syscall_trace_enter+0x124/0x184
      
      Fix this by ignoring out-of-NR_syscalls-bounds syscall numbers.
      
      Commit cd0980fc "tracing: Check invalid syscall nr while tracing syscalls"
      added the check for less than zero, but it should have also checked
      for greater than NR_syscalls.
      
      Link: http://lkml.kernel.org/p/1414620418-29472-1-git-send-email-rabin@rab.in
      
      Fixes: cd0980fc "tracing: Check invalid syscall nr while tracing syscalls"
      Cc: stable@vger.kernel.org # 2.6.33+
      Signed-off-by: NRabin Vincent <rabin@rab.in>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      086ba77a
    • R
      audit: AUDIT_FEATURE_CHANGE message format missing delimiting space · 897f1acb
      Richard Guy Briggs 提交于
      Add a space between subj= and feature= fields to make them parsable.
      Signed-off-by: NRichard Guy Briggs <rgb@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      897f1acb
  12. 30 10月, 2014 2 次提交
    • M
      kernel/kmod: fix use-after-free of the sub_info structure · 0baf2a4d
      Martin Schwidefsky 提交于
      Found this in the message log on a s390 system:
      
          BUG kmalloc-192 (Not tainted): Poison overwritten
          Disabling lock debugging due to kernel taint
          INFO: 0x00000000684761f4-0x00000000684761f7. First byte 0xff instead of 0x6b
          INFO: Allocated in call_usermodehelper_setup+0x70/0x128 age=71 cpu=2 pid=648
           __slab_alloc.isra.47.constprop.56+0x5f6/0x658
           kmem_cache_alloc_trace+0x106/0x408
           call_usermodehelper_setup+0x70/0x128
           call_usermodehelper+0x62/0x90
           cgroup_release_agent+0x178/0x1c0
           process_one_work+0x36e/0x680
           worker_thread+0x2f0/0x4f8
           kthread+0x10a/0x120
           kernel_thread_starter+0x6/0xc
           kernel_thread_starter+0x0/0xc
          INFO: Freed in call_usermodehelper_exec+0x110/0x1b8 age=71 cpu=2 pid=648
           __slab_free+0x94/0x560
           kfree+0x364/0x3e0
           call_usermodehelper_exec+0x110/0x1b8
           cgroup_release_agent+0x178/0x1c0
           process_one_work+0x36e/0x680
           worker_thread+0x2f0/0x4f8
           kthread+0x10a/0x120
           kernel_thread_starter+0x6/0xc
           kernel_thread_starter+0x0/0xc
      
      There is a use-after-free bug on the subprocess_info structure allocated
      by the user mode helper.  In case do_execve() returns with an error
      ____call_usermodehelper() stores the error code to sub_info->retval, but
      sub_info can already have been freed.
      
      Regarding UMH_NO_WAIT, the sub_info structure can be freed by
      __call_usermodehelper() before the worker thread returns from
      do_execve(), allowing memory corruption when do_execve() failed after
      exec_mmap() is called.
      
      Regarding UMH_WAIT_EXEC, the call to umh_complete() allows
      call_usermodehelper_exec() to continue which then frees sub_info.
      
      To fix this race the code needs to make sure that the call to
      call_usermodehelper_freeinfo() is always done after the last store to
      sub_info->retval.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0baf2a4d
    • R
      gcov: add ARM64 to GCOV_PROFILE_ALL · f601de20
      Riku Voipio 提交于
      Following up the arm testing of gcov, turns out gcov on ARM64 works fine
      as well.  Only change needed is adding ARM64 to Kconfig depends.
      
      Tested with qemu and mach-virt
      Signed-off-by: NRiku Voipio <riku.voipio@linaro.org>
      Acked-by: NPeter Oberparleiter <oberpar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f601de20
  13. 29 10月, 2014 1 次提交
    • P
      rcu: Make rcu_barrier() understand about missing rcuo kthreads · d7e29933
      Paul E. McKenney 提交于
      Commit 35ce7f29 (rcu: Create rcuo kthreads only for onlined CPUs)
      avoids creating rcuo kthreads for CPUs that never come online.  This
      fixes a bug in many instances of firmware: Instead of lying about their
      age, these systems instead lie about the number of CPUs that they have.
      Before commit 35ce7f29, this could result in huge numbers of useless
      rcuo kthreads being created.
      
      It appears that experience indicates that I should have told the
      people suffering from this problem to fix their broken firmware, but
      I instead produced what turned out to be a partial fix.   The missing
      piece supplied by this commit makes sure that rcu_barrier() knows not to
      post callbacks for no-CBs CPUs that have not yet come online, because
      otherwise rcu_barrier() will hang on systems having firmware that lies
      about the number of CPUs.
      
      It is tempting to simply have rcu_barrier() refuse to post a callback on
      any no-CBs CPU that does not have an rcuo kthread.  This unfortunately
      does not work because rcu_barrier() is required to wait for all pending
      callbacks.  It is therefore required to wait even for those callbacks
      that cannot possibly be invoked.  Even if doing so hangs the system.
      
      Given that posting a callback to a no-CBs CPU that does not yet have an
      rcuo kthread can hang rcu_barrier(), It is tempting to report an error
      in this case.  Unfortunately, this will result in false positives at
      boot time, when it is perfectly legal to post callbacks to the boot CPU
      before the scheduler has started, in other words, before it is legal
      to invoke rcu_barrier().
      
      So this commit instead has rcu_barrier() avoid posting callbacks to
      CPUs having neither rcuo kthread nor pending callbacks, and has it
      complain bitterly if it finds CPUs having no rcuo kthread but some
      pending callbacks.  And when rcu_barrier() does find CPUs having no rcuo
      kthread but pending callbacks, as noted earlier, it has no choice but
      to hang indefinitely.
      Reported-by: NYanko Kaneti <yaneti@declera.com>
      Reported-by: NJay Vosburgh <jay.vosburgh@canonical.com>
      Reported-by: NMeelis Roos <mroos@linux.ee>
      Reported-by: NEric B Munson <emunson@akamai.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NEric B Munson <emunson@akamai.com>
      Tested-by: NJay Vosburgh <jay.vosburgh@canonical.com>
      Tested-by: NYanko Kaneti <yaneti@declera.com>
      Tested-by: NKevin Fenzi <kevin@scrye.com>
      Tested-by: NMeelis Roos <mroos@linux.ee>
      d7e29933
  14. 28 10月, 2014 11 次提交
    • P
      perf: Fix and clean up initialization of pmu::event_idx · c719f560
      Peter Zijlstra 提交于
      Andy reported that the current state of event_idx is rather confused.
      So remove all but the x86_pmu implementation and change the default to
      return 0 (the safe option).
      Reported-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Cody P Schafer <cody@linux.vnet.ibm.com>
      Cc: Cody P Schafer <dev@codyps.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hendrik Brueckner <brueckner@linux.vnet.ibm.com>
      Cc: Himangi Saraogi <himangi774@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: sukadev@linux.vnet.ibm.com <sukadev@linux.vnet.ibm.com>
      Cc: Thomas Huth <thuth@linux.vnet.ibm.com>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: linux390@de.ibm.com
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: linux-s390@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c719f560
    • K
      sched/dl: Fix preemption checks · f3a7e1a9
      Kirill Tkhai 提交于
      1) switched_to_dl() check is wrong. We reschedule only
         if rq->curr is deadline task, and we do not reschedule
         if it's a lower priority task. But we must always
         preempt a task of other classes.
      
      2) dl_task_timer():
         Policy does not change in case of priority inheritance.
         rt_mutex_setprio() changes prio, while policy remains old.
      
      So we lose some balancing logic in dl_task_timer() and
      switched_to_dl() when we check policy instead of priority. Boosted
      task may be rq->curr.
      
      (I didn't change switched_from_dl() because no check is necessary
      there at all).
      
      I've looked at this place(switched_to_dl) several times and even fixed
      this function, but found just now...  I suppose some performance tests
      may work better after this.
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1413909356.19914.128.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f3a7e1a9
    • O
      sched: stop the unbound recursion in preempt_schedule_context() · 009f60e2
      Oleg Nesterov 提交于
      preempt_schedule_context() does preempt_enable_notrace() at the end
      and this can call the same function again; exception_exit() is heavy
      and it is quite possible that need-resched is true again.
      
      1. Change this code to dec preempt_count() and check need_resched()
         by hand.
      
      2. As Linus suggested, we can use the PREEMPT_ACTIVE bit and avoid
         the enable/disable dance around __schedule(). But in this case
         we need to move into sched/core.c.
      
      3. Cosmetic, but x86 forgets to declare this function. This doesn't
         really matter because it is only called by asm helpers, still it
         make sense to add the declaration into asm/preempt.h to match
         preempt_schedule().
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Graf <agraf@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Anvin <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/r/20141005202322.GB27962@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      009f60e2
    • K
      sched/fair: Fix division by zero sysctl_numa_balancing_scan_size · 64192658
      Kirill Tkhai 提交于
      File /proc/sys/kernel/numa_balancing_scan_size_mb allows writing of zero.
      
      This bash command reproduces problem:
      
      $ while :; do echo 0 > /proc/sys/kernel/numa_balancing_scan_size_mb; \
      	   echo 256 > /proc/sys/kernel/numa_balancing_scan_size_mb; done
      
      	divide error: 0000 [#1] SMP
      	Modules linked in:
      	CPU: 0 PID: 24112 Comm: bash Not tainted 3.17.0+ #8
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
      	task: ffff88013c852600 ti: ffff880037a68000 task.ti: ffff880037a68000
      	RIP: 0010:[<ffffffff81074191>]  [<ffffffff81074191>] task_scan_min+0x21/0x50
      	RSP: 0000:ffff880037a6bce0  EFLAGS: 00010246
      	RAX: 0000000000000a00 RBX: 00000000000003e8 RCX: 0000000000000000
      	RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff88013c852600
      	RBP: ffff880037a6bcf0 R08: 0000000000000001 R09: 0000000000015c90
      	R10: ffff880239bf6c00 R11: 0000000000000016 R12: 0000000000003fff
      	R13: ffff88013c852600 R14: ffffea0008d1b000 R15: 0000000000000003
      	FS:  00007f12bb048700(0000) GS:ffff88007da00000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      	CR2: 0000000001505678 CR3: 0000000234770000 CR4: 00000000000006f0
      	Stack:
      	 ffff88013c852600 0000000000003fff ffff880037a6bd18 ffffffff810741d1
      	 ffff88013c852600 0000000000003fff 000000000002bfff ffff880037a6bda8
      	 ffffffff81077ef7 ffffea0008a56d40 0000000000000001 0000000000000001
      	Call Trace:
      	 [<ffffffff810741d1>] task_scan_max+0x11/0x40
      	 [<ffffffff81077ef7>] task_numa_fault+0x1f7/0xae0
      	 [<ffffffff8115a896>] ? migrate_misplaced_page+0x276/0x300
      	 [<ffffffff81134a4d>] handle_mm_fault+0x62d/0xba0
      	 [<ffffffff8103e2f1>] __do_page_fault+0x191/0x510
      	 [<ffffffff81030122>] ? native_smp_send_reschedule+0x42/0x60
      	 [<ffffffff8106dc00>] ? check_preempt_curr+0x80/0xa0
      	 [<ffffffff8107092c>] ? wake_up_new_task+0x11c/0x1a0
      	 [<ffffffff8104887d>] ? do_fork+0x14d/0x340
      	 [<ffffffff811799bb>] ? get_unused_fd_flags+0x2b/0x30
      	 [<ffffffff811799df>] ? __fd_install+0x1f/0x60
      	 [<ffffffff8103e67c>] do_page_fault+0xc/0x10
      	 [<ffffffff8150d322>] page_fault+0x22/0x30
      	RIP  [<ffffffff81074191>] task_scan_min+0x21/0x50
      	RSP <ffff880037a6bce0>
      	---[ end trace 9a826d16936c04de ]---
      
      Also fix race in task_scan_min (it depends on compiler behaviour).
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dario Faggioli <raistlin@linux.it>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Rik van Riel <riel@redhat.com>
      Link: http://lkml.kernel.org/r/1413455977.24793.78.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      64192658
    • Y
      sched/fair: Care divide error in update_task_scan_period() · 2847c90e
      Yasuaki Ishimatsu 提交于
      While offling node by hot removing memory, the following divide error
      occurs:
      
        divide error: 0000 [#1] SMP
        [...]
        Call Trace:
         [...] handle_mm_fault
         [...] ? try_to_wake_up
         [...] ? wake_up_state
         [...] __do_page_fault
         [...] ? do_futex
         [...] ? put_prev_entity
         [...] ? __switch_to
         [...] do_page_fault
         [...] page_fault
        [...]
        RIP  [<ffffffff810a7081>] task_numa_fault
         RSP <ffff88084eb2bcb0>
      
      The issue occurs as follows:
        1. When page fault occurs and page is allocated from node 1,
           task_struct->numa_faults_buffer_memory[] of node 1 is
           incremented and p->numa_faults_locality[] is also incremented
           as follows:
      
           o numa_faults_buffer_memory[]       o numa_faults_locality[]
                    NR_NUMA_HINT_FAULT_TYPES
                   |      0     |     1     |
           ----------------------------------  ----------------------
            node 0 |      0     |     0     |   remote |      0     |
            node 1 |      0     |     1     |   locale |      1     |
           ----------------------------------  ----------------------
      
        2. node 1 is offlined by hot removing memory.
      
        3. When page fault occurs, fault_types[] is calculated by using
           p->numa_faults_buffer_memory[] of all online nodes in
           task_numa_placement(). But node 1 was offline by step 2. So
           the fault_types[] is calculated by using only
           p->numa_faults_buffer_memory[] of node 0. So both of fault_types[]
           are set to 0.
      
        4. The values(0) of fault_types[] pass to update_task_scan_period().
      
        5. numa_faults_locality[1] is set to 1. So the following division is
           calculated.
      
              static void update_task_scan_period(struct task_struct *p,
                                      unsigned long shared, unsigned long private){
              ...
                      ratio = DIV_ROUND_UP(private * NUMA_PERIOD_SLOTS, (private + shared));
              }
      
        6. But both of private and shared are set to 0. So divide error
           occurs here.
      
      The divide error is rare case because the trigger is node offline.
      This patch always increments denominator for avoiding divide error.
      Signed-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/54475703.8000505@jp.fujitsu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2847c90e
    • K
      sched/numa: Fix unsafe get_task_struct() in task_numa_assign() · 1effd9f1
      Kirill Tkhai 提交于
      Unlocked access to dst_rq->curr in task_numa_compare() is racy.
      If curr task is exiting this may be a reason of use-after-free:
      
      task_numa_compare()                    do_exit()
          ...                                        current->flags |= PF_EXITING;
          ...                                    release_task()
          ...                                        ~~delayed_put_task_struct()~~
          ...                                    schedule()
          rcu_read_lock()                        ...
          cur = ACCESS_ONCE(dst_rq->curr)        ...
              ...                                rq->curr = next;
              ...                                    context_switch()
              ...                                        finish_task_switch()
              ...                                            put_task_struct()
              ...                                                __put_task_struct()
              ...                                                    free_task_struct()
              task_numa_assign()                                     ...
                  get_task_struct()                                  ...
      
      As noted by Oleg:
      
        <<The lockless get_task_struct(tsk) is only safe if tsk == current
          and didn't pass exit_notify(), or if this tsk was found on a rcu
          protected list (say, for_each_process() or find_task_by_vpid()).
          IOW, it is only safe if release_task() was not called before we
          take rcu_read_lock(), in this case we can rely on the fact that
          delayed_put_pid() can not drop the (potentially) last reference
          until rcu_read_unlock().
      
          And as Kirill pointed out task_numa_compare()->task_numa_assign()
          path does get_task_struct(dst_rq->curr) and this is not safe. The
          task_struct itself can't go away, but rcu_read_lock() can't save
          us from the final put_task_struct() in finish_task_switch(); this
          reference goes away without rcu gp>>
      
      The patch provides simple check of PF_EXITING flag. If it's not set,
      this guarantees that call_rcu() of delayed_put_task_struct() callback
      hasn't happened yet, so we can safely do get_task_struct() in
      task_numa_assign().
      
      Locked dst_rq->lock protects from concurrency with the last schedule().
      Reusing or unmapping of cur's memory may happen without it.
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1413962231.19914.130.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1effd9f1
    • J
      sched/deadline: Fix races between rt_mutex_setprio() and dl_task_timer() · aee38ea9
      Juri Lelli 提交于
      dl_task_timer() is racy against several paths. Daniel noticed that
      the replenishment timer may experience a race condition against an
      enqueue_dl_entity() called from rt_mutex_setprio(). With his own
      words:
      
       rt_mutex_setprio() resets p->dl.dl_throttled. So the pattern is:
       start_dl_timer() throttled = 1, rt_mutex_setprio() throlled = 0,
       sched_switch() -> enqueue_task(), dl_task_timer-> enqueue_task()
       throttled is 0
      
      => BUG_ON(on_dl_rq(dl_se)) fires as the scheduling entity is already
      enqueued on the -deadline runqueue.
      
      As we do for the other races, we just bail out in the replenishment
      timer code.
      Reported-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: vincent@legout.info
      Cc: Dario Faggioli <raistlin@linux.it>
      Cc: Michael Trimarchi <michael@amarulasolutions.com>
      Cc: Fabio Checconi <fchecconi@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1414142198-18552-5-git-send-email-juri.lelli@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      aee38ea9
    • J
      sched/deadline: Don't replenish from a !SCHED_DEADLINE entity · 64be6f1f
      Juri Lelli 提交于
      In the deboost path, right after the dl_boosted flag has been
      reset, we can currently end up replenishing using -deadline
      parameters of a !SCHED_DEADLINE entity. This of course causes
      a bug, as those parameters are empty.
      
      In the case depicted above it is safe to simply bail out, as
      the deboosted task is going to be back to its original scheduling
      class anyway.
      Reported-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Tested-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Signed-off-by: NJuri Lelli <juri.lelli@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: vincent@legout.info
      Cc: Dario Faggioli <raistlin@linux.it>
      Cc: Michael Trimarchi <michael@amarulasolutions.com>
      Cc: Fabio Checconi <fchecconi@gmail.com>
      Link: http://lkml.kernel.org/r/1414142198-18552-4-git-send-email-juri.lelli@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      64be6f1f
    • K
      sched: Fix race between task_group and sched_task_group · eeb61e53
      Kirill Tkhai 提交于
      The race may happen when somebody is changing task_group of a forking task.
      Child's cgroup is the same as parent's after dup_task_struct() (there just
      memory copying). Also, cfs_rq and rt_rq are the same as parent's.
      
      But if parent changes its task_group before it's called cgroup_post_fork(),
      we do not reflect this situation on child. Child's cfs_rq and rt_rq remain
      the same, while child's task_group changes in cgroup_post_fork().
      
      To fix this we introduce fork() method, which calls sched_move_task() directly.
      This function changes sched_task_group on appropriate (also its logic has
      no problem with freshly created tasks, so we shouldn't introduce something
      special; we are able just to use it).
      
      Possibly, this decides the Burke Libbey's problem: https://lkml.org/lkml/2014/10/24/456Signed-off-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1414405105.19914.169.camel@tkhaiSigned-off-by: NIngo Molnar <mingo@kernel.org>
      eeb61e53
    • A
      bpf: split eBPF out of NET · f89b7755
      Alexei Starovoitov 提交于
      introduce two configs:
      - hidden CONFIG_BPF to select eBPF interpreter that classic socket filters
        depend on
      - visible CONFIG_BPF_SYSCALL (default off) that tracing and sockets can use
      
      that solves several problems:
      - tracing and others that wish to use eBPF don't need to depend on NET.
        They can use BPF_SYSCALL to allow loading from userspace or select BPF
        to use it directly from kernel in NET-less configs.
      - in 3.18 programs cannot be attached to events yet, so don't force it on
      - when the rest of eBPF infra is there in 3.19+, it's still useful to
        switch it off to minimize kernel size
      
      bloat-o-meter on x64 shows:
      add/remove: 0/60 grow/shrink: 0/2 up/down: 0/-15601 (-15601)
      
      tested with many different config combinations. Hopefully didn't miss anything.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f89b7755
    • I
      PM / Sleep: fix recovery during resuming from hibernation · 94fb823f
      Imre Deak 提交于
      If a device's dev_pm_ops::freeze callback fails during the QUIESCE
      phase, we don't rollback things correctly calling the thaw and complete
      callbacks. This could leave some devices in a suspended state in case of
      an error during resuming from hibernation.
      Signed-off-by: NImre Deak <imre.deak@intel.com>
      Cc: All applicable <stable@vger.kernel.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      94fb823f
  15. 26 10月, 2014 2 次提交
  16. 25 10月, 2014 4 次提交
    • T
      clockevents: Prevent shift out of bounds · 10632008
      Thomas Gleixner 提交于
      Andrey reported that on a kernel with UBSan enabled he found:
      
           UBSan: Undefined behaviour in ../kernel/time/clockevents.c:75:34
      
           I guess it should be 1ULL here instead of 1U:
                  (!ismax || evt->mult <= (1U << evt->shift)))
      
      That's indeed the correct solution because shift might be 32.
      Reported-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      
      10632008
    • M
      posix-timers: Fix stack info leak in timer_create() · 6891c450
      Mathias Krause 提交于
      If userland creates a timer without specifying a sigevent info, we'll
      create one ourself, using a stack local variable. Particularly will we
      use the timer ID as sival_int. But as sigev_value is a union containing
      a pointer and an int, that assignment will only partially initialize
      sigev_value on systems where the size of a pointer is bigger than the
      size of an int. On such systems we'll copy the uninitialized stack bytes
      from the timer_create() call to userland when the timer actually fires
      and we're going to deliver the signal.
      
      Initialize sigev_value with 0 to plug the stack info leak.
      
      Found in the PaX patch, written by the PaX Team.
      
      Fixes: 5a9fa730 ("posix-timers: kill ->it_sigev_signo and...")
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Brad Spengler <spender@grsecurity.net>
      Cc: PaX Team <pageexec@freemail.hu>
      Cc: <stable@vger.kernel.org>	# v2.6.28+
      Link: http://lkml.kernel.org/r/1412456799-32339-1-git-send-email-minipli@googlemail.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      6891c450
    • S
      ftrace: Fix checking of trampoline ftrace_ops in finding trampoline · 4fc40904
      Steven Rostedt (Red Hat) 提交于
      When modifying code, ftrace has several checks to make sure things
      are being done correctly. One of them is to make sure any code it
      modifies is exactly what it expects it to be before it modifies it.
      In order to do so with the new trampoline logic, it must be able
      to find out what trampoline a function is hooked to in order to
      see if the code that hooks to it is what's expected.
      
      The logic to find the trampoline from a record (accounting descriptor
      for a function that is hooked) needs to only look at the "old_hash"
      of an ops that is being modified. The old_hash is the list of function
      an ops is hooked to before its update. Since a record would only be
      pointing to an ops that is being modified if it was already hooked
      before.
      
      Currently, it can pick a modified ops based on its new functions it
      will be hooked to, and this picks the wrong trampoline and causes
      the check to fail, disabling ftrace.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      
      ftrace: squash into ordering of ops for modification
      4fc40904
    • S
      ftrace: Set ops->old_hash on modifying what an ops hooks to · 8252ecf3
      Steven Rostedt (Red Hat) 提交于
      The code that checks for trampolines when modifying function hooks
      tests against a modified ops "old_hash". But the ops old_hash pointer
      is not being updated before the changes are made, making it possible
      to not find the right hash to the callback and possibly causing
      ftrace to break in accounting and disable itself.
      
      Have the ops set its old_hash before the modifying takes place.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      8252ecf3
  17. 23 10月, 2014 2 次提交
  18. 22 10月, 2014 1 次提交