1. 01 9月, 2017 2 次提交
    • E
      mm, uprobes: fix multiple free of ->uprobes_state.xol_area · 355627f5
      Eric Biggers 提交于
      Commit 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for
      write killable") made it possible to kill a forking task while it is
      waiting to acquire its ->mmap_sem for write, in dup_mmap().
      
      However, it was overlooked that this introduced an new error path before
      the new mm_struct's ->uprobes_state.xol_area has been set to NULL after
      being copied from the old mm_struct by the memcpy in dup_mm().  For a
      task that has previously hit a uprobe tracepoint, this resulted in the
      'struct xol_area' being freed multiple times if the task was killed at
      just the right time while forking.
      
      Fix it by setting ->uprobes_state.xol_area to NULL in mm_init() rather
      than in uprobe_dup_mmap().
      
      With CONFIG_UPROBE_EVENTS=y, the bug can be reproduced by the same C
      program given by commit 2b7e8665 ("fork: fix incorrect fput of
      ->exe_file causing use-after-free"), provided that a uprobe tracepoint
      has been set on the fork_thread() function.  For example:
      
          $ gcc reproducer.c -o reproducer -lpthread
          $ nm reproducer | grep fork_thread
          0000000000400719 t fork_thread
          $ echo "p $PWD/reproducer:0x719" > /sys/kernel/debug/tracing/uprobe_events
          $ echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
          $ ./reproducer
      
      Here is the use-after-free reported by KASAN:
      
          BUG: KASAN: use-after-free in uprobe_clear_state+0x1c4/0x200
          Read of size 8 at addr ffff8800320a8b88 by task reproducer/198
      
          CPU: 1 PID: 198 Comm: reproducer Not tainted 4.13.0-rc7-00015-g36fde05f #255
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-20170228_101828-anatol 04/01/2014
          Call Trace:
           dump_stack+0xdb/0x185
           print_address_description+0x7e/0x290
           kasan_report+0x23b/0x350
           __asan_report_load8_noabort+0x19/0x20
           uprobe_clear_state+0x1c4/0x200
           mmput+0xd6/0x360
           do_exit+0x740/0x1670
           do_group_exit+0x13f/0x380
           get_signal+0x597/0x17d0
           do_signal+0x99/0x1df0
           exit_to_usermode_loop+0x166/0x1e0
           syscall_return_slowpath+0x258/0x2c0
           entry_SYSCALL_64_fastpath+0xbc/0xbe
      
          ...
      
          Allocated by task 199:
           save_stack_trace+0x1b/0x20
           kasan_kmalloc+0xfc/0x180
           kmem_cache_alloc_trace+0xf3/0x330
           __create_xol_area+0x10f/0x780
           uprobe_notify_resume+0x1674/0x2210
           exit_to_usermode_loop+0x150/0x1e0
           prepare_exit_to_usermode+0x14b/0x180
           retint_user+0x8/0x20
      
          Freed by task 199:
           save_stack_trace+0x1b/0x20
           kasan_slab_free+0xa8/0x1a0
           kfree+0xba/0x210
           uprobe_clear_state+0x151/0x200
           mmput+0xd6/0x360
           copy_process.part.8+0x605f/0x65d0
           _do_fork+0x1a5/0xbd0
           SyS_clone+0x19/0x20
           do_syscall_64+0x22f/0x660
           return_from_SYSCALL_64+0x0/0x7a
      
      Note: without KASAN, you may instead see a "Bad page state" message, or
      simply a general protection fault.
      
      Link: http://lkml.kernel.org/r/20170830033303.17927-1-ebiggers3@gmail.com
      Fixes: 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>    [4.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      355627f5
    • S
      kernel/kthread.c: kthread_worker: don't hog the cpu · 22cf8bc6
      Shaohua Li 提交于
      If the worker thread continues getting work, it will hog the cpu and rcu
      stall complains.  Make it a good citizen.  This is triggered in a loop
      block device test.
      
      Link: http://lkml.kernel.org/r/5de0a179b3184e1a2183fc503448b0269f24d75b.1503697127.git.shli@fb.comSigned-off-by: NShaohua Li <shli@fb.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      22cf8bc6
  2. 29 8月, 2017 2 次提交
    • Z
      perf/ftrace: Fix double traces of perf on ftrace:function · 75e83876
      Zhou Chengming 提交于
      When running perf on the ftrace:function tracepoint, there is a bug
      which can be reproduced by:
      
        perf record -e ftrace:function -a sleep 20 &
        perf record -e ftrace:function ls
        perf script
      
                    ls 10304 [005]   171.853235: ftrace:function:
        perf_output_begin
                    ls 10304 [005]   171.853237: ftrace:function:
        perf_output_begin
                    ls 10304 [005]   171.853239: ftrace:function:
        task_tgid_nr_ns
                    ls 10304 [005]   171.853240: ftrace:function:
        task_tgid_nr_ns
                    ls 10304 [005]   171.853242: ftrace:function:
        __task_pid_nr_ns
                    ls 10304 [005]   171.853244: ftrace:function:
        __task_pid_nr_ns
      
      We can see that all the function traces are doubled.
      
      The problem is caused by the inconsistency of the register
      function perf_ftrace_event_register() with the probe function
      perf_ftrace_function_call(). The former registers one probe
      for every perf_event. And the latter handles all perf_events
      on the current cpu. So when two perf_events on the current cpu,
      the traces of them will be doubled.
      
      So this patch adds an extra parameter "event" for perf_tp_event,
      only send sample data to this event when it's not NULL.
      Signed-off-by: NZhou Chengming <zhouchengming1@huawei.com>
      Reviewed-by: NJiri Olsa <jolsa@kernel.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: alexander.shishkin@linux.intel.com
      Cc: huawei.libin@huawei.com
      Link: http://lkml.kernel.org/r/1503668977-12526-1-git-send-email-zhouchengming1@huawei.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      75e83876
    • M
      perf/core: Fix potential double-fetch bug · f12f42ac
      Meng Xu 提交于
      While examining the kernel source code, I found a dangerous operation that
      could turn into a double-fetch situation (a race condition bug) where the same
      userspace memory region are fetched twice into kernel with sanity checks after
      the first fetch while missing checks after the second fetch.
      
        1. The first fetch happens in line 9573 get_user(size, &uattr->size).
      
        2. Subsequently the 'size' variable undergoes a few sanity checks and
           transformations (line 9577 to 9584).
      
        3. The second fetch happens in line 9610 copy_from_user(attr, uattr, size)
      
        4. Given that 'uattr' can be fully controlled in userspace, an attacker can
           race condition to override 'uattr->size' to arbitrary value (say, 0xFFFFFFFF)
           after the first fetch but before the second fetch. The changed value will be
           copied to 'attr->size'.
      
        5. There is no further checks on 'attr->size' until the end of this function,
           and once the function returns, we lose the context to verify that 'attr->size'
           conforms to the sanity checks performed in step 2 (line 9577 to 9584).
      
        6. My manual analysis shows that 'attr->size' is not used elsewhere later,
           so, there is no working exploit against it right now. However, this could
           easily turns to an exploitable one if careless developers start to use
           'attr->size' later.
      
      To fix this, override 'attr->size' from the second fetch to the one from the
      first fetch, regardless of what is actually copied in.
      
      In this way, it is assured that 'attr->size' is consistent with the checks
      performed after the first fetch.
      Signed-off-by: NMeng Xu <mengxu.gatech@gmail.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: alexander.shishkin@linux.intel.com
      Cc: meng.xu@gatech.edu
      Cc: sanidhya@gatech.edu
      Cc: taesoo@gatech.edu
      Link: http://lkml.kernel.org/r/1503522470-35531-1-git-send-email-meng.xu@gatech.eduSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f12f42ac
  3. 28 8月, 2017 1 次提交
    • L
      Minor page waitqueue cleanups · 3510ca20
      Linus Torvalds 提交于
      Tim Chen and Kan Liang have been battling a customer load that shows
      extremely long page wakeup lists.  The cause seems to be constant NUMA
      migration of a hot page that is shared across a lot of threads, but the
      actual root cause for the exact behavior has not been found.
      
      Tim has a patch that batches the wait list traversal at wakeup time, so
      that we at least don't get long uninterruptible cases where we traverse
      and wake up thousands of processes and get nasty latency spikes.  That
      is likely 4.14 material, but we're still discussing the page waitqueue
      specific parts of it.
      
      In the meantime, I've tried to look at making the page wait queues less
      expensive, and failing miserably.  If you have thousands of threads
      waiting for the same page, it will be painful.  We'll need to try to
      figure out the NUMA balancing issue some day, in addition to avoiding
      the excessive spinlock hold times.
      
      That said, having tried to rewrite the page wait queues, I can at least
      fix up some of the braindamage in the current situation. In particular:
      
       (a) we don't want to continue walking the page wait list if the bit
           we're waiting for already got set again (which seems to be one of
           the patterns of the bad load).  That makes no progress and just
           causes pointless cache pollution chasing the pointers.
      
       (b) we don't want to put the non-locking waiters always on the front of
           the queue, and the locking waiters always on the back.  Not only is
           that unfair, it means that we wake up thousands of reading threads
           that will just end up being blocked by the writer later anyway.
      
      Also add a comment about the layout of 'struct wait_page_key' - there is
      an external user of it in the cachefiles code that means that it has to
      match the layout of 'struct wait_bit_key' in the two first members.  It
      so happens to match, because 'struct page *' and 'unsigned long *' end
      up having the same values simply because the page flags are the first
      member in struct page.
      
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3510ca20
  4. 26 8月, 2017 2 次提交
    • J
      time: Fix ktime_get_raw() incorrect base accumulation · 0bcdc098
      John Stultz 提交于
      In comqit fc6eead7 ("time: Clean up CLOCK_MONOTONIC_RAW time
      handling"), the following code got mistakenly added to the update of the
      raw timekeeper:
      
       /* Update the monotonic raw base */
       seconds = tk->raw_sec;
       nsec = (u32)(tk->tkr_raw.xtime_nsec >> tk->tkr_raw.shift);
       tk->tkr_raw.base = ns_to_ktime(seconds * NSEC_PER_SEC + nsec);
      
      Which adds the raw_sec value and the shifted down raw xtime_nsec to the
      base value.
      
      But the read function adds the shifted down tk->tkr_raw.xtime_nsec value
      another time, The result of this is that ktime_get_raw() users (which are
      all internal users) see the raw time move faster then it should (the rate
      at which can vary with the current size of tkr_raw.xtime_nsec), which has
      resulted in at least problems with graphics rendering performance.
      
      The change tried to match the monotonic base update logic:
      
       seconds = (u64)(tk->xtime_sec + tk->wall_to_monotonic.tv_sec);
       nsec = (u32) tk->wall_to_monotonic.tv_nsec;
       tk->tkr_mono.base = ns_to_ktime(seconds * NSEC_PER_SEC + nsec);
      
      Which adds the wall_to_monotonic.tv_nsec value, but not the
      tk->tkr_mono.xtime_nsec value to the base.
      
      To fix this, simplify the tkr_raw.base accumulation to only accumulate the
      raw_sec portion, and do not include the tkr_raw.xtime_nsec portion, which
      will be added at read time.
      
      Fixes: fc6eead7 ("time: Clean up CLOCK_MONOTONIC_RAW time handling")
      Reported-and-tested-by: NChris Wilson <chris@chris-wilson.co.uk>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Kevin Brodsky <kevin.brodsky@arm.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Stephen Boyd <stephen.boyd@linaro.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Daniel Mentz <danielmentz@google.com>
      Link: http://lkml.kernel.org/r/1503701824-1645-1-git-send-email-john.stultz@linaro.org
      0bcdc098
    • E
      fork: fix incorrect fput of ->exe_file causing use-after-free · 2b7e8665
      Eric Biggers 提交于
      Commit 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for
      write killable") made it possible to kill a forking task while it is
      waiting to acquire its ->mmap_sem for write, in dup_mmap().
      
      However, it was overlooked that this introduced an new error path before
      a reference is taken on the mm_struct's ->exe_file.  Since the
      ->exe_file of the new mm_struct was already set to the old ->exe_file by
      the memcpy() in dup_mm(), it was possible for the mmput() in the error
      path of dup_mm() to drop a reference to ->exe_file which was never
      taken.
      
      This caused the struct file to later be freed prematurely.
      
      Fix it by updating mm_init() to NULL out the ->exe_file, in the same
      place it clears other things like the list of mmaps.
      
      This bug was found by syzkaller.  It can be reproduced using the
      following C program:
      
          #define _GNU_SOURCE
          #include <pthread.h>
          #include <stdlib.h>
          #include <sys/mman.h>
          #include <sys/syscall.h>
          #include <sys/wait.h>
          #include <unistd.h>
      
          static void *mmap_thread(void *_arg)
          {
              for (;;) {
                  mmap(NULL, 0x1000000, PROT_READ,
                       MAP_POPULATE|MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
              }
          }
      
          static void *fork_thread(void *_arg)
          {
              usleep(rand() % 10000);
              fork();
          }
      
          int main(void)
          {
              fork();
              fork();
              fork();
              for (;;) {
                  if (fork() == 0) {
                      pthread_t t;
      
                      pthread_create(&t, NULL, mmap_thread, NULL);
                      pthread_create(&t, NULL, fork_thread, NULL);
                      usleep(rand() % 10000);
                      syscall(__NR_exit_group, 0);
                  }
                  wait(NULL);
              }
          }
      
      No special kernel config options are needed.  It usually causes a NULL
      pointer dereference in __remove_shared_vm_struct() during exit, or in
      dup_mmap() (which is usually inlined into copy_process()) during fork.
      Both are due to a vm_area_struct's ->vm_file being used after it's
      already been freed.
      
      Google Bug Id: 64772007
      
      Link: http://lkml.kernel.org/r/20170823211408.31198-1-ebiggers3@gmail.com
      Fixes: 7c051267 ("mm, fork: make dup_mmap wait for mmap_sem for write killable")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Tested-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[v4.7+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b7e8665
  5. 25 8月, 2017 2 次提交
    • M
      perf/core: Fix group {cpu,task} validation · 64aee2a9
      Mark Rutland 提交于
      Regardless of which events form a group, it does not make sense for the
      events to target different tasks and/or CPUs, as this leaves the group
      inconsistent and impossible to schedule. The core perf code assumes that
      these are consistent across (successfully intialised) groups.
      
      Core perf code only verifies this when moving SW events into a HW
      context. Thus, we can violate this requirement for pure SW groups and
      pure HW groups, unless the relevant PMU driver happens to perform this
      verification itself. These mismatched groups subsequently wreak havoc
      elsewhere.
      
      For example, we handle watchpoints as SW events, and reserve watchpoint
      HW on a per-CPU basis at pmu::event_init() time to ensure that any event
      that is initialised is guaranteed to have a slot at pmu::add() time.
      However, the core code only checks the group leader's cpu filter (via
      event_filter_match()), and can thus install follower events onto CPUs
      violating thier (mismatched) CPU filters, potentially installing them
      into a CPU without sufficient reserved slots.
      
      This can be triggered with the below test case, resulting in warnings
      from arch backends.
      
        #define _GNU_SOURCE
        #include <linux/hw_breakpoint.h>
        #include <linux/perf_event.h>
        #include <sched.h>
        #include <stdio.h>
        #include <sys/prctl.h>
        #include <sys/syscall.h>
        #include <unistd.h>
      
        static int perf_event_open(struct perf_event_attr *attr, pid_t pid, int cpu,
      			   int group_fd, unsigned long flags)
        {
      	return syscall(__NR_perf_event_open, attr, pid, cpu, group_fd, flags);
        }
      
        char watched_char;
      
        struct perf_event_attr wp_attr = {
      	.type = PERF_TYPE_BREAKPOINT,
      	.bp_type = HW_BREAKPOINT_RW,
      	.bp_addr = (unsigned long)&watched_char,
      	.bp_len = 1,
      	.size = sizeof(wp_attr),
        };
      
        int main(int argc, char *argv[])
        {
      	int leader, ret;
      	cpu_set_t cpus;
      
      	/*
      	 * Force use of CPU0 to ensure our CPU0-bound events get scheduled.
      	 */
      	CPU_ZERO(&cpus);
      	CPU_SET(0, &cpus);
      	ret = sched_setaffinity(0, sizeof(cpus), &cpus);
      	if (ret) {
      		printf("Unable to set cpu affinity\n");
      		return 1;
      	}
      
      	/* open leader event, bound to this task, CPU0 only */
      	leader = perf_event_open(&wp_attr, 0, 0, -1, 0);
      	if (leader < 0) {
      		printf("Couldn't open leader: %d\n", leader);
      		return 1;
      	}
      
      	/*
      	 * Open a follower event that is bound to the same task, but a
      	 * different CPU. This means that the group should never be possible to
      	 * schedule.
      	 */
      	ret = perf_event_open(&wp_attr, 0, 1, leader, 0);
      	if (ret < 0) {
      		printf("Couldn't open mismatched follower: %d\n", ret);
      		return 1;
      	} else {
      		printf("Opened leader/follower with mismastched CPUs\n");
      	}
      
      	/*
      	 * Open as many independent events as we can, all bound to the same
      	 * task, CPU0 only.
      	 */
      	do {
      		ret = perf_event_open(&wp_attr, 0, 0, -1, 0);
      	} while (ret >= 0);
      
      	/*
      	 * Force enable/disble all events to trigger the erronoeous
      	 * installation of the follower event.
      	 */
      	printf("Opened all events. Toggling..\n");
      	for (;;) {
      		prctl(PR_TASK_PERF_EVENTS_DISABLE, 0, 0, 0, 0);
      		prctl(PR_TASK_PERF_EVENTS_ENABLE, 0, 0, 0, 0);
      	}
      
      	return 0;
        }
      
      Fix this by validating this requirement regardless of whether we're
      moving events.
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Zhou Chengming <zhouchengming1@huawei.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1498142498-15758-1-git-send-email-mark.rutland@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      64aee2a9
    • W
      cpuset: Fix incorrect memory_pressure control file mapping · 1c08c22c
      Waiman Long 提交于
      The memory_pressure control file was incorrectly set up without
      a private value (0, by default). As a result, this control
      file was treated like memory_migrate on read. By adding back the
      FILE_MEMORY_PRESSURE private value, the correct memory pressure value
      will be returned.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: 7dbdb199 ("cgroup: replace cftype->mode with CFTYPE_WORLD_WRITABLE")
      Cc: stable@vger.kernel.org # v4.4+
      1c08c22c
  6. 24 8月, 2017 4 次提交
    • S
      tracing: Fix freeing of filter in create_filter() when set_str is false · 8b0db1a5
      Steven Rostedt (VMware) 提交于
      Performing the following task with kmemleak enabled:
      
       # cd /sys/kernel/tracing/events/irq/irq_handler_entry/
       # echo 'enable_event:kmem:kmalloc:3 if irq >' > trigger
       # echo 'enable_event:kmem:kmalloc:3 if irq > 31' > trigger
       # echo scan > /sys/kernel/debug/kmemleak
       # cat /sys/kernel/debug/kmemleak
      unreferenced object 0xffff8800b9290308 (size 32):
        comm "bash", pid 1114, jiffies 4294848451 (age 141.139s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff81cef5aa>] kmemleak_alloc+0x4a/0xa0
          [<ffffffff81357938>] kmem_cache_alloc_trace+0x158/0x290
          [<ffffffff81261c09>] create_filter_start.constprop.28+0x99/0x940
          [<ffffffff812639c9>] create_filter+0xa9/0x160
          [<ffffffff81263bdc>] create_event_filter+0xc/0x10
          [<ffffffff812655e5>] set_trigger_filter+0xe5/0x210
          [<ffffffff812660c4>] event_enable_trigger_func+0x324/0x490
          [<ffffffff812652e2>] event_trigger_write+0x1a2/0x260
          [<ffffffff8138cf87>] __vfs_write+0xd7/0x380
          [<ffffffff8138f421>] vfs_write+0x101/0x260
          [<ffffffff8139187b>] SyS_write+0xab/0x130
          [<ffffffff81cfd501>] entry_SYSCALL_64_fastpath+0x1f/0xbe
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      The function create_filter() is passed a 'filterp' pointer that gets
      allocated, and if "set_str" is true, it is up to the caller to free it, even
      on error. The problem is that the pointer is not freed by create_filter()
      when set_str is false. This is a bug, and it is not up to the caller to free
      the filter on error if it doesn't care about the string.
      
      Link: http://lkml.kernel.org/r/1502705898-27571-2-git-send-email-chuhu@redhat.com
      
      Cc: stable@vger.kernel.org
      Fixes: 38b78eb8 ("tracing: Factorize filter creation")
      Reported-by: NChunyu Hu <chuhu@redhat.com>
      Tested-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      8b0db1a5
    • C
      tracing: Fix kmemleak in tracing_map_array_free() · 475bb3c6
      Chunyu Hu 提交于
      kmemleak reported the below leak when I was doing clear of the hist
      trigger. With this patch, the kmeamleak is gone.
      
      unreferenced object 0xffff94322b63d760 (size 32):
        comm "bash", pid 1522, jiffies 4403687962 (age 2442.311s)
        hex dump (first 32 bytes):
          00 01 00 00 04 00 00 00 08 00 00 00 ff 00 00 00  ................
          10 00 00 00 00 00 00 00 80 a8 7a f2 31 94 ff ff  ..........z.1...
        backtrace:
          [<ffffffff9e96c27a>] kmemleak_alloc+0x4a/0xa0
          [<ffffffff9e424cba>] kmem_cache_alloc_trace+0xca/0x1d0
          [<ffffffff9e377736>] tracing_map_array_alloc+0x26/0x140
          [<ffffffff9e261be0>] kretprobe_trampoline+0x0/0x50
          [<ffffffff9e38b935>] create_hist_data+0x535/0x750
          [<ffffffff9e38bd47>] event_hist_trigger_func+0x1f7/0x420
          [<ffffffff9e38893d>] event_trigger_write+0xfd/0x1a0
          [<ffffffff9e44dfc7>] __vfs_write+0x37/0x170
          [<ffffffff9e44f552>] vfs_write+0xb2/0x1b0
          [<ffffffff9e450b85>] SyS_write+0x55/0xc0
          [<ffffffff9e203857>] do_syscall_64+0x67/0x150
          [<ffffffff9e977ce7>] return_from_SYSCALL_64+0x0/0x6a
          [<ffffffffffffffff>] 0xffffffffffffffff
      unreferenced object 0xffff9431f27aa880 (size 128):
        comm "bash", pid 1522, jiffies 4403687962 (age 2442.311s)
        hex dump (first 32 bytes):
          00 00 8c 2a 32 94 ff ff 00 f0 8b 2a 32 94 ff ff  ...*2......*2...
          00 e0 8b 2a 32 94 ff ff 00 d0 8b 2a 32 94 ff ff  ...*2......*2...
        backtrace:
          [<ffffffff9e96c27a>] kmemleak_alloc+0x4a/0xa0
          [<ffffffff9e425348>] __kmalloc+0xe8/0x220
          [<ffffffff9e3777c1>] tracing_map_array_alloc+0xb1/0x140
          [<ffffffff9e261be0>] kretprobe_trampoline+0x0/0x50
          [<ffffffff9e38b935>] create_hist_data+0x535/0x750
          [<ffffffff9e38bd47>] event_hist_trigger_func+0x1f7/0x420
          [<ffffffff9e38893d>] event_trigger_write+0xfd/0x1a0
          [<ffffffff9e44dfc7>] __vfs_write+0x37/0x170
          [<ffffffff9e44f552>] vfs_write+0xb2/0x1b0
          [<ffffffff9e450b85>] SyS_write+0x55/0xc0
          [<ffffffff9e203857>] do_syscall_64+0x67/0x150
          [<ffffffff9e977ce7>] return_from_SYSCALL_64+0x0/0x6a
          [<ffffffffffffffff>] 0xffffffffffffffff
      
      Link: http://lkml.kernel.org/r/1502705898-27571-1-git-send-email-chuhu@redhat.com
      
      Cc: stable@vger.kernel.org
      Fixes: 08d43a5f ("tracing: Add lock-free tracing_map")
      Signed-off-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      475bb3c6
    • S
      ftrace: Check for null ret_stack on profile function graph entry function · a8f0f9e4
      Steven Rostedt (VMware) 提交于
      There's a small race when function graph shutsdown and the calling of the
      registered function graph entry callback. The callback must not reference
      the task's ret_stack without first checking that it is not NULL. Note, when
      a ret_stack is allocated for a task, it stays allocated until the task exits.
      The problem here, is that function_graph is shutdown, and a new task was
      created, which doesn't have its ret_stack allocated. But since some of the
      functions are still being traced, the callbacks can still be called.
      
      The normal function_graph code handles this, but starting with commit
      8861dd30 ("ftrace: Access ret_stack->subtime only in the function
      profiler") the profiler code references the ret_stack on function entry, but
      doesn't check if it is NULL first.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=196611
      
      Cc: stable@vger.kernel.org
      Fixes: 8861dd30 ("ftrace: Access ret_stack->subtime only in the function profiler")
      Reported-by: lilydjwg@gmail.com
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      a8f0f9e4
    • N
      timers: Fix excessive granularity of new timers after a nohz idle · 2fe59f50
      Nicholas Piggin 提交于
      When a timer base is idle, it is forwarded when a new timer is added
      to ensure that granularity does not become excessive. When not idle,
      the timer tick is expected to increment the base.
      
      However there are several problems:
      
      - If an existing timer is modified, the base is forwarded only after
        the index is calculated.
      
      - The base is not forwarded by add_timer_on.
      
      - There is a window after a timer is restarted from a nohz idle, after
        it is marked not-idle and before the timer tick on this CPU, where a
        timer may be added but the ancient base does not get forwarded.
      
      These result in excessive granularity (a 1 jiffy timeout can blow out
      to 100s of jiffies), which cause the rcu lockup detector to trigger,
      among other things.
      
      Fix this by keeping track of whether the timer base has been idle
      since it was last run or forwarded, and if so then forward it before
      adding a new timer.
      
      There is still a case where mod_timer optimises the case of a pending
      timer mod with the same expiry time, where the timer can see excessive
      granularity relative to the new, shorter interval. A comment is added,
      but it's not changed because it is an important fastpath for
      networking.
      
      This has been tested and found to fix the RCU softlockup messages.
      
      Testing was also done with tracing to measure requested versus
      achieved wakeup latencies for all non-deferrable timers in an idle
      system (with no lockup watchdogs running). Wakeup latency relative to
      absolute latency is calculated (note this suffers from round-up skew
      at low absolute times) and analysed:
      
                   max     avg      std
      upstream   506.0    1.20     4.68
      patched      2.0    1.08     0.15
      
      The bug was noticed due to the lockup detector Kconfig changes
      dropping it out of people's .configs and resulting in larger base
      clk skew When the lockup detectors are enabled, no CPU can go idle for
      longer than 4 seconds, which limits the granularity errors.
      Sub-optimal timer behaviour is observable on a smaller scale in that
      case:
      
      	     max     avg      std
      upstream     9.0    1.05     0.19
      patched      2.0    1.04     0.11
      
      Fixes: Fixes: a683f390 ("timers: Forward the wheel clock whenever possible")
      Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
      Tested-by: NDavid Miller <davem@davemloft.net>
      Cc: dzickus@redhat.com
      Cc: sfr@canb.auug.org.au
      Cc: mpe@ellerman.id.au
      Cc: Stephen Boyd <sboyd@codeaurora.org>
      Cc: linuxarm@huawei.com
      Cc: abdhalee@linux.vnet.ibm.com
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: akpm@linux-foundation.org
      Cc: paulmck@linux.vnet.ibm.com
      Cc: torvalds@linux-foundation.org
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170822084348.21436-1-npiggin@gmail.com
      2fe59f50
  7. 23 8月, 2017 1 次提交
    • D
      bpf: fix map value attribute for hash of maps · 33ba43ed
      Daniel Borkmann 提交于
      Currently, iproute2's BPF ELF loader works fine with array of maps
      when retrieving the fd from a pinned node and doing a selfcheck
      against the provided map attributes from the object file, but we
      fail to do the same for hash of maps and thus refuse to get the
      map from pinned node.
      
      Reason is that when allocating hash of maps, fd_htab_map_alloc() will
      set the value size to sizeof(void *), and any user space map creation
      requests are forced to set 4 bytes as value size. Thus, selfcheck
      will complain about exposed 8 bytes on 64 bit archs vs. 4 bytes from
      object file as value size. Contract is that fdinfo or BPF_MAP_GET_FD_BY_ID
      returns the value size used to create the map.
      
      Fix it by handling it the same way as we do for array of maps, which
      means that we leave value size at 4 bytes and in the allocation phase
      round up value size to 8 bytes. alloc_htab_elem() needs an adjustment
      in order to copy rounded up 8 bytes due to bpf_fd_htab_map_update_elem()
      calling into htab_map_update_elem() with the pointer of the map
      pointer as value. Unlike array of maps where we just xchg(), we're
      using the generic htab_map_update_elem() callback also used from helper
      calls, which published the key/value already on return, so we need
      to ensure to memcpy() the right size.
      
      Fixes: bcc6b1b7 ("bpf: Add hash of maps support")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33ba43ed
  8. 22 8月, 2017 1 次提交
    • O
      pids: make task_tgid_nr_ns() safe · dd1c1f2f
      Oleg Nesterov 提交于
      This was reported many times, and this was even mentioned in commit
      52ee2dfd ("pids: refactor vnr/nr_ns helpers to make them safe") but
      somehow nobody bothered to fix the obvious problem: task_tgid_nr_ns() is
      not safe because task->group_leader points to nowhere after the exiting
      task passes exit_notify(), rcu_read_lock() can not help.
      
      We really need to change __unhash_process() to nullify group_leader,
      parent, and real_parent, but this needs some cleanups.  Until then we
      can turn task_tgid_nr_ns() into another user of __task_pid_nr_ns() and
      fix the problem.
      Reported-by: NTroy Kensinger <tkensinger@google.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dd1c1f2f
  9. 20 8月, 2017 1 次提交
  10. 19 8月, 2017 2 次提交
    • J
      signal: don't remove SIGNAL_UNKILLABLE for traced tasks. · eb61b591
      Jamie Iles 提交于
      When forcing a signal, SIGNAL_UNKILLABLE is removed to prevent recursive
      faults, but this is undesirable when tracing.  For example, debugging an
      init process (whether global or namespace), hitting a breakpoint and
      SIGTRAP will force SIGTRAP and then remove SIGNAL_UNKILLABLE.
      Everything continues fine, but then once debugging has finished, the
      init process is left killable which is unlikely what the user expects,
      resulting in either an accidentally killed init or an init that stops
      reaping zombies.
      
      Link: http://lkml.kernel.org/r/20170815112806.10728-1-jamie.iles@oracle.comSigned-off-by: NJamie Iles <jamie.iles@oracle.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb61b591
    • L
      kmod: fix wait on recursive loop · 2ba293c9
      Luis R. Rodriguez 提交于
      Recursive loops with module loading were previously handled in kmod by
      restricting the number of modprobe calls to 50 and if that limit was
      breached request_module() would return an error and a user would see the
      following on their kernel dmesg:
      
        request_module: runaway loop modprobe binfmt-464c
        Starting init:/sbin/init exists but couldn't execute it (error -8)
      
      This issue could happen for instance when a 64-bit kernel boots a 32-bit
      userspace on some architectures and has no 32-bit binary format
      hanlders.  This is visible, for instance, when a CONFIG_MODULES enabled
      64-bit MIPS kernel boots a into o32 root filesystem and the binfmt
      handler for o32 binaries is not built-in.
      
      After commit 6d7964a7 ("kmod: throttle kmod thread limit") we now
      don't have any visible signs of an error and the kernel just waits for
      the loop to end somehow.
      
      Although this *particular* recursive loop could also be addressed by
      doing a sanity check on search_binary_handler() and disallowing a
      modular binfmt to be required for modprobe, a generic solution for any
      recursive kernel kmod issues is still needed.
      
      This should catch these loops.  We can investigate each loop and address
      each one separately as they come in, this however puts a stop gap for
      them as before.
      
      Link: http://lkml.kernel.org/r/20170809234635.13443-3-mcgrof@kernel.org
      Fixes: 6d7964a7 ("kmod: throttle kmod thread limit")
      Signed-off-by: NLuis R. Rodriguez <mcgrof@kernel.org>
      Reported-by: NMatt Redfearn <matt.redfearn@imgtec.com>
      Tested-by: NMatt Redfearn <matt.redfearn@imgetc.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Colin Ian King <colin.king@canonical.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: Daniel Mentz <danielmentz@google.com>
      Cc: David Binderman <dcb314@hotmail.com>
      Cc: Dmitry Torokhov <dmitry.torokhov@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jessica Yu <jeyu@redhat.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Michal Marek <mmarek@suse.com>
      Cc: Miroslav Benes <mbenes@suse.cz>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Shuah Khan <shuah@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ba293c9
  11. 18 8月, 2017 4 次提交
    • T
      kernel/watchdog: Prevent false positives with turbo modes · 7edaeb68
      Thomas Gleixner 提交于
      The hardlockup detector on x86 uses a performance counter based on unhalted
      CPU cycles and a periodic hrtimer. The hrtimer period is about 2/5 of the
      performance counter period, so the hrtimer should fire 2-3 times before the
      performance counter NMI fires. The NMI code checks whether the hrtimer
      fired since the last invocation. If not, it assumess a hard lockup.
      
      The calculation of those periods is based on the nominal CPU
      frequency. Turbo modes increase the CPU clock frequency and therefore
      shorten the period of the perf/NMI watchdog. With extreme Turbo-modes (3x
      nominal frequency) the perf/NMI period is shorter than the hrtimer period
      which leads to false positives.
      
      A simple fix would be to shorten the hrtimer period, but that comes with
      the side effect of more frequent hrtimer and softlockup thread wakeups,
      which is not desired.
      
      Implement a low pass filter, which checks the perf/NMI period against
      kernel time. If the perf/NMI fires before 4/5 of the watchdog period has
      elapsed then the event is ignored and postponed to the next perf/NMI.
      
      That solves the problem and avoids the overhead of shorter hrtimer periods
      and more frequent softlockup thread wakeups.
      
      Fixes: 58687acb ("lockup_detector: Combine nmi_watchdog and softlockup detector")
      Reported-and-tested-by: NKan Liang <Kan.liang@intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: dzickus@redhat.com
      Cc: prarit@redhat.com
      Cc: ak@linux.intel.com
      Cc: babu.moger@oracle.com
      Cc: peterz@infradead.org
      Cc: eranian@google.com
      Cc: acme@redhat.com
      Cc: stable@vger.kernel.org
      Cc: atomlin@redhat.com
      Cc: akpm@linux-foundation.org
      Cc: torvalds@linux-foundation.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1708150931310.1886@nanos
      7edaeb68
    • M
      genirq: Restore trigger settings in irq_modify_status() · e8f24189
      Marc Zyngier 提交于
      irq_modify_status starts by clearing the trigger settings from
      irq_data before applying the new settings, but doesn't restore them,
      leaving them to IRQ_TYPE_NONE.
      
      That's pretty confusing to the potential request_irq() that could
      follow. Instead, snapshot the settings before clearing them, and restore
      them if the irq_modify_status() invocation was not changing the trigger.
      
      Fixes: 1e2a7d78 ("irqdomain: Don't set type when mapping an IRQ")
      Reported-and-tested-by: Njeffy <jeffy.chen@rock-chips.com>
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Jon Hunter <jonathanh@nvidia.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170818095345.12378-1-marc.zyngier@arm.com
      e8f24189
    • V
      cpufreq: schedutil: Always process remote callback with slow switching · c49cbc19
      Viresh Kumar 提交于
      The frequency update from the utilization update handlers can be divided
      into two parts:
      
      (A) Finding the next frequency
      (B) Updating the frequency
      
      While any CPU can do (A), (B) can be restricted to a group of CPUs only,
      depending on the current platform.
      
      For platforms where fast cpufreq switching is possible, both (A) and (B)
      are always done from the same CPU and that CPU should be capable of
      changing the frequency of the target CPU.
      
      But for platforms where fast cpufreq switching isn't possible, after
      doing (A) we wake up a kthread which will eventually do (B). This
      kthread is already bound to the right set of CPUs, i.e. only those which
      can change the frequency of CPUs of a cpufreq policy. And so any CPU
      can actually do (A) in this case, as the frequency is updated from the
      right set of CPUs only.
      
      Check cpufreq_can_do_remote_dvfs() only for the fast switching case.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c49cbc19
    • V
      cpufreq: schedutil: Don't restrict kthread to related_cpus unnecessarily · e2cabe48
      Viresh Kumar 提交于
      Utilization update callbacks are now processed remotely, even on the
      CPUs that don't share cpufreq policy with the target CPU (if
      dvfs_possible_from_any_cpu flag is set).
      
      But in non-fast switch paths, the frequency is changed only from one of
      policy->related_cpus. This happens because the kthread which does the
      actual update is bound to a subset of CPUs (i.e. related_cpus).
      
      Allow frequency to be remotely updated as well (i.e. call
      __cpufreq_driver_target()) if dvfs_possible_from_any_cpu flag is set.
      Reported-by: NPavan Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      e2cabe48
  12. 16 8月, 2017 3 次提交
    • D
      bpf: fix bpf_trace_printk on 32 bit archs · 88a5c690
      Daniel Borkmann 提交于
      James reported that on MIPS32 bpf_trace_printk() is currently
      broken while MIPS64 works fine:
      
        bpf_trace_printk() uses conditional operators to attempt to
        pass different types to __trace_printk() depending on the
        format operators. This doesn't work as intended on 32-bit
        architectures where u32 and long are passed differently to
        u64, since the result of C conditional operators follows the
        "usual arithmetic conversions" rules, such that the values
        passed to __trace_printk() will always be u64 [causing issues
        later in the va_list handling for vscnprintf()].
      
        For example the samples/bpf/tracex5 test printed lines like
        below on MIPS32, where the fd and buf have come from the u64
        fd argument, and the size from the buf argument:
      
          [...] 1180.941542: 0x00000001: write(fd=1, buf=  (null), size=6258688)
      
        Instead of this:
      
          [...] 1625.616026: 0x00000001: write(fd=1, buf=009e4000, size=512)
      
      One way to get it working is to expand various combinations
      of argument types into 8 different combinations for 32 bit
      and 64 bit kernels. Fix tested by James on MIPS32 and MIPS64
      as well that it resolves the issue.
      
      Fixes: 9c959c86 ("tracing: Allow BPF programs to call bpf_trace_printk()")
      Reported-by: NJames Hogan <james.hogan@imgtec.com>
      Tested-by: NJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88a5c690
    • J
      audit: Receive unmount event · b5fed474
      Jan Kara 提交于
      Although audit_watch_handle_event() can handle FS_UNMOUNT event, it is
      not part of AUDIT_FS_WATCH mask and thus such event never gets to
      audit_watch_handle_event(). Thus fsnotify marks are deleted by fsnotify
      subsystem on unmount without audit being notified about that which leads
      to a strange state of existing audit rules with dead fsnotify marks.
      
      Add FS_UNMOUNT to the mask of events to be received so that audit can
      clean up its state accordingly.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      b5fed474
    • J
      audit: Fix use after free in audit_remove_watch_rule() · d76036ab
      Jan Kara 提交于
      audit_remove_watch_rule() drops watch's reference to parent but then
      continues to work with it. That is not safe as parent can get freed once
      we drop our reference. The following is a trivial reproducer:
      
      mount -o loop image /mnt
      touch /mnt/file
      auditctl -w /mnt/file -p wax
      umount /mnt
      auditctl -D
      <crash in fsnotify_destroy_mark()>
      
      Grab our own reference in audit_remove_watch_rule() earlier to make sure
      mark does not get freed under us.
      
      CC: stable@vger.kernel.org
      Reported-by: NTony Jones <tonyj@suse.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Tested-by: NTony Jones <tonyj@suse.de>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      d76036ab
  13. 11 8月, 2017 2 次提交
    • N
      mm: migrate: prevent racy access to tlb_flush_pending · 16af97dc
      Nadav Amit 提交于
      Patch series "fixes of TLB batching races", v6.
      
      It turns out that Linux TLB batching mechanism suffers from various
      races.  Races that are caused due to batching during reclamation were
      recently handled by Mel and this patch-set deals with others.  The more
      fundamental issue is that concurrent updates of the page-tables allow
      for TLB flushes to be batched on one core, while another core changes
      the page-tables.  This other core may assume a PTE change does not
      require a flush based on the updated PTE value, while it is unaware that
      TLB flushes are still pending.
      
      This behavior affects KSM (which may result in memory corruption) and
      MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior).  A
      proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
      Memory corruption in KSM is harder to produce in practice, but was
      observed by hacking the kernel and adding a delay before flushing and
      replacing the KSM page.
      
      Finally, there is also one memory barrier missing, which may affect
      architectures with weak memory model.
      
      This patch (of 7):
      
      Setting and clearing mm->tlb_flush_pending can be performed by multiple
      threads, since mmap_sem may only be acquired for read in
      task_numa_work().  If this happens, tlb_flush_pending might be cleared
      while one of the threads still changes PTEs and batches TLB flushes.
      
      This can lead to the same race between migration and
      change_protection_range() that led to the introduction of
      tlb_flush_pending.  The result of this race was data corruption, which
      means that this patch also addresses a theoretically possible data
      corruption.
      
      An actual data corruption was not observed, yet the race was was
      confirmed by adding assertion to check tlb_flush_pending is not set by
      two threads, adding artificial latency in change_protection_range() and
      using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
      Fixes: 20841405 ("mm: fix TLB flush race between migration, and
      change_protection_range")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16af97dc
    • J
      mm: fix global NR_SLAB_.*CLAIMABLE counter reads · d507e2eb
      Johannes Weiner 提交于
      As Tetsuo points out:
       "Commit 385386cf ("mm: vmstat: move slab statistics from zone to
        node counters") broke "Slab:" field of /proc/meminfo . It shows nearly
        0kB"
      
      In addition to /proc/meminfo, this problem also affects the slab
      counters OOM/allocation failure info dumps, can cause early -ENOMEM from
      overcommit protection, and miscalculate image size requirements during
      suspend-to-disk.
      
      This is because the patch in question switched the slab counters from
      the zone level to the node level, but forgot to update the global
      accessor functions to read the aggregate node data instead of the
      aggregate zone data.
      
      Use global_node_page_state() to access the global slab counters.
      
      Fixes: 385386cf ("mm: vmstat: move slab statistics from zone to node counters")
      Link: http://lkml.kernel.org/r/20170801134256.5400-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Stefan Agner <stefan@agner.ch>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d507e2eb
  14. 10 8月, 2017 4 次提交
    • P
      perf/core: Fix time on IOC_ENABLE · 9b231d9f
      Peter Zijlstra 提交于
      Vince reported that when we do IOC_ENABLE/IOC_DISABLE while the task
      is SIGSTOP'ed state the timestamps go wobbly.
      
      It turns out we indeed fail to correctly account time while in 'OFF'
      state and doing IOC_ENABLE without getting scheduled in exposes the
      problem.
      
      Further thinking about this problem, it occurred to me that we can
      suffer a similar fate when we migrate an uncore event between CPUs.
      The perf_event_install() on the 'new' CPU will do add_event_to_ctx()
      which will reset all the time stamp, resulting in a subsequent
      update_event_times() to overwrite the total_time_* fields with smaller
      values.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9b231d9f
    • P
      perf/x86: Fix RDPMC vs. mm_struct tracking · bfe33492
      Peter Zijlstra 提交于
      Vince reported the following rdpmc() testcase failure:
      
       > Failing test case:
       >
       >	fd=perf_event_open();
       >	addr=mmap(fd);
       >	exec()  // without closing or unmapping the event
       >	fd=perf_event_open();
       >	addr=mmap(fd);
       >	rdpmc()	// GPFs due to rdpmc being disabled
      
      The problem is of course that exec() plays tricks with what is
      current->mm, only destroying the old mappings after having
      installed the new mm.
      
      Fix this confusion by passing along vma->vm_mm instead of relying on
      current->mm.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Tested-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Fixes: 1e0fb9ec ("perf: Add pmu callbacks to track event mapping and unmapping")
      Link: http://lkml.kernel.org/r/20170802173930.cstykcqefmqt7jau@hirez.programming.kicks-ass.net
      [ Minor cleanups. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      bfe33492
    • V
      cpufreq: Return 0 from ->fast_switch() on errors · 209887e6
      Viresh Kumar 提交于
      CPUFREQ_ENTRY_INVALID is a special symbol which is used to specify that
      an entry in the cpufreq table is invalid. But using it outside of the
      scope of the cpufreq table looks a bit incorrect.
      
      We can represent an invalid frequency by writing it as 0 instead if we
      need. Note that it is already done that way for the return value of the
      ->get() callback.
      
      Lets do the same for ->fast_switch() and not use CPUFREQ_ENTRY_INVALID
      outside of the scope of cpufreq table.
      
      Also update the comment over cpufreq_driver_fast_switch() to clearly
      mention what this returns.
      
      None of the drivers return CPUFREQ_ENTRY_INVALID as of now from
      ->fast_switch() callback and so we don't need to update any of those.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      209887e6
    • M
      futex: Remove unnecessary warning from get_futex_key · 48fb6f4d
      Mel Gorman 提交于
      Commit 65d8fc77 ("futex: Remove requirement for lock_page() in
      get_futex_key()") removed an unnecessary lock_page() with the
      side-effect that page->mapping needed to be treated very carefully.
      
      Two defensive warnings were added in case any assumption was missed and
      the first warning assumed a correct application would not alter a
      mapping backing a futex key.  Since merging, it has not triggered for
      any unexpected case but Mark Rutland reported the following bug
      triggering due to the first warning.
      
        kernel BUG at kernel/futex.c:679!
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 3695 Comm: syz-executor1 Not tainted 4.13.0-rc3-00020-g307fec773ba3 #3
        Hardware name: linux,dummy-virt (DT)
        task: ffff80001e271780 task.stack: ffff000010908000
        PC is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
        LR is at get_futex_key+0x6a4/0xcf0 kernel/futex.c:679
        pc : [<ffff00000821ac14>] lr : [<ffff00000821ac14>] pstate: 80000145
      
      The fact that it's a bug instead of a warning was due to an unrelated
      arm64 problem, but the warning itself triggered because the underlying
      mapping changed.
      
      This is an application issue but from a kernel perspective it's a
      recoverable situation and the warning is unnecessary so this patch
      removes the warning.  The warning may potentially be triggered with the
      following test program from Mark although it may be necessary to adjust
      NR_FUTEX_THREADS to be a value smaller than the number of CPUs in the
      system.
      
          #include <linux/futex.h>
          #include <pthread.h>
          #include <stdio.h>
          #include <stdlib.h>
          #include <sys/mman.h>
          #include <sys/syscall.h>
          #include <sys/time.h>
          #include <unistd.h>
      
          #define NR_FUTEX_THREADS 16
          pthread_t threads[NR_FUTEX_THREADS];
      
          void *mem;
      
          #define MEM_PROT  (PROT_READ | PROT_WRITE)
          #define MEM_SIZE  65536
      
          static int futex_wrapper(int *uaddr, int op, int val,
                                   const struct timespec *timeout,
                                   int *uaddr2, int val3)
          {
              syscall(SYS_futex, uaddr, op, val, timeout, uaddr2, val3);
          }
      
          void *poll_futex(void *unused)
          {
              for (;;) {
                  futex_wrapper(mem, FUTEX_CMP_REQUEUE_PI, 1, NULL, mem + 4, 1);
              }
          }
      
          int main(int argc, char *argv[])
          {
              int i;
      
              mem = mmap(NULL, MEM_SIZE, MEM_PROT,
                     MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      
              printf("Mapping @ %p\n", mem);
      
              printf("Creating futex threads...\n");
      
              for (i = 0; i < NR_FUTEX_THREADS; i++)
                  pthread_create(&threads[i], NULL, poll_futex, NULL);
      
              printf("Flipping mapping...\n");
              for (;;) {
                  mmap(mem, MEM_SIZE, MEM_PROT,
                       MAP_FIXED | MAP_SHARED | MAP_ANONYMOUS, -1, 0);
              }
      
              return 0;
          }
      Reported-and-tested-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: stable@vger.kernel.org # 4.7+
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48fb6f4d
  15. 07 8月, 2017 1 次提交
  16. 03 8月, 2017 5 次提交
    • D
      cpuset: fix a deadlock due to incomplete patching of cpusets_enabled() · 89affbf5
      Dima Zavin 提交于
      In codepaths that use the begin/retry interface for reading
      mems_allowed_seq with irqs disabled, there exists a race condition that
      stalls the patch process after only modifying a subset of the
      static_branch call sites.
      
      This problem manifested itself as a deadlock in the slub allocator,
      inside get_any_partial.  The loop reads mems_allowed_seq value (via
      read_mems_allowed_begin), performs the defrag operation, and then
      verifies the consistency of mem_allowed via the read_mems_allowed_retry
      and the cookie returned by xxx_begin.
      
      The issue here is that both begin and retry first check if cpusets are
      enabled via cpusets_enabled() static branch.  This branch can be
      rewritted dynamically (via cpuset_inc) if a new cpuset is created.  The
      x86 jump label code fully synchronizes across all CPUs for every entry
      it rewrites.  If it rewrites only one of the callsites (specifically the
      one in read_mems_allowed_retry) and then waits for the
      smp_call_function(do_sync_core) to complete while a CPU is inside the
      begin/retry section with IRQs off and the mems_allowed value is changed,
      we can hang.
      
      This is because begin() will always return 0 (since it wasn't patched
      yet) while retry() will test the 0 against the actual value of the seq
      counter.
      
      The fix is to use two different static keys: one for begin
      (pre_enable_key) and one for retry (enable_key).  In cpuset_inc(), we
      first bump the pre_enable key to ensure that cpuset_mems_allowed_begin()
      always return a valid seqcount if are enabling cpusets.  Similarly, when
      disabling cpusets via cpuset_dec(), we first ensure that callers of
      cpuset_mems_allowed_retry() will start ignoring the seqcount value
      before we let cpuset_mems_allowed_begin() return 0.
      
      The relevant stack traces of the two stuck threads:
      
        CPU: 1 PID: 1415 Comm: mkdir Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8817f9c28000 task.stack: ffffc9000ffa4000
        RIP: smp_call_function_many+0x1f9/0x260
        Call Trace:
          smp_call_function+0x3b/0x70
          on_each_cpu+0x2f/0x90
          text_poke_bp+0x87/0xd0
          arch_jump_label_transform+0x93/0x100
          __jump_label_update+0x77/0x90
          jump_label_update+0xaa/0xc0
          static_key_slow_inc+0x9e/0xb0
          cpuset_css_online+0x70/0x2e0
          online_css+0x2c/0xa0
          cgroup_apply_control_enable+0x27f/0x3d0
          cgroup_mkdir+0x2b7/0x420
          kernfs_iop_mkdir+0x5a/0x80
          vfs_mkdir+0xf6/0x1a0
          SyS_mkdir+0xb7/0xe0
          entry_SYSCALL_64_fastpath+0x18/0xad
      
        ...
      
        CPU: 2 PID: 1 Comm: init Tainted: G L  4.9.36-00104-g540c51286237 #4
        Hardware name: Default string Default string/Hardware, BIOS 4.29.1-20170526215256 05/26/2017
        task: ffff8818087c0000 task.stack: ffffc90000030000
        RIP: int3+0x39/0x70
        Call Trace:
          <#DB> ? ___slab_alloc+0x28b/0x5a0
          <EOE> ? copy_process.part.40+0xf7/0x1de0
          __slab_alloc.isra.80+0x54/0x90
          copy_process.part.40+0xf7/0x1de0
          copy_process.part.40+0xf7/0x1de0
          kmem_cache_alloc_node+0x8a/0x280
          copy_process.part.40+0xf7/0x1de0
          _do_fork+0xe7/0x6c0
          _raw_spin_unlock_irq+0x2d/0x60
          trace_hardirqs_on_caller+0x136/0x1d0
          entry_SYSCALL_64_fastpath+0x5/0xad
          do_syscall_64+0x27/0x350
          SyS_clone+0x19/0x20
          do_syscall_64+0x60/0x350
          entry_SYSCALL64_slow_path+0x25/0x25
      
      Link: http://lkml.kernel.org/r/20170731040113.14197-1-dmitriyz@waymo.com
      Fixes: 46e700ab ("mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled")
      Signed-off-by: NDima Zavin <dmitriyz@waymo.com>
      Reported-by: NCliff Spradlin <cspradlin@waymo.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      89affbf5
    • K
      pid: kill pidhash_size in pidhash_init() · 27e37d84
      Kefeng Wang 提交于
      After commit 3d375d78 ("mm: update callers to use HASH_ZERO flag"),
      drop unused pidhash_size in pidhash_init().
      
      Link: http://lkml.kernel.org/r/1500389267-49222-1-git-send-email-wangkefeng.wang@huawei.comSigned-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: NPavel Tatashin <Pasha.Tatashin@Oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27e37d84
    • S
      ring-buffer: Have ring_buffer_alloc_read_page() return error on offline CPU · a7e52ad7
      Steven Rostedt (VMware) 提交于
      Chunyu Hu reported:
        "per_cpu trace directories and files are created for all possible cpus,
         but only the cpus which have ever been on-lined have their own per cpu
         ring buffer (allocated by cpuhp threads). While trace_buffers_open, the
         open handler for trace file 'trace_pipe_raw' is always trying to access
         field of ring_buffer_per_cpu, and would panic with the NULL pointer.
      
         Align the behavior of trace_pipe_raw with trace_pipe, that returns -NODEV
         when openning it if that cpu does not have trace ring buffer.
      
         Reproduce:
         cat /sys/kernel/debug/tracing/per_cpu/cpu31/trace_pipe_raw
         (cpu31 is never on-lined, this is a 16 cores x86_64 box)
      
         Tested with:
         1) boot with maxcpus=14, read trace_pipe_raw of cpu15.
            Got -NODEV.
         2) oneline cpu15, read trace_pipe_raw of cpu15.
            Get the raw trace data.
      
         Call trace:
         [ 5760.950995] RIP: 0010:ring_buffer_alloc_read_page+0x32/0xe0
         [ 5760.961678]  tracing_buffers_read+0x1f6/0x230
         [ 5760.962695]  __vfs_read+0x37/0x160
         [ 5760.963498]  ? __vfs_read+0x5/0x160
         [ 5760.964339]  ? security_file_permission+0x9d/0xc0
         [ 5760.965451]  ? __vfs_read+0x5/0x160
         [ 5760.966280]  vfs_read+0x8c/0x130
         [ 5760.967070]  SyS_read+0x55/0xc0
         [ 5760.967779]  do_syscall_64+0x67/0x150
         [ 5760.968687]  entry_SYSCALL64_slow_path+0x25/0x25"
      
      This was introduced by the addition of the feature to reuse reader pages
      instead of re-allocating them. The problem is that the allocation of a
      reader page (which is per cpu) does not check if the cpu is online and set
      up for the ring buffer.
      
      Link: http://lkml.kernel.org/r/1500880866-1177-1-git-send-email-chuhu@redhat.com
      
      Cc: stable@vger.kernel.org
      Fixes: 73a757e6 ("ring-buffer: Return reader page back into existing ring buffer")
      Reported-by: NChunyu Hu <chuhu@redhat.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      a7e52ad7
    • D
      tracing: Missing error code in tracer_alloc_buffers() · 147d88e0
      Dan Carpenter 提交于
      If ring_buffer_alloc() or one of the next couple function calls fail
      then we should return -ENOMEM but the current code returns success.
      
      Link: http://lkml.kernel.org/r/20170801110201.ajdkct7vwzixahvx@mwanda
      
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: stable@vger.kernel.org
      Fixes: b32614c0 ('tracing/rb: Convert to hotplug state machine')
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      147d88e0
    • S
      tracing: Call clear_boot_tracer() at lateinit_sync · 4bb0f0e7
      Steven Rostedt (VMware) 提交于
      The clear_boot_tracer function is used to reset the default_bootup_tracer
      string to prevent it from being accessed after boot, as it originally points
      to init data. But since clear_boot_tracer() is called via the
      init_lateinit() call, it races with the initcall for registering the hwlat
      tracer. If someone adds "ftrace=hwlat" to the kernel command line, depending
      on how the linker sets up the text, the saved command line may be cleared,
      and the hwlat tracer never is initialized.
      
      Simply have the clear_boot_tracer() be called by initcall_lateinit_sync() as
      that's for tasks to be called after lateinit.
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=196551
      
      Cc: stable@vger.kernel.org
      Fixes: e7c15cd8 ("tracing: Added hardware latency tracer")
      Reported-by: NZamir SUN <sztsian@gmail.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      4bb0f0e7
  17. 01 8月, 2017 2 次提交
    • V
      sched: cpufreq: Allow remote cpufreq callbacks · 674e7541
      Viresh Kumar 提交于
      With Android UI and benchmarks the latency of cpufreq response to
      certain scheduling events can become very critical. Currently, callbacks
      into cpufreq governors are only made from the scheduler if the target
      CPU of the event is the same as the current CPU. This means there are
      certain situations where a target CPU may not run the cpufreq governor
      for some time.
      
      One testcase to show this behavior is where a task starts running on
      CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the
      system is configured such that the new tasks should receive maximum
      demand initially, this should result in CPU0 increasing frequency
      immediately. But because of the above mentioned limitation though, this
      does not occur.
      
      This patch updates the scheduler core to call the cpufreq callbacks for
      remote CPUs as well.
      
      The schedutil, ondemand and conservative governors are updated to
      process cpufreq utilization update hooks called for remote CPUs where
      the remote CPU is managed by the cpufreq policy of the local CPU.
      
      The intel_pstate driver is updated to always reject remote callbacks.
      
      This is tested with couple of usecases (Android: hackbench, recentfling,
      galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit
      octa-core, single policy). Only galleryfling showed minor improvements,
      while others didn't had much deviation.
      
      The reason being that this patch only targets a corner case, where
      following are required to be true to improve performance and that
      doesn't happen too often with these tests:
      
      - Task is migrated to another CPU.
      - The task has high demand, and should take the target CPU to higher
        OPPs.
      - And the target CPU doesn't call into the cpufreq governor until the
        next tick.
      
      Based on initial work from Steve Muckle.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Acked-by: NSaravana Kannan <skannan@codeaurora.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      674e7541
    • M
      timers: Fix overflow in get_next_timer_interrupt · 34f41c03
      Matija Glavinic Pecotic 提交于
      For e.g. HZ=100, timer being 430 jiffies in the future, and 32 bit
      unsigned int, there is an overflow on unsigned int right-hand side
      of the expression which results with wrong values being returned.
      
      Type cast the multiplier to 64bit to avoid that issue.
      
      Fixes: 46c8f0b0 ("timers: Fix get_next_timer_interrupt() computation")
      Signed-off-by: NMatija Glavinic Pecotic <matija.glavinic-pecotic.ext@nokia.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NAlexander Sverdlin <alexander.sverdlin@nokia.com>
      Cc: khilman@baylibre.com
      Cc: akpm@linux-foundation.org
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/a7900f04-2a21-c9fd-67be-ab334d459ee5@nokia.com
      34f41c03
  18. 31 7月, 2017 1 次提交
    • A
      PM / CPU: replace raw_notifier with atomic_notifier · 313c8c16
      Alex Shi 提交于
      This patch replaces an rwlock and raw notifier by an atomic notifier
      protected by a spin_lock and RCU.
      
      The main reason for this change is due to a 'scheduling while atomic'
      bug with RT kernels on ARM/ARM64. On ARM/ARM64, the rwlock
      cpu_pm_notifier_lock in cpu_pm_enter/exit() causes a potential
      schedule after IRQ disable in the idle call chain:
      
      cpu_startup_entry
        cpu_idle_loop
          local_irq_disable()
          cpuidle_idle_call
            call_cpuidle
              cpuidle_enter
                cpuidle_enter_state
                  ->enter :arm_enter_idle_state
                    cpu_pm_enter/exit
                      CPU_PM_CPU_IDLE_ENTER
                        read_lock(&cpu_pm_notifier_lock); <-- sleep in idle
                           __rt_spin_lock();
                              schedule();
      
      The kernel panic is here:
      [    4.609601] BUG: scheduling while atomic: swapper/1/0/0x00000002
      [    4.609608] [<ffff0000086fae70>] arm_enter_idle_state+0x18/0x70
      [    4.609614] Modules linked in:
      [    4.609615] [<ffff0000086f9298>] cpuidle_enter_state+0xf0/0x218
      [    4.609620] [<ffff0000086f93f8>] cpuidle_enter+0x18/0x20
      [    4.609626] Preemption disabled at:
      [    4.609627] [<ffff0000080fa234>] call_cpuidle+0x24/0x40
      [    4.609635] [<ffff000008882fa4>] schedule_preempt_disabled+0x1c/0x28
      [    4.609639] [<ffff0000080fa49c>] cpu_startup_entry+0x154/0x1f8
      [    4.609645] [<ffff00000808e004>] secondary_start_kernel+0x15c/0x1a0
      
      Daniel Lezcano said this notification is needed on arm/arm64 platforms.
      Sebastian suggested using atomic_notifier instead of rwlock, which is not
      only removing the sleeping in idle, but also improving latency.
      
      Tony Lindgren found a miss use that rcu_read_lock used after rcu_idle_enter
      Paul McKenney suggested trying RCU_NONIDLE.
      Signed-off-by: NAlex Shi <alex.shi@linaro.org>
      Tested-by: NTony Lindgren <tony@atomide.com>
      Acked-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      [ rjw: Subject & changelog ]
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      313c8c16