1. 15 1月, 2016 1 次提交
    • T
      genirq: Validate action before dereferencing it in handle_irq_event_percpu() · 570540d5
      Thomas Gleixner 提交于
      commit 71f64340 changed the handling of irq_desc->action from
      
      CPU 0                   CPU 1
      free_irq()              lock(desc)
        lock(desc)            handle_edge_irq()
                              if (desc->action) {
                                handle_irq_event()
                                  action = desc->action
                                  unlock(desc)
        desc->action = NULL       handle_irq_event_percpu(desc, action)
                                    action->xxx
      to
      
      CPU 0                   CPU 1
      free_irq()              lock(desc)
        lock(desc)            handle_edge_irq()
                              if (desc->action) {
                                handle_irq_event()
                                  unlock(desc)
        desc->action = NULL       handle_irq_event_percpu(desc, action)
                                    action = desc->action
                                    action->xxx
      
      So if free_irq manages to set the action to NULL between the unlock and before
      the readout, we happily dereference a null pointer.
      
      We could simply revert 71f64340, but we want to preserve the better code
      generation. A simple solution is to change the action loop from a do {} while
      to a while {} loop.
      
      This is safe because we either see a valid desc->action or NULL. If the action
      is about to be removed it is still valid as free_irq() is blocked on
      synchronize_irq().
      
      CPU 0                   CPU 1
      free_irq()              lock(desc)
        lock(desc)            handle_edge_irq()
                                handle_irq_event(desc)
                                  set(INPROGRESS)
                                  unlock(desc)
                                  handle_irq_event_percpu(desc)
                                  action = desc->action
        desc->action = NULL           while (action) {
                                        action->xxx
                                        ...
                                        action = action->next;
        sychronize_irq()
          while(INPROGRESS);      lock(desc)
                                  clr(INPROGRESS)
      free(action)
      
      That's basically the same mechanism as we have for shared
      interrupts. action->next can become NULL while handle_irq_event_percpu()
      runs. Either it sees the action or NULL. It does not matter, because action
      itself cannot go away before the interrupt in progress flag has been cleared.
      
      Fixes: commit 71f64340 "genirq: Remove the second parameter from handle_irq_event_percpu()"
      Reported-by: zyjzyj2000@gmail.com
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Huang Shijie <shijie.huang@arm.com>
      Cc: Jiang Liu <jiang.liu@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1601131224190.3575@nanos
      570540d5
  2. 08 1月, 2016 1 次提交
  3. 06 1月, 2016 7 次提交
    • P
      perf/core: Collapse more IPI loops · 7b648018
      Peter Zijlstra 提交于
      This patch collapses the two 'hard' cases, which are
      perf_event_{dis,en}able().
      
      I cannot seem to convince myself the current code is correct.
      
      So starting with perf_event_disable(); we don't strictly need to test
      for event->state == ACTIVE, ctx->is_active is enough. If the event is
      not scheduled while the ctx is, __perf_event_disable() still does the
      right thing.  Its a little less efficient to IPI in that case,
      over-all simpler.
      
      For perf_event_enable(); the same goes, but I think that's actually
      broken in its current form. The current condition is: ctx->is_active
      && event->state == OFF, that means it doesn't do anything when
      !ctx->active && event->state == OFF. This is wrong, it should still
      mark the event INACTIVE in that case, otherwise we'll still not try
      and schedule the event once the context becomes active again.
      
      This patch implements the two function using the new
      event_function_call() and does away with the tricky event->state
      tests.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NAlexander Shishkin <alexander.shishkin@intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7b648018
    • Y
      sched/fair: Fix new task's load avg removed from source CPU in wake_up_new_task() · 0905f04e
      Yuyang Du 提交于
      If a newly created task is selected to go to a different CPU in fork
      balance when it wakes up the first time, its load averages should
      not be removed from the source CPU since they are never added to
      it before. The same is also applicable to a never used group entity.
      
      Fix it in remove_entity_load_avg(): when entity's last_update_time
      is 0, simply return. This should precisely identify the case in
      question, because in other migrations, the last_update_time is set
      to 0 after remove_entity_load_avg().
      Reported-by: NSteve Muckle <steve.muckle@linaro.org>
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      [peterz: cfs_rq_last_update_time]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <Juri.Lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Patrick Bellasi <patrick.bellasi@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20151216233427.GJ28098@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      0905f04e
    • W
      sched/deadline: Fix the earliest_dl.next logic · 7d92de3a
      Wanpeng Li 提交于
      earliest_dl.next should cache deadline of the earliest ready task that
      is also enqueued in the pushable rbtree, as pull algorithm uses this
      information to find candidates for migration: if the earliest_dl.next
      deadline of source rq is earlier than the earliest_dl.curr deadline of
      destination rq, the task from the source rq can be pulled.
      
      However, current implementation only guarantees that earliest_dl.next is
      the deadline of the next ready task instead of the next pushable task;
      which will result in potentially holding both rqs' lock and find nothing
      to migrate because of affinity constraints. In addition, current logic
      doesn't update the next candidate for pushing in pick_next_task_dl(),
      even if the running task is never eligible.
      
      This patch fixes both problems by updating earliest_dl.next when
      pushable dl task is enqueued/dequeued, similar to what we already do for
      RT.
      Tested-by: NLuca Abeni <luca.abeni@unitn.it>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NJuri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1449135730-27202-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7d92de3a
    • S
      sched/core: Reset task's lockless wake-queues on fork() · 093e5840
      Sebastian Andrzej Siewior 提交于
      In the following commit:
      
        76751049 ("sched: Implement lockless wake-queues")
      
      we gained lockless wake-queues.
      
      The -RT kernel managed to lockup itself with those. There could be multiple
      attempts for task X to enqueue it for a wakeup _even_ if task X is already
      running.
      
      The reason is that task X could be runnable but not yet on CPU. The the
      task performing the wakeup did not leave the CPU it could performe
      multiple wakeups.
      
      With the proper timming task X could be running and enqueued for a
      wakeup. If this happens while X is performing a fork() then its its
      child will have a !NULL `wake_q` member copied.
      
      This is not a problem as long as the child task does not participate in
      lockless wakeups :)
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Davidlohr Bueso <dbueso@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 76751049 ("sched: Implement lockless wake-queues")
      Link: http://lkml.kernel.org/r/20151221171710.GA5499@linutronix.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      093e5840
    • A
      sched/fair: Fix multiplication overflow on 32-bit systems · 9e0e83a1
      Andrey Ryabinin 提交于
      Make 'r' 64-bit type to avoid overflow in 'r * LOAD_AVG_MAX'
      on 32-bit systems:
      
      	UBSAN: Undefined behaviour in kernel/sched/fair.c:2785:18
      	signed integer overflow:
      	87950 * 47742 cannot be represented in type 'int'
      
      The most likely effect of this bug are bad load average numbers
      resulting in weird scheduling. It's also likely that this can
      persist for a longer time - until the system goes idle for
      a long time so that all load avg numbers get reset.
      
      [ This is the CFS load average metric, not the procfs output, which
        is separate. ]
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 9d89c257 ("sched/fair: Rewrite runnable load and utilization average tracking")
      Link: http://lkml.kernel.org/r/1450097243-30137-1-git-send-email-aryabinin@virtuozzo.com
      [ Improved the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9e0e83a1
    • P
      perf: Fix race in swevent hash · 12ca6ad2
      Peter Zijlstra 提交于
      There's a race on CPU unplug where we free the swevent hash array
      while it can still have events on. This will result in a
      use-after-free which is BAD.
      
      Simply do not free the hash array on unplug. This leaves the thing
      around and no use-after-free takes place.
      
      When the last swevent dies, we do a for_each_possible_cpu() iteration
      anyway to clean these up, at which time we'll free it, so no leakage
      will occur.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      12ca6ad2
    • P
      perf: Fix race in perf_event_exec() · c1274499
      Peter Zijlstra 提交于
      I managed to tickle this warning:
      
        [ 2338.884942] ------------[ cut here ]------------
        [ 2338.890112] WARNING: CPU: 13 PID: 35162 at ../kernel/events/core.c:2702 task_ctx_sched_out+0x6b/0x80()
        [ 2338.900504] Modules linked in:
        [ 2338.903933] CPU: 13 PID: 35162 Comm: bash Not tainted 4.4.0-rc4-dirty #244
        [ 2338.911610] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
        [ 2338.923071]  ffffffff81f1468e ffff8807c6457cb8 ffffffff815c680c 0000000000000000
        [ 2338.931382]  ffff8807c6457cf0 ffffffff810c8a56 ffffe8ffff8c1bd0 ffff8808132ed400
        [ 2338.939678]  0000000000000286 ffff880813170380 ffff8808132ed400 ffff8807c6457d00
        [ 2338.947987] Call Trace:
        [ 2338.950726]  [<ffffffff815c680c>] dump_stack+0x4e/0x82
        [ 2338.956474]  [<ffffffff810c8a56>] warn_slowpath_common+0x86/0xc0
        [ 2338.963195]  [<ffffffff810c8b4a>] warn_slowpath_null+0x1a/0x20
        [ 2338.969720]  [<ffffffff811a49cb>] task_ctx_sched_out+0x6b/0x80
        [ 2338.976244]  [<ffffffff811a62d2>] perf_event_exec+0xe2/0x180
        [ 2338.982575]  [<ffffffff8121fb6f>] setup_new_exec+0x6f/0x1b0
        [ 2338.988810]  [<ffffffff8126de83>] load_elf_binary+0x393/0x1660
        [ 2338.995339]  [<ffffffff811dc772>] ? get_user_pages+0x52/0x60
        [ 2339.001669]  [<ffffffff8121e297>] search_binary_handler+0x97/0x200
        [ 2339.008581]  [<ffffffff8121f8b3>] do_execveat_common.isra.33+0x543/0x6e0
        [ 2339.016072]  [<ffffffff8121fcea>] SyS_execve+0x3a/0x50
        [ 2339.021819]  [<ffffffff819fc165>] stub_execve+0x5/0x5
        [ 2339.027469]  [<ffffffff819fbeb2>] ? entry_SYSCALL_64_fastpath+0x12/0x71
        [ 2339.034860] ---[ end trace ee1337c59a0ddeac ]---
      
      Which is a WARN_ON_ONCE() indicating that cpuctx->task_ctx is not
      what we expected it to be.
      
      This is because context switches can swap the task_struct::perf_event_ctxp[]
      pointer around. Therefore you have to either disable preemption when looking
      at current, or hold ctx->lock.
      
      Fix perf_event_enable_on_exec(), it loads current->perf_event_ctxp[]
      before disabling interrupts, therefore a preemption in the right place
      can swap contexts around and we're using the wrong one.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vince Weaver <vincent.weaver@maine.edu>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Link: http://lkml.kernel.org/r/20151210195740.GG6357@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c1274499
  4. 05 1月, 2016 1 次提交
    • Q
      tracing: Fix setting of start_index in find_next() · f36d1be2
      Qiu Peiyang 提交于
      When we do cat /sys/kernel/debug/tracing/printk_formats, we hit kernel
      panic at t_show.
      
      general protection fault: 0000 [#1] PREEMPT SMP
      CPU: 0 PID: 2957 Comm: sh Tainted: G W  O 3.14.55-x86_64-01062-gd4acdc7 #2
      RIP: 0010:[<ffffffff811375b2>]
       [<ffffffff811375b2>] t_show+0x22/0xe0
      RSP: 0000:ffff88002b4ebe80  EFLAGS: 00010246
      RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000004
      RDX: 0000000000000004 RSI: ffffffff81fd26a6 RDI: ffff880032f9f7b1
      RBP: ffff88002b4ebe98 R08: 0000000000001000 R09: 000000000000ffec
      R10: 0000000000000000 R11: 000000000000000f R12: ffff880004d9b6c0
      R13: 7365725f6d706400 R14: ffff880004d9b6c0 R15: ffffffff82020570
      FS:  0000000000000000(0000) GS:ffff88003aa00000(0063) knlGS:00000000f776bc40
      CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
      CR2: 00000000f6c02ff0 CR3: 000000002c2b3000 CR4: 00000000001007f0
      Call Trace:
       [<ffffffff811dc076>] seq_read+0x2f6/0x3e0
       [<ffffffff811b749b>] vfs_read+0x9b/0x160
       [<ffffffff811b7f69>] SyS_read+0x49/0xb0
       [<ffffffff81a3a4b9>] ia32_do_call+0x13/0x13
       ---[ end trace 5bd9eb630614861e ]---
      Kernel panic - not syncing: Fatal exception
      
      When the first time find_next calls find_next_mod_format, it should
      iterate the trace_bprintk_fmt_list to find the first print format of
      the module. However in current code, start_index is smaller than *pos
      at first, and code will not iterate the list. Latter container_of will
      get the wrong address with former v, which will cause mod_fmt be a
      meaningless object and so is the returned mod_fmt->fmt.
      
      This patch will fix it by correcting the start_index. After fixed,
      when the first time calls find_next_mod_format, start_index will be
      equal to *pos, and code will iterate the trace_bprintk_fmt_list to
      get the right module printk format, so is the returned mod_fmt->fmt.
      
      Link: http://lkml.kernel.org/r/5684B900.9000309@intel.com
      
      Cc: stable@vger.kernel.org # 3.12+
      Fixes: 102c9323 "tracing: Add __tracepoint_string() to export string pointers"
      Signed-off-by: NQiu Peiyang <peiyangx.qiu@intel.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      f36d1be2
  5. 29 12月, 2015 1 次提交
  6. 21 12月, 2015 1 次提交
  7. 20 12月, 2015 7 次提交
  8. 19 12月, 2015 4 次提交
    • Y
      clocksource: Make clocksource validation work for all clocksources · 1f45f1f3
      Yang Yingliang 提交于
      The clocksource validation which makes sure that the newly read value
      is not smaller than the last value only works if the clocksource mask
      is 64bit, i.e. the counter is 64bit wide. But we want to use that
      mechanism also for clocksources which are less than 64bit wide.
      
      So instead of checking whether bit 63 is set, we check whether the
      most significant bit of the clocksource mask is set in the delta
      result. If it is set, we return 0.
      
      [ tglx: Simplified the implementation, added a comment and massaged
        	the commit message ]
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Cc: <linux-arm-kernel@lists.infradead.org>
      Link: http://lkml.kernel.org/r/56349607.6070708@huawei.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      1f45f1f3
    • H
      kexec: Fix race between panic() and crash_kexec() · 7bbee5ca
      Hidehiro Kawai 提交于
      Currently, panic() and crash_kexec() can be called at the same time.
      For example (x86 case):
      
      CPU 0:
        oops_end()
          crash_kexec()
            mutex_trylock() // acquired
              nmi_shootdown_cpus() // stop other CPUs
      
      CPU 1:
        panic()
          crash_kexec()
            mutex_trylock() // failed to acquire
          smp_send_stop() // stop other CPUs
          infinite loop
      
      If CPU 1 calls smp_send_stop() before nmi_shootdown_cpus(), kdump
      fails.
      
      In another case:
      
      CPU 0:
        oops_end()
          crash_kexec()
            mutex_trylock() // acquired
              <NMI>
              io_check_error()
                panic()
                  crash_kexec()
                    mutex_trylock() // failed to acquire
                  infinite loop
      
      Clearly, this is an undesirable result.
      
      To fix this problem, this patch changes crash_kexec() to exclude others
      by using the panic_cpu atomic.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: kexec@lists.infradead.org
      Cc: linux-doc@vger.kernel.org
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Minfei Huang <mnfhuang@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: x86-ml <x86@kernel.org>
      Link: http://lkml.kernel.org/r/20151210014630.25437.94161.stgit@softrsSigned-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      7bbee5ca
    • H
      panic, x86: Allow CPUs to save registers even if looping in NMI context · 58c5661f
      Hidehiro Kawai 提交于
      Currently, kdump_nmi_shootdown_cpus(), a subroutine of crash_kexec(),
      sends an NMI IPI to CPUs which haven't called panic() to stop them,
      save their register information and do some cleanups for crash dumping.
      However, if such a CPU is infinitely looping in NMI context, we fail to
      save its register information into the crash dump.
      
      For example, this can happen when unknown NMIs are broadcast to all
      CPUs as follows:
      
        CPU 0                             CPU 1
        ===========================       ==========================
        receive an unknown NMI
        unknown_nmi_error()
          panic()                         receive an unknown NMI
            spin_trylock(&panic_lock)     unknown_nmi_error()
            crash_kexec()                   panic()
                                              spin_trylock(&panic_lock)
                                              panic_smp_self_stop()
                                                infinite loop
              kdump_nmi_shootdown_cpus()
                issue NMI IPI -----------> blocked until IRET
                                                infinite loop...
      
      Here, since CPU 1 is in NMI context, the second NMI from CPU 0 is
      blocked until CPU 1 executes IRET. However, CPU 1 never executes IRET,
      so the NMI is not handled and the callback function to save registers is
      never called.
      
      In practice, this can happen on some servers which broadcast NMIs to all
      CPUs when the NMI button is pushed.
      
      To save registers in this case, we need to:
      
        a) Return from NMI handler instead of looping infinitely
        or
        b) Call the callback function directly from the infinite loop
      
      Inherently, a) is risky because NMI is also used to prevent corrupted
      data from being propagated to devices.  So, we chose b).
      
      This patch does the following:
      
      1. Move the infinite looping of CPUs which haven't called panic() in NMI
         context (actually done by panic_smp_self_stop()) outside of panic() to
         enable us to refer pt_regs. Please note that panic_smp_self_stop() is
         still used for normal context.
      
      2. Call a callback of kdump_nmi_shootdown_cpus() directly to save
         registers and do some cleanups after setting waiting_for_crash_ipi which
         is used for counting down the number of CPUs which handled the callback
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Gobinda Charan Maji <gobinda.cemk07@gmail.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Hidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Javi Merino <javi.merino@arm.com>
      Cc: Jiang Liu <jiang.liu@linux.intel.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: kexec@lists.infradead.org
      Cc: linux-doc@vger.kernel.org
      Cc: lkml <linux-kernel@vger.kernel.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Stefan Lippers-Hollmann <s.l-h@gmx.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Link: http://lkml.kernel.org/r/20151210014628.25437.75256.stgit@softrs
      [ Cleanup comments, fixup formatting. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      58c5661f
    • H
      panic, x86: Fix re-entrance problem due to panic on NMI · 1717f209
      Hidehiro Kawai 提交于
      If panic on NMI happens just after panic() on the same CPU, panic() is
      recursively called. Kernel stalls, as a result, after failing to acquire
      panic_lock.
      
      To avoid this problem, don't call panic() in NMI context if we've
      already entered panic().
      
      For that, introduce nmi_panic() macro to reduce code duplication. In
      the case of panic on NMI, don't return from NMI handlers if another CPU
      already panicked.
      Signed-off-by: NHidehiro Kawai <hidehiro.kawai.ez@hitachi.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: David Hildenbrand <dahi@linux.vnet.ibm.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Gobinda Charan Maji <gobinda.cemk07@gmail.com>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Javi Merino <javi.merino@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: kexec@lists.infradead.org
      Cc: linux-doc@vger.kernel.org
      Cc: lkml <linux-kernel@vger.kernel.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Nicolas Iooss <nicolas.iooss_linux@m4x.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Link: http://lkml.kernel.org/r/20151210014626.25437.13302.stgit@softrs
      [ Cleanup comments, fixup formatting. ]
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      1717f209
  9. 18 12月, 2015 1 次提交
  10. 17 12月, 2015 4 次提交
    • J
      timekeeping: Cap adjustments so they don't exceed the maxadj value · ec02b076
      John Stultz 提交于
      Thus its been occasionally noted that users have seen
      confusing warnings like:
      
          Adjusting tsc more than 11% (5941981 vs 7759439)
      
      We try to limit the maximum total adjustment to 11% (10% tick
      adjustment + 0.5% frequency adjustment). But this is done by
      bounding the requested adjustment values, and the internal
      steering that is done by tracking the error from what was
      requested and what was applied, does not have any such limits.
      
      This is usually not problematic, but in some cases has a risk
      that an adjustment could cause the clocksource mult value to
      overflow, so its an indication things are outside of what is
      expected.
      
      It ends up most of the reports of this 11% warning are on systems
      using chrony, which utilizes the adjtimex() ADJ_TICK interface
      (which allows a +-10% adjustment). The original rational for
      ADJ_TICK unclear to me but my assumption it was originally added
      to allow broken systems to get a big constant correction at boot
      (see adjtimex userspace package for an example) which would allow
      the system to work w/ ntpd's 0.5% adjustment limit.
      
      Chrony uses ADJ_TICK to make very aggressive short term corrections
      (usually right at startup). Which push us close enough to the max
      bound that a few late ticks can cause the internal steering to push
      past the max adjust value (tripping the warning).
      
      Thus this patch adds some extra logic to enforce the max adjustment
      cap in the internal steering.
      
      Note: This has the potential to slow corrections when the ADJ_TICK
      value is furthest away from the default value. So it would be good to
      get some testing from folks using chrony, to make sure we don't
      cause any troubles there.
      
      Cc: Miroslav Lichvar <mlichvar@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Prarit Bhargava <prarit@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Tested-by: NMiroslav Lichvar <mlichvar@redhat.com>
      Reported-by: NAndy Lutomirski <luto@kernel.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      ec02b076
    • D
      ntp: Fix second_overflow's input parameter type to be 64bits · c7963487
      DengChao 提交于
      The function "second_overflow" uses "unsign long"
      as its input parameter type which will overflow after
      year 2106 on 32bit systems.
      
      Thus this patch replaces it with time64_t type.
      
      While the 64-bit division is expensive, "next_ntp_leap_sec"
      has been calculated already, so we can just re-use it in the
      TIME_INS/DEL cases, allowing one expensive division per
      leapsecond instead of re-doing the divsion once a second after
      the leap flag has been set.
      Signed-off-by: NDengChao <chao.deng@linaro.org>
      [jstultz: Tweaked commit message]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      c7963487
    • D
      ntp: Change time_reftime to time64_t and utilize 64bit __ktime_get_real_seconds · 0af86465
      DengChao 提交于
      The type of static variant "time_reftime" and the call of
      get_seconds in ntp are both not y2038 safe.
      
      So change the type of time_reftime to time64_t and replace
      get_seconds with __ktime_get_real_seconds.
      
      The local variant "secs" in ntp_update_offset represents
      seconds between now and last ntp adjustment, it seems impossible
      that this time will last more than 68 years, so keep its type as
      "long".
      Reviewed-by: NJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: NDengChao <chao.deng@linaro.org>
      [jstultz: Tweaked commit message]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      0af86465
    • D
      timekeeping: Provide internal function __ktime_get_real_seconds · dee36654
      DengChao 提交于
      In order to fix Y2038 issues in the ntp code we will need replace
      get_seconds() with ktime_get_real_seconds() but as the ntp code uses
      the timekeeping lock which is also used by ktime_get_real_seconds(),
      we need a version without locking.
      Add a new function __ktime_get_real_seconds() in timekeeping to
      do this.
      Reviewed-by: NJohn Stultz <john.stultz@linaro.org>
      Signed-off-by: NDengChao <chao.deng@linaro.org>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      dee36654
  11. 16 12月, 2015 3 次提交
  12. 14 12月, 2015 3 次提交
    • T
      genirq: Free irq_desc with rcu · 425a5072
      Thomas Gleixner 提交于
      The new VMD device driver needs to iterate over a list of
      "demultiplexing" interrupts. Protecting that list with a lock is not
      possible because the list is also required in code pathes which hold
      irq descriptor lock. Therefor the demultiplexing interrupt handler
      would create a lock inversion scenario if it calls a demux handler
      with the list protection lock held.
      
      A solution for this is to free the irq descriptor via RCU, so the
      list can be walked with rcu read lock held.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Keith Busch <keith.busch@intel.com>
      425a5072
    • T
      genirq: Prevent chip buslock deadlock · abc7e40c
      Thomas Gleixner 提交于
      If a interrupt chip utilizes chip->buslock then free_irq() can
      deadlock in the following way:
      
      CPU0				CPU1
      				interrupt(X) (Shared or spurious)
      free_irq(X)			interrupt_thread(X)
      chip_bus_lock(X)
      				   irq_finalize_oneshot(X)
      				     chip_bus_lock(X)
      synchronize_irq(X)
      	
      synchronize_irq() waits for the interrupt thread to complete,
      i.e. forever.
      
      Solution is simple: Drop chip_bus_lock() before calling
      synchronize_irq() as we do with the irq_desc lock. There is nothing to
      be protected after the point where irq_desc lock has been released.
      
      This adds chip_bus_lock/unlock() to the remove_irq() code path, but
      that's actually correct in the case where remove_irq() is called on
      such an interrupt. The current users of remove_irq() are not affected
      as none of those interrupts is on a chip which requires buslock.
      Reported-by: NFredrik Markström <fredrik.markstrom@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      abc7e40c
    • P
      sched/wait: Fix the signal handling fix · dfd01f02
      Peter Zijlstra 提交于
      Jan Stancek reported that I wrecked things for him by fixing things for
      Vladimir :/
      
      His report was due to an UNINTERRUPTIBLE wait getting -EINTR, which
      should not be possible, however my previous patch made this possible by
      unconditionally checking signal_pending().
      
      We cannot use current->state as was done previously, because the
      instruction after the store to that variable it can be changed.  We must
      instead pass the initial state along and use that.
      
      Fixes: 68985633 ("sched/wait: Fix signal handling in bit wait helpers")
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Reported-by: NChris Mason <clm@fb.com>
      Tested-by: NJan Stancek <jstancek@redhat.com>
      Tested-by: NVladimir Murzin <vladimir.murzin@arm.com>
      Tested-by: NChris Mason <clm@fb.com>
      Reviewed-by: NPaul Turner <pjt@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: tglx@linutronix.de
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: hpa@zytor.com
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfd01f02
  13. 13 12月, 2015 1 次提交
    • C
      kernel: remove stop_machine() Kconfig dependency · 86fffe4a
      Chris Wilson 提交于
      Currently the full stop_machine() routine is only enabled on SMP if
      module unloading is enabled, or if the CPUs are hotpluggable.  This
      leads to configurations where stop_machine() is broken as it will then
      only run the callback on the local CPU with irqs disabled, and not stop
      the other CPUs or run the callback on them.
      
      For example, this breaks MTRR setup on x86 in certain configs since
      ea8596bb ("kprobes/x86: Remove unused text_poke_smp() and
      text_poke_smp_batch() functions") as the MTRR is only established on the
      boot CPU.
      
      This patch removes the Kconfig option for STOP_MACHINE and uses the SMP
      and HOTPLUG_CPU config options to compile the correct stop_machine() for
      the architecture, removing the false dependency on MODULE_UNLOAD in the
      process.
      
      Link: https://lkml.org/lkml/2014/10/8/124
      References: https://bugs.freedesktop.org/show_bug.cgi?id=84794Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Pranith Kumar <bobby.prani@gmail.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Iulia Manda <iulia.manda21@gmail.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86fffe4a
  14. 11 12月, 2015 2 次提交
    • J
      time: Verify time values in adjtimex ADJ_SETOFFSET to avoid overflow · 37cf4dc3
      John Stultz 提交于
      For adjtimex()'s ADJ_SETOFFSET, make sure the tv_usec value is
      sane. We might multiply them later which can cause an overflow
      and undefined behavior.
      
      This patch introduces new helper functions to simplify the
      checking code and adds comments to clarify
      
      Orginally this patch was by Sasha Levin, but I've basically
      rewritten it, so he should get credit for finding the issue
      and I should get the blame for any mistakes made since.
      
      Also, credit to Richard Cochran for the phrasing used in the
      comment for what is considered valid here.
      
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      37cf4dc3
    • S
      ntp: Verify offset doesn't overflow in ntp_update_offset · 52d189f1
      Sasha Levin 提交于
      We need to make sure that the offset is valid before manipulating it,
      otherwise it might overflow on the multiplication.
      
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Richard Cochran <richardcochran@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      [jstultz: Reworked one of the checks so it makes more sense]
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      52d189f1
  15. 08 12月, 2015 3 次提交