1. 09 4月, 2015 1 次提交
    • L
      Copy the kernel module data from user space in chunks · 3afe9f84
      Linus Torvalds 提交于
      Unlike most (all?) other copies from user space, kernel module loading
      is almost unlimited in size.  So we do a potentially huge
      "copy_from_user()" when we copy the module data from user space to the
      kernel buffer, which can be a latency concern when preemption is
      disabled (or voluntary).
      
      Also, because 'copy_from_user()' clears the tail of the kernel buffer on
      failures, even a *failed* copy can end up wasting a lot of time.
      
      Normally neither of these are concerns in real life, but they do trigger
      when doing stress-testing with trinity.  Running in a VM seems to add
      its own overheadm causing trinity module load testing to even trigger
      the watchdog.
      
      The simple fix is to just chunk up the module loading, so that it never
      tries to copy insanely big areas in one go.  That bounds the latency,
      and also the amount of (unnecessarily, in this case) cleared memory for
      the failure case.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3afe9f84
  2. 26 3月, 2015 1 次提交
    • M
      mm: numa: slow PTE scan rate if migration failures occur · 074c2381
      Mel Gorman 提交于
      Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226
      
        Across the board the 4.0-rc1 numbers are much slower, and the degradation
        is far worse when using the large memory footprint configs. Perf points
        straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:
      
         -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
            - default_send_IPI_mask_sequence_phys
               - 99.99% physflat_send_IPI_mask
                  - 99.37% native_send_call_func_ipi
                       smp_call_function_many
                     - native_flush_tlb_others
                        - 99.85% flush_tlb_page
                             ptep_clear_flush
                             try_to_unmap_one
                             rmap_walk
                             try_to_unmap
                             migrate_pages
                             migrate_misplaced_page
                           - handle_mm_fault
                              - 99.73% __do_page_fault
                                   trace_do_page_fault
                                   do_async_page_fault
                                 + async_page_fault
                    0.63% native_send_call_func_single_ipi
                       generic_exec_single
                       smp_call_function_single
      
      This is showing excessive migration activity even though excessive
      migrations are meant to get throttled.  Normally, the scan rate is tuned
      on a per-task basis depending on the locality of faults.  However, if
      migrations fail for any reason then the PTE scanner may scan faster if
      the faults continue to be remote.  This means there is higher system CPU
      overhead and fault trapping at exactly the time we know that migrations
      cannot happen.  This patch tracks when migration failures occur and
      slows the PTE scanner.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      074c2381
  3. 23 3月, 2015 4 次提交
    • P
      timers/tick/broadcast-hrtimer: Fix suspicious RCU usage in idle loop · a127d2bc
      Preeti U Murthy 提交于
      The hrtimer mode of broadcast queues hrtimers in the idle entry
      path so as to wakeup cpus in deep idle states. The associated
      call graph is :
      
      	cpuidle_idle_call()
      	|____ clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_ENTER, ....))
      	     |_____tick_broadcast_set_event()
      		   |____clockevents_program_event()
      			|____bc_set_next()
      
      The hrtimer_{start/cancel} functions call into tracing which uses RCU.
      But it is not legal to call into RCU in cpuidle because it is one of the
      quiescent states. Hence protect this region with RCU_NONIDLE which informs
      RCU that the cpu is momentarily non-idle.
      
      As an aside it is helpful to point out that the clock event device that is
      programmed here is not a per-cpu clock device; it is a
      pseudo clock device, used by the broadcast framework alone.
      The per-cpu clock device programming never goes through bc_set_next().
      Signed-off-by: NPreeti U Murthy <preeti@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: linuxppc-dev@ozlabs.org
      Cc: mpe@ellerman.id.au
      Cc: tglx@linutronix.de
      Link: http://lkml.kernel.org/r/20150318104705.17763.56668.stgit@preeti.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a127d2bc
    • P
      lockdep: Fix the module unload key range freeing logic · 35a9393c
      Peter Zijlstra 提交于
      Module unload calls lockdep_free_key_range(), which removes entries
      from the data structures. Most of the lockdep code OTOH assumes the
      data structures are append only; in specific see the comments in
      add_lock_to_list() and look_up_lock_class().
      
      Clearly this has only worked by accident; make it work proper. The
      actual scenario to make it go boom would involve the memory freed by
      the module unlock being re-allocated and re-used for a lock inside of
      a rcu-sched grace period. This is a very unlikely scenario, still
      better plug the hole.
      
      Use RCU list iteration in all places and ammend the comments.
      
      Change lockdep_free_key_range() to issue a sync_sched() between
      removal from the lists and returning -- which results in the memory
      being freed. Further ensure the callers are placed correctly and
      comment the requirements.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Tsyvarev <tsyvarev@ispras.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      35a9393c
    • B
      sched: Fix RLIMIT_RTTIME when PI-boosting to RT · 746db944
      Brian Silverman 提交于
      When non-realtime tasks get priority-inheritance boosted to a realtime
      scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
      counter used for checking this (the same one used for SCHED_RR
      timeslices) was not getting reset. This meant that tasks running with a
      non-realtime scheduling class which are repeatedly boosted to a realtime
      one, but never block while they are running realtime, eventually hit the
      timeout without ever running for a time over the limit. This patch
      resets the realtime timeslice counter when un-PI-boosting from an RT to
      a non-RT scheduling class.
      
      I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
      mutex which induces priority boosting and spins while boosted that gets
      killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
      applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
      does happen eventually with PREEMPT_VOLUNTARY kernels.
      Signed-off-by: NBrian Silverman <brian@peloton-tech.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: austin@peloton-tech.com
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1424305436-6716-1-git-send-email-brian@peloton-tech.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      746db944
    • P
      perf: Fix irq_work 'tail' recursion · d525211f
      Peter Zijlstra 提交于
      Vince reported a watchdog lockup like:
      
      	[<ffffffff8115e114>] perf_tp_event+0xc4/0x210
      	[<ffffffff810b4f8a>] perf_trace_lock+0x12a/0x160
      	[<ffffffff810b7f10>] lock_release+0x130/0x260
      	[<ffffffff816c7474>] _raw_spin_unlock_irqrestore+0x24/0x40
      	[<ffffffff8107bb4d>] do_send_sig_info+0x5d/0x80
      	[<ffffffff811f69df>] send_sigio_to_task+0x12f/0x1a0
      	[<ffffffff811f71ce>] send_sigio+0xae/0x100
      	[<ffffffff811f72b7>] kill_fasync+0x97/0xf0
      	[<ffffffff8115d0b4>] perf_event_wakeup+0xd4/0xf0
      	[<ffffffff8115d103>] perf_pending_event+0x33/0x60
      	[<ffffffff8114e3fc>] irq_work_run_list+0x4c/0x80
      	[<ffffffff8114e448>] irq_work_run+0x18/0x40
      	[<ffffffff810196af>] smp_trace_irq_work_interrupt+0x3f/0xc0
      	[<ffffffff816c99bd>] trace_irq_work_interrupt+0x6d/0x80
      
      Which is caused by an irq_work generating new irq_work and therefore
      not allowing forward progress.
      
      This happens because processing the perf irq_work triggers another
      perf event (tracepoint stuff) which in turn generates an irq_work ad
      infinitum.
      
      Avoid this by raising the recursion counter in the irq_work -- which
      effectively disables all software events (including tracepoints) from
      actually triggering again.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Tested-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20150219170311.GH21418@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d525211f
  4. 18 3月, 2015 1 次提交
  5. 17 3月, 2015 1 次提交
    • P
      livepatch: Fix subtle race with coming and going modules · 8cb2c2dc
      Petr Mladek 提交于
      There is a notifier that handles live patches for coming and going modules.
      It takes klp_mutex lock to avoid races with coming and going patches but
      it does not keep the lock all the time. Therefore the following races are
      possible:
      
        1. The notifier is called sometime in STATE_MODULE_COMING. The module
           is visible by find_module() in this state all the time. It means that
           new patch can be registered and enabled even before the notifier is
           called. It might create wrong order of stacked patches, see below
           for an example.
      
         2. New patch could still see the module in the GOING state even after
            the notifier has been called. It will try to initialize the related
            object structures but the module could disappear at any time. There
            will stay mess in the structures. It might even cause an invalid
            memory access.
      
      This patch solves the problem by adding a boolean variable into struct module.
      The value is true after the coming and before the going handler is called.
      New patches need to be applied when the value is true and they need to ignore
      the module when the value is false.
      
      Note that we need to know state of all modules on the system. The races are
      related to new patches. Therefore we do not know what modules will get
      patched.
      
      Also note that we could not simply ignore going modules. The code from the
      module could be called even in the GOING state until mod->exit() finishes.
      If we start supporting patches with semantic changes between function
      calls, we need to apply new patches to any still usable code.
      See below for an example.
      
      Finally note that the patch solves only the situation when a new patch is
      registered. There are no such problems when the patch is being removed.
      It does not matter who disable the patch first, whether the normal
      disable_patch() or the module notifier. There is nothing to do
      once the patch is disabled.
      
      Alternative solutions:
      ======================
      
      + reject new patches when a patched module is coming or going; this is ugly
      
      + wait with adding new patch until the module leaves the COMING and GOING
        states; this might be dangerous and complicated; we would need to release
        kgr_lock in the middle of the patch registration to avoid a deadlock
        with the coming and going handlers; also we might need a waitqueue for
        each module which seems to be even bigger overhead than the boolean
      
      + stop modules from entering COMING and GOING states; wait until modules
        leave these states when they are already there; looks complicated; we would
        need to ignore the module that asked to stop the others to avoid a deadlock;
        also it is unclear what to do when two modules asked to stop others and
        both are in COMING state (situation when two new patches are applied)
      
      + always register/enable new patches and fix up the potential mess (registered
        patches order) in klp_module_init(); this is nasty and prone to regressions
        in the future development
      
      + add another MODULE_STATE where the kallsyms are visible but the module is not
        used yet; this looks too complex; the module states are checked on "many"
        locations
      
      Example of patch stacking breakage:
      ===================================
      
      The notifier could _not_ _simply_ ignore already initialized module objects.
      For example, let's have three patches (P1, P2, P3) for functions a() and b()
      where a() is from vmcore and b() is from a module M. Something like:
      
      	a()	b()
      P1	a1()	b1()
      P2	a2()	b2()
      P3	a3()	b3(3)
      
      If you load the module M after all patches are registered and enabled.
      The ftrace ops for function a() and b() has listed the functions in this
      order:
      
      	ops_a->func_stack -> list(a3,a2,a1)
      	ops_b->func_stack -> list(b3,b2,b1)
      
      , so the pointer to b3() is the first and will be used.
      
      Then you might have the following scenario. Let's start with state when patches
      P1 and P2 are registered and enabled but the module M is not loaded. Then ftrace
      ops for b() does not exist. Then we get into the following race:
      
      CPU0					CPU1
      
      load_module(M)
      
        complete_formation()
      
        mod->state = MODULE_STATE_COMING;
        mutex_unlock(&module_mutex);
      
      					klp_register_patch(P3);
      					klp_enable_patch(P3);
      
      					# STATE 1
      
        klp_module_notify(M)
          klp_module_notify_coming(P1);
          klp_module_notify_coming(P2);
          klp_module_notify_coming(P3);
      
      					# STATE 2
      
      The ftrace ops for a() and b() then looks:
      
        STATE1:
      
      	ops_a->func_stack -> list(a3,a2,a1);
      	ops_b->func_stack -> list(b3);
      
        STATE2:
      	ops_a->func_stack -> list(a3,a2,a1);
      	ops_b->func_stack -> list(b2,b1,b3);
      
      therefore, b2() is used for the module but a3() is used for vmcore
      because they were the last added.
      
      Example of the race with going modules:
      =======================================
      
      CPU0					CPU1
      
      delete_module()  #SYSCALL
      
         try_stop_module()
           mod->state = MODULE_STATE_GOING;
      
         mutex_unlock(&module_mutex);
      
      					klp_register_patch()
      					klp_enable_patch()
      
      					#save place to switch universe
      
      					b()     # from module that is going
      					  a()   # from core (patched)
      
         mod->exit();
      
      Note that the function b() can be called until we call mod->exit().
      
      If we do not apply patch against b() because it is in MODULE_STATE_GOING,
      it will call patched a() with modified semantic and things might get wrong.
      
      [jpoimboe@redhat.com: use one boolean instead of two]
      Signed-off-by: NPetr Mladek <pmladek@suse.cz>
      Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      8cb2c2dc
  6. 13 3月, 2015 2 次提交
    • L
      perf: Fix context leak in put_event() · d415a7f1
      Leon Yu 提交于
      Commit:
      
        a83fe28e ("perf: Fix put_event() ctx lock")
      
      changed the locking logic in put_event() by replacing mutex_lock_nested()
      with perf_event_ctx_lock_nested(), but didn't fix the subsequent
      mutex_unlock() with a correct counterpart, perf_event_ctx_unlock().
      
      Contexts are thus leaked as a result of incremented refcount
      in perf_event_ctx_lock_nested().
      Signed-off-by: NLeon Yu <chianglungyu@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Fixes: a83fe28e ("perf: Fix put_event() ctx lock")
      Link: http://lkml.kernel.org/r/1424954613-5034-1-git-send-email-chianglungyu@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d415a7f1
    • A
      kasan, module, vmalloc: rework shadow allocation for modules · a5af5aa8
      Andrey Ryabinin 提交于
      Current approach in handling shadow memory for modules is broken.
      
      Shadow memory could be freed only after memory shadow corresponds it is no
      longer used.  vfree() called from interrupt context could use memory its
      freeing to store 'struct llist_node' in it:
      
          void vfree(const void *addr)
          {
          ...
              if (unlikely(in_interrupt())) {
                  struct vfree_deferred *p = this_cpu_ptr(&vfree_deferred);
                  if (llist_add((struct llist_node *)addr, &p->list))
                          schedule_work(&p->wq);
      
      Later this list node used in free_work() which actually frees memory.
      Currently module_memfree() called in interrupt context will free shadow
      before freeing module's memory which could provoke kernel crash.
      
      So shadow memory should be freed after module's memory.  However, such
      deallocation order could race with kasan_module_alloc() in module_alloc().
      
      Free shadow right before releasing vm area.  At this point vfree()'d
      memory is not used anymore and yet not available for other allocations.
      New VM_KASAN flag used to indicate that vm area has dynamically allocated
      shadow memory so kasan frees shadow only if it was previously allocated.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5af5aa8
  7. 09 3月, 2015 3 次提交
    • S
      ftrace: Fix ftrace enable ordering of sysctl ftrace_enabled · 524a3868
      Steven Rostedt (Red Hat) 提交于
      Some archs (specifically PowerPC), are sensitive with the ordering of
      the enabling of the calls to function tracing and setting of the
      function to use to be traced.
      
      That is, update_ftrace_function() sets what function the ftrace_caller
      trampoline should call. Some archs require this to be set before
      calling ftrace_run_update_code().
      
      Another bug was discovered, that ftrace_startup_sysctl() called
      ftrace_run_update_code() directly. If the function the ftrace_caller
      trampoline changes, then it will not be updated. Instead a call
      to ftrace_startup_enable() should be called because it tests to see
      if the callback changed since the code was disabled, and will
      tell the arch to update appropriately. Most archs do not need this
      notification, but PowerPC does.
      
      The problem could be seen by the following commands:
      
       # echo 0 > /proc/sys/kernel/ftrace_enabled
       # echo function > /sys/kernel/debug/tracing/current_tracer
       # echo 1 > /proc/sys/kernel/ftrace_enabled
       # cat /sys/kernel/debug/tracing/trace
      
      The trace will show that function tracing was not active.
      
      Cc: stable@vger.kernel.org # 2.6.27+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      524a3868
    • P
      ftrace: Fix en(dis)able graph caller when en(dis)abling record via sysctl · 1619dc3f
      Pratyush Anand 提交于
      When ftrace is enabled globally through the proc interface, we must check if
      ftrace_graph_active is set. If it is set, then we should also pass the
      FTRACE_START_FUNC_RET command to ftrace_run_update_code(). Similarly, when
      ftrace is disabled globally through the proc interface, we must check if
      ftrace_graph_active is set. If it is set, then we should also pass the
      FTRACE_STOP_FUNC_RET command to ftrace_run_update_code().
      
      Consider the following situation.
      
       # echo 0 > /proc/sys/kernel/ftrace_enabled
      
      After this ftrace_enabled = 0.
      
       # echo function_graph > /sys/kernel/debug/tracing/current_tracer
      
      Since ftrace_enabled = 0, ftrace_enable_ftrace_graph_caller() is never
      called.
      
       # echo 1 > /proc/sys/kernel/ftrace_enabled
      
      Now ftrace_enabled will be set to true, but still
      ftrace_enable_ftrace_graph_caller() will not be called, which is not
      desired.
      
      Further if we execute the following after this:
        # echo nop > /sys/kernel/debug/tracing/current_tracer
      
      Now since ftrace_enabled is set it will call
      ftrace_disable_ftrace_graph_caller(), which causes a kernel warning on
      the ARM platform.
      
      On the ARM platform, when ftrace_enable_ftrace_graph_caller() is called,
      it checks whether the old instruction is a nop or not. If it's not a nop,
      then it returns an error. If it is a nop then it replaces instruction at
      that address with a branch to ftrace_graph_caller.
      ftrace_disable_ftrace_graph_caller() behaves just the opposite. Therefore,
      if generic ftrace code ever calls either ftrace_enable_ftrace_graph_caller()
      or ftrace_disable_ftrace_graph_caller() consecutively two times in a row,
      then it will return an error, which will cause the generic ftrace code to
      raise a warning.
      
      Note, x86 does not have an issue with this because the architecture
      specific code for ftrace_enable_ftrace_graph_caller() and
      ftrace_disable_ftrace_graph_caller() does not check the previous state,
      and calling either of these functions twice in a row has no ill effect.
      
      Link: http://lkml.kernel.org/r/e4fbe64cdac0dd0e86a3bf914b0f83c0b419f146.1425666454.git.panand@redhat.com
      
      Cc: stable@vger.kernel.org # 2.6.31+
      Signed-off-by: NPratyush Anand <panand@redhat.com>
      [
        removed extra if (ftrace_start_up) and defined ftrace_graph_active as 0
        if CONFIG_FUNCTION_GRAPH_TRACER is not set.
      ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      1619dc3f
    • S
      ftrace: Clear REGS_EN and TRAMP_EN flags on disabling record via sysctl · b24d443b
      Steven Rostedt (Red Hat) 提交于
      When /proc/sys/kernel/ftrace_enabled is set to zero, all function
      tracing is disabled. But the records that represent the functions
      still hold information about the ftrace_ops that are hooked to them.
      
      ftrace_ops may request "REGS" (have a full set of pt_regs passed to
      the callback), or "TRAMP" (the ops has its own trampoline to use).
      When the record is updated to represent the state of the ops hooked
      to it, it sets "REGS_EN" and/or "TRAMP_EN" to state that the callback
      points to the correct trampoline (REGS has its own trampoline).
      
      When ftrace_enabled is set to zero, all ftrace locations are a nop,
      so they do not point to any trampoline. But the _EN flags are still
      set. This can cause the accounting to go wrong when ftrace_enabled
      is cleared and an ops that has a trampoline is registered or unregistered.
      
      For example, the following will cause ftrace to crash:
      
       # echo function_graph > /sys/kernel/debug/tracing/current_tracer
       # echo 0 > /proc/sys/kernel/ftrace_enabled
       # echo nop > /sys/kernel/debug/tracing/current_tracer
       # echo 1 > /proc/sys/kernel/ftrace_enabled
       # echo function_graph > /sys/kernel/debug/tracing/current_tracer
      
      As function_graph uses a trampoline, when ftrace_enabled is set to zero
      the updates to the record are not done. When enabling function_graph
      again, the record will still have the TRAMP_EN flag set, and it will
      look for an op that has a trampoline other than the function_graph
      ops, and fail to find one.
      
      Cc: stable@vger.kernel.org # 3.17+
      Reported-by: NPratyush Anand <panand@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      b24d443b
  8. 07 3月, 2015 1 次提交
  9. 06 3月, 2015 2 次提交
  10. 05 3月, 2015 2 次提交
    • T
      workqueue: fix hang involving racing cancel[_delayed]_work_sync()'s for PREEMPT_NONE · 8603e1b3
      Tejun Heo 提交于
      cancel[_delayed]_work_sync() are implemented using
      __cancel_work_timer() which grabs the PENDING bit using
      try_to_grab_pending() and then flushes the work item with PENDING set
      to prevent the on-going execution of the work item from requeueing
      itself.
      
      try_to_grab_pending() can always grab PENDING bit without blocking
      except when someone else is doing the above flushing during
      cancelation.  In that case, try_to_grab_pending() returns -ENOENT.  In
      this case, __cancel_work_timer() currently invokes flush_work().  The
      assumption is that the completion of the work item is what the other
      canceling task would be waiting for too and thus waiting for the same
      condition and retrying should allow forward progress without excessive
      busy looping
      
      Unfortunately, this doesn't work if preemption is disabled or the
      latter task has real time priority.  Let's say task A just got woken
      up from flush_work() by the completion of the target work item.  If,
      before task A starts executing, task B gets scheduled and invokes
      __cancel_work_timer() on the same work item, its try_to_grab_pending()
      will return -ENOENT as the work item is still being canceled by task A
      and flush_work() will also immediately return false as the work item
      is no longer executing.  This puts task B in a busy loop possibly
      preventing task A from executing and clearing the canceling state on
      the work item leading to a hang.
      
      task A			task B			worker
      
      						executing work
      __cancel_work_timer()
        try_to_grab_pending()
        set work CANCELING
        flush_work()
          block for work completion
      						completion, wakes up A
      			__cancel_work_timer()
      			while (forever) {
      			  try_to_grab_pending()
      			    -ENOENT as work is being canceled
      			  flush_work()
      			    false as work is no longer executing
      			}
      
      This patch removes the possible hang by updating __cancel_work_timer()
      to explicitly wait for clearing of CANCELING rather than invoking
      flush_work() after try_to_grab_pending() fails with -ENOENT.
      
      Link: http://lkml.kernel.org/g/20150206171156.GA8942@axis.com
      
      v3: bit_waitqueue() can't be used for work items defined in vmalloc
          area.  Switched to custom wake function which matches the target
          work item and exclusive wait and wakeup.
      
      v2: v1 used wake_up() on bit_waitqueue() which leads to NULL deref if
          the target bit waitqueue has wait_bit_queue's on it.  Use
          DEFINE_WAIT_BIT() and __wake_up_bit() instead.  Reported by Tomeu
          Vizoso.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NRabin Vincent <rabin.vincent@axis.com>
      Cc: Tomeu Vizoso <tomeu.vizoso@gmail.com>
      Cc: stable@vger.kernel.org
      Tested-by: NJesper Nilsson <jesper.nilsson@axis.com>
      Tested-by: NRabin Vincent <rabin.vincent@axis.com>
      8603e1b3
    • R
      genirq / PM: Add flag for shared NO_SUSPEND interrupt lines · 17f48034
      Rafael J. Wysocki 提交于
      It currently is required that all users of NO_SUSPEND interrupt
      lines pass the IRQF_NO_SUSPEND flag when requesting the IRQ or the
      WARN_ON_ONCE() in irq_pm_install_action() will trigger.  That is
      done to warn about situations in which unprepared interrupt handlers
      may be run unnecessarily for suspended devices and may attempt to
      access those devices by mistake.  However, it may cause drivers
      that have no technical reasons for using IRQF_NO_SUSPEND to set
      that flag just because they happen to share the interrupt line
      with something like a timer.
      
      Moreover, the generic handling of wakeup interrupts introduced by
      commit 9ce7a258 (genirq: Simplify wakeup mechanism) only works
      for IRQs without any NO_SUSPEND users, so the drivers of wakeup
      devices needing to use shared NO_SUSPEND interrupt lines for
      signaling system wakeup generally have to detect wakeup in their
      interrupt handlers.  Thus if they happen to share an interrupt line
      with a NO_SUSPEND user, they also need to request that their
      interrupt handlers be run after suspend_device_irqs().
      
      In both cases the reason for using IRQF_NO_SUSPEND is not because
      the driver in question has a genuine need to run its interrupt
      handler after suspend_device_irqs(), but because it happens to
      share the line with some other NO_SUSPEND user.  Otherwise, the
      driver would do without IRQF_NO_SUSPEND just fine.
      
      To make it possible to specify that condition explicitly, introduce
      a new IRQ action handler flag for shared IRQs, IRQF_COND_SUSPEND,
      that, when set, will indicate to the IRQ core that the interrupt
      user is generally fine with suspending the IRQ, but it also can
      tolerate handler invocations after suspend_device_irqs() and, in
      particular, it is capable of detecting system wakeup and triggering
      it as appropriate from its interrupt handler.
      
      That will allow us to work around a problem with a shared timer
      interrupt line on at91 platforms.
      
      Link: http://marc.info/?l=linux-kernel&m=142252777602084&w=2
      Link: http://marc.info/?t=142252775300011&r=1&w=2
      Link: https://lkml.org/lkml/2014/12/15/552Reported-by: NBoris Brezillon <boris.brezillon@free-electrons.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      17f48034
  11. 03 3月, 2015 5 次提交
  12. 01 3月, 2015 3 次提交
  13. 23 2月, 2015 1 次提交
  14. 20 2月, 2015 8 次提交
    • C
      debug: prevent entering debug mode on panic/exception. · 5516fd7b
      Colin Cross 提交于
      On non-developer devices, kgdb prevents the device from rebooting
      after a panic.
      
      Incase of panics and exceptions, to allow the device to reboot, prevent
      entering debug mode to avoid getting stuck waiting for the user to
      interact with debugger.
      
      To avoid entering the debugger on panic/exception without any extra
      configuration, panic_timeout is being used which can be set via
      /proc/sys/kernel/panic at run time and CONFIG_PANIC_TIMEOUT sets the
      default value.
      
      Setting panic_timeout indicates that the user requested machine to
      perform unattended reboot after panic. We dont want to get stuck waiting
      for the user input incase of panic.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: kgdb-bugreport@lists.sourceforge.net
      Cc: linux-kernel@vger.kernel.org
      Cc: Android Kernel Team <kernel-team@android.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Signed-off-by: NColin Cross <ccross@android.com>
      [Kiran: Added context to commit message.
      panic_timeout is used instead of break_on_panic and
      break_on_exception to honor CONFIG_PANIC_TIMEOUT
      Modified the commit as per community feedback]
      Signed-off-by: NKiran Raparthy <kiran.kumar@linaro.org>
      Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      5516fd7b
    • D
      kdb: Const qualifier for kdb_getstr's prompt argument · 32d375f6
      Daniel Thompson 提交于
      All current callers of kdb_getstr() can pass constant pointers via the
      prompt argument. This patch adds a const qualification to make explicit
      the fact that this is safe.
      Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      32d375f6
    • D
      kdb: Provide forward search at more prompt · fb6daa75
      Daniel Thompson 提交于
      Currently kdb allows the output of comamnds to be filtered using the
      | grep feature. This is useful but does not permit the output emitted
      shortly after a string match to be examined without wading through the
      entire unfiltered output of the command. Such a feature is particularly
      useful to navigate function traces because these traces often have a
      useful trigger string *before* the point of interest.
      
      This patch reuses the existing filtering logic to introduce a simple
      forward search to kdb that can be triggered from the more prompt.
      Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      fb6daa75
    • D
      kdb: Fix a prompt management bug when using | grep · ab08e464
      Daniel Thompson 提交于
      Currently when the "| grep" feature is used to filter the output of a
      command then the prompt is not displayed for the subsequent command.
      Likewise any characters typed by the user are also not echoed to the
      display. This rather disconcerting problem eventually corrects itself
      when the user presses Enter and the kdb_grepping_flag is cleared as
      kdb_parse() tries to make sense of whatever they typed.
      
      This patch resolves the problem by moving the clearing of this flag
      from the middle of command processing to the beginning.
      Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      ab08e464
    • D
      kdb: Remove stack dump when entering kgdb due to NMI · 54543881
      Daniel Thompson 提交于
      Issuing a stack dump feels ergonomically wrong when entering due to NMI.
      
      Entering due to NMI is normally a reaction to a user request, either the
      NMI button on a server or a "magic knock" on a UART. Therefore the
      backtrace behaviour on entry due to NMI should be like SysRq-g (no stack
      dump) rather than like oops.
      
      Note also that the stack dump does not offer any information that
      cannot be trivial retrieved using the 'bt' command.
      Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      54543881
    • D
      kdb: Avoid printing KERN_ levels to consoles · f7d4ca8b
      Daniel Thompson 提交于
      Currently when kdb traps printk messages then the raw log level prefix
      (consisting of '\001' followed by a numeral) does not get stripped off
      before the message is issued to the various I/O handlers supported by
      kdb. This causes annoying visual noise as well as causing problems
      grepping for ^. It is also a change of behaviour compared to normal usage
      of printk() usage. For example <SysRq>-h ends up with different output to
      that of kdb's "sr h".
      
      This patch addresses the problem by stripping log levels from messages
      before they are issued to the I/O handlers. printk() which can also
      act as an i/o handler in some cases is special cased; if the caller
      provided a log level then the prefix will be preserved when sent to
      printk().
      
      The addition of non-printable characters to the output of kdb commands is a
      regression, albeit and extremely elderly one, introduced by commit
      04d2c8c8 ("printk: convert the format for KERN_<LEVEL> to a 2 byte
      pattern"). Note also that this patch does *not* restore the original
      behaviour from v3.5. Instead it makes printk() from within a kdb command
      display the message without any prefix (i.e. like printk() normally does).
      Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Cc: Joe Perches <joe@perches.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      f7d4ca8b
    • J
      kdb: Fix off by one error in kdb_cpu() · df0036d1
      Jason Wessel 提交于
      There was a follow on replacement patch against the prior
      "kgdb: Timeout if secondary CPUs ignore the roundup".
      
      See: https://lkml.org/lkml/2015/1/7/442
      
      This patch is the delta vs the patch that was committed upstream:
        * Fix an off-by-one error in kdb_cpu().
        * Replace NR_CPUS with CONFIG_NR_CPUS to tell checkpatch that we
          really want a static limit.
        * Removed the "KGDB: " prefix from the pr_crit() in debug_core.c
          (kgdb-next contains a patch which introduced pr_fmt() to this file
          to the tag will now be applied automatically).
      
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      df0036d1
    • J
      kdb: fix incorrect counts in KDB summary command output · 14675592
      Jay Lan 提交于
      The output of KDB 'summary' command should report MemTotal, MemFree
      and Buffers output in kB. Current codes report in unit of pages.
      
      A define of K(x) as
      is defined in the code, but not used.
      
      This patch would apply the define to convert the values to kB.
      Please include me on Cc on replies. I do not subscribe to linux-kernel.
      Signed-off-by: NJay Lan <jlan@sgi.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      14675592
  15. 18 2月, 2015 5 次提交