1. 08 11月, 2011 4 次提交
    • S
      ftrace: Fix hash record accounting bug · d4d34b98
      Steven Rostedt 提交于
      If the set_ftrace_filter is cleared by writing just whitespace to
      it, then the filter hash refcounts will be decremented but not
      updated. This causes two bugs:
      
      1) No functions will be enabled for tracing when they all should be
      
      2) If the users clears the set_ftrace_filter twice, it will crash ftrace:
      
      ------------[ cut here ]------------
      WARNING: at /home/rostedt/work/git/linux-trace.git/kernel/trace/ftrace.c:1384 __ftrace_hash_rec_update.part.27+0x157/0x1a7()
      Modules linked in:
      Pid: 2330, comm: bash Not tainted 3.1.0-test+ #32
      Call Trace:
       [<ffffffff81051828>] warn_slowpath_common+0x83/0x9b
       [<ffffffff8105185a>] warn_slowpath_null+0x1a/0x1c
       [<ffffffff810ba362>] __ftrace_hash_rec_update.part.27+0x157/0x1a7
       [<ffffffff810ba6e8>] ? ftrace_regex_release+0xa7/0x10f
       [<ffffffff8111bdfe>] ? kfree+0xe5/0x115
       [<ffffffff810ba51e>] ftrace_hash_move+0x2e/0x151
       [<ffffffff810ba6fb>] ftrace_regex_release+0xba/0x10f
       [<ffffffff8112e49a>] fput+0xfd/0x1c2
       [<ffffffff8112b54c>] filp_close+0x6d/0x78
       [<ffffffff8113a92d>] sys_dup3+0x197/0x1c1
       [<ffffffff8113a9a6>] sys_dup2+0x4f/0x54
       [<ffffffff8150cac2>] system_call_fastpath+0x16/0x1b
      ---[ end trace 77a3a7ee73794a02 ]---
      
      Link: http://lkml.kernel.org/r/20111101141420.GA4918@debianReported-by: NRabin Vincent <rabin@rab.in>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d4d34b98
    • G
      jump_label: jump_label_inc may return before the code is patched · c8452afb
      Gleb Natapov 提交于
      If cpu A calls jump_label_inc() just after atomic_add_return() is
      called by cpu B, atomic_inc_not_zero() will return value greater then
      zero and jump_label_inc() will return to a caller before jump_label_update()
      finishes its job on cpu B.
      
      Link: http://lkml.kernel.org/r/20111018175551.GH17571@redhat.com
      
      Cc: stable@vger.kernel.org
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NJason Baron <jbaron@redhat.com>
      Signed-off-by: NGleb Natapov <gleb@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      c8452afb
    • S
      ftrace: Remove force undef config value left for testing · 8ee3c92b
      Steven Rostedt 提交于
      A forced undef of a config value was used for testing and was
      accidently left in during the final commit. This causes x86 to
      run slower than needed while running function tracing as well
      as causes the function graph selftest to fail when DYNMAIC_FTRACE
      is not set. This is because the code in MCOUNT expects the ftrace
      code to be processed with the config value set that happened to
      be forced not set.
      
      The forced config option was left in by:
          commit 6331c28c
          ftrace: Fix dynamic selftest failure on some archs
      
      Link: http://lkml.kernel.org/r/20111102150255.GA6973@debian
      
      Cc: stable@vger.kernel.org
      Reported-by: NRabin Vincent <rabin@rab.in>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      8ee3c92b
    • S
      lockdep: Show subclass in pretty print of lockdep output · e5e78d08
      Steven Rostedt 提交于
      The pretty print of the lockdep debug splat uses just the lock name
      to show how the locking scenario happens. But when it comes to
      nesting locks, the output becomes confusing which takes away the point
      of the pretty printing of the lock scenario.
      
      Without displaying the subclass info, we get the following output:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(slock-AF_INET);
                                      lock(slock-AF_INET);
                                      lock(slock-AF_INET);
         lock(slock-AF_INET);
      
        *** DEADLOCK ***
      
      The above looks more of a A->A locking bug than a A->B B->A.
      By adding the subclass to the output, we can see what really happened:
      
       other info that might help us debug this:
      
        Possible unsafe locking scenario:
      
              CPU0                    CPU1
              ----                    ----
         lock(slock-AF_INET);
                                      lock(slock-AF_INET/1);
                                      lock(slock-AF_INET);
         lock(slock-AF_INET/1);
      
        *** DEADLOCK ***
      
      This bug was discovered while tracking down a real bug caught by lockdep.
      
      Link: http://lkml.kernel.org/r/20111025202049.GB25043@hostway.ca
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NSimon Kirby <sim@hostway.ca>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e5e78d08
  2. 05 11月, 2011 1 次提交
    • S
      tracing: Add boiler plate for subsystem filter · 49aa2951
      Steven Rostedt 提交于
      The system filter can be used to set multiple event filters that
      exist within the system. But currently it displays the last filter
      written that does not necessarily correspond to the filters within
      the system. The system filter itself is not used to filter any events.
      The system filter is just a means to set filters of the events within
      it.
      
      Because this causes an ambiguous state when the system filter reads
      a filter string but the events within the system have different strings
      it is best to just show a boiler plate:
      
       ### global filter ###
       # Use this to set filters for multiple events.
       # Only events with the given fields will be affected.
       # If no events are modified, an error message will be displayed here.
      
      If an error occurs while writing to the system filter, the system
      filter will replace the boiler plate with the error message as it
      currently does.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      49aa2951
  3. 03 11月, 2011 1 次提交
  4. 01 11月, 2011 1 次提交
  5. 14 10月, 2011 2 次提交
  6. 11 10月, 2011 3 次提交
    • S
      tracing: Do not allocate buffer for trace_marker · d696b58c
      Steven Rostedt 提交于
      When doing intense tracing, the kmalloc inside trace_marker can
      introduce side effects to what is being traced.
      
      As trace_marker() is used by userspace to inject data into the
      kernel ring buffer, it needs to do so with the least amount
      of intrusion to the operations of the kernel or the user space
      application.
      
      As the ring buffer is designed to write directly into the buffer
      without the need to make a temporary buffer, and userspace already
      went through the hassle of knowing how big the write will be,
      we can simply pin the userspace pages and write the data directly
      into the buffer. This improves the impact of tracing via trace_marker
      tremendously!
      
      Thanks to Peter Zijlstra and Thomas Gleixner for pointing out the
      use of get_user_pages_fast() and kmap_atomic().
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Suggested-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d696b58c
    • S
      tracing: Warn on output if the function tracer was found corrupted · e0a413f6
      Steven Rostedt 提交于
      As the function tracer is very intrusive, lots of self checks are
      performed on the tracer and if something is found to be strange
      it will shut itself down keeping it from corrupting the rest of the
      kernel. This shutdown may still allow functions to be traced, as the
      tracing only stops new modifications from happening. Trying to stop
      the function tracer itself can cause more harm as it requires code
      modification.
      
      Although a WARN_ON() is executed, a user may not notice it. To help
      the user see that something isn't right with the tracing of the system
      a big warning is added to the output of the tracer that lets the user
      know that their data may be incomplete.
      Reported-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e0a413f6
    • M
      ftrace/kprobes: Fix not to delete probes if in use · 02ca1521
      Masami Hiramatsu 提交于
      Fix kprobe-tracer not to delete a probe if the probe is in use.
      In that case, delete operation will return -EBUSY.
      
      This bug can cause a kernel panic if enabled probes are deleted
      during perf record.
      
      (Add some probes on functions)
      sh-4.2# perf probe --del probe:\*
      sh-4.2# exit
      (kernel panic)
      
      This is originally reported on the fedora bugzilla:
      
       https://bugzilla.redhat.com/show_bug.cgi?id=742383
      
      I've also checked that this problem doesn't happen on
      tracepoints when module removing because perf event
      locks target module.
      
      $ sudo ./perf record -e xfs:\* -aR sh
      sh-4.2# rmmod xfs
      ERROR: Module xfs is in use
      sh-4.2# exit
      [ perf record: Woken up 1 times to write data ]
      [ perf record: Captured and wrote 0.203 MB perf.data (~8862 samples) ]
      Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Frank Ch. Eigler <fche@redhat.com>
      Cc: stable@kernel.org
      Link: http://lkml.kernel.org/r/20111004104438.14591.6553.stgit@fedora15Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      02ca1521
  7. 30 9月, 2011 2 次提交
    • P
      posix-cpu-timers: Cure SMP wobbles · d670ec13
      Peter Zijlstra 提交于
      David reported:
      
        Attached below is a watered-down version of rt/tst-cpuclock2.c from
        GLIBC.  Just build it with "gcc -o test test.c -lpthread -lrt" or
        similar.
      
        Run it several times, and you will see cases where the main thread
        will measure a process clock difference before and after the nanosleep
        which is smaller than the cpu-burner thread's individual thread clock
        difference.  This doesn't make any sense since the cpu-burner thread
        is part of the top-level process's thread group.
      
        I've reproduced this on both x86-64 and sparc64 (using both 32-bit and
        64-bit binaries).
      
        For example:
      
        [davem@boricha build-x86_64-linux]$ ./test
        process: before(0.001221967) after(0.498624371) diff(497402404)
        thread:  before(0.000081692) after(0.498316431) diff(498234739)
        self:    before(0.001223521) after(0.001240219) diff(16698)
        [davem@boricha build-x86_64-linux]$ 
      
        The diff of 'process' should always be >= the diff of 'thread'.
      
        I make sure to wrap the 'thread' clock measurements the most tightly
        around the nanosleep() call, and that the 'process' clock measurements
        are the outer-most ones.
      
        ---
        #include <unistd.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <time.h>
        #include <fcntl.h>
        #include <string.h>
        #include <errno.h>
        #include <pthread.h>
      
        static pthread_barrier_t barrier;
      
        static void *chew_cpu(void *arg)
        {
      	  pthread_barrier_wait(&barrier);
      	  while (1)
      		  __asm__ __volatile__("" : : : "memory");
      	  return NULL;
        }
      
        int main(void)
        {
      	  clockid_t process_clock, my_thread_clock, th_clock;
      	  struct timespec process_before, process_after;
      	  struct timespec me_before, me_after;
      	  struct timespec th_before, th_after;
      	  struct timespec sleeptime;
      	  unsigned long diff;
      	  pthread_t th;
      	  int err;
      
      	  err = clock_getcpuclockid(0, &process_clock);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(pthread_self(), &my_thread_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_init(&barrier, NULL, 2);
      	  err = pthread_create(&th, NULL, chew_cpu, NULL);
      	  if (err)
      		  return 1;
      
      	  err = pthread_getcpuclockid(th, &th_clock);
      	  if (err)
      		  return 1;
      
      	  pthread_barrier_wait(&barrier);
      
      	  err = clock_gettime(process_clock, &process_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_before);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(th_clock, &th_before);
      	  if (err)
      		  return 1;
      
      	  sleeptime.tv_sec = 0;
      	  sleeptime.tv_nsec = 500000000;
      	  nanosleep(&sleeptime, NULL);
      
      	  err = clock_gettime(th_clock, &th_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(my_thread_clock, &me_after);
      	  if (err)
      		  return 1;
      
      	  err = clock_gettime(process_clock, &process_after);
      	  if (err)
      		  return 1;
      
      	  diff = process_after.tv_nsec - process_before.tv_nsec;
      	  printf("process: before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 process_before.tv_sec, process_before.tv_nsec,
      		 process_after.tv_sec, process_after.tv_nsec, diff);
      	  diff = th_after.tv_nsec - th_before.tv_nsec;
      	  printf("thread:  before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 th_before.tv_sec, th_before.tv_nsec,
      		 th_after.tv_sec, th_after.tv_nsec, diff);
      	  diff = me_after.tv_nsec - me_before.tv_nsec;
      	  printf("self:    before(%lu.%.9lu) after(%lu.%.9lu) diff(%lu)\n",
      		 me_before.tv_sec, me_before.tv_nsec,
      		 me_after.tv_sec, me_after.tv_nsec, diff);
      
      	  return 0;
        }
      
      This is due to us using p->se.sum_exec_runtime in
      thread_group_cputime() where we iterate the thread group and sum all
      data. This does not take time since the last schedule operation (tick
      or otherwise) into account. We can cure this by using
      task_sched_runtime() at the cost of having to take locks.
      
      This also means we can (and must) do away with
      thread_group_sched_runtime() since the modified thread_group_cputime()
      is now more accurate and would deadlock when called from
      thread_group_sched_runtime().
      
      Aside of that it makes the function safe on 32 bit systems. The old
      code added t->se.sum_exec_runtime unprotected. sum_exec_runtime is a
      64bit value and could be changed on another cpu at the same time.
      Reported-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org
      Link: http://lkml.kernel.org/r/1314874459.7945.22.camel@twinsTested-by: NDavid Miller <davem@davemloft.net>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      d670ec13
    • R
      Resource: fix wrong resource window calculation · 47ea91b4
      Ram Pai 提交于
      __find_resource() incorrectly returns a resource window which overlaps
      an existing allocated window.  This happens when the parent's
      resource-window spans 0x00000000 to 0xffffffff and is entirely allocated
      to all its children resource-windows.
      
      __find_resource() looks for gaps in resource allocation among the
      children resource windows.  When it encounters the last child window it
      blindly tries the range next to one allocated to the last child.  Since
      the last child's window ends at 0xffffffff the calculation overflows,
      leading the algorithm to believe that any window in the range 0x0000000
      to 0xfffffff is available for allocation.  This leads to a conflicting
      window allocation.
      
      Michal Ludvig reported this issue seen on his platform.  The following
      patch fixes the problem and has been verified by Michal.  I believe this
      bug has been there for ages.  It got exposed by git commit 2bbc6942
      ("PCI : ability to relocate assigned pci-resources")
      Signed-off-by: NRam Pai <linuxram@us.ibm.com>
      Tested-by: NMichal Ludvig <mludvig@logix.net.nz>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47ea91b4
  8. 26 9月, 2011 2 次提交
  9. 22 9月, 2011 1 次提交
  10. 20 9月, 2011 3 次提交
  11. 19 9月, 2011 1 次提交
  12. 18 9月, 2011 2 次提交
  13. 15 9月, 2011 1 次提交
  14. 12 9月, 2011 1 次提交
  15. 31 8月, 2011 4 次提交
    • E
      perf_event: Fix broken calc_timer_values() · 7f310a5d
      Eric B Munson 提交于
      We detected a serious issue with PERF_SAMPLE_READ and
      timing information when events were being multiplexing.
      
      Samples would have time_running > time_enabled. That
      was easy to reproduce with a libpfm4 example (ran 3
      times to cause multiplexing on Core 2):
      
       $ syst_smpl -e uops_retired:freq=1 &
       $ syst_smpl -e uops_retired:freq=1 &
       $ syst_smpl -e uops_retired:freq=1 &
       IIP:0x0000000040062d ... PERIOD:2355332948 ENA=40144625315 RUN=60014875184
       syst_smpl: WARNING: time_running > time_enabled
      	63277537998 uops_retired:freq=1 , scaled
      
      The bug was not present in kernel up to (and including) 3.0. It turns
      out the bug was introduced by the following commit:
      
      commit c4794295
      
          events: Move lockless timer calculation into helper function
      
      The parameters of the function got reversed yet the call sites
      were not updated to reflect the change. That lead to time_running
      and time_enabled being swapped. That had no effect when there was
      no multiplexing because in that case time_running = time_enabled
      but it would show up in any other scenario.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20110829124112.GA4828@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      7f310a5d
    • V
      trace: Add ring buffer stats to measure rate of events · c64e148a
      Vaibhav Nagarnaik 提交于
      The stats file under per_cpu folder provides the number of entries,
      overruns and other statistics about the CPU ring buffer. However, the
      numbers do not provide any indication of how full the ring buffer is in
      bytes compared to the overall size in bytes. Also, it is helpful to know
      the rate at which the cpu buffer is filling up.
      
      This patch adds an entry "bytes: " in printed stats for per_cpu ring
      buffer which provides the actual bytes consumed in the ring buffer. This
      field includes the number of bytes used by recorded events and the
      padding bytes added when moving the tail pointer to next page.
      
      It also adds the following time stamps:
      "oldest event ts:" - the oldest timestamp in the ring buffer
      "now ts:"  - the timestamp at the time of reading
      
      The field "now ts" provides a consistent time snapshot to the userspace
      when being read. This is read from the same trace clock used by tracing
      event timestamps.
      
      Together, these values provide the rate at which the buffer is filling
      up, from the formula:
      bytes / (now_ts - oldest_event_ts)
      Signed-off-by: NVaibhav Nagarnaik <vnagarnaik@google.com>
      Cc: Michael Rubin <mrubin@google.com>
      Cc: David Sharp <dhsharp@google.com>
      Link: http://lkml.kernel.org/r/1313531179-9323-3-git-send-email-vnagarnaik@google.comSigned-off-by: NSteven Rostedt <rostedt@goodmis.org>
      c64e148a
    • V
      trace: Add a new readonly entry to report total buffer size · f81ab074
      Vaibhav Nagarnaik 提交于
      The current file "buffer_size_kb" reports the size of per-cpu buffer and
      not the overall memory allocated which could be misleading. A new file
      "buffer_total_size_kb" adds up all the enabled CPU buffer sizes and
      reports it. This is only a readonly entry.
      Signed-off-by: NVaibhav Nagarnaik <vnagarnaik@google.com>
      Cc: Michael Rubin <mrubin@google.com>
      Cc: David Sharp <dhsharp@google.com>
      Link: http://lkml.kernel.org/r/1313531179-9323-2-git-send-email-vnagarnaik@google.comSigned-off-by: NSteven Rostedt <rostedt@goodmis.org>
      f81ab074
    • S
      tracing: Add preempt disable for filter self test · 86b6ef21
      Steven Rostedt 提交于
      The self testing for event filters does not really need preemption
      disabled as there are no races at the time of testing, but the functions
      it calls uses rcu_dereference_sched() which will complain if preemption
      is not disabled.
      
      Cc: Jiri Olsa <jolsa@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      86b6ef21
  16. 29 8月, 2011 4 次提交
    • S
      perf events: Fix slow and broken cgroup context switch code · a8d757ef
      Stephane Eranian 提交于
      The current cgroup context switch code was incorrect leading
      to bogus counts. Furthermore, as soon as there was an active
      cgroup event on a CPU, the context switch cost on that CPU
      would increase by a significant amount as demonstrated by a
      simple ping/pong example:
      
       $ ./pong
       Both processes pinned to CPU1, running for 10s
       10684.51 ctxsw/s
      
      Now start a cgroup perf stat:
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100
      
      $ ./pong
       Both processes pinned to CPU1, running for 10s
       6674.61 ctxsw/s
      
      That's a 37% penalty.
      
      Note that pong is not even in the monitored cgroup.
      
      The results shown by perf stat are bogus:
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 100
      
       Performance counter stats for 'sleep 100':
      
       CPU1 <not counted> cycles   test
       CPU1 16,984,189,138 cycles  #    0.000 GHz
      
      The second 'cycles' event should report a count @ CPU clock
      (here 2.4GHz) as it is counting across all cgroups.
      
      The patch below fixes the bogus accounting and bypasses any
      cgroup switches in case the outgoing and incoming tasks are
      in the same cgroup.
      
      With this patch the same test now yields:
       $ ./pong
       Both processes pinned to CPU1, running for 10s
       10775.30 ctxsw/s
      
      Start perf stat with cgroup:
      
       $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
      
      Run pong outside the cgroup:
       $ /pong
       Both processes pinned to CPU1, running for 10s
       10687.80 ctxsw/s
      
      The penalty is now less than 2%.
      
      And the results for perf stat are correct:
      
      $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
      
       Performance counter stats for 'sleep 10':
      
       CPU1 <not counted> cycles test #    0.000 GHz
       CPU1 23,933,981,448 cycles      #    0.000 GHz
      
      Now perf stat reports the correct counts for
      for the non cgroup event.
      
      If we run pong inside the cgroup, then we also get the
      correct counts:
      
      $ perf stat -e cycles,cycles -A -a -G test  -C 1 -- sleep 10
      
       Performance counter stats for 'sleep 10':
      
       CPU1 22,297,726,205 cycles test #    0.000 GHz
       CPU1 23,933,981,448 cycles      #    0.000 GHz
      
            10.001457237 seconds time elapsed
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20110825135803.GA4697@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      a8d757ef
    • W
      sched: Fix a memory leak in __sdt_free() · feff8fa0
      WANG Cong 提交于
      This patch fixes the following memory leak:
      
      unreferenced object 0xffff880107266800 (size 512):
        comm "sched-powersave", pid 3718, jiffies 4323097853 (age 27495.450s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
          [<ffffffff81133940>] create_object+0x187/0x28b
          [<ffffffff814ac103>] kmemleak_alloc+0x73/0x98
          [<ffffffff811232ba>] __kmalloc_node+0x104/0x159
          [<ffffffff81044b98>] kzalloc_node.clone.97+0x15/0x17
          [<ffffffff8104cb90>] build_sched_domains+0xb7/0x7f3
          [<ffffffff8104d4df>] partition_sched_domains+0x1db/0x24a
          [<ffffffff8109ee4a>] do_rebuild_sched_domains+0x3b/0x47
          [<ffffffff810a00c7>] rebuild_sched_domains+0x10/0x12
          [<ffffffff8104d5ba>] sched_power_savings_store+0x6c/0x7b
          [<ffffffff8104d5df>] sched_mc_power_savings_store+0x16/0x18
          [<ffffffff8131322c>] sysdev_class_store+0x20/0x22
          [<ffffffff81193876>] sysfs_write_file+0x108/0x144
          [<ffffffff81135b10>] vfs_write+0xaf/0x102
          [<ffffffff81135d23>] sys_write+0x4d/0x74
          [<ffffffff814c8a42>] system_call_fastpath+0x16/0x1b
          [<ffffffffffffffff>] 0xffffffffffffffff
      Signed-off-by: NWANG Cong <amwang@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org # 3.0
      Link: http://lkml.kernel.org/r/1313671017-4112-1-git-send-email-amwang@redhat.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      feff8fa0
    • T
      sched: Move blk_schedule_flush_plug() out of __schedule() · 9c40cef2
      Thomas Gleixner 提交于
      There is no real reason to run blk_schedule_flush_plug() with
      interrupts and preemption disabled.
      
      Move it into schedule() and call it when the task is going voluntarily
      to sleep. There might be false positives when the task is woken
      between that call and actually scheduling, but that's not really
      different from being woken immediately after switching away.
      
      This fixes a deadlock in the scheduler where the
      blk_schedule_flush_plug() callchain enables interrupts and thereby
      allows a wakeup to happen of the task that's going to sleep.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@kernel.org # 2.6.39+
      Link: http://lkml.kernel.org/n/tip-dwfxtra7yg1b5r65m32ywtct@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@elte.hu>
      9c40cef2
    • T
      sched: Separate the scheduler entry for preemption · c259e01a
      Thomas Gleixner 提交于
      Block-IO and workqueues call into notifier functions from the
      scheduler core code with interrupts and preemption disabled. These
      calls should be made before entering the scheduler core.
      
      To simplify this, separate the scheduler core code into
      __schedule(). __schedule() is directly called from the places which
      set PREEMPT_ACTIVE and from schedule(). This allows us to add the work
      checks into schedule(), so they are only called when a task voluntary
      goes to sleep.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: stable@kernel.org # 2.6.39+
      Link: http://lkml.kernel.org/r/20110622174918.813258321@linutronix.deSigned-off-by: NIngo Molnar <mingo@elte.hu>
      c259e01a
  17. 27 8月, 2011 1 次提交
  18. 26 8月, 2011 2 次提交
    • N
      kernel/printk: do not turn off bootconsole in printk_late_init() if keep_bootcon · 4c30c6f5
      Nishanth Aravamudan 提交于
      It seems that 7bf69395 ("console: allow to retain boot console via
      boot option keep_bootcon") doesn't always achieve what it aims, as when
      printk_late_init() runs it unconditionally turns off all boot consoles.
      With this patch, I am able to see more messages on the boot console in
      KVM guests than I can without, when keep_bootcon is specified.
      
      I think it is appropriate for the relevant -stable trees.  However, it's
      more of an annoyance than a serious bug (ideally you don't need to keep
      the boot console around as console handover should be working -- I was
      encountering a situation where the console handover wasn't working and
      not having the boot console available meant I couldn't see why).
      Signed-off-by: NNishanth Aravamudan <nacc@us.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: Greg KH <gregkh@suse.de>
      Acked-by: NFabio M. Di Nitto <fdinitto@redhat.com>
      Cc: <stable@kernel.org>		[2.6.39.x, 3.0.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c30c6f5
    • A
      Add a personality to report 2.6.x version numbers · be27425d
      Andi Kleen 提交于
      I ran into a couple of programs which broke with the new Linux 3.0
      version.  Some of those were binary only.  I tried to use LD_PRELOAD to
      work around it, but it was quite difficult and in one case impossible
      because of a mix of 32bit and 64bit executables.
      
      For example, all kind of management software from HP doesnt work, unless
      we pretend to run a 2.6 kernel.
      
        $ uname -a
        Linux svivoipvnx001 3.0.0-08107-g97cd98f #1062 SMP Fri Aug 12 18:11:45 CEST 2011 i686 i686 i386 GNU/Linux
      
        $ hpacucli ctrl all show
      
        Error: No controllers detected.
      
        $ rpm -qf /usr/sbin/hpacucli
        hpacucli-8.75-12.0
      
      Another notable case is that Python now reports "linux3" from
      sys.platform(); which in turn can break things that were checking
      sys.platform() == "linux2":
      
        https://bugzilla.mozilla.org/show_bug.cgi?id=664564
      
      It seems pretty clear to me though it's a bug in the apps that are using
      '==' instead of .startswith(), but this allows us to unbreak broken
      programs.
      
      This patch adds a UNAME26 personality that makes the kernel report a
      2.6.40+x version number instead.  The x is the x in 3.x.
      
      I know this is somewhat ugly, but I didn't find a better workaround, and
      compatibility to existing programs is important.
      
      Some programs also read /proc/sys/kernel/osrelease.  This can be worked
      around in user space with mount --bind (and a mount namespace)
      
      To use:
      
        wget ftp://ftp.kernel.org/pub/linux/kernel/people/ak/uname26/uname26.c
        gcc -o uname26 uname26.c
        ./uname26 program
      Signed-off-by: NAndi Kleen <ak@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      be27425d
  19. 24 8月, 2011 1 次提交
    • L
      Revert "irq: Always set IRQF_ONESHOT if no primary handler is specified" · 69dd3d8e
      Linus Torvalds 提交于
      This reverts commit f3637a5f.
      
      It turns out that this breaks several drivers, one example being OMAP
      boards which use the on-board OMAP UARTs and the omap-serial driver that
      will not boot to userspace after the commit.
      
      Paul Walmsley reports that enabling CONFIG_DEBUG_SHIRQ reveals 'IRQ
      handler type mismatch' errors:
      
        IRQ handler type mismatch for IRQ 74
        current handler: serial idle
        ...
      
      and the reason is that setting IRQF_ONESHOT will now result in those
      interrupt handlers having different IRQF flags, and thus being
      unsharable.  So the commit log in the reverted commit:
      
                                  "Since it is required for those users and
          there is no difference for others it makes sense to add this flag
          unconditionally."
      
      is simply not true: there may not be any difference from a "actions at
      irq time", but there is a *big* difference wrt this flag testing irq
      management (see __setup_irq() in kernel/irq/manage.c).
      
      One solution may be to stop verifying IRQF_ONESHOT in __setup_irq(), but
      right now the safe course of action is to revert the change.  Let's
      revisit this in a later merge window.
      Reported-by: NPaul Walmsley <paul@pwsan.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Requested-by: NAlan Cox <alan@lxorguk.ukuu.org.uk>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69dd3d8e
  20. 20 8月, 2011 3 次提交