1. 11 7月, 2008 3 次提交
    • S
      ftrace: trace schedule · 1e16c0a0
      Steven Rostedt 提交于
      After the sched_clock code has been removed from sched.c we can now trace
      the scheduler. The scheduler has a lot of functions that would be worth
      tracing.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1e16c0a0
    • S
      ftrace: define function trace nop · 001b6767
      Steven Rostedt 提交于
      When CONFIG_FTRACE is not enabled, the tracing_start_functon_trace
      and tracing_stop_function_trace should be nops.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      001b6767
    • S
      ftrace: move sched_switch enable after markers · 007c05d4
      Steven Rostedt 提交于
      We have two markers now that are enabled on sched_switch. One that records
      the context switching and the other that records task wake ups. Currently
      we enable the tracing first and then set the markers. This causes some
      confusing traces:
      
      # tracer: sched_switch
      #
      #           TASK-PID   CPU#    TIMESTAMP  FUNCTION
      #              | |      |          |         |
             trace-cmd-3973  [00]   115.834817:   3973:120:R   +     3:  0:S
             trace-cmd-3973  [01]   115.834910:   3973:120:R   +     6:  0:S
             trace-cmd-3973  [02]   115.834910:   3973:120:R   +     9:  0:S
             trace-cmd-3973  [03]   115.834910:   3973:120:R   +    12:  0:S
             trace-cmd-3973  [02]   115.834910:   3973:120:R   +     9:  0:S
                <idle>-0     [02]   115.834910:      0:140:R ==>  3973:120:R
      
      Here we see that trace-cmd with PID 3973 wakes up task 9 but the next line
      shows the idle task doing a context switch to task 3973.
      
      Enabling the tracing to _after_ the markers are set creates a much saner
      output:
      
      # tracer: sched_switch
      #
      #           TASK-PID   CPU#    TIMESTAMP  FUNCTION
      #              | |      |          |         |
                <idle>-0     [02]  7922.634225:      0:140:R ==>  4790:120:R
             trace-cmd-4789  [03]  7922.634225:      0:140:R   +  4790:120:R
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      007c05d4
  2. 03 7月, 2008 1 次提交
  3. 01 7月, 2008 1 次提交
    • P
      mmiotrace broken in linux-next (8-bit writes only) · 3e61e0c9
      Pekka Paalanen 提交于
      The moment mmiotrace is enabled, I hit a NULL deref in:
      
      IP: [<ffffffff80256e71>] __trace_special+0x17c/0x23a
      Call Trace:
       [<ffffffff802573cc>] ftrace_special+0x6f/0x9a
       [<ffffffff8023e3e4>] down+0x19/0x4a
       [<ffffffff80228adc>] acquire_console_sem+0x42/0x58
       [<ffffffff8035d273>] con_flush_chars+0x28/0x43
       [<ffffffff80354a70>] write_chan+0x22e/0x334
       [<ffffffff802244e9>] ? default_wake_function+0x0/0xf
       [<ffffffff8035236d>] tty_write+0x195/0x228
       [<ffffffff80354842>] ? write_chan+0x0/0x334
       [<ffffffff8027c23a>] vfs_write+0xae/0x137
       [<ffffffff8027c6e3>] sys_write+0x47/0x70
       [<ffffffff8020b1db>] system_call_after_swapgs+0x7b/0x80
      
      which means 'entry' in __trace_special() is NULL.
      
      [ mingo@elte.hu: that ftrace_special() was a leftover. ]
      Signed-off-by: NPekka Paalanen <pq@iki.fi>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Cc: proski@gnu.org
      Cc: "Vegard Nossum" <vegard.nossum@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3e61e0c9
  4. 24 6月, 2008 4 次提交
    • J
      kgdb: sparse fix · aabdc3b8
      Jason Wessel 提交于
      - Fix warning reported by sparse
      kernel/kgdb.c:1502:6: warning: symbol 'kgdb_console_write' was not declared.
      	Should it be static?
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      aabdc3b8
    • A
      ftrace: avoid modifying kprobe'd records · f22f9a89
      Abhishek Sagar 提交于
      Avoid modifying the mcount call-site if there is a kprobe installed on it.
      These records are not marked as failed however. This allowed the filter
      rules on them to remain up-to-date. Whenever the kprobe on the corresponding
      record is removed, the record gets updated as normal.
      Signed-off-by: NAbhishek Sagar <sagar.abhishek@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f22f9a89
    • A
      ftrace: freeze kprobe'd records · ecea656d
      Abhishek Sagar 提交于
      Let records identified as being kprobe'd be marked as "frozen". The trouble
      with records which have a kprobe installed on their mcount call-site is
      that they don't get updated. So if such a function which is currently being
      traced gets its tracing disabled due to a new filter rule (or because it
      was added to the notrace list) then it won't be updated and continue being
      traced. This patch allows scanning of all frozen records during tracing to
      check if they should be traced.
      Signed-off-by: NAbhishek Sagar <sagar.abhishek@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ecea656d
    • A
      ftrace: store mcount address in rec->ip · 395a59d0
      Abhishek Sagar 提交于
      Record the address of the mcount call-site. Currently all archs except sparc64
      record the address of the instruction following the mcount call-site. Some
      general cleanups are entailed. Storing mcount addresses in rec->ip enables
      looking them up in the kprobe hash table later on to check if they're kprobe'd.
      Signed-off-by: NAbhishek Sagar <sagar.abhishek@gmail.com>
      Cc: davem@davemloft.net
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      395a59d0
  5. 23 6月, 2008 1 次提交
    • T
      futexes: fix fault handling in futex_lock_pi · 1b7558e4
      Thomas Gleixner 提交于
      This patch addresses a very sporadic pi-futex related failure in
      highly threaded java apps on large SMP systems.
      
      David Holmes reported that the pi_state consistency check in
      lookup_pi_state triggered with his test application. This means that
      the kernel internal pi_state and the user space futex variable are out
      of sync. First we assumed that this is a user space data corruption,
      but deeper investigation revieled that the problem happend because the
      pi-futex code is not handling a fault in the futex_lock_pi path when
      the user space variable needs to be fixed up.
      
      The fault happens when a fork mapped the anon memory which contains
      the futex readonly for COW or the page got swapped out exactly between
      the unlock of the futex and the return of either the new futex owner
      or the task which was the expected owner but failed to acquire the
      kernel internal rtmutex. The current futex_lock_pi() code drops out
      with an inconsistent in case it faults and returns -EFAULT to user
      space. User space has no way to fixup that state.
      
      When we wrote this code we thought that we could not drop the hash
      bucket lock at this point to handle the fault.
      
      After analysing the code again it turned out to be wrong because there
      are only two tasks involved which might modify the pi_state and the
      user space variable:
      
       - the task which acquired the rtmutex
       - the pending owner of the pi_state which did not get the rtmutex
      
      Both tasks drop into the fixup_pi_state() function before returning to
      user space. The first task which acquired the hash bucket lock faults
      in the fixup of the user space variable, drops the spinlock and calls
      futex_handle_fault() to fault in the page. Now the second task could
      acquire the hash bucket lock and tries to fixup the user space
      variable as well. It either faults as well or it succeeds because the
      first task already faulted the page in.
      
      One caveat is to avoid a double fixup. After returning from the fault
      handling we reacquire the hash bucket lock and check whether the
      pi_state owner has been modified already.
      Reported-by: NDavid Holmes <david.holmes@sun.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: David Holmes <david.holmes@sun.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      
       kernel/futex.c |   93 ++++++++++++++++++++++++++++++++++++++++++++-------------
       1 file changed, 73 insertions(+), 20 deletions(-)
      1b7558e4
  6. 20 6月, 2008 3 次提交
    • O
      sched: refactor wait_for_completion_timeout() · ea71a546
      Oleg Nesterov 提交于
      Simplify the code and fix the boundary condition of
      wait_for_completion_timeout(,0).
      
      We can kill the first __remove_wait_queue() as well.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      ea71a546
    • R
      sched: fix wait_for_completion_timeout() spurious failure under heavy load · bb10ed09
      Roland Dreier 提交于
      It seems that the current implementaton of wait_for_completion_timeout()
      has a small problem under very high load for the common pattern:
      
      	if (!wait_for_completion_timeout(&done, timeout))
      		/* handle failure */
      
      because the implementation very roughly does (lots of code deleted to
      show the basic flow):
      
      	static inline long __sched
      	do_wait_for_common(struct completion *x, long timeout, int state)
      	{
      		if (x->done)
      			return timeout;
      
      		do {
      			timeout = schedule_timeout(timeout);
      
      			if (!timeout)
      				return timeout;
      
      		} while (!x->done);
      
      		return timeout;
      	}
      
      so if the system is very busy and x->done is not set when
      do_wait_for_common() is entered, it is possible that the first call to
      schedule_timeout() returns 0 because the task doing wait_for_completion
      doesn't get rescheduled for a long time, even if it is woken up early
      enough.
      
      In this case, wait_for_completion_timeout() returns 0 without even
      checking x->done again, and the code above falls into its failure case
      purely for scheduler reasons, even if the hardware event or whatever was
      being waited for happened early enough.
      
      It would make sense to add an extra test to do_wait_for() in the timeout
      case and return 1 if x->done is actually set.
      
      A quick audit (not exhaustive) of wait_for_completion_timeout() callers
      seems to indicate that no one actually cares about the return value in
      the success case -- they just test for 0 (timed out) versus non-zero
      (wait succeeded).
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      bb10ed09
    • P
      sched: rt: dont stop the period timer when there are tasks wanting to run · 8a8cde16
      Peter Zijlstra 提交于
      So if the group ever gets throttled, it will never wake up again.
      Reported-by: N"Daniel K." <dk@uw.no>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NDaniel K. <dk@uw.no>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8a8cde16
  7. 19 6月, 2008 9 次提交
    • B
      sched, delay accounting: fix incorrect delay time when constantly waiting on runqueue · d4abc238
      Bharath Ravi 提交于
      This patch corrects the incorrect value of per process run-queue wait
      time reported by delay statistics. The anomaly was due to the following
      reason. When a process leaves the CPU and immediately starts waiting for
      CPU on the runqueue (which means it remains in the TASK_RUNNABLE state),
      the time of re-entry into the run-queue is never recorded. Due to this,
      the waiting time on the runqueue from this point of re-entry upto the
      next time it hits the CPU is not accounted for. This is solved by
      recording the time of re-entry of a process leaving the CPU in the
      sched_info_depart() function IF the process will go back to waiting on
      the run-queue. This IF condition is verified by checking whether the
      process is still in the TASK_RUNNABLE state.
      
      The patch was tested on 2.6.26-rc6 using two simple CPU hog programs.
      The values noted prior to the fix did not account for the time spent on
      the runqueue waiting. After the fix, the correct values were reported
      back to user space.
      Signed-off-by: NBharath Ravi <bharathravi1@gmail.com>
      Signed-off-by: NMadhava K R  <madhavakr@gmail.com>
      Cc: dhaval@linux.vnet.ibm.com
      Cc: vatsa@in.ibm.com
      Cc: balbir@in.ibm.com
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d4abc238
    • J
      softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression · 9c106c11
      Jason Wessel 提交于
      The touch_nmi_watchdog() routine on x86 ultimately calls
      touch_softlockup_watchdog().  The problem is that to touch the
      softlockup watchdog, the cpu_clock code has to be called which could
      involve multiple cpu locks and can lead to a hard hang if one of the
      locks is held by a processor that is not going to return anytime soon
      (such as could be the case with kgdb or perhaps even with some other
      kind of exception).
      
      This patch causes the public version of the
      touch_softlockup_watchdog() to defer the cpu clock access to a later
      point.
      
      The test case for this problem is to use the following kernel config
      options:
      
      CONFIG_KGDB_TESTS=y
      CONFIG_KGDB_TESTS_ON_BOOT=y
      CONFIG_KGDB_TESTS_BOOT_STRING="V1F100I100000"
      
      It should be noted that kgdb test suite and these options were not
      available until 2.6.26-rc2, so it was necessary to patch the kgdb
      test suite during the bisection.
      
      I would consider this patch a regression fix because the problem first
      appeared in commit 27ec4407 when some
      logic was added to try to periodically sync the clocks.  It was
      possible to work around this particular problem by simply not
      performing the sync anytime the system was in a critical context.
      This was ok until commit 3e51f33f,
      which added config option CONFIG_HAVE_UNSTABLE_SCHED_CLOCK and some
      multi-cpu locks to sync the clocks.  It became clear that accessing
      this code from an nmi was the source of the lockups.  Avoiding the
      access to the low level clock code from an code inside the NMI
      processing also fixed the problem with the 27ec44... commit.
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9c106c11
    • S
      rcupreempt: remove export of rcu_batches_completed_bh · afd38009
      Steven Rostedt 提交于
      In rcupreempt, rcu_batches_completed_bh is defined as a static inline in
      the header file. This does not need to be exported, and not only that,
      this breaks my PPC build.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: paulus@samba.org
      Cc: linuxppc-dev@ozlabs.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      afd38009
    • L
      cpuset: limit the input of cpuset.sched_relax_domain_level · 30e0e178
      Li Zefan 提交于
      We allow the inputs to be [-1 ... SD_LV_MAX), and return -EINVAL
      for inputs outside this range.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Acked-by: NPaul Jackson <pj@sgi.com>
      Acked-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      30e0e178
    • M
      sched: CPU hotplug events must not destroy scheduler domains created by the cpusets · f18f982a
      Max Krasnyansky 提交于
      First issue is not related to the cpusets. We're simply leaking doms_cur.
      It's allocated in arch_init_sched_domains() which is called for every
      hotplug event. So we just keep reallocation doms_cur without freeing it.
      I introduced free_sched_domains() function that cleans things up.
      
      Second issue is that sched domains created by the cpusets are
      completely destroyed by the CPU hotplug events. For all CPU hotplug
      events scheduler attaches all CPUs to the NULL domain and then puts
      them all into the single domain thereby destroying domains created
      by the cpusets (partition_sched_domains).
      The solution is simple, when cpusets are enabled scheduler should not
      create default domain and instead let cpusets do that. Which is
      exactly what the patch does.
      Signed-off-by: NMax Krasnyansky <maxk@qualcomm.com>
      Cc: pj@sgi.com
      Cc: menage@google.com
      Cc: rostedt@goodmis.org
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      f18f982a
    • P
      sched: rt-group: fix RR buglet · 15a8641e
      Peter Zijlstra 提交于
      In tick_task_rt() we first call update_curr_rt() which can dequeue a runqueue
      due to it running out of runtime, and then we try to requeue it, of it also
      having exhausted its RR quota. Obviously requeueing something that is no longer
      on the runqueue will not have the expected result.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NDaniel K. <dk@uw.no>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      15a8641e
    • P
      sched: rt-group: heirarchy aware throttle · ad2a3f13
      Peter Zijlstra 提交于
      The bandwidth throttle code dequeues a group when it runs out of quota, and
      re-queues it once the period rolls over and the quota gets refreshed.
      
      Sadly it failed to take the hierarchy into consideration. Share more of the
      enqueue/dequeue code with regular task opterations.
      
      Also, some operations like sched_setscheduler() can dequeue/enqueue tasks that
      are in throttled runqueues, we should not inadvertly re-enqueue empty runqueues
      so check for that.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NDaniel K. <dk@uw.no>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ad2a3f13
    • P
      sched: rt-group: fix hierarchy · 7ea56616
      Peter Zijlstra 提交于
      Don't re-set the entity's runqueue to the wrong rq after we've set it
      to the right one.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NDaniel K. <dk@uw.no>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7ea56616
    • D
      sched: NULL pointer dereference while setting sched_rt_period_us · 49307fd6
      Dario Faggioli 提交于
      When CONFIG_RT_GROUP_SCHED and CONFIG_CGROUP_SCHED are enabled, with:
      
       echo 10000 > /proc/sys/kernel/sched_rt_period_us
      
      We get this:
      
       BUG: unable to handle kernel NULL pointer dereference at 0000008c
       [  947.682233] IP: [<c0216b72>] __rt_schedulable+0x12/0x160
       [  947.683123] *pde = 00000000=20
       [  947.683782] Oops: 0000 [#1]
       [  947.684307] Modules linked in:
       [  947.684308]
       [  947.684308] Pid: 2359, comm: bash Not tainted (2.6.26-rc6 #8)
       [  947.684308] EIP: 0060:[<c0216b72>] EFLAGS: 00000246 CPU: 0
       [  947.684308] EIP is at __rt_schedulable+0x12/0x160
       [  947.684308] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000001
       [  947.684308] ESI: c0521db4 EDI: 00000001 EBP: c6cc9f00 ESP: c6cc9ed0
       [  947.684308]  DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
       [  947.684308] Process bash (pid: 2359, tiÆcc8000 taskÇa54f00=20 task.tiÆcc8000)
       [  947.684308] Stack: c0222790 00000000 080f8c08 c0521db4 c6cc9f00 00000001 00000000 00000000
       [  947.684308]        c6cc9f9c 00000000 c0521db4 00000001 c6cc9f28 c0216d40 00000000 00000000
       [  947.684308]        c6cc9f9c 000f4240 000e7ef0 ffffffff c0521db4 c79dfb60 c6cc9f58 c02af2cc
       [  947.684308] Call Trace:
       [  947.684308]  [<c0222790>] ? do_proc_dointvec_conv+0x0/0x50
       [  947.684308]  [<c0216d40>] ? sched_rt_handler+0x80/0x110
       [  947.684308]  [<c02af2cc>] ? proc_sys_call_handler+0x9c/0xb0
       [  947.684308]  [<c02af2fa>] ? proc_sys_write+0x1a/0x20
       [  947.684308]  [<c0273c36>] ? vfs_write+0x96/0x160
       [  947.684308]  [<c02af2e0>] ? proc_sys_write+0x0/0x20
       [  947.684308]  [<c027423d>] ? sys_write+0x3d/0x70
       [  947.684308]  [<c0202ef5>] ? sysenter_past_esp+0x6a/0x91
       [  947.684308]  =======================
       [  947.684308] Code: 24 04 e8 62 b1 0e 00 89 c7 89 f8 8b 5d f4 8b 75
       f8 8b 7d fc 89 ec 5d c3 90 55 89 e5 57 56 53 83 ec 24 89 45 ec 89 55 e4
       89 4d e8 <8b> b8 8c 00 00 00 85 ff 0f 84 c9 00 00 00 8b 57 24 39 55 e8
       8b
       [  947.684308] EIP: [<c0216b72>] __rt_schedulable+0x12/0x160 SS:ESP  0068:c6cc9ed0
      
      We think the following patch solves the issue.
      Signed-off-by: NDario Faggioli <raistlin@linux.it>
      Signed-off-by: NMichael Trimarchi <trimarchimichael@yahoo.it>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      49307fd6
  8. 17 6月, 2008 1 次提交
  9. 16 6月, 2008 1 次提交
  10. 14 6月, 2008 1 次提交
  11. 13 6月, 2008 1 次提交
  12. 12 6月, 2008 3 次提交
    • L
      sched: 64-bit: fix arithmetics overflow · 7a232e03
      Lai Jiangshan 提交于
      (overflow means weight >= 2^32 here, because inv_weigh = 2^32/weight)
      
      A weight of a cfs_rq is the sum of weights of which entities
      are queued on this cfs_rq, so it will overflow when there are
      too many entities.
      
      Although, overflow occurs very rarely, but it break fairness when
      it occurs. 64-bits systems have more memory than 32-bit systems
      and 64-bit systems can create more process usually, so overflow may
      occur more frequently.
      
      This patch guarantees fairness when overflow happens on 64-bit systems.
      Thanks to the optimization of compiler, it changes nothing on 32-bit.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7a232e03
    • L
      sched: fair group: fix overflow(was: fix divide by zero) · 2e084786
      Lai Jiangshan 提交于
      I found a bug which can be reproduced by this way:(linux-2.6.26-rc5, x86-64)
      (use 2^32, 2^33, ...., 2^63 as shares value)
      
      # mkdir /dev/cpuctl
      # mount -t cgroup -o cpu cpuctl /dev/cpuctl
      # cd /dev/cpuctl
      # mkdir sub
      # echo 0x8000000000000000 > sub/cpu.shares
      # echo $$ > sub/tasks
      oops here! divide by zero.
      
      This is because do_div() expects the 2th parameter to be 32 bits,
      but unsigned long is 64 bits in x86_64.
      
      Peter Zijstra pointed it out that the sane thing to do is limit the
      shares value to something smaller instead of using an even more
      expensive divide.
      
      Also, I found another bug about "the shares value is too large":
      
      pid1 and pid2 are set affinity to cpu#0
      pid1 is attached to cg1 and pid2 is attached to cg2
      
      if cg1/cpu.shares = 1024 cg2/cpu.shares = 2000000000
      then pid2 got 100% usage of cpu, and pid1 0%
      
      if cg1/cpu.shares = 1024 cg2/cpu.shares = 20000000000
      then pid2 got 0% usage of cpu, and pid1 100%
      
      And a weight of a cfs_rq is the sum of weights of which entities
      are queued on this cfs_rq, so the shares value should be limited
      to a smaller value.
      
      I think that (1UL << 18) is a good limited value:
      
      1) it's not too large, we can create a lot of group before overflow
      2) it's several times the weight value for nice=-19 (not too small)
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2e084786
    • J
      ftrace: fix printout · 20764ff1
      Jiri Slaby 提交于
      Do not print loglevel before "entries of %ld bytes". Move it to the previous
      pr_info.
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Cc: Steven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      20764ff1
  13. 10 6月, 2008 7 次提交
    • A
      ftrace: disable tracing when current_tracer is set to "none" · 2b1bce17
      Ankita Garg 提交于
      Found that inspite of setting the current_tracer to "none", trace from
      the previous trace type continued to be collected. The patch below fixes
      this and causes the trace to be disabled when the "none" type is
      selected.
      
      Compile and boot tested the patch for functionality.
      Signed-off-by: NAnkita Garg <ankita@in.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2b1bce17
    • I
      sched: sched_clock() lockdep fix · 040ec23d
      Ingo Molnar 提交于
      Sitsofe Wheeler bisected the following commit to cause a lockdep to
      warn about itself and turn itself off:
      
      > commit c6531cce
      > Author: Ingo Molnar <mingo@elte.hu>
      > Date:   Mon May 12 21:21:14 2008 +0200
      >
      >     sched: do not trace sched_clock
      
      do not use raw irq flags in cpu_clock() as it causes lockdep to lose
      track of the true state of the IRQ flag.
      Reported-and-bisected-by: NSitsofe Wheeler <sitsofe@yahoo.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      040ec23d
    • A
      ftrace: prevent freeing of all failed updates · 34078a5e
      Abhishek Sagar 提交于
      Steven Rostedt wrote:
      > If we unload a module and reload it, will it ever get converted again?
      
      The intent was always to filter core kernel functions to prevent their freeing.
      Here's a fix which should allow re-recording of module call-sites.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      34078a5e
    • A
      ftrace: add debugfs entry 'failures' · eb9a7bf0
      Abhishek Sagar 提交于
      Identify functions which had their mcount call-site updates failed. This can
      help us track functions which ftrace shouldn't fiddle with, and are thus not
      being traced. If there is no race with any external agent which is modifying
      the mcount call-site, then this file displays no entries (normal case).
      Signed-off-by: NAbhishek Sagar <sagar.abhishek@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      eb9a7bf0
    • A
      ftrace: remove ftrace_ip_converted() · 1d74f2a0
      Abhishek Sagar 提交于
      Remove the unneeded function ftrace_ip_converted().
      Signed-off-by: NAbhishek Sagar <sagar.abhishek@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1d74f2a0
    • A
      ftrace: prevent freeing of all failed updates · 0eb96701
      Abhishek Sagar 提交于
      Prevent freeing of records which cause problems and correspond to function from
      core kernel text. A new flag, FTRACE_FL_CONVERTED is used to mark a record
      as "converted". All other records are patched lazily to NOPs. Failed records
      now also remain on frace_hash table. Each invocation of ftrace_record_ip now
      checks whether the traced function has ever been recorded (including past
      failures) and doesn't re-record it again.
      Signed-off-by: NAbhishek Sagar <sagar.abhishek@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0eb96701
    • O
      sched: fix TASK_WAKEKILL vs SIGKILL race · 16882c1e
      Oleg Nesterov 提交于
      schedule() has the special "TASK_INTERRUPTIBLE && signal_pending()" case,
      this allows us to do
      
      	current->state = TASK_INTERRUPTIBLE;
      	schedule();
      
      without fear to sleep with pending signal.
      
      However, the code like
      
      	current->state = TASK_KILLABLE;
      	schedule();
      
      is not right, schedule() doesn't take TASK_WAKEKILL into account. This means
      that mutex_lock_killable(), wait_for_completion_killable(), down_killable(),
      schedule_timeout_killable() can miss SIGKILL (and btw the second SIGKILL has
      no effect).
      
      Introduce the new helper, signal_pending_state(), and change schedule() to
      use it. Hopefully it will have more users, that is why the task's state is
      passed separately.
      
      Note this "__TASK_STOPPED | __TASK_TRACED" check in signal_pending_state().
      This is needed to preserve the current behaviour (ptrace_notify). I hope
      this check will be removed soon, but this (afaics good) change needs the
      separate discussion.
      
      The fast path is "(state & (INTERRUPTIBLE | WAKEKILL)) + signal_pending(p)",
      basically the same that schedule() does now. However, this patch of course
      bloats schedule().
      Signed-off-by: NOleg Nesterov <oleg@tv-sign.ru>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      16882c1e
  14. 07 6月, 2008 1 次提交
  15. 02 6月, 2008 2 次提交
    • S
      ftrace: user update and disable dynamic ftrace daemon · ad90c0e3
      Steven Rostedt 提交于
      In dynamic ftrace, the mcount function starts off pointing to a stub
      function that just returns.
      
      On start up, the call to the stub is modified to point to a "record_ip"
      function. The job of the record_ip function is to add the function to
      a pre-allocated hash list. If the function is already there, it simply is
      ignored, otherwise it is added to the list.
      
      Later, a ftraced daemon wakes up and calls kstop_machine if any functions
      have been recorded, and changes the calls to the recorded functions to
      a simple nop.  If no functions were recorded, the daemon goes back to sleep.
      
      The daemon wakes up once a second to see if it needs to update any newly
      recorded functions into nops.  Usually it does not, but if a lot of code
      has been executed for the first time in the kernel, the ftraced daemon
      will call kstop_machine to update those into nops.
      
      The problem currently is that there's no way to stop the daemon from doing
      this, and it can cause unneeded latencies (800us which for some is bothersome).
      
      This patch adds a new file /debugfs/tracing/ftraced_enabled. If the daemon
      is active, reading this will return "enabled\n" and "disabled\n" when the
      daemon is not running. To disable the daemon, the user can echo "0" or
      "disable" into this file, and "1" or "enable" to re-enable the daemon.
      
      Since the daemon is used to convert the functions into nops to increase
      the performance of the system, I also added that anytime something is
      written into the ftraced_enabled file, kstop_machine will run if there
      are new functions that have been detected that need to be converted.
      
      This way the user can disable the daemon but still be able to control the
      conversion of the mcount calls to nops by simply,
      
        "echo 0 > /debugfs/tracing/ftraced_enabled"
      
      when they need to do more conversions.
      
      To see the number of converted functions:
      
        "cat /debugfs/tracing/dyn_ftrace_total_info"
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ad90c0e3
    • A
      ftrace: distinguish kretprobe'd functions in trace logs · 76094a2c
      Abhishek Sagar 提交于
      Tracing functions via ftrace which have a kretprobe installed on them, can produce misleading output in their trace logs. E.g, consider the correct trace of the following sequence:
      
      do_IRQ()
      {
      ~
        irq_enter();
      ~
      }
      
      Trace log (sample):
      <idle>-0     [00] 4154504455.781616: irq_enter <- do_IRQ
      
      But if irq_enter() has a kretprobe installed on it, the return value stored on the stack at each invocation is modified to divert the return to a kprobe trampoline function called kretprobe_trampoline(). So with this the trace would (currently) look like:
      
      <idle>-0     [00] 4154504455.781616: irq_enter <- kretprobe_trampoline
      
      Now this is quite misleading to the end user, as it suggests something that didn't actually happen. So just to avoid such misinterpretations, the inlined patch aims to output such a log as:
      
      <idle>-0     [00] 4154504455.781616: irq_enter <- [unknown/kretprobe'd]
      Signed-off-by: NAbhishek Sagar <sagar.abhishek@gmail.com>
      Acked-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      76094a2c
  16. 01 6月, 2008 1 次提交
    • A
      capabilities: remain source compatible with 32-bit raw legacy capability support. · ca05a99a
      Andrew G. Morgan 提交于
      Source code out there hard-codes a notion of what the
      _LINUX_CAPABILITY_VERSION #define means in terms of the semantics of the
      raw capability system calls capget() and capset().  Its unfortunate, but
      true.
      
      Since the confusing header file has been in a released kernel, there is
      software that is erroneously using 64-bit capabilities with the semantics
      of 32-bit compatibilities.  These recently compiled programs may suffer
      corruption of their memory when sys_getcap() overwrites more memory than
      they are coded to expect, and the raising of added capabilities when using
      sys_capset().
      
      As such, this patch does a number of things to clean up the situation
      for all. It
      
        1. forces the _LINUX_CAPABILITY_VERSION define to always retain its
           legacy value.
      
        2. adopts a new #define strategy for the kernel's internal
           implementation of the preferred magic.
      
        3. deprecates v2 capability magic in favor of a new (v3) magic
           number. The functionality of v3 is entirely equivalent to v2,
           the only difference being that the v2 magic causes the kernel
           to log a "deprecated" warning so the admin can find applications
           that may be using v2 inappropriately.
      
      [User space code continues to be encouraged to use the libcap API which
      protects the application from details like this.  libcap-2.10 is the first
      to support v3 capabilities.]
      
      Fixes issue reported in https://bugzilla.redhat.com/show_bug.cgi?id=447518.
      Thanks to Bojan Smojver for the report.
      
      [akpm@linux-foundation.org: s/depreciate/deprecate/g]
      [akpm@linux-foundation.org: be robust about put_user size]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NAndrew G. Morgan <morgan@kernel.org>
      Cc: Serge E. Hallyn <serue@us.ibm.com>
      Cc: Bojan Smojver <bojan@rexursive.com>
      Cc: stable@kernel.org
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NChris Wright <chrisw@sous-sol.org>
      ca05a99a