- 13 10月, 2010 1 次提交
-
-
由 Steven Rostedt 提交于
Time stamps for the ring buffer are created by the difference between two events. Each page of the ring buffer holds a full 64 bit timestamp. Each event has a 27 bit delta stamp from the last event. The unit of time is nanoseconds, so 27 bits can hold ~134 milliseconds. If two events happen more than 134 milliseconds apart, a time extend is inserted to add more bits for the delta. The time extend has 59 bits, which is good for ~18 years. Currently the time extend is committed separately from the event. If an event is discarded before it is committed, due to filtering, the time extend still exists. If all events are being filtered, then after ~134 milliseconds a new time extend will be added to the buffer. This can only happen till the end of the page. Since each page holds a full timestamp, there is no reason to add a time extend to the beginning of a page. Time extends can only fill a page that has actual data at the beginning, so there is no fear that time extends will fill more than a page without any data. When reading an event, a loop is made to skip over time extends since they are only used to maintain the time stamp and are never given to the caller. As a paranoid check to prevent the loop running forever, with the knowledge that time extends may only fill a page, a check is made that tests the iteration of the loop, and if the iteration is more than the number of time extends that can fit in a page a warning is printed and the ring buffer is disabled (all of ftrace is also disabled with it). There is another event type that is called a TIMESTAMP which can hold 64 bits of data in the theoretical case that two events happen 18 years apart. This code has not been implemented, but the name of this event exists, as well as the structure for it. The size of a TIMESTAMP is 16 bytes, where as a time extend is only 8 bytes. The macro used to calculate how many time extends can fit on a page used the TIMESTAMP size instead of the time extend size cutting the amount in half. The following test case can easily trigger the warning since we only need to have half the page filled with time extends to trigger the warning: # cd /sys/kernel/debug/tracing/ # echo function > current_tracer # echo 'common_pid < 0' > events/ftrace/function/filter # echo > trace # echo 1 > trace_marker # sleep 120 # cat trace Enabling the function tracer and then setting the filter to only trace functions where the process id is negative (no events), then clearing the trace buffer to ensure that we have nothing in the buffer, then write to trace_marker to add an event to the beginning of a page, sleep for 2 minutes (only 35 seconds is probably needed, but this guarantees the bug), and then finally reading the trace which will trigger the bug. This patch fixes the typo and prevents the false positive of that warning. Reported-by: NHans J. Koch <hjk@linutronix.de> Tested-by: NHans J. Koch <hjk@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Stable Kernel <stable@kernel.org> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
-
- 12 10月, 2010 1 次提交
-
-
由 John Blackwood 提交于
perf events: repair incorrect use of copy_from_user This makes the perf_event_period() return 0 instead of -EFAULT on success. Signed-off-by: John Blackwood<john.blackwood@ccur.com> Signed-off-by: NJoe Korty <joe.korty@ccur.com> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100928220311.GA18145@tsunami.ccur.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 08 10月, 2010 1 次提交
-
-
由 Eric Dumazet 提交于
When proc_doulongvec_minmax() is used with an array of longs, and no min/max check requested (.extra1 or .extra2 being NULL), we dereference a NULL pointer for the second element of the array. Noticed while doing some changes in network stack for the "16TB problem" Fix is to not change min & max pointers in __do_proc_doulongvec_minmax(), so that all elements of the vector share an unique min/max limit, like proc_dointvec_minmax(). [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Americo Wang <xiyou.wangcong@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 07 10月, 2010 1 次提交
-
-
由 Andi Kleen 提交于
The original hwpoison code added a new siginfo field si_addr_lsb to pass the granuality of the fault address to user space. Unfortunately this field was never copied to user space. Fix this here. I added explicit checks for the MCEERR codes to avoid having to patch all potential callers to initialize the field. Signed-off-by: NAndi Kleen <ak@linux.intel.com>
-
- 06 10月, 2010 1 次提交
-
-
由 Linus Torvalds 提交于
With all the recent module loading cleanups, we've minimized the code that sits under module_mutex, fixing various deadlocks and making it possible to do most of the module loading in parallel. However, that whole conversion totally missed the rather obscure code that adds a new module to the list for BUG() handling. That code was doubly obscure because (a) the code itself lives in lib/bugs.c (for dubious reasons) and (b) it gets called from the architecture-specific "module_finalize()" rather than from generic code. Calling it from arch-specific code makes no sense what-so-ever to begin with, and is now actively wrong since that code isn't protected by the module loading lock any more. So this commit moves the "module_bug_{finalize,cleanup}()" calls away from the arch-specific code, and into the generic code - and in the process protects it with the module_mutex so that the list operations are now safe. Future fixups: - move the module list handling code into kernel/module.c where it belongs. - get rid of 'module_bug_list' and just use the regular list of modules (called 'modules' - imagine that) that we already create and maintain for other reasons. Reported-and-tested-by: NThomas Gleixner <tglx@linutronix.de> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Adrian Bunk <bunk@kernel.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: stable@kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 02 10月, 2010 1 次提交
-
-
由 Ira W. Snyder 提交于
The kfifo_dma family of functions use sg_mark_end() on the last element in their scatterlist. This forces use of a fresh scatterlist for each DMA operation, which makes recycling a single scatterlist impossible. Change the behavior of the kfifo_dma functions to match the usage of the dma_map_sg function. This means that users must respect the returned nents value. The sample code is updated to reflect the change. This bug is trivial to cause: call kfifo_dma_in_prepare() such that it prepares a scatterlist with a single entry comprising the whole fifo. This is the case when you map the entirety of a newly created empty fifo. This causes the setup_sgl() function to mark the first scatterlist entry as the end of the chain, no matter what comes after it. Afterwards, add and remove some data from the fifo such that another call to kfifo_dma_in_prepare() will create two scatterlist entries. It returns nents=2. However, due to the previous sg_mark_end() call, sg_is_last() will now return true for the first scatterlist element. This causes the sample code to print a single scatterlist element when it should print two. By removing the call to sg_mark_end(), we make the API as similar as possible to the DMA mapping API. All users are required to respect the returned nents. Signed-off-by: NIra W. Snyder <iws@ovro.caltech.edu> Cc: Stefani Seibold <stefani@seibold.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 9月, 2010 1 次提交
-
-
由 Andrea Arcangeli 提交于
The below bug in fork led to the rmap walk finding the parent huge-pmd twice instead of just once, because the anon_vma_chain objects of the child vma still point to the vma->vm_mm of the parent. The patch fixes it by making the rmap walk accurate during fork. It's not a big deal normally but it worth being accurate considering the cost is the same. Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com> Acked-by: NJohannes Weiner <jweiner@redhat.com> Acked-by: NRik van Riel <riel@redhat.com> Acked-by: NHugh Dickins <hughd@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 21 9月, 2010 5 次提交
-
-
由 Steven Rostedt 提交于
Add a tracepoint that shows the priority of a task being boosted via priority inheritance. Cc: Gregory Haskins <ghaskins@novell.com> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
-
由 Steven Rostedt 提交于
If a high priority task is waking up on a CPU that is running a lower priority task that is bound to a CPU, see if we can move the high RT task to another CPU first. Note, if all other CPUs are running higher priority tasks than the CPU bounded current task, then it will be preempted regardless. Signed-off-by: NSteven Rostedt <rostedt@goodmis.org> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Gregory Haskins <ghaskins@novell.com> LKML-Reference: <20100921024138.888922071@goodmis.org> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Steven Rostedt 提交于
When first working on the RT scheduler design, we concentrated on keeping all CPUs running RT tasks instead of having multiple RT tasks on a single CPU waiting for the migration thread to move them. Instead we take a more proactive stance and push or pull RT tasks from one CPU to another on wakeup or scheduling. When an RT task wakes up on a CPU that is running another RT task, instead of preempting it and killing the cache of the running RT task, we look to see if we can migrate the RT task that is waking up, even if the RT task waking up is of higher priority. This may sound a bit odd, but RT tasks should be limited in migration by the user anyway. But in practice, people do not do this, which causes high prio RT tasks to bounce around the CPUs. This becomes even worse when we have priority inheritance, because a high prio task can block on a lower prio task and boost its priority. When the lower prio task wakes up the high prio task, if it happens to be on the same CPU it will migrate off of it. But in reality, the above does not happen much either, because the wake up of the lower prio task, which has already been boosted, if it was on the same CPU as the higher prio task, it would then migrate off of it. But anyway, we do not want to migrate them either. To examine the scheduling, I created a test program and examined it under kernelshark. The test program created CPU * 2 threads, where each thread had a different priority. The program takes different options. The options used in this change log was to have priority inheritance mutexes or not. All threads did the following loop: static void grab_lock(long id, int iter, int l) { ftrace_write("thread %ld iter %d, taking lock %d\n", id, iter, l); pthread_mutex_lock(&locks[l]); ftrace_write("thread %ld iter %d, took lock %d\n", id, iter, l); busy_loop(nr_tasks - id); ftrace_write("thread %ld iter %d, unlock lock %d\n", id, iter, l); pthread_mutex_unlock(&locks[l]); } void *start_task(void *id) { [...] while (!done) { for (l = 0; l < nr_locks; l++) { grab_lock(id, i, l); ftrace_write("thread %ld iter %d sleeping\n", id, i); ms_sleep(id); } i++; } [...] } The busy_loop(ms) keeps the CPU spinning for ms milliseconds. The ms_sleep(ms) sleeps for ms milliseconds. The ftrace_write() writes to the ftrace buffer to help analyze via ftrace. The higher the id, the higher the prio, the shorter it does the busy loop, but the longer it spins. This is usually the case with RT tasks, the lower priority tasks usually run longer than higher priority tasks. At the end of the test, it records the number of loops each thread took, as well as the number of voluntary preemptions, non-voluntary preemptions, and number of migrations each thread took, taking the information from /proc/$$/sched and /proc/$$/status. Running this on a 4 CPU processor, the results without changes to the kernel looked like this: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 53 3220 1470 98 1: 562 773 724 98 2: 752 933 1375 98 3: 749 39 697 98 4: 758 5 515 98 5: 764 2 679 99 6: 761 2 535 99 7: 757 3 346 99 total: 5156 4977 6341 787 Each thread regardless of priority migrated a few hundred times. The higher priority tasks, were a little better but still took quite an impact. By letting higher priority tasks bump the lower prio task from the CPU, things changed a bit: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 37 2835 1937 98 1: 666 1821 1865 98 2: 654 1003 1385 98 3: 664 635 973 99 4: 698 197 352 99 5: 703 101 159 99 6: 708 1 75 99 7: 713 1 2 99 total: 4843 6594 6748 789 The total # of migrations did not change (several runs showed the difference all within the noise). But we now see a dramatic improvement to the higher priority tasks. (kernelshark showed that the watchdog timer bumped the highest priority task to give it the 2 count. This was actually consistent with every run). Notice that the # of iterations did not change either. The above was with priority inheritance mutexes. That is, when the higher prority task blocked on a lower priority task, the lower priority task would inherit the higher priority task (which shows why task 6 was bumped so many times). When not using priority inheritance mutexes, the current kernel shows this: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 56 3101 1892 95 1: 594 713 937 95 2: 625 188 618 95 3: 628 4 491 96 4: 640 7 468 96 5: 631 2 501 96 6: 641 1 466 96 7: 643 2 497 96 total: 4458 4018 5870 765 Not much changed with or without priority inheritance mutexes. But if we let the high priority task bump lower priority tasks on wakeup we see: Task vol nonvol migrated iterations ---- --- ------ -------- ---------- 0: 115 3439 2782 98 1: 633 1354 1583 99 2: 652 919 1218 99 3: 645 713 934 99 4: 690 3 3 99 5: 694 1 4 99 6: 720 3 4 99 7: 747 0 1 100 Which shows a even bigger change. The big difference between task 3 and task 4 is because we have only 4 CPUs on the machine, causing the 4 highest prio tasks to always have preference. Although I did not measure cache misses, and I'm sure there would be little to measure since the test was not data intensive, I could imagine large improvements for higher priority tasks when dealing with lower priority tasks. Thus, I'm satisfied with making the change and agreeing with what Gregory Haskins argued a few years ago when we first had this discussion. One final note. All tasks in the above tests were RT tasks. Any RT task will always preempt a non RT task that is running on the CPU the RT task wants to run on. Signed-off-by: NSteven Rostedt <rostedt@goodmis.org> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Gregory Haskins <ghaskins@novell.com> LKML-Reference: <20100921024138.605460343@goodmis.org> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Venkatesh Pallipadi 提交于
scheduler uses cache_nice_tries as an indicator to do cache_hot and active load balance, when normal load balance fails. Currently, this value is changed on any failed load balance attempt. That ends up being not so nice to workloads that enter/exit idle often, as they do more frequent new_idle balance and that pretty soon results in cache hot tasks being pulled in. Making the cache_nice_tries ignore failed new_idle balance seems to make better sense. With that only the failed load balance in periodic load balance gets accounted and the rate of accumulation of cache_nice_tries will not depend on idle entry/exit (short running sleep-wakeup kind of tasks). This reduces movement of cache_hot tasks. schedstat diff (after-before) excerpt from a workload that has frequent and short wakeup-idle pattern (:2 in cpu col below refers to NEWIDLE idx) This snapshot was across ~400 seconds. Without this change: domainstats: domain0 cpu cnt bln fld imb gain hgain nobusyq nobusyg 0:2 306487 219575 73167 110069413 44583 19070 1172 218403 1:2 292139 194853 81421 120893383 50745 21902 1259 193594 2:2 283166 174607 91359 129699642 54931 23688 1287 173320 3:2 273998 161788 93991 132757146 57122 24351 1366 160422 4:2 289851 215692 62190 83398383 36377 13680 851 214841 5:2 316312 222146 77605 117582154 49948 20281 988 221158 6:2 297172 195596 83623 122133390 52801 21301 929 194667 7:2 283391 178078 86378 126622761 55122 22239 928 177150 8:2 297655 210359 72995 110246694 45798 19777 1125 209234 9:2 297357 202011 79363 119753474 50953 22088 1089 200922 10:2 278797 178703 83180 122514385 52969 22726 1128 177575 11:2 272661 167669 86978 127342327 55857 24342 1195 166474 12:2 293039 204031 73211 110282059 47285 19651 948 203083 13:2 289502 196762 76803 114712942 49339 20547 1016 195746 14:2 264446 169609 78292 115715605 50459 21017 982 168627 15:2 260968 163660 80142 116811793 51483 21281 1064 162596 With this change: domainstats: domain0 cpu cnt bln fld imb gain hgain nobusyq nobusyg 0:2 272347 187380 77455 105420270 24975 1 953 186427 1:2 267276 172360 86234 116242264 28087 6 1028 171332 2:2 259769 156777 93281 123243134 30555 1 1043 155734 3:2 250870 143129 97627 127370868 32026 6 1188 141941 4:2 248422 177116 64096 78261112 22202 2 757 176359 5:2 275595 180683 84950 116075022 29400 6 778 179905 6:2 262418 162609 88944 119256898 31056 4 817 161792 7:2 252204 147946 92646 122388300 32879 4 824 147122 8:2 262335 172239 81631 110477214 26599 4 864 171375 9:2 261563 164775 88016 117203621 28331 3 849 163926 10:2 243389 140949 93379 121353071 29585 2 909 140040 11:2 242795 134651 98310 124768957 30895 2 1016 133635 12:2 255234 166622 79843 104696912 26483 4 746 165876 13:2 244944 151595 83855 109808099 27787 3 801 150794 14:2 241301 140982 89935 116954383 30403 6 845 140137 15:2 232271 128564 92821 119185207 31207 4 1416 127148 Signed-off-by: NVenkatesh Pallipadi <venki@google.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1284167957-3675-1-git-send-email-venki@google.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Suresh Siddha 提交于
There's a situation where the nohz balancer will try to wake itself: cpu-x is idle which is also ilb_cpu got a scheduler tick during idle and the nohz_kick_needed() in trigger_load_balance() checks for rq_x->nr_running which might not be zero (because of someone waking a task on this rq etc) and this leads to the situation of the cpu-x sending a kick to itself. And this can cause a lockup. Avoid this by not marking ourself eligible for kicking. Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1284400941.2684.19.camel@sbsiddha-MOBL3.sc.intel.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 17 9月, 2010 1 次提交
-
-
由 Matt Helsley 提交于
Hardware breakpoints can't be registered within pid namespaces because tsk->pid is passed rather than the pid in the current namespace. (See https://bugzilla.kernel.org/show_bug.cgi?id=17281 ) This is a quick fix demonstrating the problem but is not the best method of solving the problem since passing pids internally is not the best way to avoid pid namespace bugs. Subsequent patches will show a better solution. Much thanks to Frederic Weisbecker <fweisbec@gmail.com> for doing the bulk of the work finding this bug. Reported-by: NRobin Green <greenrd@greenrd.org> Signed-off-by: NMatt Helsley <matthltc@us.ibm.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Prasad <prasad@linux.vnet.ibm.com> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Will Deacon <will.deacon@arm.com> Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com> Cc: 2.6.33-2.6.35 <stable@kernel.org> LKML-Reference: <f63454af09fb1915717251570423eb9ddd338340.1284407762.git.matthltc@us.ibm.com> Signed-off-by: NIngo Molnar <mingo@elte.hu> Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
-
- 16 9月, 2010 1 次提交
-
-
由 Heiko Carstens 提交于
With 710390d9 "sched: Optimize branch hint in context_switch()" the branch hint logic within context_switch() got inversed. In fact the hints "if (likely(!mm))" and "if (likely(!prev->mm))" mean that it is likely that the previous and next task are kernel threads. That assumption is certainly counter intuitive, but Tim has shown that at least with his workload this is true. Nevertheless the truth is: it depends on the current workload. So just remove the annotations which also improves readability. Reported-by: NTim Blechmann <tim@klingt.org> Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Mike Galbraith <efault@gmx.de> LKML-Reference: <20100916124225.GA2209@osiris.boeblingen.de.ibm.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 15 9月, 2010 2 次提交
-
-
由 Stanislaw Gruszka 提交于
We have 32-bit variable overflow possibility when multiply in task_times() and thread_group_times() functions. When the overflow happens then the scaled utime value becomes erroneously small and the scaled stime becomes i erroneously big. Reported here: https://bugzilla.redhat.com/show_bug.cgi?id=633037 https://bugzilla.kernel.org/show_bug.cgi?id=16559Reported-by: NMichael Chapman <redhat-bugzilla@very.puzzling.org> Reported-by: NCiriaco Garcia de Celis <sysman@etherpilot.com> Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> Cc: <stable@kernel.org> # 2.6.32.19+ (partially) and 2.6.33+ LKML-Reference: <20100914143513.GB8415@redhat.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 H. Peter Anvin 提交于
compat_alloc_user_space() expects the caller to independently call access_ok() to verify the returned area. A missing call could introduce problems on some architectures. This patch incorporates the access_ok() check into compat_alloc_user_space() and also adds a sanity check on the length. The existing compat_alloc_user_space() implementations are renamed arch_compat_alloc_user_space() and are used as part of the implementation of the new global function. This patch assumes NULL will cause __get_user()/__put_user() to either fail or access userspace on all architectures. This should be followed by checking the return value of compat_access_user_space() for NULL in the callers, at which time the access_ok() in the callers can also be removed. Reported-by: NBen Hawkes <hawkes@sota.gen.nz> Signed-off-by: NH. Peter Anvin <hpa@linux.intel.com> Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: NChris Metcalf <cmetcalf@tilera.com> Acked-by: NDavid S. Miller <davem@davemloft.net> Acked-by: NIngo Molnar <mingo@elte.hu> Acked-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NTony Luck <tony.luck@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Helge Deller <deller@gmx.de> Cc: James Bottomley <jejb@parisc-linux.org> Cc: Kyle McMartin <kyle@mcmartin.ca> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Cc: Paul Mackerras <paulus@samba.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: <stable@kernel.org>
-
- 14 9月, 2010 2 次提交
-
-
由 Mathieu Desnoyers 提交于
Fix incorrect handling of the following case: INTERACTIVE INTERACTIVE_SOMETHING_ELSE The comparison only checks up to each element's length. Changelog since v1: - Embellish using some Rostedtisms. [ mingo: ^^ == smaller and cleaner ] Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: NSteven Rostedt <rostedt@goodmis.org> Cc: <stable@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tony Lindgren <tony@atomide.com> LKML-Reference: <20100913214700.GB16118@Krystal> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Ingo Molnar 提交于
Mathieu reported bad latencies with make -j10 kind of kbuild workloads - which is mostly caused by us scheduling with a too coarse granularity. Reduce the minimum granularity some more, to make sure we can meet the latency target. I got the following results (make -j10 kbuild load, average of 3 runs): vanilla: maximum latency: 38278.9 µs average latency: 7730.1 µs patched: maximum latency: 22702.1 µs average latency: 6684.8 µs Mathieu also measured it: | | * wakeup-latency.c (SIGEV_THREAD) with make -j10 | | - Mainline 2.6.35.2 kernel | | maximum latency: 45762.1 µs | average latency: 7348.6 µs | | - With only Peter's smaller min_gran (shown below): | | maximum latency: 29100.6 µs | average latency: 6684.1 µs | Reported-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com> Reported-by: NLinus Torvalds <torvalds@linux-foundation.org> Acked-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com> Suggested-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <AANLkTi=8m4g01wZPacySoF7U0PevTNVgJoZZrHiUD-pN@mail.gmail.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 13 9月, 2010 1 次提交
-
-
由 Tejun Heo 提交于
Update copyright notice and add Documentation/workqueue.txt. Randy Dunlap, Dave Chinner: misc fixes. Signed-off-by: NTejun Heo <tj@kernel.org> Reviewed-By: NFlorian Mickler <florian@mickler.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Randy Dunlap <randy.dunlap@oracle.com> Cc: Dave Chinner <david@fromorbit.com>
-
- 12 9月, 2010 1 次提交
-
-
由 Rafael J. Wysocki 提交于
There is a problem in hibernate_preallocate_memory() that it calls preallocate_image_memory() with an argument that may be greater than the total number of available non-highmem memory pages. If that's the case, the OOM condition is guaranteed to trigger, which in turn can cause significant slowdown to occur during hibernation. To avoid that, make preallocate_image_memory() adjust its argument before calling preallocate_image_pages(), so that the total number of saveable non-highem pages left is not less than the minimum size of a hibernation image. Change hibernate_preallocate_memory() to try to allocate from highmem if the number of pages allocated by preallocate_image_memory() is too low. Modify free_unnecessary_pages() to take all possible memory allocation patterns into account. Reported-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl> Tested-by: NM. Vefa Bicakci <bicave@superonline.com>
-
- 11 9月, 2010 1 次提交
-
-
由 mark gross 提交于
Correct some pr_debug() misuse and add a stronger parameter check to pm_qos_write() for the ASCII hex value case. Thanks to Dan Carpenter for pointing out the problem! Signed-off-by: Nmark gross <markgross@thegnar.org> Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
-
- 10 9月, 2010 11 次提交
-
-
由 Heiko Carstens 提交于
Just got my 6 way machine to a state where cpu 0 is in an endless loop within __smp_call_function_single. All other cpus are idle. The call trace on cpu 0 looks like this: __smp_call_function_single scheduler_tick update_process_times tick_sched_timer __run_hrtimer hrtimer_interrupt clock_comparator_work do_extint ext_int_handler ----> timer irq cpu_idle __smp_call_function_single() got called from nohz_balancer_kick() (inlined) with the remote cpu being 1, wait being 0 and the per cpu variable remote_sched_softirq_cb (call_single_data) of the current cpu (0). Then it loops forever when it tries to grab the lock of the call_single_data, since it is already locked and enqueued on cpu 0. My theory how this could have happened: for some reason the scheduler decided to call __smp_call_function_single() on it's own cpu, and sends an IPI to itself. The interrupt stays pending since IRQs are disabled. If then the hypervisor schedules the cpu away it might happen that upon rescheduling both the IPI and the timer IRQ are pending. If then interrupts are enabled again it depends which one gets scheduled first. If the timer interrupt gets delivered first we end up with the local deadlock as seen in the calltrace above. Let's make __smp_call_function_single() check if the target cpu is the current cpu and execute the function immediately just like smp_call_function_single does. That should prevent at least the scenario described here. It might also be that the scheduler is not supposed to call __smp_call_function_single with the remote cpu being the current cpu, but that is a different issue. Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: NJens Axboe <jaxboe@fusionio.com> Cc: Venkatesh Pallipadi <venki@google.com> Cc: Suresh Siddha <suresh.b.siddha@intel.com> LKML-Reference: <20100910114729.GB2827@osiris.boeblingen.de.ibm.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Chris Wright 提交于
Be sure to avoid entering t_show() with FTRACE_ITER_HASH set without having properly started the iterator to iterate the hash. This case is degenerate and, as discovered by Robert Swiecki, can cause t_hash_show() to misuse a pointer. This causes a NULL ptr deref with possible security implications. Tracked as CVE-2010-3079. Cc: Robert Swiecki <swiecki@google.com> Cc: Eugene Teo <eugene@redhat.com> Cc: <stable@kernel.org> Signed-off-by: NChris Wright <chrisw@sous-sol.org> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
-
由 Hugh Dickins 提交于
Please revert 2.6.36-rc commit d2997b10 "hibernation: freeze swap at hibernation". It complicated matters by adding a second swap allocation path, just for hibernation; without in any way fixing the issue that it was intended to address - page reclaim after fixing the hibernation image might free swap from a page already imaged as swapcache, letting its swap be reallocated to store a different page of the image: resulting in data corruption if the imaged page were freed as clean then swapped back in. Pages freed to si->swap_map were still in danger of being reallocated by the alternative allocation path. I guess it inadvertently fixed slow SSD swap allocation for hibernation, as reported by Nigel Cunningham: by missing out the discards that occur on the usual swap allocation path; but that was unintentional, and needs a separate fix. Signed-off-by: NHugh Dickins <hughd@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Ondrej Zary <linux@rainbow-software.org> Cc: Andrea Gelmini <andrea.gelmini@gmail.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Nigel Cunningham <nigel@tuxonice.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jerome Marchand 提交于
gid_t is a unsigned int. If group_info contains a gid greater than MAX_INT, groups_search() function may look on the wrong side of the search tree. This solves some unfair "permission denied" problems. Signed-off-by: NJerome Marchand <jmarchan@redhat.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michael S. Tsirkin 提交于
Add cgroup_attach_task_all() The existing cgroup_attach_task_current_cg() API is called by a thread to attach another thread to all of its cgroups; this is unsuitable for cases where a privileged task wants to attach itself to the cgroups of a less privileged one, since the call must be made from the context of the target task. This patch adds a more generic cgroup_attach_task_all() API that allows both the source task and to-be-moved task to be specified. cgroup_attach_task_current_cg() becomes a specialization of the more generic new function. [menage@google.com: rewrote changelog] [akpm@linux-foundation.org: address reviewer comments] Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Tested-by: NAlex Williamson <alex.williamson@redhat.com> Acked-by: NPaul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Ben Blum <bblum@google.com> Cc: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Peter Oberparleiter 提交于
The gcov-kernel infrastructure expects that each object file is loaded only once. This may not be true, e.g. when loading multiple kernel modules which are linked to the same object file. As a result, loading such kernel modules will result in incorrect gcov results while unloading will cause a null-pointer dereference. This patch fixes these problems by changing the gcov-kernel infrastructure so that multiple profiling data sets can be associated with one debugfs entry. It applies to 2.6.36-rc1. Signed-off-by: NPeter Oberparleiter <oberpar@linux.vnet.ibm.com> Reported-by: NWerner Spies <werner.spies@thalesgroup.com> Cc: <stable@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Heiko Carstens 提交于
On top of the SMT and MC scheduling domains this adds the BOOK scheduling domain. This is useful for NUMA like machines which do not have an interface which tells which piece of memory is attached to which node or where the hardware performs striping. Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100831082844.253053798@de.ibm.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Heiko Carstens 提交于
Merge and simplify the two cpu_to_core_group variants so that the resulting function follows the same pattern like cpu_to_phys_group. Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20100831082843.953617555@de.ibm.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Suresh Siddha 提交于
Currently sched_avg_update() (which updates rt_avg stats in the rq) is getting called from scale_rt_power() (in the load balance context) which doesn't take rq->lock. Fix it by moving the sched_avg_update() to more appropriate update_cpu_load() where the CFS load gets updated as well. Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1282596171.2694.3.camel@sbsiddha-MOBL3> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Peter Zijlstra 提交于
Since we have UP_PREPARE, we should also have UP_CANCELED. Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> Cc: paulus <paulus@samba.org> LKML-Reference: <new-submission> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
由 Li Zefan 提交于
Commit 1c024eca (perf, trace: Optimize tracepoints by using per-tracepoint-per-cpu hlist to track events) caused a module refcount leak. Reported-And-Tested-by: NAvi Kivity <avi@redhat.com> Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4C7E1F12.8030304@cn.fujitsu.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 08 9月, 2010 3 次提交
-
-
由 Masami Hiramatsu 提交于
Check the argument name whether it is invalid (not C-like symbol name). This makes event format simple. Reported-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> LKML-Reference: <20100827113912.22882.62313.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
由 Masami Hiramatsu 提交于
Set "argN" name for each argument automatically if it has no specified name. Since dynamic trace event(kprobe_events) accepts special characters for its argument, its format can show those special characters (e.g. '$', '%', '+'). However, perf can't parse those format because of the character (especially '%') mess up the format. This sets "argX" name for those arguments if user omitted the argument names. E.g. # echo 'p do_fork %ax IP=%ip $stack' > tracing/kprobe_events # cat tracing/kprobe_events p:kprobes/p_do_fork_0 do_fork arg1=%ax IP=%ip arg3=$stack Reported-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> LKML-Reference: <20100827113906.22882.59312.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
由 Masami Hiramatsu 提交于
Fix a memory leak which happens when a field name conflicts with others. In error case, free_trace_probe() will free all arguments until nr_args, so this increments nr_args the begining of the loop instead of the end. Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> LKML-Reference: <20100827113846.22882.12670.stgit@ltc236.sdl.hitachi.co.jp> Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
-
- 09 9月, 2010 1 次提交
-
-
由 Steven Rostedt 提交于
Reading the file set_ftrace_filter does three things. 1) shows whether or not filters are set for the function tracer 2) shows what functions are set for the function tracer 3) shows what triggers are set on any functions 3 is independent from 1 and 2. The way this file currently works is that it is a state machine, and as you read it, it may change state. But this assumption breaks when you use lseek() on the file. The state machine gets out of sync and the t_show() may use the wrong pointer and cause a kernel oops. Luckily, this will only kill the app that does the lseek, but the app dies while holding a mutex. This prevents anyone else from using the set_ftrace_filter file (or any other function tracing file for that matter). A real fix for this is to rewrite the code, but that is too much for a -rc release or stable. This patch simply disables llseek on the set_ftrace_filter() file for now, and we can do the proper fix for the next major release. Reported-by: NRobert Swiecki <swiecki@google.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: Tavis Ormandy <taviso@google.com> Cc: Eugene Teo <eugene@redhat.com> Cc: vendor-sec@lst.de Cc: <stable@kernel.org> Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
-
- 08 9月, 2010 1 次提交
-
-
由 Christian Dietrich 提交于
The CONFIG_SMP ifdef isn't necessary at this point, because it is checked in an outer ifdef level already and has no effect here. Cleanup only, no functional effect. Signed-off-by: NChristian Dietrich <qy03fugy@stud.informatik.uni-erlangen.de> Cc: vamos-dev@i4.informatik.uni-erlangen.de Cc: Peter Zijlstra <peterz@infradead.org> Cc: Mike Galbraith <efault@gmx.de> Cc: Tejun Heo <tj@kernel.org> LKML-Reference: <7a3a39ef3f765a4473cb026b1f204059568a7098.1283782701.git.qy03fugy@stud.informatik.uni-erlangen.de> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 05 9月, 2010 2 次提交
-
-
由 Michael S. Tsirkin 提交于
cgroup_attach_task_current_cg API that have upstream is backwards: we really need an API to attach to the cgroups from another process A to the current one. In our case (vhost), a priveledged user wants to attach it's task to cgroups from a less priveledged one, the API makes us run it in the other task's context, and this fails. So let's make the API generic and just pass in 'from' and 'to' tasks. Add an inline wrapper for cgroup_attach_task_current_cg to avoid breaking bisect. Signed-off-by: NMichael S. Tsirkin <mst@redhat.com> Acked-by: NLi Zefan <lizf@cn.fujitsu.com> Acked-by: NPaul Menage <menage@google.com>
-
由 Andi Kleen 提交于
No real bugs I believe, just some dead code. Signed-off-by: NAndi Kleen <ak@linux.intel.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: andi@firstfloor.org Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-
- 03 9月, 2010 1 次提交
-
-
由 Randy Dunlap 提交于
Fix kernel-doc notation in linux/mutex.h and kernel/mutex.c, then add these 2 files to the kernel-locking docbook as the Mutex API reference chapter. Add one API function to mutex-design.txt and correct a typo in that file. Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com> Cc: Rusty Russell <rusty@rustcorp.com.au> LKML-Reference: <20100902154816.6cc2f9ad.randy.dunlap@oracle.com> Signed-off-by: NIngo Molnar <mingo@elte.hu>
-