1. 23 2月, 2011 3 次提交
    • P
      sched: Fix the group_imb logic · 866ab43e
      Peter Zijlstra 提交于
      On a 2*6*2 machine something like:
      
       taskset -c 3-11 bash -c 'for ((i=0;i<9;i++)) do while :; do :; done & done'
      
      _should_ result in 9 busy CPUs, each running 1 task.
      
      However it didn't quite work reliably, most of the time one cpu of the
      second socket (6-11) would be idle and one cpu of the first socket
      (0-5) would have two tasks on it.
      
      The group_imb logic is supposed to deal with this and detect when a
      particular group is imbalanced (like in our case, 0-2 are idle but 3-5
      will have 4 tasks on it).
      
      The detection phase needed a bit of a tweak as it was too weak and
      required more than 2 avg weight tasks difference between idle and busy
      cpus in the group which won't trigger for our test-case. So cure that
      to be one or more avg task weight difference between cpus.
      
      Once the detection phase worked, it was then defeated by the f_b_g()
      tests trying to avoid ping-pongs. In particular, this_load >= max_load
      triggered because the pulling cpu (the (first) idle cpu in on the
      second socket, say 6) would find this_load to be 5 and max_load to be
      4 (there'd be 5 tasks running on our socket and only 4 on the other
      socket).
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nikhil Rao <ncrao@google.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      866ab43e
    • P
      sched: Clean up some f_b_g() comments · cc57aa8f
      Peter Zijlstra 提交于
      The existing comment tends to grow state (as it already has), split it
      up and place it near the actual tests.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nikhil Rao <ncrao@google.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cc57aa8f
    • P
      sched: Clean up remnants of sd_idle · c186fafe
      Peter Zijlstra 提交于
      With the wholesale removal of the sd_idle SMT logic we can clean up
      some more.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Nikhil Rao <ncrao@google.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Mike Galbraith <efault@gmx.de>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c186fafe
  2. 17 2月, 2011 3 次提交
  3. 16 2月, 2011 1 次提交
  4. 14 2月, 2011 1 次提交
  5. 12 2月, 2011 2 次提交
    • K
      timer debug: Hide kernel addresses via %pK in /proc/timer_list · f5903085
      Kees Cook 提交于
      In the continuing effort to avoid kernel addresses leaking to
      unprivileged users, this patch switches to %pK for
      /proc/timer_list reporting.
      Signed-off-by: NKees Cook <kees.cook@canonical.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      Cc: Dan Rosenberg <drosenberg@vsecurity.com>
      Cc: Eugene Teo <eugeneteo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20110212032125.GA23571@outflux.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f5903085
    • T
      ptrace: use safer wake up on ptrace_detach() · 01e05e9a
      Tejun Heo 提交于
      The wake_up_process() call in ptrace_detach() is spurious and not
      interlocked with the tracee state.  IOW, the tracee could be running or
      sleeping in any place in the kernel by the time wake_up_process() is
      called.  This can lead to the tracee waking up unexpectedly which can be
      dangerous.
      
      The wake_up is spurious and should be removed but for now reduce its
      toxicity by only waking up if the tracee is in TRACED or STOPPED state.
      
      This bug can possibly be used as an attack vector.  I don't think it
      will take too much effort to come up with an attack which triggers oops
      somewhere.  Most sleeps are wrapped in condition test loops and should
      be safe but we have quite a number of places where sleep and wakeup
      conditions are expected to be interlocked.  Although the window of
      opportunity is tiny, ptrace can be used by non-privileged users and with
      some loading the window can definitely be extended and exploited.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01e05e9a
  6. 11 2月, 2011 2 次提交
  7. 10 2月, 2011 1 次提交
  8. 08 2月, 2011 3 次提交
  9. 04 2月, 2011 1 次提交
  10. 03 2月, 2011 10 次提交
    • S
      tracing: Replace syscall_meta_data struct array with pointer array · 3d56e331
      Steven Rostedt 提交于
      Currently the syscall_meta structures for the syscall tracepoints are
      placed in the __syscall_metadata section, and at link time, the linker
      makes one large array of all these syscall metadata structures. On boot
      up, this array is read (much like the initcall sections) and the syscall
      data is processed.
      
      The problem is that there is no guarantee that gcc will place complex
      structures nicely together in an array format. Two structures in the
      same file may be placed awkwardly, because gcc has no clue that they
      are suppose to be in an array.
      
      A hack was used previous to force the alignment to 4, to pack the
      structures together. But this caused alignment issues with other
      architectures (sparc).
      
      Instead of packing the structures into an array, the structures' addresses
      are now put into the __syscall_metadata section. As pointers are always the
      natural alignment, gcc should always pack them tightly together
      (otherwise initcall, extable, etc would also fail).
      
      By having the pointers to the structures in the section, we can still
      iterate the trace_events without causing unnecessary alignment problems
      with other architectures, or depending on the current behaviour of
      gcc that will likely change in the future just to tick us kernel developers
      off a little more.
      
      The __syscall_metadata section is also moved into the .init.data section
      as it is now only needed at boot up.
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      3d56e331
    • M
      tracepoints: Fix section alignment using pointer array · 65498646
      Mathieu Desnoyers 提交于
      Make the tracepoints more robust, making them solid enough to handle compiler
      changes by not relying on anything based on compiler-specific behavior with
      respect to structure alignment. Implement an approach proposed by David Miller:
      use an array of const pointers to refer to the individual structures, and export
      this pointer array through the linker script rather than the structures per se.
      It will consume 32 extra bytes per tracepoint (24 for structure padding and 8
      for the pointers), but are less likely to break due to compiler changes.
      
      History:
      
      commit 7e066fb8 tracepoints: add DECLARE_TRACE() and DEFINE_TRACE()
      added the aligned(32) type and variable attribute to the tracepoint structures
      to deal with gcc happily aligning statically defined structures on 32-byte
      multiples.
      
      One attempt was to use a 8-byte alignment for tracepoint structures by applying
      both the variable and type attribute to tracepoint structures definitions and
      declarations. It worked fine with gcc 4.5.1, but broke with gcc 4.4.4 and 4.4.5.
      
      The reason is that the "aligned" attribute only specify the _minimum_ alignment
      for a structure, leaving both the compiler and the linker free to align on
      larger multiples. Because tracepoint.c expects the structures to be placed as an
      array within each section, up-alignment cause NULL-pointer exceptions due to the
      extra unexpected padding.
      
      (this patch applies on top of -tip)
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      LKML-Reference: <20110126222622.GA10794@Krystal>
      CC: Frederic Weisbecker <fweisbec@gmail.com>
      CC: Ingo Molnar <mingo@elte.hu>
      CC: Thomas Gleixner <tglx@linutronix.de>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      65498646
    • M
      sched: Add yield_to(task, preempt) functionality · d95f4122
      Mike Galbraith 提交于
      Currently only implemented for fair class tasks.
      
      Add a yield_to_task method() to the fair scheduling class. allowing the
      caller of yield_to() to accelerate another thread in it's thread group,
      task group.
      
      Implemented via a scheduler hint, using cfs_rq->next to encourage the
      target being selected.  We can rely on pick_next_entity to keep things
      fair, so noone can accelerate a thread that has already used its fair
      share of CPU time.
      
      This also means callers should only call yield_to when they really
      mean it.  Calling it too often can result in the scheduler just
      ignoring the hint.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201095051.4ddb7738@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d95f4122
    • R
      sched: Use a buddy to implement yield_task_fair() · ac53db59
      Rik van Riel 提交于
      Use the buddy mechanism to implement yield_task_fair.  This
      allows us to skip onto the next highest priority se at every
      level in the CFS tree, unless doing so would introduce gross
      unfairness in CPU time distribution.
      
      We order the buddy selection in pick_next_entity to check
      yield first, then last, then next.  We need next to be able
      to override yield, because it is possible for the "next" and
      "yield" task to be different processen in the same sub-tree
      of the CFS tree.  When they are, we need to go into that
      sub-tree regardless of the "yield" hint, and pick the correct
      entity once we get to the right level.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201095103.3a79e92a@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ac53db59
    • R
      sched: Limit the scope of clear_buddies · 2c13c919
      Rik van Riel 提交于
      The clear_buddies function does not seem to play well with the concept
      of hierarchical runqueues.  In the following tree, task groups are
      represented by 'G', tasks by 'T', next by 'n' and last by 'l'.
      
           (nl)
          /    \
         G(nl)  G
         / \     \
       T(l) T(n)  T
      
      This situation can arise when a task is woken up T(n), and the previously
      running task T(l) is marked last.
      
      When clear_buddies is called from either T(l) or T(n), the next and last
      buddies of the group G(nl) will be cleared.  This is not the desired
      result, since we would like to be able to find the other type of buddy
      in many cases.
      
      This especially a worry when implementing yield_task_fair through the
      buddy system.
      
      The fix is simple: only clear the buddy type that the task itself
      is indicated to be.  As an added bonus, we stop walking up the tree
      when the buddy has already been cleared or pointed elsewhere.
      Signed-off-by: NRik van Riel <riel@redhat.coM>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201094837.6b0962a9@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2c13c919
    • R
      sched: Check the right ->nr_running in yield_task_fair() · 725e7580
      Rik van Riel 提交于
      With CONFIG_FAIR_GROUP_SCHED, each task_group has its own cfs_rq.
      Yielding to a task from another cfs_rq may be worthwhile, since
      a process calling yield typically cannot use the CPU right now.
      
      Therefor, we want to check the per-cpu nr_running, not the
      cgroup local one.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201094715.798c4f86@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      725e7580
    • P
      sched: Fix update_curr_rt() · 06c3bc65
      Peter Zijlstra 提交于
      cpu_stopper_thread()
        migration_cpu_stop()
          __migrate_task()
            deactivate_task()
              dequeue_task()
                dequeue_task_rq()
                  update_curr_rt()
      
      Will call update_curr_rt() on rq->curr, which at that time is
      rq->stop. The problem is that rq->stop.prio matches an RT prio and
      thus falsely assumes its a rt_sched_class task.
      Reported-Debuged-Tested-Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Cc: stable@kernel.org # .37
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      06c3bc65
    • P
      perf: Fix reading in perf_event_read() · 542e72fc
      Peter Zijlstra 提交于
      It is quite possible for the event to have been disabled between
      perf_event_read() sending the IPI and the CPU servicing the IPI and
      calling __perf_event_read(), hence revalidate the state.
      Reported-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      542e72fc
    • S
      tracing: Replace trace_event struct array with pointer array · e4a9ea5e
      Steven Rostedt 提交于
      Currently the trace_event structures are placed in the _ftrace_events
      section, and at link time, the linker makes one large array of all
      the trace_event structures. On boot up, this array is read (much like
      the initcall sections) and the events are processed.
      
      The problem is that there is no guarantee that gcc will place complex
      structures nicely together in an array format. Two structures in the
      same file may be placed awkwardly, because gcc has no clue that they
      are suppose to be in an array.
      
      A hack was used previous to force the alignment to 4, to pack the
      structures together. But this caused alignment issues with other
      architectures (sparc).
      
      Instead of packing the structures into an array, the structures' addresses
      are now put into the _ftrace_event section. As pointers are always the
      natural alignment, gcc should always pack them tightly together
      (otherwise initcall, extable, etc would also fail).
      
      By having the pointers to the structures in the section, we can still
      iterate the trace_events without causing unnecessary alignment problems
      with other architectures, or depending on the current behaviour of
      gcc that will likely change in the future just to tick us kernel developers
      off a little more.
      
      The _ftrace_event section is also moved into the .init.data section
      as it is now only needed at boot up.
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e4a9ea5e
    • T
      genirq: Prevent irq storm on migration · f1a06390
      Thomas Gleixner 提交于
      move_native_irq() masks and unmasks the interrupt line
      unconditionally, but the interrupt line might be masked due to a
      threaded oneshot handler in progress. Unmasking the line in that case
      can lead to interrupt storms. Observed on PREEMPT_RT.
      
      Originally-from: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@kernel.org
      f1a06390
  11. 31 1月, 2011 4 次提交
  12. 28 1月, 2011 1 次提交
    • E
      perf: Fix alloc_callchain_buffers() · 88d4f0db
      Eric Dumazet 提交于
      Commit 927c7a9e ("perf: Fix race in callchains") introduced
      a mismatch in the sizing of struct callchain_cpus_entries.
      
      nr_cpu_ids must be used instead of num_possible_cpus(), or we
      might get out of bound memory accesses on some machines.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Stephane Eranian <eranian@google.com>
      CC: stable@kernel.org
      LKML-Reference: <1295980851.3588.351.camel@edumazet-laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      88d4f0db
  13. 27 1月, 2011 1 次提交
  14. 26 1月, 2011 7 次提交