1. 30 4月, 2011 2 次提交
  2. 19 4月, 2011 1 次提交
    • L
      next_pidmap: fix overflow condition · c78193e9
      Linus Torvalds 提交于
      next_pidmap() just quietly accepted whatever 'last' pid that was passed
      in, which is not all that safe when one of the users is /proc.
      
      Admittedly the proc code should do some sanity checking on the range
      (and that will be the next commit), but that doesn't mean that the
      helper functions should just do that pidmap pointer arithmetic without
      checking the range of its arguments.
      
      So clamp 'last' to PID_MAX_LIMIT.  The fact that we then do "last+1"
      doesn't really matter, the for-loop does check against the end of the
      pidmap array properly (it's only the actual pointer arithmetic overflow
      case we need to worry about, and going one bit beyond isn't going to
      overflow).
      
      [ Use PID_MAX_LIMIT rather than pid_max as per Eric Biederman ]
      Reported-by: NTavis Ormandy <taviso@cmpxchg8b.com>
      Analyzed-by: NRobert Święcki <robert@swiecki.net>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c78193e9
  3. 18 4月, 2011 1 次提交
  4. 16 4月, 2011 2 次提交
    • J
      block: make unplug timer trace event correspond to the schedule() unplug · 49cac01e
      Jens Axboe 提交于
      It's a pretty close match to what we had before - the timer triggering
      would mean that nobody unplugged the plug in due time, in the new
      scheme this matches very closely what the schedule() unplug now is.
      It's essentially the difference between an explicit unplug (IO unplug)
      or an implicit unplug (timer unplug, we scheduled with pending IO
      queued).
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      49cac01e
    • J
      block: let io_schedule() flush the plug inline · a237c1c5
      Jens Axboe 提交于
      Linus correctly observes that the most important dispatch cases
      are now done from kblockd, this isn't ideal for latency reasons.
      The original reason for switching dispatches out-of-line was to
      avoid too deep a stack, so by _only_ letting the "accidental"
      flush directly in schedule() be guarded by offload to kblockd,
      we should be able to get the best of both worlds.
      
      So add a blk_schedule_flush_plug() that offloads to kblockd,
      and only use that from the schedule() path.
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      a237c1c5
  5. 15 4月, 2011 1 次提交
  6. 13 4月, 2011 1 次提交
  7. 12 4月, 2011 4 次提交
  8. 11 4月, 2011 3 次提交
    • K
      sched: Fix erroneous all_pinned logic · b30aef17
      Ken Chen 提交于
      The scheduler load balancer has specific code to deal with cases of
      unbalanced system due to lots of unmovable tasks (for example because of
      hard CPU affinity). In those situation, it excludes the busiest CPU that
      has pinned tasks for load balance consideration such that it can perform
      second 2nd load balance pass on the rest of the system.
      
      This all works as designed if there is only one cgroup in the system.
      
      However, when we have multiple cgroups, this logic has false positives and
      triggers multiple load balance passes despite there are actually no pinned
      tasks at all.
      
      The reason it has false positives is that the all pinned logic is deep in
      the lowest function of can_migrate_task() and is too low level:
      
      load_balance_fair() iterates each task group and calls balance_tasks() to
      migrate target load. Along the way, balance_tasks() will also set a
      all_pinned variable. Given that task-groups are iterated, this all_pinned
      variable is essentially the status of last group in the scanning process.
      Task group can have number of reasons that no load being migrated, none
      due to cpu affinity. However, this status bit is being propagated back up
      to the higher level load_balance(), which incorrectly think that no tasks
      were moved.  It kick off the all pinned logic and start multiple passes
      attempt to move load onto puller CPU.
      
      To fix this, move the all_pinned aggregation up at the iterator level.
      This ensures that the status is aggregated over all task-groups, not just
      last one in the list.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Cc: stable@kernel.org
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/BANLkTi=ernzNawaR5tJZEsV_QVnfxqXmsQ@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b30aef17
    • K
      sched: Fix sched-domain avg_load calculation · b0432d8f
      Ken Chen 提交于
      In function find_busiest_group(), the sched-domain avg_load isn't
      calculated at all if there is a group imbalance within the domain. This
      will cause erroneous imbalance calculation.
      
      The reason is that calculate_imbalance() sees sds->avg_load = 0 and it
      will dump entire sds->max_load into imbalance variable, which is used
      later on to migrate entire load from busiest CPU to the puller CPU.
      
      This has two really bad effect:
      
      1. stampede of task migration, and they won't be able to break out
         of the bad state because of positive feedback loop: large load
         delta -> heavier load migration -> larger imbalance and the cycle
         goes on.
      
      2. severe imbalance in CPU queue depth.  This causes really long
         scheduling latency blip which affects badly on application that
         has tight latency requirement.
      
      The fix is to have kernel calculate domain avg_load in both cases. This
      will ensure that imbalance calculation is always sensible and the target
      is usually half way between busiest and puller CPU.
      Signed-off-by: NKen Chen <kenchen@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org>
      Link: http://lkml.kernel.org/r/20110408002322.3A0D812217F@elm.corp.google.comSigned-off-by: NIngo Molnar <mingo@elte.hu>
      b0432d8f
    • S
      perf_event: Fix cgrp event scheduling bug in perf_enable_on_exec() · e566b76e
      Stephane Eranian 提交于
      There is a bug in perf_event_enable_on_exec() when cgroup events are
      active on a CPU: the cgroup events may be scheduled twice causing event
      state corruptions which eventually may lead to kernel panics.
      
      The reason is that the function needs to first schedule out the cgroup
      events, just like for the per-thread events. The cgroup event are
      scheduled back in automatically from the perf_event_context_sched_in()
      function.
      
      The patch also adds a WARN_ON_ONCE() is perf_cgroup_switch() to catch any
      bogus state.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/20110406005454.GA1062@quadSigned-off-by: NIngo Molnar <mingo@elte.hu>
      e566b76e
  9. 09 4月, 2011 1 次提交
  10. 05 4月, 2011 7 次提交
    • P
      sched: Clean up rebalance_domains() load-balance interval calculation · 49c022e6
      Peter Zijlstra 提交于
      Instead of the possible multiple-evaluation of num_online_cpus()
      in rebalance_domains() that Linus reported, avoid it altogether
      in the normal case since it's implemented with a Hamming weight
      function over a cpu bitmask which can be darn expensive for those
      with big iron.
      
      This also makes it cleaner, smaller and documents the code.
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1301991265.2225.12.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      49c022e6
    • R
      kernel/signal.c: add kernel-doc notation to syscalls · 41c57892
      Randy Dunlap 提交于
      Add kernel-doc to syscalls in signal.c.
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      41c57892
    • R
      kernel/signal.c: fix typos and coding style · 5aba085e
      Randy Dunlap 提交于
      General coding style and comment fixes; no code changes:
      
       - Use multi-line-comment coding style.
       - Put some function signatures completely on one line.
       - Hyphenate some words.
       - Spell Posix as POSIX.
       - Correct typos & spellos in some comments.
       - Drop trailing whitespace.
       - End sentences with periods.
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5aba085e
    • J
      jump label: Introduce static_branch() interface · d430d3d7
      Jason Baron 提交于
      Introduce:
      
      static __always_inline bool static_branch(struct jump_label_key *key);
      
      instead of the old JUMP_LABEL(key, label) macro.
      
      In this way, jump labels become really easy to use:
      
      Define:
      
              struct jump_label_key jump_key;
      
      Can be used as:
      
              if (static_branch(&jump_key))
                      do unlikely code
      
      enable/disale via:
      
              jump_label_inc(&jump_key);
              jump_label_dec(&jump_key);
      
      that's it!
      
      For the jump labels disabled case, the static_branch() becomes an
      atomic_read(), and jump_label_inc()/dec() are simply atomic_inc(),
      atomic_dec() operations. We show testing results for this change below.
      
      Thanks to H. Peter Anvin for suggesting the 'static_branch()' construct.
      
      Since we now require a 'struct jump_label_key *key', we can store a pointer into
      the jump table addresses. In this way, we can enable/disable jump labels, in
      basically constant time. This change allows us to completely remove the previous
      hashtable scheme. Thanks to Peter Zijlstra for this re-write.
      
      Testing:
      
      I ran a series of 'tbench 20' runs 5 times (with reboots) for 3
      configurations, where tracepoints were disabled.
      
      jump label configured in
      avg: 815.6
      
      jump label *not* configured in (using atomic reads)
      avg: 800.1
      
      jump label *not* configured in (regular reads)
      avg: 803.4
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110316212947.GA8792@redhat.com>
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Suggested-by: NH. Peter Anvin <hpa@linux.intel.com>
      Tested-by: NDavid Daney <ddaney@caviumnetworks.com>
      Acked-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      d430d3d7
    • J
      tracing: Avoid soft lockup in trace_pipe · ee5e51f5
      Jiri Olsa 提交于
      running following commands:
      
        # enable the binary option
        echo 1 > ./options/bin
        # disable context info option
        echo 0 > ./options/context-info
        # tracing only events
        echo 1 > ./events/enable
        cat trace_pipe
      
      plus forcing system to generate many tracing events,
      is causing lockup (in NON preemptive kernels) inside
      tracing_read_pipe function.
      
      The issue is also easily reproduced by running ltp stress test.
      (ftrace_stress_test.sh)
      
      The reasons are:
       - bin/hex/raw output functions for events are set to
         trace_nop_print function, which prints nothing and
         returns TRACE_TYPE_HANDLED value
       - LOST EVENT trace do not handle trace_seq overflow
      
      These reasons force the while loop in tracing_read_pipe
      function never to break.
      
      The attached patch fixies handling of lost event trace, and
      changes trace_nop_print to print minimal info, which is needed
      for the correct tracing_read_pipe processing.
      
      v2 changes:
       - omit the cond_resched changes by trace_nop_print changes
       - WARN changed to WARN_ONCE and added info to be able
         to find out the culprit
      
      v3 changes:
       - make more accurate patch comment
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      LKML-Reference: <20110325110518.GC1922@jolsa.brq.redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      ee5e51f5
    • S
      tracing: Print trace_bprintk() formats for modules too · 1813dc37
      Steven Rostedt 提交于
      The file debugfs/tracing/printk_formats maps the addresses
      to the formats that are used by trace_bprintk() so that userspace
      tools can read the buffer and be able to decode trace_bprintk events
      to get the format saved when reading the ring buffer directly.
      
      This is because trace_bprintk() does not store the format into the
      buffer, but just the address of the format, which is hidden in
      the kernel memory.
      
      But currently it only exports trace_bprintk()s from the kernel core
      and not for modules. The modules need their formats exported
      as well.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      1813dc37
    • S
      tracing: Convert trace_printk() formats for module to const char * · 0588fa30
      Steven Rostedt 提交于
      The trace_printk() formats for modules do not show up in the
      debugfs/tracing/printk_formats file. Only the formats that are
      for trace_printk()s that are in the kernel core.
      
      To facilitate the change to add trace_printk() formats from modules
      into that file as well, we need to convert the structure that
      holds the formats from char fmt[], into const char *fmt,
      and allocate them separately.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      0588fa30
  11. 04 4月, 2011 1 次提交
  12. 03 4月, 2011 1 次提交
  13. 01 4月, 2011 1 次提交
  14. 31 3月, 2011 5 次提交
  15. 30 3月, 2011 3 次提交
  16. 29 3月, 2011 6 次提交