1. 04 3月, 2011 3 次提交
    • D
      sched: Allow users with sufficient RLIMIT_NICE to change from SCHED_IDLE policy · c02aa73b
      Darren Hart 提交于
      The current scheduler implementation returns -EPERM when trying to
      change from SCHED_IDLE to SCHED_OTHER or SCHED_BATCH. Since SCHED_IDLE
      is considered to be a nice 20 on steroids, changing to another policy
      should be allowed provided the RLIMIT_NICE is accounted for.
      
      This patch allows the following test-case to pass with RLIMIT_NICE=40,
      but still fail with RLIMIT_NICE=10 when the calling process is run
      from a typical shell (nice 0, or 20 in rlimit terms).
      
      int main()
      {
      	int ret;
      	struct sched_param sp;
      	sp.sched_priority = 0;
      
      	/* switch to SCHED_IDLE */
      	ret = sched_setscheduler(0, SCHED_IDLE, &sp);
      	printf("setscheduler IDLE: %d\n", ret);
      	if (ret) return ret;
      
      	/* switch back to SCHED_OTHER */
      	ret = sched_setscheduler(0, SCHED_OTHER, &sp);
      	printf("setscheduler OTHER: %d\n", ret);
      
      	return ret;
      }
      
       $ ulimit -e
       40
       $ ./test
       setscheduler IDLE: 0
       setscheduler OTHER: 0
      
       $ ulimit -e 10
       $ ulimit -e
       10
       $ ./test
       setscheduler IDLE: 0
       setscheduler OTHER: -1
      Signed-off-by: NDarren Hart <dvhart@linux.intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Richard Purdie <richard.purdie@linuxfoundation.org>
      LKML-Reference: <4D657BEE.4040608@linux.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c02aa73b
    • D
      sched: Allow SCHED_BATCH to preempt SCHED_IDLE tasks · a2f5c9ab
      Darren Hart 提交于
      Perform the test for SCHED_IDLE before testing for SCHED_BATCH (and
      ensure idle tasks don't preempt idle tasks) so the non-interactive,
      but still important, SCHED_BATCH tasks will run in favor of the very
      low priority SCHED_IDLE tasks.
      Signed-off-by: NDarren Hart <dvhart@linux.intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NMike Galbraith <efault@gmx.de>
      Cc: Richard Purdie <richard.purdie@linuxfoundation.org>
      LKML-Reference: <1298408674-3130-2-git-send-email-dvhart@linux.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a2f5c9ab
    • B
      sched: Fix sched rt group scheduling when hierachy is enabled · 0c3b9168
      Balbir Singh 提交于
      The current sched rt code is broken when it comes to hierarchical
      scheduling, this patch fixes two problems
      
      1. It adds redundant enqueuing (harmless) when it finds a queue
         has tasks enqueued, but it has no run time and it is not
         throttled.
      
      2. The most important change is in sched_rt_rq_enqueue/dequeue.
         The code just picks the rt_rq belonging to the current cpu
         on which the period timer runs, the patch fixes it, so that
         the correct rt_se is enqueued/dequeued.
      
      Tested with a simple hierarchy
      
      /c/d, c and d assigned similar runtimes of 50,000 and a while
      1 loop runs within "d". Both c and d get throttled, without
      the patch, the task just stops running and never runs (depends
      on where the sched_rt b/w timer runs). With the patch, the
      task is throttled and runs as expected.
      
      [ bharata, suggestions on how to pick the rt_se belong to the
        rt_rq and correct cpu ]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Acked-by: NBharata B Rao <bharata@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org
      LKML-Reference: <20110303113435.GA2868@balbir.in.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0c3b9168
  2. 03 3月, 2011 1 次提交
    • T
      blktrace: Remove blk_fill_rwbs_rq. · 2d3a8497
      Tao Ma 提交于
      If we enable trace events to trace block actions, We use
      blk_fill_rwbs_rq to analyze the corresponding actions
      in request's cmd_flags, but we only choose the minor 2 bits
      from it, so most of other flags(e.g, REQ_SYNC) are missing.
      For example, with a sync write we get:
      write_test-2409  [001]   160.013869: block_rq_insert: 3,64 W 0 () 258135 + =
      8 [write_test]
      
      Since now we have integrated the flags of both bio and request,
      it is safe to pass rq->cmd_flags directly to blk_fill_rwbs and
      blk_fill_rwbs_rq isn't needed any more.
      
      With this patch, after a sync write we get:
      write_test-2417  [000]   226.603878: block_rq_insert: 3,64 WS 0 () 258135 +=
       8 [write_test]
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      2d3a8497
  3. 26 2月, 2011 2 次提交
    • T
      clockevents: Prevent oneshot mode when broadcast device is periodic · 3a142a06
      Thomas Gleixner 提交于
      When the per cpu timer is marked CLOCK_EVT_FEAT_C3STOP, then we only
      can switch into oneshot mode, when the backup broadcast device
      supports oneshot mode as well. Otherwise we would try to switch the
      broadcast device into an unsupported mode unconditionally. This went
      unnoticed so far as the current available broadcast devices support
      oneshot mode. Seth unearthed this problem while debugging and working
      around an hpet related BIOS wreckage.
      
      Add the necessary check to tick_is_oneshot_available().
      Reported-and-tested-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <alpine.LFD.2.00.1102252231200.2701@localhost6.localdomain6>
      Cc: stable@kernel.org # .21 ->
      3a142a06
    • V
      sched: Clean up the IRQ_TIME_ACCOUNTING code · 544b4a1f
      Venkatesh Pallipadi 提交于
      Fix this warning:
      
        lkml.org/lkml/2011/1/30/124
      
       kernel/sched.c:3719: warning: 'irqtime_account_idle_ticks' defined but not used
       kernel/sched.c:3720: warning: 'irqtime_account_process_tick' defined but not used
      
      In a cleaner way than:
      
       7e949870: sched: Add #ifdef around irq time accounting functions
      
      This patch will not have any functional impact.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Cc: heiko.carstens@de.ibm.com
      Cc: a.p.zijlstra@chello.nl
      LKML-Reference: <1298675596-10992-1-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      544b4a1f
  4. 25 2月, 2011 1 次提交
  5. 23 2月, 2011 6 次提交
  6. 19 2月, 2011 2 次提交
    • T
      genirq: Disable the SHIRQ_DEBUG call in request_threaded_irq for now · 6d83f94d
      Thomas Gleixner 提交于
      With CONFIG_SHIRQ_DEBUG=y we call a newly installed interrupt handler
      in request_threaded_irq().
      
      The original implementation (commit a304e1b8) called the handler
      _BEFORE_ it was installed, but that caused problems with handlers
      calling disable_irq_nosync(). See commit 377bf1e4.
      
      It's braindead in the first place to call disable_irq_nosync in shared
      handlers, but ....
      
      Moving this call after we installed the handler looks innocent, but it
      is very subtle broken on SMP.
      
      Interrupt handlers rely on the fact, that the irq core prevents
      reentrancy.
      
      Now this debug call violates that promise because we run the handler
      w/o the IRQ_INPROGRESS protection - which we cannot apply here because
      that would result in a possibly forever masked interrupt line.
      
      A concurrent real hardware interrupt on a different CPU results in
      handler reentrancy and can lead to complete wreckage, which was
      unfortunately observed in reality and took a fricking long time to
      debug.
      
      Leave the code here for now. We want this debug feature, but that's
      not easy to fix. We really should get rid of those
      disable_irq_nosync() abusers and remove that function completely.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Anton Vorontsov <avorontsov@ru.mvista.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: stable@kernel.org # .28 -> .37
      6d83f94d
    • T
      genirq: Prevent access beyond allocated_irqs bitmap · c1ee6264
      Thomas Gleixner 提交于
      Lars-Peter Clausen pointed out:
      
         I stumbled upon this while looking through the existing archs using
         SPARSE_IRQ.  Even with SPARSE_IRQ the NR_IRQS is still the upper
         limit for the number of IRQs.
      
         Both PXA and MMP set NR_IRQS to IRQ_BOARD_START, with
         IRQ_BOARD_START being the number of IRQs used by the core.
      
         In various machine files the nr_irqs field of the ARM machine
         defintion struct is then set to "IRQ_BOARD_START + NR_BOARD_IRQS".
      
         As a result "nr_irqs" will greater then NR_IRQS which then again
         causes the "allocated_irqs" bitmap in the core irq code to be
         accessed beyond its size overwriting unrelated data.
      
      The core code really misses a sanity check there.
      
      This went unnoticed so far as by chance the compiler/linker places
      data behind that bitmap which gets initialized later on those affected
      platforms.
      
      So the obvious fix would be to add a sanity check in early_irq_init()
      and break all affected platforms. Though that check wants to be
      backported to stable as well, which will require to fix all known
      problematic platforms and probably some more yet not known ones as
      well. Lots of churn.
      
      A way simpler solution is to allocate a slightly larger bitmap and
      avoid the whole churn w/o breaking anything. Add a few warnings when
      an arch returns utter crap.
      Reported-by: NLars-Peter Clausen <lars@metafoo.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@kernel.org # .37
      Cc: Haojian Zhuang <haojian.zhuang@marvell.com>
      Cc: Eric Miao <eric.y.miao@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      c1ee6264
  7. 17 2月, 2011 3 次提交
  8. 16 2月, 2011 2 次提交
  9. 14 2月, 2011 1 次提交
  10. 12 2月, 2011 2 次提交
    • K
      timer debug: Hide kernel addresses via %pK in /proc/timer_list · f5903085
      Kees Cook 提交于
      In the continuing effort to avoid kernel addresses leaking to
      unprivileged users, this patch switches to %pK for
      /proc/timer_list reporting.
      Signed-off-by: NKees Cook <kees.cook@canonical.com>
      Cc: John Stultz <johnstul@us.ibm.com>
      Cc: Dan Rosenberg <drosenberg@vsecurity.com>
      Cc: Eugene Teo <eugeneteo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20110212032125.GA23571@outflux.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f5903085
    • T
      ptrace: use safer wake up on ptrace_detach() · 01e05e9a
      Tejun Heo 提交于
      The wake_up_process() call in ptrace_detach() is spurious and not
      interlocked with the tracee state.  IOW, the tracee could be running or
      sleeping in any place in the kernel by the time wake_up_process() is
      called.  This can lead to the tracee waking up unexpectedly which can be
      dangerous.
      
      The wake_up is spurious and should be removed but for now reduce its
      toxicity by only waking up if the tracee is in TRACED or STOPPED state.
      
      This bug can possibly be used as an attack vector.  I don't think it
      will take too much effort to come up with an attack which triggers oops
      somewhere.  Most sleeps are wrapped in condition test loops and should
      be safe but we have quite a number of places where sleep and wakeup
      conditions are expected to be interlocked.  Although the window of
      opportunity is tiny, ptrace can be used by non-privileged users and with
      some loading the window can definitely be extended and exploited.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NRoland McGrath <roland@redhat.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01e05e9a
  11. 11 2月, 2011 2 次提交
  12. 10 2月, 2011 1 次提交
  13. 08 2月, 2011 3 次提交
  14. 04 2月, 2011 1 次提交
  15. 03 2月, 2011 10 次提交
    • S
      tracing: Replace syscall_meta_data struct array with pointer array · 3d56e331
      Steven Rostedt 提交于
      Currently the syscall_meta structures for the syscall tracepoints are
      placed in the __syscall_metadata section, and at link time, the linker
      makes one large array of all these syscall metadata structures. On boot
      up, this array is read (much like the initcall sections) and the syscall
      data is processed.
      
      The problem is that there is no guarantee that gcc will place complex
      structures nicely together in an array format. Two structures in the
      same file may be placed awkwardly, because gcc has no clue that they
      are suppose to be in an array.
      
      A hack was used previous to force the alignment to 4, to pack the
      structures together. But this caused alignment issues with other
      architectures (sparc).
      
      Instead of packing the structures into an array, the structures' addresses
      are now put into the __syscall_metadata section. As pointers are always the
      natural alignment, gcc should always pack them tightly together
      (otherwise initcall, extable, etc would also fail).
      
      By having the pointers to the structures in the section, we can still
      iterate the trace_events without causing unnecessary alignment problems
      with other architectures, or depending on the current behaviour of
      gcc that will likely change in the future just to tick us kernel developers
      off a little more.
      
      The __syscall_metadata section is also moved into the .init.data section
      as it is now only needed at boot up.
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      3d56e331
    • M
      tracepoints: Fix section alignment using pointer array · 65498646
      Mathieu Desnoyers 提交于
      Make the tracepoints more robust, making them solid enough to handle compiler
      changes by not relying on anything based on compiler-specific behavior with
      respect to structure alignment. Implement an approach proposed by David Miller:
      use an array of const pointers to refer to the individual structures, and export
      this pointer array through the linker script rather than the structures per se.
      It will consume 32 extra bytes per tracepoint (24 for structure padding and 8
      for the pointers), but are less likely to break due to compiler changes.
      
      History:
      
      commit 7e066fb8 tracepoints: add DECLARE_TRACE() and DEFINE_TRACE()
      added the aligned(32) type and variable attribute to the tracepoint structures
      to deal with gcc happily aligning statically defined structures on 32-byte
      multiples.
      
      One attempt was to use a 8-byte alignment for tracepoint structures by applying
      both the variable and type attribute to tracepoint structures definitions and
      declarations. It worked fine with gcc 4.5.1, but broke with gcc 4.4.4 and 4.4.5.
      
      The reason is that the "aligned" attribute only specify the _minimum_ alignment
      for a structure, leaving both the compiler and the linker free to align on
      larger multiples. Because tracepoint.c expects the structures to be placed as an
      array within each section, up-alignment cause NULL-pointer exceptions due to the
      extra unexpected padding.
      
      (this patch applies on top of -tip)
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      LKML-Reference: <20110126222622.GA10794@Krystal>
      CC: Frederic Weisbecker <fweisbec@gmail.com>
      CC: Ingo Molnar <mingo@elte.hu>
      CC: Thomas Gleixner <tglx@linutronix.de>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      65498646
    • M
      sched: Add yield_to(task, preempt) functionality · d95f4122
      Mike Galbraith 提交于
      Currently only implemented for fair class tasks.
      
      Add a yield_to_task method() to the fair scheduling class. allowing the
      caller of yield_to() to accelerate another thread in it's thread group,
      task group.
      
      Implemented via a scheduler hint, using cfs_rq->next to encourage the
      target being selected.  We can rely on pick_next_entity to keep things
      fair, so noone can accelerate a thread that has already used its fair
      share of CPU time.
      
      This also means callers should only call yield_to when they really
      mean it.  Calling it too often can result in the scheduler just
      ignoring the hint.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201095051.4ddb7738@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d95f4122
    • R
      sched: Use a buddy to implement yield_task_fair() · ac53db59
      Rik van Riel 提交于
      Use the buddy mechanism to implement yield_task_fair.  This
      allows us to skip onto the next highest priority se at every
      level in the CFS tree, unless doing so would introduce gross
      unfairness in CPU time distribution.
      
      We order the buddy selection in pick_next_entity to check
      yield first, then last, then next.  We need next to be able
      to override yield, because it is possible for the "next" and
      "yield" task to be different processen in the same sub-tree
      of the CFS tree.  When they are, we need to go into that
      sub-tree regardless of the "yield" hint, and pick the correct
      entity once we get to the right level.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201095103.3a79e92a@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ac53db59
    • R
      sched: Limit the scope of clear_buddies · 2c13c919
      Rik van Riel 提交于
      The clear_buddies function does not seem to play well with the concept
      of hierarchical runqueues.  In the following tree, task groups are
      represented by 'G', tasks by 'T', next by 'n' and last by 'l'.
      
           (nl)
          /    \
         G(nl)  G
         / \     \
       T(l) T(n)  T
      
      This situation can arise when a task is woken up T(n), and the previously
      running task T(l) is marked last.
      
      When clear_buddies is called from either T(l) or T(n), the next and last
      buddies of the group G(nl) will be cleared.  This is not the desired
      result, since we would like to be able to find the other type of buddy
      in many cases.
      
      This especially a worry when implementing yield_task_fair through the
      buddy system.
      
      The fix is simple: only clear the buddy type that the task itself
      is indicated to be.  As an added bonus, we stop walking up the tree
      when the buddy has already been cleared or pointed elsewhere.
      Signed-off-by: NRik van Riel <riel@redhat.coM>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201094837.6b0962a9@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2c13c919
    • R
      sched: Check the right ->nr_running in yield_task_fair() · 725e7580
      Rik van Riel 提交于
      With CONFIG_FAIR_GROUP_SCHED, each task_group has its own cfs_rq.
      Yielding to a task from another cfs_rq may be worthwhile, since
      a process calling yield typically cannot use the CPU right now.
      
      Therefor, we want to check the per-cpu nr_running, not the
      cgroup local one.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20110201094715.798c4f86@annuminas.surriel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      725e7580
    • P
      sched: Fix update_curr_rt() · 06c3bc65
      Peter Zijlstra 提交于
      cpu_stopper_thread()
        migration_cpu_stop()
          __migrate_task()
            deactivate_task()
              dequeue_task()
                dequeue_task_rq()
                  update_curr_rt()
      
      Will call update_curr_rt() on rq->curr, which at that time is
      rq->stop. The problem is that rq->stop.prio matches an RT prio and
      thus falsely assumes its a rt_sched_class task.
      Reported-Debuged-Tested-Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Cc: stable@kernel.org # .37
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      06c3bc65
    • P
      perf: Fix reading in perf_event_read() · 542e72fc
      Peter Zijlstra 提交于
      It is quite possible for the event to have been disabled between
      perf_event_read() sending the IPI and the CPU servicing the IPI and
      calling __perf_event_read(), hence revalidate the state.
      Reported-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      542e72fc
    • S
      tracing: Replace trace_event struct array with pointer array · e4a9ea5e
      Steven Rostedt 提交于
      Currently the trace_event structures are placed in the _ftrace_events
      section, and at link time, the linker makes one large array of all
      the trace_event structures. On boot up, this array is read (much like
      the initcall sections) and the events are processed.
      
      The problem is that there is no guarantee that gcc will place complex
      structures nicely together in an array format. Two structures in the
      same file may be placed awkwardly, because gcc has no clue that they
      are suppose to be in an array.
      
      A hack was used previous to force the alignment to 4, to pack the
      structures together. But this caused alignment issues with other
      architectures (sparc).
      
      Instead of packing the structures into an array, the structures' addresses
      are now put into the _ftrace_event section. As pointers are always the
      natural alignment, gcc should always pack them tightly together
      (otherwise initcall, extable, etc would also fail).
      
      By having the pointers to the structures in the section, we can still
      iterate the trace_events without causing unnecessary alignment problems
      with other architectures, or depending on the current behaviour of
      gcc that will likely change in the future just to tick us kernel developers
      off a little more.
      
      The _ftrace_event section is also moved into the .init.data section
      as it is now only needed at boot up.
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e4a9ea5e
    • T
      genirq: Prevent irq storm on migration · f1a06390
      Thomas Gleixner 提交于
      move_native_irq() masks and unmasks the interrupt line
      unconditionally, but the interrupt line might be masked due to a
      threaded oneshot handler in progress. Unmasking the line in that case
      can lead to interrupt storms. Observed on PREEMPT_RT.
      
      Originally-from: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@kernel.org
      f1a06390