1. 11 4月, 2008 1 次提交
    • R
      asmlinkage_protect replaces prevent_tail_call · 54a01510
      Roland McGrath 提交于
      The prevent_tail_call() macro works around the problem of the compiler
      clobbering argument words on the stack, which for asmlinkage functions
      is the caller's (user's) struct pt_regs.  The tail/sibling-call
      optimization is not the only way that the compiler can decide to use
      stack argument words as scratch space, which we have to prevent.
      Other optimizations can do it too.
      
      Until we have new compiler support to make "asmlinkage" binding on the
      compiler's own use of the stack argument frame, we have work around all
      the manifestations of this issue that crop up.
      
      More cases seem to be prevented by also keeping the incoming argument
      variables live at the end of the function.  This makes their original
      stack slots attractive places to leave those variables, so the compiler
      tends not clobber them for something else.  It's still no guarantee, but
      it handles some observed cases that prevent_tail_call() did not.
      Signed-off-by: NRoland McGrath <roland@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54a01510
  2. 05 4月, 2008 1 次提交
    • P
      cgroups: add cgroup support for enabling controllers at boot time · 8bab8dde
      Paul Menage 提交于
      The effects of cgroup_disable=foo are:
      
      - foo isn't auto-mounted if you mount all cgroups in a single hierarchy
      - foo isn't visible as an individually mountable subsystem
      
      As a result there will only ever be one call to foo->create(), at init time;
      all processes will stay in this group, and the group will never be mounted on
      a visible hierarchy.  Any additional effects (e.g.  not allocating metadata)
      are up to the foo subsystem.
      
      This doesn't handle early_init subsystems (their "disabled" bit isn't set be,
      but it could easily be extended to do so if any of the early_init systems
      wanted it - I think it would just involve some nastier parameter processing
      since it would occur before the command-line argument parser had been run.
      
      Hugh said:
      
        Ballpark figures, I'm trying to get this question out rather than
        processing the exact numbers: CONFIG_CGROUP_MEM_RES_CTLR adds 15% overhead
        to the affected paths, booting with cgroup_disable=memory cuts that back to
        1% overhead (due to slightly bigger struct page).
      
        I'm no expert on distros, they may have no interest whatever in
        CONFIG_CGROUP_MEM_RES_CTLR=y; and the rest of us can easily build with or
        without it, or apply the cgroup_disable=memory patches.
      
      Unix bench's execl test result on x86_64 was
      
      == just after boot without mounting any cgroup fs.==
      mem_cgorup=off : Execl Throughput       43.0     3150.1      732.6
      mem_cgroup=on  : Execl Throughput       43.0     2932.6      682.0
      ==
      
      [lizf@cn.fujitsu.com: fix boot option parsing]
      Signed-off-by: NBalbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Paul Menage <menage@google.com>
      Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Hugh Dickins <hugh@veritas.com>
      Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
      Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8bab8dde
  3. 03 4月, 2008 1 次提交
    • M
      markers: use synchronize_sched() · 6496968e
      Mathieu Desnoyers 提交于
      Markers do not mix well with CONFIG_PREEMPT_RCU because it uses
      preempt_disable/enable() and not rcu_read_lock/unlock for minimal
      intrusiveness.  We would need call_sched and sched_barrier primitives.
      
      Currently, the modification (connection and disconnection) of probes
      from markers requires changes to the data structure done in RCU-style :
      a new data structure is created, the pointer is changed atomically, a
      quiescent state is reached and then the old data structure is freed.
      
      The quiescent state is reached once all the currently running
      preempt_disable regions are done running.  We use the call_rcu mechanism
      to execute kfree() after such quiescent state has been reached.
      However, the new CONFIG_PREEMPT_RCU version of call_rcu and rcu_barrier
      does not guarantee that all preempt_disable code regions have finished,
      hence the race.
      
      The "proper" way to do this is to use rcu_read_lock/unlock, but we don't
      want to use it to minimize intrusiveness on the traced system.  (we do
      not want the marker code to call into much of the OS code, because it
      would quickly restrict what can and cannot be instrumented, such as the
      scheduler).
      
      The temporary fix, until we get call_rcu_sched and rcu_barrier_sched in
      mainline, is to use synchronize_sched before each call_rcu calls, so we
      wait for the quiescent state in the system call code path.  It will slow
      down batch marker enable/disable, but will make sure the race is gone.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6496968e
  4. 31 3月, 2008 2 次提交
  5. 29 3月, 2008 2 次提交
  6. 27 3月, 2008 1 次提交
  7. 26 3月, 2008 4 次提交
  8. 25 3月, 2008 6 次提交
  9. 21 3月, 2008 5 次提交
  10. 20 3月, 2008 1 次提交
  11. 19 3月, 2008 6 次提交
    • I
      sched: retune wake granularity · 74e3cd7f
      Ingo Molnar 提交于
      reduce wake-up granularity for better interactivity.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      74e3cd7f
    • I
      sched: wakeup-buddy tasks are cache-hot · f540a608
      Ingo Molnar 提交于
      Wakeup-buddy tasks are cache-hot - this makes it a bit harder
      for the load-balancer to tear them apart. (but it's still possible,
      if the load is sufficiently assymetric)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f540a608
    • I
      sched: improve affine wakeups · 4ae7d5ce
      Ingo Molnar 提交于
      improve affine wakeups. Maintain the 'overlap' metric based on CFS's
      sum_exec_runtime - which means the amount of time a task executes
      after it wakes up some other task.
      
      Use the 'overlap' for the wakeup decisions: if the 'overlap' is short,
      it means there's strong workload coupling between this task and the
      woken up task. If the 'overlap' is large then the workload is decoupled
      and the scheduler will move them to separate CPUs more easily.
      
      ( Also slightly move the preempt_check within try_to_wake_up() - this has
        no effect on functionality but allows 'early wakeups' (for still-on-rq
        tasks) to be correctly accounted as well.)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4ae7d5ce
    • I
      sched: clean up wakeup balancing, code flow · f4827386
      Ingo Molnar 提交于
      Clean up the code flow. No code changed:
      
      kernel/sched.o:
      
         text	   data	    bss	    dec	    hex	filename
        42521	   2858	    232	  45611	   b22b	sched.o.before
        42521	   2858	    232	  45611	   b22b	sched.o.after
      
      md5:
         09b31c44e9aff8666f72773dc433e2df  sched.o.before.asm
         09b31c44e9aff8666f72773dc433e2df  sched.o.after.asm
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f4827386
    • I
      sched: clean up wakeup balancing, rename variables · ac192d39
      Ingo Molnar 提交于
      rename 'cpu' to 'prev_cpu'. No code changed:
      
      kernel/sched.o:
      
         text	   data	    bss	    dec	    hex	filename
        42521	   2858	    232	  45611	   b22b	sched.o.before
        42521	   2858	    232	  45611	   b22b	sched.o.after
      
      md5:
         09b31c44e9aff8666f72773dc433e2df  sched.o.before.asm
         09b31c44e9aff8666f72773dc433e2df  sched.o.after.asm
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ac192d39
    • I
      sched: clean up wakeup balancing, move wake_affine() · 098fb9db
      Ingo Molnar 提交于
      split out the affine-wakeup bits.
      
      No code changed:
      
      kernel/sched.o:
      
         text	   data	    bss	    dec	    hex	filename
        42521	   2858	    232	  45611	   b22b	sched.o.before
        42521	   2858	    232	  45611	   b22b	sched.o.after
      
      md5:
         9d76738f1272aa82f0b7affd2f51df6b  sched.o.before.asm
         09b31c44e9aff8666f72773dc433e2df  sched.o.after.asm
      
      (the md5's changed because stack slots changed and some registers
      get scheduled by gcc in a different order - but otherwise the before
      and after assembly is instruction for instruction equivalent.)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      098fb9db
  12. 17 3月, 2008 1 次提交
  13. 15 3月, 2008 7 次提交
    • I
      sched: simplify sched_slice() · 6a6029b8
      Ingo Molnar 提交于
      Use the existing calc_delta_mine() calculation for sched_slice(). This
      saves a divide and simplifies the code because we share it with the
      other /cfs_rq->load users.
      
      It also improves code size:
      
            text    data     bss     dec     hex filename
           42659    2740     144   45543    b1e7 sched.o.before
           42093    2740     144   44977    afb1 sched.o.after
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      6a6029b8
    • I
      sched: fix fair sleepers · e22ecef1
      Ingo Molnar 提交于
      Fair sleepers need to scale their latency target down by runqueue
      weight. Otherwise busy systems will gain ever larger sleep bonus.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      e22ecef1
    • P
      sched: fix overload performance: buddy wakeups · aa2ac252
      Peter Zijlstra 提交于
      Currently we schedule to the leftmost task in the runqueue. When the
      runtimes are very short because of some server/client ping-pong,
      especially in over-saturated workloads, this will cycle through all
      tasks trashing the cache.
      
      Reduce cache trashing by keeping dependent tasks together by running
      newly woken tasks first. However, by not running the leftmost task first
      we could starve tasks because the wakee can gain unlimited runtime.
      
      Therefore we only run the wakee if its within a small
      (wakeup_granularity) window of the leftmost task. This preserves
      fairness, but does alternate server/client task groups.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      aa2ac252
    • I
      sched: fix calc_delta_mine() · 27d11726
      Ingo Molnar 提交于
      lw->weight can be 0 for a short time during bootup.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      27d11726
    • I
      sched: fix update_load_add()/sub() · e89996ae
      Ingo Molnar 提交于
      Clear the cached inverse value when updating load. This is needed for
      calc_delta_mine() to work correctly when using the rq load.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      e89996ae
    • P
      sched: min_vruntime fix · 3fe69747
      Peter Zijlstra 提交于
      Current min_vruntime tracking is incorrect and will cause serious
      problems when we don't run the leftmost task for some reason.
      
      min_vruntime does two things; 1) it's used to determine a forward
      direction when the u64 vruntime wraps, 2) it's used to track the
      leftmost vruntime to position newly enqueued tasks from.
      
      The current logic advances min_vruntime whenever the current task's
      vruntime advance. Because the current task may pass the leftmost task
      still waiting we're failing the second goal. This causes new tasks to be
      placed too far ahead and thus penalizes their runtime.
      
      Fix this by making min_vruntime the min_vruntime of the waiting tasks by
      tracking it in enqueue/dequeue, and compare against current's vruntime
      to obtain the absolute minimum when placing new tasks.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3fe69747
    • H
      sched: fix race in schedule() · 0e1f3483
      Hiroshi Shimamoto 提交于
      Fix a hard to trigger crash seen in the -rt kernel that also affects
      the vanilla scheduler.
      
      There is a race condition between schedule() and some dequeue/enqueue
      functions; rt_mutex_setprio(), __setscheduler() and sched_move_task().
      
      When scheduling to idle, idle_balance() is called to pull tasks from
      other busy processor. It might drop the rq lock. It means that those 3
      functions encounter on_rq=0 and running=1. The current task should be
      put when running.
      
      Here is a possible scenario:
      
         CPU0                               CPU1
          |                              schedule()
          |                              ->deactivate_task()
          |                              ->idle_balance()
          |                              -->load_balance_newidle()
      rt_mutex_setprio()                     |
          |                              --->double_lock_balance()
          *get lock                          *rel lock
          * on_rq=0, ruuning=1               |
          * sched_class is changed           |
          *rel lock                          *get lock
          :                                  |
                                             :
                                         ->put_prev_task_rt()
                                         ->pick_next_task_fair()
                                             => panic
      
      The current process of CPU1(P1) is scheduling. Deactivated P1, and the
      scheduler looks for another process on other CPU's runqueue because CPU1
      will be idle. idle_balance(), load_balance_newidle() and
      double_lock_balance() are called and double_lock_balance() could drop
      the rq lock. On the other hand, CPU0 is trying to boost the priority of
      P1. The result of boosting only P1's prio and sched_class are changed to
      RT. The sched entities of P1 and P1's group are never put. It makes
      cfs_rq invalid, because the cfs_rq has curr and no leaf, but
      pick_next_task_fair() is called, then the kernel panics.
      Signed-off-by: NHiroshi Shimamoto <h-shimamoto@ct.jp.nec.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0e1f3483
  14. 13 3月, 2008 1 次提交
  15. 12 3月, 2008 1 次提交
    • R
      Hibernation: Fix mark_nosave_pages() · a82f7119
      Rafael J. Wysocki 提交于
      There is a problem in the hibernation code that triggers on some NUMA
      systems on which pfn_valid() returns 'true' for some PFNs that don't
      belong to any zone.  Namely, there is a BUG_ON() in
      memory_bm_find_bit() that triggers for PFNs not belonging to any
      zone and passing the pfn_valid() test.  On the affected systems it
      triggers when we mark PFNs reported by the platform as not saveable,
      because the PFNs in question belong to a region mapped directly using
      iorepam() (i.e. the ACPI data area) and they pass the pfn_valid()
      test.
      
      Modify memory_bm_find_bit() so that it returns an error if given PFN
      doesn't belong to any zone instead of crashing the kernel and ignore
      the result returned by it in mark_nosave_pages(), while marking the
      "nosave" memory regions.
      
      This doesn't affect the hibernation functionality, as we won't touch
      the PFNs in question anyway.
      
      http://bugzilla.kernel.org/show_bug.cgi?id=9966 .
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Signed-off-by: NLen Brown <len.brown@intel.com>
      a82f7119