1. 15 5月, 2017 10 次提交
    • L
      sched/topology: Refactor function build_overlap_sched_groups() · 8c033469
      Lauro Ramos Venancio 提交于
      Create functions build_group_from_child_sched_domain() and
      init_overlap_sched_group(). No functional change.
      Signed-off-by: NLauro Ramos Venancio <lvenanci@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1492091769-19879-2-git-send-email-lvenanci@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8c033469
    • P
      sched/clock: Print a warning recommending 'tsc=unstable' · 7708d5f0
      Peter Zijlstra 提交于
      With our switch to stable delayed until late_initcall(), the most
      likely cause of hitting mark_tsc_unstable() is the watchdog. The
      watchdog typically only triggers when creative BIOS'es fiddle with the
      TSC to hide SMI latency.
      
      Since the watchdog can only detect TSC fiddling after the fact all TSC
      clocks (including userspace GTOD) can already have reported funny
      values.
      
      The only way to fully avoid this, is manually marking the TSC unstable
      at boot. Suggest people do this on their broken systems.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7708d5f0
    • P
      sched/clock: Use late_initcall() instead of sched_init_smp() · 2e44b7dd
      Peter Zijlstra 提交于
      Core2 marks its TSC unstable in ACPI Processor Idle, which is probed
      after sched_init_smp(). Luckily it appears both acpi_processor and
      intel_idle (which has a similar check) are mandatory built-in.
      
      This means we can delay switching to stable until after these drivers
      have ran (if they were modules, this would be impossible).
      
      Delay the stable switch to late_initcall() to allow these drivers to
      mark TSC unstable and avoid difficult stable->unstable transitions.
      Reported-by: NLofstedt, Marta <marta.lofstedt@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2e44b7dd
    • P
      cpuidle: Fix idle time tracking · f9fccdb9
      Peter Zijlstra 提交于
      Ville reported that on his Core2, which has TSC stop in idle, we would
      always report very short idle durations. He tracked this down to
      commit:
      
        e93e59ce ("cpuidle: Replace ktime_get() with local_clock()")
      
      which replaces ktime_get() with local_clock().
      
      Add a sched_clock_idle_wakeup_event() call, which will re-sync the
      clock with ktime_get_ns() when TSC is unstable and no-op otherwise.
      Reported-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Tested-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Fixes: e93e59ce ("cpuidle: Replace ktime_get() with local_clock()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f9fccdb9
    • P
      sched/clock: Remove watchdog touching · 3067a33d
      Peter Zijlstra 提交于
      Commit:
      
        2bacec8c ("sched: touch softlockup watchdog after idling")
      
      introduced the touch_softlockup_watchdog_sched() call without
      justification and I feel sched_clock management is not the right
      place, it should only be concerned with producing semi coherent time.
      
      If this causes watchdog thingies, we can find a better place.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3067a33d
    • P
      sched/clock: Remove unused argument to sched_clock_idle_wakeup_event() · ac1e843f
      Peter Zijlstra 提交于
      The argument to sched_clock_idle_wakeup_event() has not been used in a
      long time. Remove it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac1e843f
    • P
      x86/tsc, sched/clock, clocksource: Use clocksource watchdog to provide stable sync points · b421b22b
      Peter Zijlstra 提交于
      Currently we keep sched_clock_tick() active for stable TSC in order to
      keep the per-CPU state semi up-to-date. The (obvious) problem is that
      by the time we detect TSC is borked, our per-CPU state is also borked.
      
      So hook into the clocksource watchdog and call a method after we've
      found it to still be stable.
      
      There's the obvious race where the TSC goes wonky between finding it
      stable and us running the callback, but closing that is too much work
      and not really worth it, since we're already detecting TSC wobbles
      after the fact, so we cannot, per definition, fully avoid funny clock
      values.
      
      And since the watchdog runs less often than the tick, this is also an
      optimization.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b421b22b
    • P
      sched/clock: Initialize all per-CPU state before switching (back) to unstable · cf15ca8d
      Peter Zijlstra 提交于
      In preparation for not keeping the sched_clock_tick() active for
      stable TSC, we need to explicitly initialize all per-CPU state
      before switching back to unstable.
      
      Note: this patch looses the __gtod_offset calculation; it will be
      restored in the next one.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      cf15ca8d
    • V
      sched/cfs: Make util/load_avg more stable · 625ed2bf
      Vincent Guittot 提交于
      In the current implementation of load/util_avg, we assume that the
      ongoing time segment has fully elapsed, and util/load_sum is divided
      by LOAD_AVG_MAX, even if part of the time segment still remains to
      run. As a consequence, this remaining part is considered as idle time
      and generates unexpected variations of util_avg of a busy CPU in the
      range [1002..1024[ whereas util_avg should stay at 1023.
      
      In order to keep the metric stable, we should not consider the ongoing
      time segment when computing load/util_avg but only the segments that
      have already fully elapsed. But to not consider the current time
      segment adds unwanted latency in the load/util_avg responsivness
      especially when the time is scaled instead of the contribution.
      
      Instead of waiting for the current time segment to have fully elapsed
      before accounting it in load/util_avg, we can already account the
      elapsed part but change the range used to compute load/util_avg
      accordingly.
      
      At the very beginning of a new time segment, the past segments have
      been decayed and the max value is LOAD_AVG_MAX*y. At the very end of
      the current time segment, the max value becomes:
      
        LOAD_AVG_MAX*y + 1024(us)  (== LOAD_AVG_MAX)
      
      In fact, the max value is:
      
        LOAD_AVG_MAX*y + sa->period_contrib
      
      at any time in the time segment.
      
      Taking advantage of the fact that:
      
        LOAD_AVG_MAX*y == LOAD_AVG_MAX-1024
      
      the range becomes [0..LOAD_AVG_MAX-1024+sa->period_contrib].
      
      As the elapsed part is already accounted in load/util_sum, we update
      the max value according to the current position in the time segment
      instead of removing its contribution.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten.Rasmussen@arm.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: pjt@google.com
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1493188076-2767-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      625ed2bf
    • S
      sched/core: Call __schedule() from do_idle() without enabling preemption · 8663effb
      Steven Rostedt (VMware) 提交于
      I finally got around to creating trampolines for dynamically allocated
      ftrace_ops with using synchronize_rcu_tasks(). For users of the ftrace
      function hook callbacks, like perf, that allocate the ftrace_ops
      descriptor via kmalloc() and friends, ftrace was not able to optimize
      the functions being traced to use a trampoline because they would also
      need to be allocated dynamically. The problem is that they cannot be
      freed when CONFIG_PREEMPT is set, as there's no way to tell if a task
      was preempted on the trampoline. That was before Paul McKenney
      implemented synchronize_rcu_tasks() that would make sure all tasks
      (except idle) have scheduled out or have entered user space.
      
      While testing this, I triggered this bug:
      
       BUG: unable to handle kernel paging request at ffffffffa0230077
       ...
       RIP: 0010:0xffffffffa0230077
       ...
       Call Trace:
        schedule+0x5/0xe0
        schedule_preempt_disabled+0x18/0x30
        do_idle+0x172/0x220
      
      What happened was that the idle task was preempted on the trampoline.
      As synchronize_rcu_tasks() ignores the idle thread, there's nothing
      that lets ftrace know that the idle task was preempted on a trampoline.
      
      The idle task shouldn't need to ever enable preemption. The idle task
      is simply a loop that calls schedule or places the cpu into idle mode.
      In fact, having preemption enabled is inefficient, because it can
      happen when idle is just about to call schedule anyway, which would
      cause schedule to be called twice. Once for when the interrupt came in
      and was returning back to normal context, and then again in the normal
      path that the idle loop is running in, which would be pointless, as it
      had already scheduled.
      
      The only reason schedule_preempt_disable() enables preemption is to be
      able to call sched_submit_work(), which requires preemption enabled. As
      this is a nop when the task is in the RUNNING state, and idle is always
      in the running state, there's no reason that idle needs to enable
      preemption. But that means it cannot use schedule_preempt_disable() as
      other callers of that function require calling sched_submit_work().
      
      Adding a new function local to kernel/sched/ that allows idle to call
      the scheduler without enabling preemption, fixes the
      synchronize_rcu_tasks() issue, as well as removes the pointless spurious
      schedule calls caused by interrupts happening in the brief window where
      preemption is enabled just before it calls schedule.
      
      Reviewed: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170414084809.3dacde2a@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8663effb
  2. 27 4月, 2017 1 次提交
  3. 21 4月, 2017 1 次提交
    • P
      rcu: Make non-preemptive schedule be Tasks RCU quiescent state · bcbfdd01
      Paul E. McKenney 提交于
      Currently, a call to schedule() acts as a Tasks RCU quiescent state
      only if a context switch actually takes place.  However, just the
      call to schedule() guarantees that the calling task has moved off of
      whatever tracing trampoline that it might have been one previously.
      This commit therefore plumbs schedule()'s "preempt" parameter into
      rcu_note_context_switch(), which then records the Tasks RCU quiescent
      state, but only if this call to schedule() was -not- due to a preemption.
      
      To avoid adding overhead to the common-case context-switch path,
      this commit hides the rcu_note_context_switch() check under an existing
      non-common-case check.
      Suggested-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      bcbfdd01
  4. 18 4月, 2017 1 次提交
  5. 14 4月, 2017 4 次提交
    • P
      sched/fair: Move the PELT constants into a generated header · 283e2ed3
      Peter Zijlstra 提交于
      Now that we have a tool to generate the PELT constants in C form,
      use its output as a separate header.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      283e2ed3
    • P
      sched/fair: Increase PELT accuracy for small tasks · bb0bd044
      Peter Zijlstra 提交于
      We truncate (and loose) the lower 10 bits of runtime in
      ___update_load_avg(), this means there's a consistent bias to
      under-account tasks. This is esp. significant for small tasks.
      
      Cure this by only forwarding last_update_time to the point we've
      actually accounted for, leaving the remainder for the next time.
      Reported-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      bb0bd044
    • P
      sched/fair: Fix comments · 3841cdc3
      Peter Zijlstra 提交于
      Historically our periods (or p) argument in PELT denoted the number of
      full periods (what is now d2). However recent patches have changed
      this to the total decay (previously p+1), leading to a confusing
      discrepancy between comments and code.
      
      Try and clarify things by making periods (in code) and p (in comments)
      be the same thing (again).
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3841cdc3
    • P
      sched/fair: Fix corner case in __accumulate_sum() · 05296e75
      Peter Zijlstra 提交于
      Paul noticed that in the (periods >= LOAD_AVG_MAX_N) case in
      __accumulate_sum(), the returned contribution value (LOAD_AVG_MAX) is
      incorrect.
      
      This is because at this point, the decay_load() on the old state --
      the first step in accumulate_sum() -- will not have resulted in 0, and
      will therefore result in a sum larger than the maximum value of our
      series. Obviously broken.
      
      Note that:
      
      	decay_load(LOAD_AVG_MAX, LOAD_AVG_MAX_N) =
      
                      1   (345 / 32)
      	47742 * - ^            = ~27
                      2
      
      Not to mention that any further contribution from the d3 segment (our
      new period) would also push it over the maximum.
      
      Solve this by noting that we can write our c2 term:
      
      		    p
      	c2 = 1024 \Sum y^n
      		   n=1
      
      In terms of our maximum value:
      
      		    inf		      inf	  p
      	max = 1024 \Sum y^n = 1024 ( \Sum y^n + \Sum y^n + y^0 )
      		    n=0		      n=p+1	 n=1
      
      Further note that:
      
                 inf              inf            inf
              ( \Sum y^n ) y^p = \Sum y^(n+p) = \Sum y^n
                 n=0              n=0            n=p
      
      Combined that gives us:
      
      		    p
      	c2 = 1024 \Sum y^n
      		   n=1
      
      		     inf        inf
      	   = 1024 ( \Sum y^n - \Sum y^n - y^0 )
      		     n=0        n=p+1
      
      	   = max - (max y^(p+1)) - 1024
      
      Further simplify things by dealing with p=0 early on.
      Reported-by: NPaul Turner <pjt@google.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Yuyang Du <yuyang.du@intel.com>
      Cc: linux-kernel@vger.kernel.org
      Fixes: a481db34 ("sched/fair: Optimize ___update_sched_avg()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      05296e75
  6. 13 4月, 2017 1 次提交
  7. 11 4月, 2017 1 次提交
  8. 04 4月, 2017 3 次提交
    • P
      sched,tracing: Update trace_sched_pi_setprio() · b91473ff
      Peter Zijlstra 提交于
      Pass the PI donor task, instead of a numerical priority.
      
      Numerical priorities are not sufficient to describe state ever since
      SCHED_DEADLINE.
      
      Annotate all sched tracepoints that are currently broken; fixing them
      will bork userspace. *hate*.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: xlpang@redhat.com
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170323150216.353599881@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      b91473ff
    • P
      sched/rtmutex: Refactor rt_mutex_setprio() · acd58620
      Peter Zijlstra 提交于
      With the introduction of SCHED_DEADLINE the whole notion that priority
      is a single number is gone, therefore the @prio argument to
      rt_mutex_setprio() doesn't make sense anymore.
      
      So rework the code to pass a pi_task instead.
      
      Note this also fixes a problem with pi_top_task caching; previously we
      would not set the pointer (call rt_mutex_update_top_task) if the
      priority didn't change, this could lead to a stale pointer.
      
      As for the XXX, I think its fine to use pi_task->prio, because if it
      differs from waiter->prio, a PI chain update is immenent.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: xlpang@redhat.com
      Cc: rostedt@goodmis.org
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170323150216.303827095@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      acd58620
    • X
      sched/rtmutex/deadline: Fix a PI crash for deadline tasks · e96a7705
      Xunlei Pang 提交于
      A crash happened while I was playing with deadline PI rtmutex.
      
          BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
          IP: [<ffffffff810eeb8f>] rt_mutex_get_top_task+0x1f/0x30
          PGD 232a75067 PUD 230947067 PMD 0
          Oops: 0000 [#1] SMP
          CPU: 1 PID: 10994 Comm: a.out Not tainted
      
          Call Trace:
          [<ffffffff810b658c>] enqueue_task+0x2c/0x80
          [<ffffffff810ba763>] activate_task+0x23/0x30
          [<ffffffff810d0ab5>] pull_dl_task+0x1d5/0x260
          [<ffffffff810d0be6>] pre_schedule_dl+0x16/0x20
          [<ffffffff8164e783>] __schedule+0xd3/0x900
          [<ffffffff8164efd9>] schedule+0x29/0x70
          [<ffffffff8165035b>] __rt_mutex_slowlock+0x4b/0xc0
          [<ffffffff81650501>] rt_mutex_slowlock+0xd1/0x190
          [<ffffffff810eeb33>] rt_mutex_timed_lock+0x53/0x60
          [<ffffffff810ecbfc>] futex_lock_pi.isra.18+0x28c/0x390
          [<ffffffff810ed8b0>] do_futex+0x190/0x5b0
          [<ffffffff810edd50>] SyS_futex+0x80/0x180
      
      This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
      are only protected by pi_lock when operating pi waiters, while
      rt_mutex_get_top_task(), will access them with rq lock held but
      not holding pi_lock.
      
      In order to tackle it, we introduce new "pi_top_task" pointer
      cached in task_struct, and add new rt_mutex_update_top_task()
      to update its value, it can be called by rt_mutex_setprio()
      which held both owner's pi_lock and rq lock. Thus "pi_top_task"
      can be safely accessed by enqueue_task_dl() under rq lock.
      
      Originally-From: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NXunlei Pang <xlpang@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: juri.lelli@arm.com
      Cc: bigeasy@linutronix.de
      Cc: mathieu.desnoyers@efficios.com
      Cc: jdesfossez@efficios.com
      Cc: bristot@redhat.com
      Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      e96a7705
  9. 30 3月, 2017 2 次提交
    • Y
      sched/fair: Optimize ___update_sched_avg() · a481db34
      Yuyang Du 提交于
      The main PELT function ___update_load_avg(), which implements the
      accumulation and progression of the geometric average series, is
      implemented along the following lines for the scenario where the time
      delta spans all 3 possible sections (see figure below):
      
        1. add the remainder of the last incomplete period
        2. decay old sum
        3. accumulate new sum in full periods since last_update_time
        4. accumulate the current incomplete period
        5. update averages
      
      Or:
      
                  d1          d2           d3
                  ^           ^            ^
                  |           |            |
                |<->|<----------------->|<--->|
        ... |---x---|------| ... |------|-----x (now)
      
        load_sum' = (load_sum + weight * scale * d1) * y^(p+1) +	(1,2)
      
                                              p
      	      weight * scale * 1024 * \Sum y^n +		(3)
                                             n=1
      
      	      weight * scale * d3 * y^0				(4)
      
        load_avg' = load_sum' / LOAD_AVG_MAX				(5)
      
      Where:
      
       d1 - is the delta part completing the remainder of the last
            incomplete period,
       d2 - is the delta part spannind complete periods, and
       d3 - is the delta part starting the current incomplete period.
      
      We can simplify the code in two steps; the first step is to separate
      the first term into new and old parts like:
      
        (load_sum + weight * scale * d1) * y^(p+1) = load_sum * y^(p+1) +
      					       weight * scale * d1 * y^(p+1)
      
      Once we've done that, its easy to see that all new terms carry the
      common factors:
      
        weight * scale
      
      If we factor those out, we arrive at the form:
      
        load_sum' = load_sum * y^(p+1) +
      
      	      weight * scale * (d1 * y^(p+1) +
      
      					 p
      			        1024 * \Sum y^n +
      					n=1
      
      				d3 * y^0)
      
      Which results in a simpler, smaller and faster implementation.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: matt@codeblueprint.co.uk
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1486935863-25251-3-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a481db34
    • P
      sched/fair: Explicitly generate __update_load_avg() instances · 0ccb977f
      Peter Zijlstra 提交于
      The __update_load_avg() function is an __always_inline because its
      used with constant propagation to generate different variants of the
      code without having to duplicate it (which would be prone to bugs).
      
      Explicitly instantiate the 3 variants.
      
      Note that most of this is called from rather hot paths, so reducing
      branches is good.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0ccb977f
  10. 27 3月, 2017 2 次提交
    • P
      sched/clock: Fix broken stable to unstable transfer · 7b09cc5a
      Pavel Tatashin 提交于
      When it is determined that the clock is actually unstable, and
      we switch from stable to unstable, the __clear_sched_clock_stable()
      function is eventually called.
      
      In this function we set gtod_offset so the following holds true:
      
        sched_clock() + raw_offset == ktime_get_ns() + gtod_offset
      
      But instead of getting the latest timestamps, we use the last values
      from scd, so instead of sched_clock() we use scd->tick_raw, and
      instead of ktime_get_ns() we use scd->tick_gtod.
      
      However, later, when we use gtod_offset sched_clock_local() we do not
      add it to scd->tick_gtod to calculate the correct clock value when we
      determine the boundaries for min/max clocks.
      
      This can result in tick granularity sched_clock() values, so fix it.
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: hpa@zytor.com
      Fixes: 5680d809 ("sched/clock: Provide better clock continuity")
      Link: http://lkml.kernel.org/r/1490214265-899964-2-git-send-email-pasha.tatashin@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7b09cc5a
    • S
      sched/fair: Prefer sibiling only if local group is under-utilized · 05b40e05
      Srikar Dronamraju 提交于
      If the child domain prefers tasks to go siblings, the local group could
      end up pulling tasks to itself even if the local group is almost equally
      loaded as the source group.
      
      Lets assume a 4 core,smt==2 machine running 5 thread ebizzy workload.
      Everytime, local group has capacity and source group has atleast 2 threads,
      local group tries to pull the task. This causes the threads to constantly
      move between different cores. This is even more profound if the cores have
      more threads, like in Power 8, smt 8 mode.
      
      Fix this by only allowing local group to pull a task, if the source group
      has more number of tasks than the local group.
      
      Here are the relevant perf stat numbers of a 22 core,smt 8 Power 8 machine.
      
      Without patch:
       Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
      
                   1,440      context-switches          #    0.001 K/sec                    ( +-  1.26% )
                     366      cpu-migrations            #    0.000 K/sec                    ( +-  5.58% )
                   3,933      page-faults               #    0.002 K/sec                    ( +- 11.08% )
      
       Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
      
                   6,287      context-switches          #    0.001 K/sec                    ( +-  3.65% )
                   3,776      cpu-migrations            #    0.001 K/sec                    ( +-  4.84% )
                   5,702      page-faults               #    0.001 K/sec                    ( +-  9.36% )
      
       Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
      
                   8,776      context-switches          #    0.001 K/sec                    ( +-  0.73% )
                   2,790      cpu-migrations            #    0.000 K/sec                    ( +-  0.98% )
                  10,540      page-faults               #    0.001 K/sec                    ( +-  3.12% )
      
      With patch:
      
       Performance counter stats for 'ebizzy -t 22 -S 100' (5 runs):
      
                   1,133      context-switches          #    0.001 K/sec                    ( +-  4.72% )
                     123      cpu-migrations            #    0.000 K/sec                    ( +-  3.42% )
                   3,858      page-faults               #    0.002 K/sec                    ( +-  8.52% )
      
       Performance counter stats for 'ebizzy -t 48 -S 100' (5 runs):
      
                   2,169      context-switches          #    0.000 K/sec                    ( +-  6.19% )
                     189      cpu-migrations            #    0.000 K/sec                    ( +- 12.75% )
                   5,917      page-faults               #    0.001 K/sec                    ( +-  8.09% )
      
       Performance counter stats for 'ebizzy -t 96 -S 100' (5 runs):
      
                   5,333      context-switches          #    0.001 K/sec                    ( +-  5.91% )
                     506      cpu-migrations            #    0.000 K/sec                    ( +-  3.35% )
                  10,792      page-faults               #    0.001 K/sec                    ( +-  7.75% )
      
      Which show that in these workloads CPU migrations get reduced significantly.
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/1490205470-10249-1-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      05b40e05
  11. 24 3月, 2017 2 次提交
  12. 23 3月, 2017 5 次提交
    • V
      sched/fair: Fix FTQ noise bench regression · bc427898
      Vincent Guittot 提交于
      A regression of the FTQ noise has been reported by Ying Huang,
      on the following hardware:
      
        8 threads Intel(R) Core(TM)i7-4770 CPU @ 3.40GHz with 8G memory
      
      ... which was caused by this commit:
      
        commit 4e516076 ("sched/fair: Propagate asynchrous detach")
      
      The only part of the patch that can increase the noise is the update
      of blocked load of group entity in update_blocked_averages().
      
      We can optimize this call and skip the update of group entity if its load
      and utilization are already null and there is no pending propagation of load
      in the task group.
      
      This optimization partly restores the noise score. A more agressive
      optimization has been tried but has shown worse score.
      
      Reported-by: ying.huang@linux.intel.com
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dietmar.eggemann@arm.com
      Cc: ying.huang@intel.com
      Fixes: 4e516076 ("sched/fair: Propagate asynchrous detach")
      Link: http://lkml.kernel.org/r/1489758442-2877-1-git-send-email-vincent.guittot@linaro.org
      [ Fixed typos, improved layout. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      bc427898
    • W
      sched/core: Fix rq lock pinning warning after call balance callbacks · d7921a5d
      Wanpeng Li 提交于
      This can be reproduced by running rt-migrate-test:
      
       WARNING: CPU: 2 PID: 2195 at kernel/locking/lockdep.c:3670 lock_unpin_lock()
       unpinning an unpinned lock
       ...
       Call Trace:
        dump_stack()
        __warn()
        warn_slowpath_fmt()
        lock_unpin_lock()
        __balance_callback()
        __schedule()
        schedule()
        futex_wait_queue_me()
        futex_wait()
        do_futex()
        SyS_futex()
        do_syscall_64()
        entry_SYSCALL64_slow_path()
      
      Revert the rq_lock_irqsave() usage here, the whole point of the
      balance_callback() was to allow dropping rq->lock.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 8a8c69c3 ("sched/core: Add rq->lock wrappers")
      Link: http://lkml.kernel.org/r/1489718719-3951-1-git-send-email-wanpeng.li@hotmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d7921a5d
    • P
      sched/clock, x86/perf: Fix "perf test tsc" · 698eff63
      Peter Zijlstra 提交于
      People reported that commit:
      
        5680d809 ("sched/clock: Provide better clock continuity")
      
      broke "perf test tsc".
      
      That commit added another offset to the reported clock value; so
      take that into account when computing the provided offset values.
      Reported-by: NAdrian Hunter <adrian.hunter@intel.com>
      Reported-by: NArnaldo Carvalho de Melo <acme@kernel.org>
      Tested-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 5680d809 ("sched/clock: Provide better clock continuity")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      698eff63
    • P
      sched/clock: Fix clear_sched_clock_stable() preempt wobbly · 71fdb70e
      Peter Zijlstra 提交于
      Paul reported a problems with clear_sched_clock_stable(). Since we run
      all of __clear_sched_clock_stable() from workqueue context, there's a
      preempt problem.
      
      Solve it by only running the static_key_disable() from workqueue.
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: fweisbec@gmail.com
      Link: http://lkml.kernel.org/r/20170313124621.GA3328@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      71fdb70e
    • R
      cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely · b7eaf1aa
      Rafael J. Wysocki 提交于
      The way the schedutil governor uses the PELT metric causes it to
      underestimate the CPU utilization in some cases.
      
      That can be easily demonstrated by running kernel compilation on
      a Sandy Bridge Intel processor, running turbostat in parallel with
      it and looking at the values written to the MSR_IA32_PERF_CTL
      register.  Namely, the expected result would be that when all CPUs
      were 100% busy, all of them would be requested to run in the maximum
      P-state, but observation shows that this clearly isn't the case.
      The CPUs run in the maximum P-state for a while and then are
      requested to run slower and go back to the maximum P-state after
      a while again.  That causes the actual frequency of the processor to
      visibly oscillate below the sustainable maximum in a jittery fashion
      which clearly is not desirable.
      
      That has been attributed to CPU utilization metric updates on task
      migration that cause the total utilization value for the CPU to be
      reduced by the utilization of the migrated task.  If that happens,
      the schedutil governor may see a CPU utilization reduction and will
      attempt to reduce the CPU frequency accordingly right away.  That
      may be premature, though, for example if the system is generally
      busy and there are other runnable tasks waiting to be run on that
      CPU already.
      
      This is unlikely to be an issue on systems where cpufreq policies are
      shared between multiple CPUs, because in those cases the policy
      utilization is computed as the maximum of the CPU utilization values
      over the whole policy and if that turns out to be low, reducing the
      frequency for the policy most likely is a good idea anyway.  On
      systems with one CPU per policy, however, it may affect performance
      adversely and even lead to increased energy consumption in some cases.
      
      On those systems it may be addressed by taking another utilization
      metric into consideration, like whether or not the CPU whose
      frequency is about to be reduced has been idle recently, because if
      that's not the case, the CPU is likely to be busy in the near future
      and its frequency should not be reduced.
      
      To that end, use the counter of idle calls in the timekeeping code.
      Namely, make the schedutil governor look at that counter for the
      current CPU every time before its frequency is about to be reduced.
      If the counter has not changed since the previous iteration of the
      governor computations for that CPU, the CPU has been busy for all
      that time and its frequency should not be decreased, so if the new
      frequency would be lower than the one set previously, the governor
      will skip the frequency update.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Reviewed-by: NJoel Fernandes <joelaf@google.com>
      b7eaf1aa
  13. 21 3月, 2017 1 次提交
  14. 16 3月, 2017 6 次提交