1. 15 5月, 2017 17 次提交
    • L
      sched/topology: Move comment about asymmetric node setups · c20e1ea4
      Lauro Ramos Venancio 提交于
      Signed-off-by: NLauro Ramos Venancio <lvenanci@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: lwang@redhat.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1492717903-5195-4-git-send-email-lvenanci@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c20e1ea4
    • L
      sched/topology: Optimize build_group_mask() · f32d782e
      Lauro Ramos Venancio 提交于
      The group mask is always used in intersection with the group CPUs. So,
      when building the group mask, we don't have to care about CPUs that are
      not part of the group.
      Signed-off-by: NLauro Ramos Venancio <lvenanci@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: lwang@redhat.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1492717903-5195-2-git-send-email-lvenanci@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f32d782e
    • P
      sched/topology: Verify the first group matches the child domain · a420b063
      Peter Zijlstra 提交于
      We want sched_groups to be sibling child domains (or individual CPUs
      when there are no child domains). Furthermore, since the first group
      of a domain should include the CPU of that domain, the first group of
      each domain should match the child domain.
      
      Verify this is indeed so.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a420b063
    • P
      sched/debug: Print the scheduler topology group mask · b0151c25
      Peter Zijlstra 提交于
      In order to determine the balance_cpu (for should_we_balance()) we need
      the sched_group_mask() for overlapping domains.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b0151c25
    • P
      sched/topology: Simplify build_overlap_sched_groups() · 91eaed0d
      Peter Zijlstra 提交于
      Now that the first group will always be the previous domain of this
      @cpu this can be simplified.
      
      In fact, writing the code now removed should've been a big clue I was
      doing it wrong :/
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      91eaed0d
    • P
      sched/topology: Fix building of overlapping sched-groups · 0372dd27
      Peter Zijlstra 提交于
      When building the overlapping groups, we very obviously should start
      with the previous domain of _this_ @cpu, not CPU-0.
      
      This can be readily demonstrated with a topology like:
      
        node   0   1   2   3
          0:  10  20  30  20
          1:  20  10  20  30
          2:  30  20  10  20
          3:  20  30  20  10
      
      Where (for example) CPU1 ends up generating the following nonsensical groups:
      
        [] CPU1 attaching sched-domain:
        []  domain 0: span 0-2 level NUMA
        []   groups: 1 2 0
        []   domain 1: span 0-3 level NUMA
        []    groups: 1-3 (cpu_capacity = 3072) 0-1,3 (cpu_capacity = 3072)
      
      Where the fact that domain 1 doesn't include a group with span 0-2 is
      the obvious fail.
      
      With patch this looks like:
      
        [] CPU1 attaching sched-domain:
        []  domain 0: span 0-2 level NUMA
        []   groups: 1 0 2
        []   domain 1: span 0-3 level NUMA
        []    groups: 0-2 (cpu_capacity = 3072) 0,2-3 (cpu_capacity = 3072)
      Debugged-by: NLauro Ramos Venancio <lvenanci@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Cc: stable@vger.kernel.org
      Fixes: e3589f6c ("sched: Allow for overlapping sched_domain spans")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0372dd27
    • P
      sched/fair, cpumask: Export for_each_cpu_wrap() · c743f0a5
      Peter Zijlstra 提交于
      More users for for_each_cpu_wrap() have appeared. Promote the construct
      to generic cpumask interface.
      
      The implementation is slightly modified to reduce arguments.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Lauro Ramos Venancio <lvenanci@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: lwang@redhat.com
      Link: http://lkml.kernel.org/r/20170414122005.o35me2h5nowqkxbv@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c743f0a5
    • L
      sched/topology: Refactor function build_overlap_sched_groups() · 8c033469
      Lauro Ramos Venancio 提交于
      Create functions build_group_from_child_sched_domain() and
      init_overlap_sched_group(). No functional change.
      Signed-off-by: NLauro Ramos Venancio <lvenanci@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1492091769-19879-2-git-send-email-lvenanci@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8c033469
    • P
      sched/clock: Print a warning recommending 'tsc=unstable' · 7708d5f0
      Peter Zijlstra 提交于
      With our switch to stable delayed until late_initcall(), the most
      likely cause of hitting mark_tsc_unstable() is the watchdog. The
      watchdog typically only triggers when creative BIOS'es fiddle with the
      TSC to hide SMI latency.
      
      Since the watchdog can only detect TSC fiddling after the fact all TSC
      clocks (including userspace GTOD) can already have reported funny
      values.
      
      The only way to fully avoid this, is manually marking the TSC unstable
      at boot. Suggest people do this on their broken systems.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7708d5f0
    • P
      sched/clock: Use late_initcall() instead of sched_init_smp() · 2e44b7dd
      Peter Zijlstra 提交于
      Core2 marks its TSC unstable in ACPI Processor Idle, which is probed
      after sched_init_smp(). Luckily it appears both acpi_processor and
      intel_idle (which has a similar check) are mandatory built-in.
      
      This means we can delay switching to stable until after these drivers
      have ran (if they were modules, this would be impossible).
      
      Delay the stable switch to late_initcall() to allow these drivers to
      mark TSC unstable and avoid difficult stable->unstable transitions.
      Reported-by: NLofstedt, Marta <marta.lofstedt@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2e44b7dd
    • P
      cpuidle: Fix idle time tracking · f9fccdb9
      Peter Zijlstra 提交于
      Ville reported that on his Core2, which has TSC stop in idle, we would
      always report very short idle durations. He tracked this down to
      commit:
      
        e93e59ce ("cpuidle: Replace ktime_get() with local_clock()")
      
      which replaces ktime_get() with local_clock().
      
      Add a sched_clock_idle_wakeup_event() call, which will re-sync the
      clock with ktime_get_ns() when TSC is unstable and no-op otherwise.
      Reported-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Tested-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Fixes: e93e59ce ("cpuidle: Replace ktime_get() with local_clock()")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      f9fccdb9
    • P
      sched/clock: Remove watchdog touching · 3067a33d
      Peter Zijlstra 提交于
      Commit:
      
        2bacec8c ("sched: touch softlockup watchdog after idling")
      
      introduced the touch_softlockup_watchdog_sched() call without
      justification and I feel sched_clock management is not the right
      place, it should only be concerned with producing semi coherent time.
      
      If this causes watchdog thingies, we can find a better place.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3067a33d
    • P
      sched/clock: Remove unused argument to sched_clock_idle_wakeup_event() · ac1e843f
      Peter Zijlstra 提交于
      The argument to sched_clock_idle_wakeup_event() has not been used in a
      long time. Remove it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac1e843f
    • P
      x86/tsc, sched/clock, clocksource: Use clocksource watchdog to provide stable sync points · b421b22b
      Peter Zijlstra 提交于
      Currently we keep sched_clock_tick() active for stable TSC in order to
      keep the per-CPU state semi up-to-date. The (obvious) problem is that
      by the time we detect TSC is borked, our per-CPU state is also borked.
      
      So hook into the clocksource watchdog and call a method after we've
      found it to still be stable.
      
      There's the obvious race where the TSC goes wonky between finding it
      stable and us running the callback, but closing that is too much work
      and not really worth it, since we're already detecting TSC wobbles
      after the fact, so we cannot, per definition, fully avoid funny clock
      values.
      
      And since the watchdog runs less often than the tick, this is also an
      optimization.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b421b22b
    • P
      sched/clock: Initialize all per-CPU state before switching (back) to unstable · cf15ca8d
      Peter Zijlstra 提交于
      In preparation for not keeping the sched_clock_tick() active for
      stable TSC, we need to explicitly initialize all per-CPU state
      before switching back to unstable.
      
      Note: this patch looses the __gtod_offset calculation; it will be
      restored in the next one.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      cf15ca8d
    • V
      sched/cfs: Make util/load_avg more stable · 625ed2bf
      Vincent Guittot 提交于
      In the current implementation of load/util_avg, we assume that the
      ongoing time segment has fully elapsed, and util/load_sum is divided
      by LOAD_AVG_MAX, even if part of the time segment still remains to
      run. As a consequence, this remaining part is considered as idle time
      and generates unexpected variations of util_avg of a busy CPU in the
      range [1002..1024[ whereas util_avg should stay at 1023.
      
      In order to keep the metric stable, we should not consider the ongoing
      time segment when computing load/util_avg but only the segments that
      have already fully elapsed. But to not consider the current time
      segment adds unwanted latency in the load/util_avg responsivness
      especially when the time is scaled instead of the contribution.
      
      Instead of waiting for the current time segment to have fully elapsed
      before accounting it in load/util_avg, we can already account the
      elapsed part but change the range used to compute load/util_avg
      accordingly.
      
      At the very beginning of a new time segment, the past segments have
      been decayed and the max value is LOAD_AVG_MAX*y. At the very end of
      the current time segment, the max value becomes:
      
        LOAD_AVG_MAX*y + 1024(us)  (== LOAD_AVG_MAX)
      
      In fact, the max value is:
      
        LOAD_AVG_MAX*y + sa->period_contrib
      
      at any time in the time segment.
      
      Taking advantage of the fact that:
      
        LOAD_AVG_MAX*y == LOAD_AVG_MAX-1024
      
      the range becomes [0..LOAD_AVG_MAX-1024+sa->period_contrib].
      
      As the elapsed part is already accounted in load/util_sum, we update
      the max value according to the current position in the time segment
      instead of removing its contribution.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten.Rasmussen@arm.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: pjt@google.com
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1493188076-2767-1-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      625ed2bf
    • S
      sched/core: Call __schedule() from do_idle() without enabling preemption · 8663effb
      Steven Rostedt (VMware) 提交于
      I finally got around to creating trampolines for dynamically allocated
      ftrace_ops with using synchronize_rcu_tasks(). For users of the ftrace
      function hook callbacks, like perf, that allocate the ftrace_ops
      descriptor via kmalloc() and friends, ftrace was not able to optimize
      the functions being traced to use a trampoline because they would also
      need to be allocated dynamically. The problem is that they cannot be
      freed when CONFIG_PREEMPT is set, as there's no way to tell if a task
      was preempted on the trampoline. That was before Paul McKenney
      implemented synchronize_rcu_tasks() that would make sure all tasks
      (except idle) have scheduled out or have entered user space.
      
      While testing this, I triggered this bug:
      
       BUG: unable to handle kernel paging request at ffffffffa0230077
       ...
       RIP: 0010:0xffffffffa0230077
       ...
       Call Trace:
        schedule+0x5/0xe0
        schedule_preempt_disabled+0x18/0x30
        do_idle+0x172/0x220
      
      What happened was that the idle task was preempted on the trampoline.
      As synchronize_rcu_tasks() ignores the idle thread, there's nothing
      that lets ftrace know that the idle task was preempted on a trampoline.
      
      The idle task shouldn't need to ever enable preemption. The idle task
      is simply a loop that calls schedule or places the cpu into idle mode.
      In fact, having preemption enabled is inefficient, because it can
      happen when idle is just about to call schedule anyway, which would
      cause schedule to be called twice. Once for when the interrupt came in
      and was returning back to normal context, and then again in the normal
      path that the idle loop is running in, which would be pointless, as it
      had already scheduled.
      
      The only reason schedule_preempt_disable() enables preemption is to be
      able to call sched_submit_work(), which requires preemption enabled. As
      this is a nop when the task is in the RUNNING state, and idle is always
      in the running state, there's no reason that idle needs to enable
      preemption. But that means it cannot use schedule_preempt_disable() as
      other callers of that function require calling sched_submit_work().
      
      Adding a new function local to kernel/sched/ that allows idle to call
      the scheduler without enabling preemption, fixes the
      synchronize_rcu_tasks() issue, as well as removes the pointless spurious
      schedule calls caused by interrupts happening in the brief window where
      preemption is enabled just before it calls schedule.
      
      Reviewed: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170414084809.3dacde2a@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8663effb
  2. 13 5月, 2017 2 次提交
  3. 10 5月, 2017 1 次提交
    • W
      perf/callchain: Force USER_DS when invoking perf_callchain_user() · 88b0193d
      Will Deacon 提交于
      Perf can generate and record a user callchain in response to a synchronous
      request, such as a tracepoint firing. If this happens under set_fs(KERNEL_DS),
      then we can end up walking the user stack (and dereferencing/saving whatever we
      find there) without the protections usually afforded by checks such as
      access_ok.
      
      Rather than play whack-a-mole with each architecture's stack unwinding
      implementation, fix the root of the problem by ensuring that we force USER_DS
      when invoking perf_callchain_user from the perf core.
      Reported-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Jiri Olsa <jolsa@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      88b0193d
  4. 09 5月, 2017 15 次提交
  5. 06 5月, 2017 1 次提交
    • R
      ACPI / sleep: Ignore spurious SCI wakeups from suspend-to-idle · eed4d47e
      Rafael J. Wysocki 提交于
      The ACPI SCI (System Control Interrupt) is set up as a wakeup IRQ
      during suspend-to-idle transitions and, consequently, any events
      signaled through it wake up the system from that state.  However,
      on some systems some of the events signaled via the ACPI SCI while
      suspended to idle should not cause the system to wake up.  In fact,
      quite often they should just be discarded.
      
      Arguably, systems should not resume entirely on such events, but in
      order to decide which events really should cause the system to resume
      and which are spurious, it is necessary to resume up to the point
      when ACPI SCIs are actually handled and processed, which is after
      executing dpm_resume_noirq() in the system resume path.
      
      For this reasons, add a loop around freeze_enter() in which the
      platforms can process events signaled via multiplexed IRQ lines
      like the ACPI SCI and add suspend-to-idle hooks that can be
      used for this purpose to struct platform_freeze_ops.
      
      In the ACPI case, the ->wake hook is used for checking if the SCI
      has triggered while suspended and deferring the interrupt-induced
      system wakeup until the events signaled through it are actually
      processed sufficiently to decide whether or not the system should
      resume.  In turn, the ->sync hook allows all of the relevant event
      queues to be flushed so as to prevent events from being missed due
      to race conditions.
      
      In addition to that, some ACPI code processing wakeup events needs
      to be modified to use the "hard" version of wakeup triggers, so that
      it will cause a system resume to happen on device-induced wakeup
      events even if the "soft" mechanism to prevent the system from
      suspending is not enabled (that also helps to catch device-induced
      wakeup events occurring during suspend transitions in progress).
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      eed4d47e
  6. 05 5月, 2017 1 次提交
  7. 04 5月, 2017 3 次提交