1. 14 5月, 2018 1 次提交
    • M
      sched/numa: Stagger NUMA balancing scan periods for new threads · 13784475
      Mel Gorman 提交于
      Threads share an address space and each can change the protections of the
      same address space to trap NUMA faults. This is redundant and potentially
      counter-productive as any thread doing the update will suffice. Potentially
      only one thread is required but that thread may be idle or it may not have
      any locality concerns and pick an unsuitable scan rate.
      
      This patch uses independent scan period but they are staggered based on
      the number of address space users when the thread is created.  The intent
      is that threads will avoid scanning at the same time and have a chance
      to adapt their scan rate later if necessary. This reduces the total scan
      activity early in the lifetime of the threads.
      
      The different in headline performance across a range of machines and
      workloads is marginal but the system CPU usage is reduced as well as overall
      scan activity.  The following is the time reported by NAS Parallel Benchmark
      using unbound openmp threads and a D size class:
      
      			      4.17.0-rc1             4.17.0-rc1
      				 vanilla           stagger-v1r1
      	Time bt.D      442.77 (   0.00%)      419.70 (   5.21%)
      	Time cg.D      171.90 (   0.00%)      180.85 (  -5.21%)
      	Time ep.D       33.10 (   0.00%)       32.90 (   0.60%)
      	Time is.D        9.59 (   0.00%)        9.42 (   1.77%)
      	Time lu.D      306.75 (   0.00%)      304.65 (   0.68%)
      	Time mg.D       54.56 (   0.00%)       52.38 (   4.00%)
      	Time sp.D     1020.03 (   0.00%)      903.77 (  11.40%)
      	Time ua.D      400.58 (   0.00%)      386.49 (   3.52%)
      
      Note it's not a universal win but we have no prior knowledge of which
      thread matters but the number of threads created often exceeds the size
      of the node when the threads are not bound. However, there is a reducation
      of overall system CPU usage:
      
      				    4.17.0-rc1             4.17.0-rc1
      				       vanilla           stagger-v1r1
      	sys-time-bt.D         48.78 (   0.00%)       48.22 (   1.15%)
      	sys-time-cg.D         25.31 (   0.00%)       26.63 (  -5.22%)
      	sys-time-ep.D          1.65 (   0.00%)        0.62 (  62.42%)
      	sys-time-is.D         40.05 (   0.00%)       24.45 (  38.95%)
      	sys-time-lu.D         37.55 (   0.00%)       29.02 (  22.72%)
      	sys-time-mg.D         47.52 (   0.00%)       34.92 (  26.52%)
      	sys-time-sp.D        119.01 (   0.00%)      109.05 (   8.37%)
      	sys-time-ua.D         51.52 (   0.00%)       45.13 (  12.40%)
      
      NUMA scan activity is also reduced:
      
      	NUMA alloc local               1042828     1342670
      	NUMA base PTE updates        140481138    93577468
      	NUMA huge PMD updates           272171      180766
      	NUMA page range updates      279832690   186129660
      	NUMA hint faults               1395972     1193897
      	NUMA hint local faults          877925      855053
      	NUMA hint local percent             62          71
      	NUMA pages migrated           12057909     9158023
      
      Similar observations are made for other thread-intensive workloads. System
      CPU usage is lower even though the headline gains in performance tend to be
      small. For example, specjbb 2005 shows almost no difference in performance
      but scan activity is reduced by a third on a 4-socket box. I didn't find
      a workload (thread intensive or otherwise) that suffered badly.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180504154109.mvrha2qo5wdl65vr@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      13784475
  2. 05 4月, 2018 1 次提交
  3. 20 3月, 2018 1 次提交
    • P
      sched/cpufreq/schedutil: Use util_est for OPP selection · a07630b8
      Patrick Bellasi 提交于
      When schedutil looks at the CPU utilization, the current PELT value for
      that CPU is returned straight away. In certain scenarios this can have
      undesired side effects and delays on frequency selection.
      
      For example, since the task utilization is decayed at wakeup time, a
      long sleeping big task newly enqueued does not add immediately a
      significant contribution to the target CPU. This introduces some latency
      before schedutil will be able to detect the best frequency required by
      that task.
      
      Moreover, the PELT signal build-up time is a function of the current
      frequency, because of the scale invariant load tracking support. Thus,
      starting from a lower frequency, the utilization build-up time will
      increase even more and further delays the selection of the actual
      frequency which better serves the task requirements.
      
      In order to reduce these kind of latencies, we integrate the usage
      of the CPU's estimated utilization in the sugov_get_util function.
      
      This allows to properly consider the expected utilization of a CPU which,
      for example, has just got a big task running after a long sleep period.
      Ultimately this allows to select the best frequency to run a task
      right after its wake-up.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@android.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20180309095245.11071-4-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a07630b8
  4. 09 3月, 2018 5 次提交
  5. 04 3月, 2018 1 次提交
    • I
      sched/headers: Simplify and clean up header usage in the scheduler · 325ea10c
      Ingo Molnar 提交于
      Do the following cleanups and simplifications:
      
       - sched/sched.h already includes <asm/paravirt.h>, so no need to
         include it in sched/core.c again.
      
       - order the <linux/sched/*.h> headers alphabetically
      
       - add all <linux/sched/*.h> headers to kernel/sched/sched.h
      
       - remove all unnecessary includes from the .c files that
         are already included in kernel/sched/sched.h.
      
      Finally, make all scheduler .c files use a single common header:
      
        #include "sched.h"
      
      ... which now contains a union of the relied upon headers.
      
      This makes the various .c files easier to read and easier to handle.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      325ea10c
  6. 03 3月, 2018 1 次提交
    • I
      sched: Clean up and harmonize the coding style of the scheduler code base · 97fb7a0a
      Ingo Molnar 提交于
      A good number of small style inconsistencies have accumulated
      in the scheduler core, so do a pass over them to harmonize
      all these details:
      
       - fix speling in comments,
      
       - use curly braces for multi-line statements,
      
       - remove unnecessary parentheses from integer literals,
      
       - capitalize consistently,
      
       - remove stray newlines,
      
       - add comments where necessary,
      
       - remove invalid/unnecessary comments,
      
       - align structure definitions and other data types vertically,
      
       - add missing newlines for increased readability,
      
       - fix vertical tabulation where it's misaligned,
      
       - harmonize preprocessor conditional block labeling
         and vertical alignment,
      
       - remove line-breaks where they uglify the code,
      
       - add newline after local variable definitions,
      
      No change in functionality:
      
        md5:
           1191fa0a890cfa8132156d2959d7e9e2  built-in.o.before.asm
           1191fa0a890cfa8132156d2959d7e9e2  built-in.o.after.asm
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      97fb7a0a
  7. 21 2月, 2018 2 次提交
  8. 06 2月, 2018 1 次提交
  9. 10 1月, 2018 7 次提交
  10. 09 11月, 2017 1 次提交
    • P
      sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds · 765cc3a4
      Patrick Bellasi 提交于
      When the kernel is compiled with !CONFIG_SCHED_DEBUG support, we expect that
      all SCHED_FEAT are turned into compile time constants being propagated
      to support compiler optimizations.
      
      Specifically, we expect that code blocks like this:
      
         if (sched_feat(FEATURE_NAME) [&& <other_conditions>]) {
      	/* FEATURE CODE */
         }
      
      are turned into dead-code in case FEATURE_NAME defaults to FALSE, and thus
      being removed by the compiler from the finale image.
      
      For this mechanism to properly work it's required for the compiler to
      have full access, from each translation unit, to whatever is the value
      defined by the sched_feat macro. This macro is defined as:
      
         #define sched_feat(x) (sysctl_sched_features & (1UL << __SCHED_FEAT_##x))
      
      and thus, the compiler can optimize that code only if the value of
      sysctl_sched_features is visible within each translation unit.
      
      Since:
      
         029632fb ("sched: Make separate sched*.c translation units")
      
      the scheduler code has been split into separate translation units
      however the definition of sysctl_sched_features is part of
      kernel/sched/core.c while, for all the other scheduler modules, it is
      visible only via kernel/sched/sched.h as an:
      
         extern const_debug unsigned int sysctl_sched_features
      
      Unfortunately, an extern reference does not allow the compiler to apply
      constants propagation. Thus, on !CONFIG_SCHED_DEBUG kernel we still end up
      with code to load a memory reference and (eventually) doing an unconditional
      jump of a chunk of code.
      
      This mechanism is unavoidable when sched_features can be turned on and off at
      run-time. However, this is not the case for "production" kernels compiled with
      !CONFIG_SCHED_DEBUG. In this case, sysctl_sched_features is just a constant value
      which cannot be changed at run-time and thus memory loads and jumps can be
      avoided altogether.
      
      This patch fixes the case of !CONFIG_SCHED_DEBUG kernel by declaring a local version
      of the sysctl_sched_features constant for each translation unit. This will
      ultimately allow the compiler to perform constants propagation and dead-code
      pruning.
      
      Tests have been done, with !CONFIG_SCHED_DEBUG on a v4.14-rc8 with and without
      the patch, by running 30 iterations of:
      
         perf bench sched messaging --pipe --thread --group 4 --loop 50000
      
      on a 40 cores Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz using the
      powersave governor to rule out variations due to frequency scaling.
      
      Statistics on the reported completion time:
      
                         count     mean       std     min       99%     max
        v4.14-rc8         30.0  15.7831  0.176032  15.442  16.01226  16.014
        v4.14-rc8+patch   30.0  15.5033  0.189681  15.232  15.93938  15.962
      
      ... show a 1.8% speedup on average completion time and 0.5% speedup in the
      99 percentile.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NChris Redpath <chris.redpath@arm.com>
      Reviewed-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Reviewed-by: NBrendan Jackman <brendan.jackman@arm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Link: http://lkml.kernel.org/r/20171108184101.16006-1-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      765cc3a4
  11. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  12. 10 10月, 2017 3 次提交
    • S
      sched/rt: Simplify the IPI based RT balancing logic · 4bdced5c
      Steven Rostedt (Red Hat) 提交于
      When a CPU lowers its priority (schedules out a high priority task for a
      lower priority one), a check is made to see if any other CPU has overloaded
      RT tasks (more than one). It checks the rto_mask to determine this and if so
      it will request to pull one of those tasks to itself if the non running RT
      task is of higher priority than the new priority of the next task to run on
      the current CPU.
      
      When we deal with large number of CPUs, the original pull logic suffered
      from large lock contention on a single CPU run queue, which caused a huge
      latency across all CPUs. This was caused by only having one CPU having
      overloaded RT tasks and a bunch of other CPUs lowering their priority. To
      solve this issue, commit:
      
        b6366f04 ("sched/rt: Use IPI to trigger RT task push migration instead of pulling")
      
      changed the way to request a pull. Instead of grabbing the lock of the
      overloaded CPU's runqueue, it simply sent an IPI to that CPU to do the work.
      
      Although the IPI logic worked very well in removing the large latency build
      up, it still could suffer from a large number of IPIs being sent to a single
      CPU. On a 80 CPU box, I measured over 200us of processing IPIs. Worse yet,
      when I tested this on a 120 CPU box, with a stress test that had lots of
      RT tasks scheduling on all CPUs, it actually triggered the hard lockup
      detector! One CPU had so many IPIs sent to it, and due to the restart
      mechanism that is triggered when the source run queue has a priority status
      change, the CPU spent minutes! processing the IPIs.
      
      Thinking about this further, I realized there's no reason for each run queue
      to send its own IPI. As all CPUs with overloaded tasks must be scanned
      regardless if there's one or many CPUs lowering their priority, because
      there's no current way to find the CPU with the highest priority task that
      can schedule to one of these CPUs, there really only needs to be one IPI
      being sent around at a time.
      
      This greatly simplifies the code!
      
      The new approach is to have each root domain have its own irq work, as the
      rto_mask is per root domain. The root domain has the following fields
      attached to it:
      
        rto_push_work	 - the irq work to process each CPU set in rto_mask
        rto_lock	 - the lock to protect some of the other rto fields
        rto_loop_start - an atomic that keeps contention down on rto_lock
      		    the first CPU scheduling in a lower priority task
      		    is the one to kick off the process.
        rto_loop_next	 - an atomic that gets incremented for each CPU that
      		    schedules in a lower priority task.
        rto_loop	 - a variable protected by rto_lock that is used to
      		    compare against rto_loop_next
        rto_cpu	 - The cpu to send the next IPI to, also protected by
      		    the rto_lock.
      
      When a CPU schedules in a lower priority task and wants to make sure
      overloaded CPUs know about it. It increments the rto_loop_next. Then it
      atomically sets rto_loop_start with a cmpxchg. If the old value is not "0",
      then it is done, as another CPU is kicking off the IPI loop. If the old
      value is "0", then it will take the rto_lock to synchronize with a possible
      IPI being sent around to the overloaded CPUs.
      
      If rto_cpu is greater than or equal to nr_cpu_ids, then there's either no
      IPI being sent around, or one is about to finish. Then rto_cpu is set to the
      first CPU in rto_mask and an IPI is sent to that CPU. If there's no CPUs set
      in rto_mask, then there's nothing to be done.
      
      When the CPU receives the IPI, it will first try to push any RT tasks that is
      queued on the CPU but can't run because a higher priority RT task is
      currently running on that CPU.
      
      Then it takes the rto_lock and looks for the next CPU in the rto_mask. If it
      finds one, it simply sends an IPI to that CPU and the process continues.
      
      If there's no more CPUs in the rto_mask, then rto_loop is compared with
      rto_loop_next. If they match, everything is done and the process is over. If
      they do not match, then a CPU scheduled in a lower priority task as the IPI
      was being passed around, and the process needs to start again. The first CPU
      in rto_mask is sent the IPI.
      
      This change removes this duplication of work in the IPI logic, and greatly
      lowers the latency caused by the IPIs. This removed the lockup happening on
      the 120 CPU machine. It also simplifies the code tremendously. What else
      could anyone ask for?
      
      Thanks to Peter Zijlstra for simplifying the rto_loop_start atomic logic and
      supplying me with the rto_start_trylock() and rto_start_unlock() helper
      functions.
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Scott Wood <swood@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170424114732.1aac6dc4@gandalf.local.homeSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4bdced5c
    • P
      sched/deadline: Rename __dl_clear() to __dl_sub() · 8c0944ce
      Peter Zijlstra 提交于
      __dl_sub() is more meaningful as a name, and is more consistent
      with the naming of the dual function (__dl_add()).
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1504778971-13573-4-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8c0944ce
    • L
      sched/headers: Remove duplicate prototype of __dl_clear_params() · e964d350
      luca abeni 提交于
      Signed-off-by: Nluca abeni <luca.abeni@santannapisa.it>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1504778971-13573-2-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e964d350
  13. 30 9月, 2017 4 次提交
    • P
      sched/fair: Propagate an effective runnable_load_avg · 1ea6c46a
      Peter Zijlstra 提交于
      The load balancer uses runnable_load_avg as load indicator. For
      !cgroup this is:
      
        runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq
      
      That is, a direct sum of all runnable tasks on that runqueue. As
      opposed to load_avg, which is a sum of all tasks on the runqueue,
      which includes a blocked component.
      
      However, in the cgroup case, this comes apart since the group entities
      are always runnable, even if most of their constituent entities are
      blocked.
      
      Therefore introduce a runnable_weight which for task entities is the
      same as the regular weight, but for group entities is a fraction of
      the entity weight and represents the runnable part of the group
      runqueue.
      
      Then propagate this load through the PELT hierarchy to arrive at an
      effective runnable load avgerage -- which we should not confuse with
      the canonical runnable load average.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1ea6c46a
    • P
      sched/fair: Rewrite PELT migration propagation · 0e2d2aaa
      Peter Zijlstra 提交于
      When an entity migrates in (or out) of a runqueue, we need to add (or
      remove) its contribution from the entire PELT hierarchy, because even
      non-runnable entities are included in the load average sums.
      
      In order to do this we have some propagation logic that updates the
      PELT tree, however the way it 'propagates' the runnable (or load)
      change is (more or less):
      
                           tg->weight * grq->avg.load_avg
        ge->avg.load_avg = ------------------------------
                                     tg->load_avg
      
      But that is the expression for ge->weight, and per the definition of
      load_avg:
      
        ge->avg.load_avg := ge->weight * ge->avg.runnable_avg
      
      That destroys the runnable_avg (by setting it to 1) we wanted to
      propagate.
      
      Instead directly propagate runnable_sum.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0e2d2aaa
    • P
      sched/fair: Rewrite cfs_rq->removed_*avg · 2a2f5d4e
      Peter Zijlstra 提交于
      Since on wakeup migration we don't hold the rq->lock for the old CPU
      we cannot update its state. Instead we add the removed 'load' to an
      atomic variable and have the next update on that CPU collect and
      process it.
      
      Currently we have 2 atomic variables; which already have the issue
      that they can be read out-of-sync. Also, two atomic ops on a single
      cacheline is already more expensive than an uncontended lock.
      
      Since we want to add more, convert the thing over to an explicit
      cacheline with a lock in.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2a2f5d4e
    • V
      sched/fair: Use reweight_entity() for set_user_nice() · 9059393e
      Vincent Guittot 提交于
      Now that we directly change load_avg and propagate that change into
      the sums, sys_nice() and co should do the same, otherwise its possible
      to confuse load accounting when we migrate near the weight change.
      Fixes-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      [ Added changelog, fixed the call condition. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20170517095045.GA8420@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9059393e
  14. 25 9月, 2017 1 次提交
  15. 12 9月, 2017 1 次提交
  16. 09 9月, 2017 2 次提交
  17. 29 8月, 2017 1 次提交
    • Y
      smp: Avoid using two cache lines for struct call_single_data · 966a9671
      Ying Huang 提交于
      struct call_single_data is used in IPIs to transfer information between
      CPUs.  Its size is bigger than sizeof(unsigned long) and less than
      cache line size.  Currently it is not allocated with any explicit alignment
      requirements.  This makes it possible for allocated call_single_data to
      cross two cache lines, which results in double the number of the cache lines
      that need to be transferred among CPUs.
      
      This can be fixed by requiring call_single_data to be aligned with the
      size of call_single_data. Currently the size of call_single_data is the
      power of 2.  If we add new fields to call_single_data, we may need to
      add padding to make sure the size of new definition is the power of 2
      as well.
      
      Fortunately, this is enforced by GCC, which will report bad sizes.
      
      To set alignment requirements of call_single_data to the size of
      call_single_data, a struct definition and a typedef is used.
      
      To test the effect of the patch, I used the vm-scalability multiple
      thread swap test case (swap-w-seq-mt).  The test will create multiple
      threads and each thread will eat memory until all RAM and part of swap
      is used, so that huge number of IPIs are triggered when unmapping
      memory.  In the test, the throughput of memory writing improves ~5%
      compared with misaligned call_single_data, because of faster IPIs.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NHuang, Ying <ying.huang@intel.com>
      [ Add call_single_data_t and align with size of call_single_data. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/87bmnqd6lz.fsf@yhuang-mobile.sh.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      966a9671
  18. 25 8月, 2017 1 次提交
  19. 01 8月, 2017 1 次提交
    • V
      sched: cpufreq: Allow remote cpufreq callbacks · 674e7541
      Viresh Kumar 提交于
      With Android UI and benchmarks the latency of cpufreq response to
      certain scheduling events can become very critical. Currently, callbacks
      into cpufreq governors are only made from the scheduler if the target
      CPU of the event is the same as the current CPU. This means there are
      certain situations where a target CPU may not run the cpufreq governor
      for some time.
      
      One testcase to show this behavior is where a task starts running on
      CPU0, then a new task is also spawned on CPU0 by a task on CPU1. If the
      system is configured such that the new tasks should receive maximum
      demand initially, this should result in CPU0 increasing frequency
      immediately. But because of the above mentioned limitation though, this
      does not occur.
      
      This patch updates the scheduler core to call the cpufreq callbacks for
      remote CPUs as well.
      
      The schedutil, ondemand and conservative governors are updated to
      process cpufreq utilization update hooks called for remote CPUs where
      the remote CPU is managed by the cpufreq policy of the local CPU.
      
      The intel_pstate driver is updated to always reject remote callbacks.
      
      This is tested with couple of usecases (Android: hackbench, recentfling,
      galleryfling, vellamo, Ubuntu: hackbench) on ARM hikey board (64 bit
      octa-core, single policy). Only galleryfling showed minor improvements,
      while others didn't had much deviation.
      
      The reason being that this patch only targets a corner case, where
      following are required to be true to improve performance and that
      doesn't happen too often with these tests:
      
      - Task is migrated to another CPU.
      - The task has high demand, and should take the target CPU to higher
        OPPs.
      - And the target CPU doesn't call into the cpufreq governor until the
        next tick.
      
      Based on initial work from Steve Muckle.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Acked-by: NSaravana Kannan <skannan@codeaurora.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      674e7541
  20. 23 6月, 2017 2 次提交
  21. 20 6月, 2017 1 次提交
    • I
      sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well · 6d3aed3d
      Ingo Molnar 提交于
      This definition of SCHED_WARN_ON():
      
       #define SCHED_WARN_ON(x)        ((void)(x))
      
      is not fully compatible with the 'real' WARN_ON_ONCE() primitive, as it
      has no return value, so it cannot be used in conditionals.
      
      Fix it.
      
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6d3aed3d
  22. 08 6月, 2017 1 次提交