1. 21 2月, 2018 10 次提交
    • M
      rcu: Use wrapper for lockdep asserts · a32e01ee
      Matthew Wilcox 提交于
      Commits c0b334c5 and ea9b0c8a introduced new sparse warnings
      by accessing rcu_node->lock directly and ignoring the __private
      marker.  Introduce a new wrapper and use it.  Also fix a similar problem
      in srcutree.c introduced by a3883df3.
      Signed-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a32e01ee
    • L
      rcu: Remove redundant nxttail index macro define · 65518db8
      Liu, Changcheng 提交于
      RCU's nxttail has been optimized to be a rcu_segcblist, which is
      a multi-tailed linked list with macros defined for the indexes for
      each tail.  The indexes have been defined in linux/rcu_segcblist.h,
      so this commit removes the redundant definitions in kernel/rcu/tree.h.
      Signed-off-by: NLiu Changcheng <changcheng.liu@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      65518db8
    • P
      rcu: Consolidate rcu.h #ifdefs · bfbd767d
      Paul E. McKenney 提交于
      The kernel/rcu/rcu.h file has a pair of consecutive #ifdefs on
      CONFIG_TINY_RCU, so this commit consolidates them, thus saving a few
      lines of code.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      bfbd767d
    • P
      rcu: More clearly identify grace-period kthread stack dump · d07aee2c
      Paul E. McKenney 提交于
      It is not always obvious that the stack dump from a starved grace-period
      kthread isn't instead that of a CPU stalling the current grace period.
      This commit therefore adds a pr_err() flagging these dumps.
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      d07aee2c
    • P
      rcu: Remove obsolete force-quiescent-state statistics for debugfs · d62df573
      Paul E. McKenney 提交于
      The debugfs interface displayed statistics on RCU-pending checks but
      this interface has since been removed.  This commit therefore removes the
      no-longer-used rcu_state structure's ->n_force_qs_lh and ->n_force_qs_ngp
      fields along with their updates.  (Though the ->n_force_qs_ngp field
      was actually not used at all, embarrassingly enough.)
      
      If this information proves necessary in the future, the corresponding
      event traces will be added.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      d62df573
    • P
      rcu: Remove obsolete __rcu_pending() statistics for debugfs · 01c495f7
      Paul E. McKenney 提交于
      The debugfs interface displayed statistics on RCU-pending checks
      but this interface has since been removed.  This commit therefore
      removes the no-longer-used rcu_data structure's ->n_rcu_pending,
      ->n_rp_core_needs_qs, ->n_rp_report_qs, ->n_rp_cb_ready,
      ->n_rp_cpu_needs_gp, ->n_rp_gp_completed, ->n_rp_gp_started,
      ->n_rp_nocb_defer_wakeup, and ->n_rp_need_nothing fields along with
      their updates.
      
      If this information proves necessary in the future, the corresponding
      event traces will be added.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      01c495f7
    • P
      rcu: Remove obsolete callback-invocation statistics for debugfs · 62df63e0
      Paul E. McKenney 提交于
      The debugfs interface displayed statistics on RCU callback invocation but
      this interface has since been removed.  This commit therefore removes the
      no-longer-used rcu_data structure's ->n_cbs_invoked and ->n_nocbs_invoked
      fields along with their updates.
      
      If this information proves necessary in the future, the corresponding
      event traces will be added.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      62df63e0
    • P
      rcu: Remove obsolete boost statistics for debugfs · bec06785
      Paul E. McKenney 提交于
      The debugfs interface displayed statistics on RCU priority boosting,
      but this interface has since been removed.  This commit therefore
      removes the no-longer-used rcu_data structure's ->n_tasks_boosted,
      ->n_exp_boosts, and ->n_exp_boosts and their updates.
      
      If this information proves necessary in the future, the corresponding
      event traces will be added.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      bec06785
    • T
      rcu: Call touch_nmi_watchdog() while printing stall warnings · 3caa973b
      Tejun Heo 提交于
      When RCU stall warning triggers, it can print out a lot of messages
      while holding spinlocks.  If the console device is slow (e.g. an
      actual or IPMI serial console), it may end up triggering NMI hard
      lockup watchdog like the following.
      
      *** CPU printking while holding RCU spinlock
      
        PID: 4149739  TASK: ffff881a46baa880  CPU: 13  COMMAND: "CPUThreadPool8"
         #0 [ffff881fff945e48] crash_nmi_callback at ffffffff8103f7d0
         #1 [ffff881fff945e58] nmi_handle at ffffffff81020653
         #2 [ffff881fff945eb0] default_do_nmi at ffffffff81020c36
         #3 [ffff881fff945ed0] do_nmi at ffffffff81020d32
         #4 [ffff881fff945ef0] end_repeat_nmi at ffffffff81956a7e
            [exception RIP: io_serial_in+21]
            RIP: ffffffff81630e55  RSP: ffff881fff943b88  RFLAGS: 00000002
            RAX: 000000000000ca00  RBX: ffffffff8230e188  RCX: 0000000000000000
            RDX: 00000000000002fd  RSI: 0000000000000005  RDI: ffffffff8230e188
            RBP: ffff881fff943bb0   R8: 0000000000000000   R9: ffffffff820cb3c4
            R10: 0000000000000019  R11: 0000000000002000  R12: 00000000000026e1
            R13: 0000000000000020  R14: ffffffff820cd398  R15: 0000000000000035
            ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0000
        --- <NMI exception stack> ---
         #5 [ffff881fff943b88] io_serial_in at ffffffff81630e55
         #6 [ffff881fff943b90] wait_for_xmitr at ffffffff8163175c
         #7 [ffff881fff943bb8] serial8250_console_putchar at ffffffff816317dc
         #8 [ffff881fff943bd8] uart_console_write at ffffffff8162ac00
         #9 [ffff881fff943c08] serial8250_console_write at ffffffff81634691
        #10 [ffff881fff943c80] univ8250_console_write at ffffffff8162f7c2
        #11 [ffff881fff943c90] console_unlock at ffffffff810dfc55
        #12 [ffff881fff943cf0] vprintk_emit at ffffffff810dffb5
        #13 [ffff881fff943d50] vprintk_default at ffffffff810e01bf
        #14 [ffff881fff943d60] vprintk_func at ffffffff810e1127
        #15 [ffff881fff943d70] printk at ffffffff8119a8a4
        #16 [ffff881fff943dd0] print_cpu_stall_info at ffffffff810eb78c
        #17 [ffff881fff943e88] rcu_check_callbacks at ffffffff810ef133
        #18 [ffff881fff943ee8] update_process_times at ffffffff810f3497
        #19 [ffff881fff943f10] tick_sched_timer at ffffffff81103037
        #20 [ffff881fff943f38] __hrtimer_run_queues at ffffffff810f3f38
        #21 [ffff881fff943f88] hrtimer_interrupt at ffffffff810f442b
      
      *** CPU triggering the hardlockup watchdog
      
        PID: 4149709  TASK: ffff88010f88c380  CPU: 26  COMMAND: "CPUThreadPool35"
         #0 [ffff883fff1059d0] machine_kexec at ffffffff8104a874
         #1 [ffff883fff105a30] __crash_kexec at ffffffff811116cc
         #2 [ffff883fff105af0] __crash_kexec at ffffffff81111795
         #3 [ffff883fff105b08] panic at ffffffff8119a6ae
         #4 [ffff883fff105b98] watchdog_overflow_callback at ffffffff81135dbd
         #5 [ffff883fff105bb0] __perf_event_overflow at ffffffff81186866
         #6 [ffff883fff105be8] perf_event_overflow at ffffffff81192bc4
         #7 [ffff883fff105bf8] intel_pmu_handle_irq at ffffffff8100b265
         #8 [ffff883fff105df8] perf_event_nmi_handler at ffffffff8100489f
         #9 [ffff883fff105e58] nmi_handle at ffffffff81020653
        #10 [ffff883fff105eb0] default_do_nmi at ffffffff81020b94
        #11 [ffff883fff105ed0] do_nmi at ffffffff81020d32
        #12 [ffff883fff105ef0] end_repeat_nmi at ffffffff81956a7e
            [exception RIP: queued_spin_lock_slowpath+248]
            RIP: ffffffff810da958  RSP: ffff883fff103e68  RFLAGS: 00000046
            RAX: 0000000000000000  RBX: 0000000000000046  RCX: 00000000006d0000
            RDX: ffff883fff49a950  RSI: 0000000000d10101  RDI: ffffffff81e54300
            RBP: ffff883fff103e80   R8: ffff883fff11a950   R9: 0000000000000000
            R10: 000000000e5873ba  R11: 000000000000010f  R12: ffffffff81e54300
            R13: 0000000000000000  R14: ffff88010f88c380  R15: ffffffff81e54300
            ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
        --- <NMI exception stack> ---
        #13 [ffff883fff103e68] queued_spin_lock_slowpath at ffffffff810da958
        #14 [ffff883fff103e70] _raw_spin_lock_irqsave at ffffffff8195550b
        #15 [ffff883fff103e88] rcu_check_callbacks at ffffffff810eed18
        #16 [ffff883fff103ee8] update_process_times at ffffffff810f3497
        #17 [ffff883fff103f10] tick_sched_timer at ffffffff81103037
        #18 [ffff883fff103f38] __hrtimer_run_queues at ffffffff810f3f38
        #19 [ffff883fff103f88] hrtimer_interrupt at ffffffff810f442b
        --- <IRQ stack> ---
      
      Avoid spuriously triggering NMI hardlockup watchdog by touching it
      from the print functions.  show_state_filter() shares the same problem
      and solution.
      
      v2: Relocate the comment to where it belongs.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3caa973b
    • P
      rcu: Fix CPU offload boot message when no CPUs are offloaded · 3016611e
      Paul E. McKenney 提交于
      In CONFIG_RCU_NOCB_CPU=y kernels, if the boot parameters indicate that
      none of the CPUs should in fact be offloaded, the following somewhat
      obtuse message appears:
      
      	Offload RCU callbacks from CPUs: .
      
      This commit therefore makes the message at least grammatically correct
      in this case:
      
      	Offload RCU callbacks from CPUs: (none)
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3016611e
  2. 16 2月, 2018 2 次提交
    • P
      sched/isolation: Eliminate NO_HZ_FULL_ALL · a7c8655b
      Paul E. McKenney 提交于
      Commit 6f1982fe ("sched/isolation: Handle the nohz_full= parameter")
      broke CONFIG_NO_HZ_FULL_ALL=y kernels.  This breakage is due to the code
      under CONFIG_NO_HZ_FULL_ALL failing to invoke the shiny new housekeeping
      functions.  This means that rcutorture scenario TREE04 now emits RCU CPU
      stall warnings due to the RCU grace-period kthreads not being awakened
      at a time of their choosing, or perhaps even not at all:
      
      [   27.731422] rcu_bh kthread starved for 21001 jiffies! g18446744073709551369 c18446744073709551368 f0x0 RCU_GP_WAIT_FQS(3) ->state=0x402 ->cpu=3
      [   27.731423] rcu_bh          I14936     9      2 0x80080000
      [   27.731435] Call Trace:
      [   27.731440]  __schedule+0x31a/0x6d0
      [   27.731442]  schedule+0x31/0x80
      [   27.731446]  schedule_timeout+0x15a/0x320
      [   27.731453]  ? call_timer_fn+0x130/0x130
      [   27.731457]  rcu_gp_kthread+0x66c/0xea0
      [   27.731458]  ? rcu_gp_kthread+0x66c/0xea0
      
      Because no one has complained about CONFIG_NO_HZ_FULL_ALL=y being broken,
      I hypothesize that no one is in fact using it, other than rcutorture.
      This commit therefore eliminates CONFIG_NO_HZ_FULL_ALL and updates
      rcutorture's config files to instead use the nohz_full= kernel parameter
      to put the desired CPUs into nohz_full mode.
      
      Fixes: 6f1982fe ("sched/isolation: Handle the nohz_full= parameter")
      Reported-by: Nkernel test robot <xiaolong.ye@intel.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Wanpeng Li <kernellwp@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      a7c8655b
    • L
      rcu: Remove unnecessary spinlock in rcu_boot_init_percpu_data() · 398953e6
      Lihao Liang 提交于
      Since rcu_boot_init_percpu_data() is only called at boot time,
      there is no data race and spinlock is not needed.
      Signed-off-by: NLihao Liang <lianglihao@huawei.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      398953e6
  3. 12 2月, 2018 1 次提交
    • L
      vfs: do bulk POLL* -> EPOLL* replacement · a9a08845
      Linus Torvalds 提交于
      This is the mindless scripted replacement of kernel use of POLL*
      variables as described by Al, done by this script:
      
          for V in IN OUT PRI ERR RDNORM RDBAND WRNORM WRBAND HUP RDHUP NVAL MSG; do
              L=`git grep -l -w POLL$V | grep -v '^t' | grep -v /um/ | grep -v '^sa' | grep -v '/poll.h$'|grep -v '^D'`
              for f in $L; do sed -i "-es/^\([^\"]*\)\(\<POLL$V\>\)/\\1E\\2/" $f; done
          done
      
      with de-mangling cleanups yet to come.
      
      NOTE! On almost all architectures, the EPOLL* constants have the same
      values as the POLL* constants do.  But they keyword here is "almost".
      For various bad reasons they aren't the same, and epoll() doesn't
      actually work quite correctly in some cases due to this on Sparc et al.
      
      The next patch from Al will sort out the final differences, and we
      should be all done.
      Scripted-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a08845
  4. 08 2月, 2018 2 次提交
  5. 07 2月, 2018 16 次提交
  6. 06 2月, 2018 9 次提交
    • J
      bpf: sockmap, fix leaking maps with attached but not detached progs · 3d9e9526
      John Fastabend 提交于
      When a program is attached to a map we increment the program refcnt
      to ensure that the program is not removed while it is potentially
      being referenced from sockmap side. However, if this same program
      also references the map (this is a reasonably common pattern in
      my programs) then the verifier will also increment the maps refcnt
      from the verifier. This is to ensure the map doesn't get garbage
      collected while the program has a reference to it.
      
      So we are left in a state where the map holds the refcnt on the
      program stopping it from being removed and releasing the map refcnt.
      And vice versa the program holds a refcnt on the map stopping it
      from releasing the refcnt on the prog.
      
      All this is fine as long as users detach the program while the
      map fd is still around. But, if the user omits this detach command
      we are left with a dangling map we can no longer release.
      
      To resolve this when the map fd is released decrement the program
      references and remove any reference from the map to the program.
      This fixes the issue with possibly dangling map and creates a
      user side API constraint. That is, the map fd must be held open
      for programs to be attached to a map.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      3d9e9526
    • J
      bpf: sockmap, add sock close() hook to remove socks · 1aa12bdf
      John Fastabend 提交于
      The selftests test_maps program was leaving dangling BPF sockmap
      programs around because not all psock elements were removed from
      the map. The elements in turn hold a reference on the BPF program
      they are attached to causing BPF programs to stay open even after
      test_maps has completed.
      
      The original intent was that sk_state_change() would be called
      when TCP socks went through TCP_CLOSE state. However, because
      socks may be in SOCK_DEAD state or the sock may be a listening
      socket the event is not always triggered.
      
      To resolve this use the ULP infrastructure and register our own
      proto close() handler. This fixes the above case.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Reported-by: NPrashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1aa12bdf
    • M
      sched/fair: Use a recently used CPU as an idle candidate and the basis for SIS · 32e839dd
      Mel Gorman 提交于
      The select_idle_sibling() (SIS) rewrite in commit:
      
        10e2f1ac ("sched/core: Rewrite and improve select_idle_siblings()")
      
      ... replaced a domain iteration with a search that broadly speaking
      does a wrapped walk of the scheduler domain sharing a last-level-cache.
      
      While this had a number of improvements, one consequence is that two tasks
      that share a waker/wakee relationship push each other around a socket. Even
      though two tasks may be active, all cores are evenly used. This is great from
      a search perspective and spreads a load across individual cores, but it has
      adverse consequences for cpufreq. As each CPU has relatively low utilisation,
      cpufreq may decide the utilisation is too low to used a higher P-state and
      overall computation throughput suffers.
      
      While individual cpufreq and cpuidle drivers may compensate by artifically
      boosting P-state (at c0) or avoiding lower C-states (during idle), it does
      not help if hardware-based cpufreq (e.g. HWP) is used.
      
      This patch tracks a recently used CPU based on what CPU a task was running
      on when it last was a waker a CPU it was recently using when a task is a
      wakee. During SIS, the recently used CPU is used as a target if it's still
      allowed by the task and is idle.
      
      The benefit may be non-obvious so consider an example of two tasks
      communicating back and forth. Task A may be an application doing IO where
      task B is a kworker or kthread like journald. Task A may issue IO, wake
      B and B wakes up A on completion.  With the existing scheme this may look
      like the following (potentially different IDs if SMT is in use but similar
      principal applies).
      
       A (cpu 0)	wake	B (wakes on cpu 1)
       B (cpu 1)	wake	A (wakes on cpu 2)
       A (cpu 2)	wake	B (wakes on cpu 3)
       etc.
      
      A careful reader may wonder why CPU 0 was not idle when B wakes A the
      first time and it's simply due to the fact that A can be rescheduled to
      another CPU and the pattern is that prev == target when B tries to wakeup A
      and the information about CPU 0 has been lost.
      
      With this patch, the pattern is more likely to be:
      
       A (cpu 0)	wake	B (wakes on cpu 1)
       B (cpu 1)	wake	A (wakes on cpu 0)
       A (cpu 0)	wake	B (wakes on cpu 1)
       etc
      
      i.e. two communicating casts are more likely to use just two cores instead
      of all available cores sharing a LLC.
      
      The most dramatic speedup was noticed on dbench using the XFS filesystem on
      UMA as clients interact heavily with workqueues in that configuration. Note
      that a similar speedup is not observed on ext4 as the wakeup pattern
      is different:
      
                                4.15.0-rc9             4.15.0-rc9
                                 waprev-v1        biasancestor-v1
       Hmean      1      287.54 (   0.00%)      817.01 ( 184.14%)
       Hmean      2     1268.12 (   0.00%)     1781.24 (  40.46%)
       Hmean      4     1739.68 (   0.00%)     1594.47 (  -8.35%)
       Hmean      8     2464.12 (   0.00%)     2479.56 (   0.63%)
       Hmean     64     1455.57 (   0.00%)     1434.68 (  -1.44%)
      
      The results can be less dramatic on NUMA where automatic balancing interferes
      with the test. It's also known that network benchmarks running on localhost
      also benefit quite a bit from this patch (roughly 10% on netperf RR for UDP
      and TCP depending on the machine). Hackbench also seens small improvements
      (6-11% depending on machine and thread count). The facebook schbench was also
      tested but in most cases showed little or no different to wakeup latencies.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-5-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      32e839dd
    • M
      sched/fair: Do not migrate if the prev_cpu is idle · 806486c3
      Mel Gorman 提交于
      wake_affine_idle() prefers to move a task to the current CPU if the
      wakeup is due to an interrupt. The expectation is that the interrupt
      data is cache hot and relevant to the waking task as well as avoiding
      a search. However, there is no way to determine if there was cache hot
      data on the previous CPU that may exceed the interrupt data. Furthermore,
      round-robin delivery of interrupts can migrate tasks around a socket where
      each CPU is under-utilised.  This can interact badly with cpufreq which
      makes decisions based on per-cpu data. It has been observed on machines
      with HWP that p-states are not boosted to their maximum levels even though
      the workload is latency and throughput sensitive.
      
      This patch uses the previous CPU for the task if it's idle and cache-affine
      with the current CPU even if the current CPU is idle due to the wakup
      being related to the interrupt. This reduces migrations at the cost of
      the interrupt data not being cache hot when the task wakes.
      
      A variety of workloads were tested on various machines and no adverse
      impact was noticed that was outside noise. dbench on ext4 on UMA showed
      roughly 10% reduction in the number of CPU migrations and it is a case
      where interrupts are frequent for IO competions. In most cases, the
      difference in performance is quite small but variability is often
      reduced. For example, this is the result for pgbench running on a UMA
      machine with different numbers of clients.
      
                                4.15.0-rc9             4.15.0-rc9
                                  baseline              waprev-v1
       Hmean     1     22096.28 (   0.00%)    22734.86 (   2.89%)
       Hmean     4     74633.42 (   0.00%)    75496.77 (   1.16%)
       Hmean     7    115017.50 (   0.00%)   113030.81 (  -1.73%)
       Hmean     12   126209.63 (   0.00%)   126613.40 (   0.32%)
       Hmean     16   131886.91 (   0.00%)   130844.35 (  -0.79%)
       Stddev    1       636.38 (   0.00%)      417.11 (  34.46%)
       Stddev    4       614.64 (   0.00%)      583.24 (   5.11%)
       Stddev    7       542.46 (   0.00%)      435.45 (  19.73%)
       Stddev    12      173.93 (   0.00%)      171.50 (   1.40%)
       Stddev    16      671.42 (   0.00%)      680.30 (  -1.32%)
       CoeffVar  1         2.88 (   0.00%)        1.83 (  36.26%)
      
      Note that the different in performance is marginal but for low utilisation,
      there is less variability.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-4-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      806486c3
    • M
      sched/fair: Restructure wake_affine*() to return a CPU id · 3b76c4a3
      Mel Gorman 提交于
      This is a preparation patch that has wake_affine*() return a CPU ID instead of
      a boolean. The intent is to allow the wake_affine() helpers to be avoided
      if a decision is already made. This patch has no functional change.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-3-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3b76c4a3
    • M
      sched/fair: Remove unnecessary parameters from wake_affine_idle() · 89a55f56
      Mel Gorman 提交于
      wake_affine_idle() takes parameters it never uses so clean it up.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180130104555.4125-2-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      89a55f56
    • W
      sched/rt: Make update_curr_rt() more accurate · e7ad2031
      Wen Yang 提交于
      rq->clock_task may be updated between the two calls of
      rq_clock_task() in update_curr_rt(). Calling rq_clock_task() only
      once makes it more accurate and efficient, taking update_curr() as
      reference.
      Signed-off-by: NWen Yang <wen.yang99@zte.com.cn>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NJiang Biao <jiang.biao2@zte.com.cn>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: zhong.weidong@zte.com.cn
      Link: http://lkml.kernel.org/r/1517800721-42092-1-git-send-email-wen.yang99@zte.com.cnSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e7ad2031
    • S
      sched/rt: Up the root domain ref count when passing it around via IPIs · 364f5665
      Steven Rostedt (VMware) 提交于
      When issuing an IPI RT push, where an IPI is sent to each CPU that has more
      than one RT task scheduled on it, it references the root domain's rto_mask,
      that contains all the CPUs within the root domain that has more than one RT
      task in the runable state. The problem is, after the IPIs are initiated, the
      rq->lock is released. This means that the root domain that is associated to
      the run queue could be freed while the IPIs are going around.
      
      Add a sched_get_rd() and a sched_put_rd() that will increment and decrement
      the root domain's ref count respectively. This way when initiating the IPIs,
      the scheduler will up the root domain's ref count before releasing the
      rq->lock, ensuring that the root domain does not go away until the IPI round
      is complete.
      Reported-by: NPavan Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 4bdced5c ("sched/rt: Simplify the IPI based RT balancing logic")
      Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      364f5665
    • S
      sched/rt: Use container_of() to get root domain in rto_push_irq_work_func() · ad0f1d9d
      Steven Rostedt (VMware) 提交于
      When the rto_push_irq_work_func() is called, it looks at the RT overloaded
      bitmask in the root domain via the runqueue (rq->rd). The problem is that
      during CPU up and down, nothing here stops rq->rd from changing between
      taking the rq->rd->rto_lock and releasing it. That means the lock that is
      released is not the same lock that was taken.
      
      Instead of using this_rq()->rd to get the root domain, as the irq work is
      part of the root domain, we can simply get the root domain from the irq work
      that is passed to the routine:
      
       container_of(work, struct root_domain, rto_push_work)
      
      This keeps the root domain consistent.
      Reported-by: NPavan Kondeti <pkondeti@codeaurora.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 4bdced5c ("sched/rt: Simplify the IPI based RT balancing logic")
      Link: http://lkml.kernel.org/r/CAEU1=PkiHO35Dzna8EQqNSKW1fr1y1zRQ5y66X117MG06sQtNA@mail.gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ad0f1d9d