1. 06 3月, 2019 1 次提交
    • P
      sched/wake_q: Fix wakeup ordering for wake_q · 653a1dbc
      Peter Zijlstra 提交于
      [ Upstream commit 4c4e3731564c8945ac5ac90fc2a1e1f21cb79c92 ]
      
      Notable cmpxchg() does not provide ordering when it fails, however
      wake_q_add() requires ordering in this specific case too. Without this
      it would be possible for the concurrent wakeup to not observe our
      prior state.
      
      Andrea Parri provided:
      
        C wake_up_q-wake_q_add
      
        {
      	int next = 0;
      	int y = 0;
        }
      
        P0(int *next, int *y)
        {
      	int r0;
      
      	/* in wake_up_q() */
      
      	WRITE_ONCE(*next, 1);   /* node->next = NULL */
      	smp_mb();               /* implied by wake_up_process() */
      	r0 = READ_ONCE(*y);
        }
      
        P1(int *next, int *y)
        {
      	int r1;
      
      	/* in wake_q_add() */
      
      	WRITE_ONCE(*y, 1);      /* wake_cond = true */
      	smp_mb__before_atomic();
      	r1 = cmpxchg_relaxed(next, 1, 2);
        }
      
        exists (0:r0=0 /\ 1:r1=0)
      
        This "exists" clause cannot be satisfied according to the LKMM:
      
        Test wake_up_q-wake_q_add Allowed
        States 3
        0:r0=0; 1:r1=1;
        0:r0=1; 1:r1=0;
        0:r0=1; 1:r1=1;
        No
        Witnesses
        Positive: 0 Negative: 3
        Condition exists (0:r0=0 /\ 1:r1=0)
        Observation wake_up_q-wake_q_add Never 0 3
      Reported-by: NYongji Xie <elohimes@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      653a1dbc
  2. 20 12月, 2018 1 次提交
  3. 06 12月, 2018 1 次提交
    • P
      sched/smt: Make sched_smt_present track topology · a2c09481
      Peter Zijlstra (Intel) 提交于
      commit c5511d03ec090980732e929c318a7a6374b5550e upstream
      
      Currently the 'sched_smt_present' static key is enabled when at CPU bringup
      SMT topology is observed, but it is never disabled. However there is demand
      to also disable the key when the topology changes such that there is no SMT
      present anymore.
      
      Implement this by making the key count the number of cores that have SMT
      enabled.
      
      In particular, the SMT topology bits are set before interrrupts are enabled
      and similarly, are cleared after interrupts are disabled for the last time
      and the CPU dies.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185004.246110444@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a2c09481
  4. 27 11月, 2018 1 次提交
    • V
      sched/core: Take the hotplug lock in sched_init_smp() · b1e814e4
      Valentin Schneider 提交于
      [ Upstream commit 40fa3780bac2b654edf23f6b13f4e2dd550aea10 ]
      
      When running on linux-next (8c60c36d0b8c ("Add linux-next specific files
      for 20181019")) + CONFIG_PROVE_LOCKING=y on a big.LITTLE system (e.g.
      Juno or HiKey960), we get the following report:
      
       [    0.748225] Call trace:
       [    0.750685]  lockdep_assert_cpus_held+0x30/0x40
       [    0.755236]  static_key_enable_cpuslocked+0x20/0xc8
       [    0.760137]  build_sched_domains+0x1034/0x1108
       [    0.764601]  sched_init_domains+0x68/0x90
       [    0.768628]  sched_init_smp+0x30/0x80
       [    0.772309]  kernel_init_freeable+0x278/0x51c
       [    0.776685]  kernel_init+0x10/0x108
       [    0.780190]  ret_from_fork+0x10/0x18
      
      The static_key in question is 'sched_asym_cpucapacity' introduced by
      commit:
      
        df054e8445a4 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")
      
      In this particular case, we enable it because smp_prepare_cpus() will
      end up fetching the capacity-dmips-mhz entry from the devicetree,
      so we already have some asymmetry detected when entering sched_init_smp().
      
      This didn't get detected in tip/sched/core because we were missing:
      
        commit cb538267ea1e ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")
      
      Calls to build_sched_domains() post sched_init_smp() will hold the
      hotplug lock, it just so happens that this very first call is a
      special case. As stated by a comment in sched_init_smp(), "There's no
      userspace yet to cause hotplug operations" so this is a harmless
      warning.
      
      However, to both respect the semantics of underlying
      callees and make lockdep happy, take the hotplug lock in
      sched_init_smp(). This also satisfies the comment atop
      sched_init_domains() that says "Callers must hold the hotplug lock".
      Reported-by: NSudeep Holla <sudeep.holla@arm.com>
      Tested-by: NSudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: NValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar.Eggemann@arm.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: morten.rasmussen@arm.com
      Cc: quentin.perret@arm.com
      Link: http://lkml.kernel.org/r/1540301851-3048-1-git-send-email-valentin.schneider@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b1e814e4
  5. 02 10月, 2018 1 次提交
    • S
      sched/numa: Pass destination CPU as a parameter to migrate_task_rq · 1327237a
      Srikar Dronamraju 提交于
      This additional parameter (new_cpu) is used later for identifying if
      task migration is across nodes.
      
      No functional change.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     203353  200668   -1.32036
      1     328205  321791   -1.95427
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     214384  204848   -4.44809
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     188553  188098   -0.241311
      1     196273  200351   2.07772
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     57581.2  58145.9  0.980702
      1     103468   103798   0.318939
      
      Brings out the variance between different specjbb2005 runs.
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,941,377      13,912,183
      migrations                1,157,323       1,155,931
      faults                    382,175         367,139
      cache-misses              54,993,823,500  54,240,196,814
      sched:sched_move_numa     2,005           1,571
      sched:sched_stick_numa    14              9
      sched:sched_swap_numa     529             463
      migrate:mm_migrate_pages  1,573           703
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        67099   50155
      numa_hint_faults_local  58456   45264
      numa_hit                240416  239652
      numa_huge_pte_updates   18      36
      numa_interleave         65      68
      numa_local              240339  239576
      numa_other              77      76
      numa_pages_migrated     1574    680
      numa_pte_updates        77182   71146
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,176,453       3,156,720
      migrations                30,238          30,354
      faults                    87,869          97,261
      cache-misses              12,544,479,391  12,400,026,826
      sched:sched_move_numa     23              4
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     6               1
      migrate:mm_migrate_pages  10              20
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        236     272
      numa_hint_faults_local  201     186
      numa_hit                72293   71362
      numa_huge_pte_updates   0       0
      numa_interleave         26      23
      numa_local              72233   71299
      numa_other              60      63
      numa_pages_migrated     8       2
      numa_pte_updates        0       0
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,478,820    8,606,824
      migrations                171,323      155,352
      faults                    307,499      301,409
      cache-misses              240,353,599  157,759,224
      sched:sched_move_numa     214          168
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     4            3
      migrate:mm_migrate_pages  89           125
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        5301    4650
      numa_hint_faults_local  4745    3946
      numa_hit                92943   90489
      numa_huge_pte_updates   0       0
      numa_interleave         899     892
      numa_local              92345   90034
      numa_other              598     455
      numa_pages_migrated     88      124
      numa_pte_updates        5505    4818
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before      After
      cs                        2,066,172   2,113,167
      migrations                11,076      10,533
      faults                    149,544     142,727
      cache-misses              10,398,067  5,594,192
      sched:sched_move_numa     43          10
      sched:sched_stick_numa    0           0
      sched:sched_swap_numa     0           0
      migrate:mm_migrate_pages  6           6
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        3552    744
      numa_hint_faults_local  3347    584
      numa_hit                25611   25551
      numa_huge_pte_updates   0       0
      numa_interleave         213     263
      numa_local              25583   25302
      numa_other              28      249
      numa_pages_migrated     6       6
      numa_pte_updates        3535    744
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        99,358,136       101,227,352
      migrations                4,041,607        4,151,829
      faults                    749,653          745,233
      cache-misses              225,562,543,251  224,669,561,766
      sched:sched_move_numa     771              617
      sched:sched_stick_numa    14               2
      sched:sched_swap_numa     204              187
      migrate:mm_migrate_pages  1,180            316
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        27409   24195
      numa_hint_faults_local  20677   21639
      numa_hit                239988  238331
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              239983  238331
      numa_other              5       0
      numa_pages_migrated     1016    204
      numa_pte_updates        27916   24561
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        60,899,307      62,738,978
      migrations                544,668         562,702
      faults                    270,834         228,465
      cache-misses              74,543,455,635  75,778,067,952
      sched:sched_move_numa     735             648
      sched:sched_stick_numa    25              13
      sched:sched_swap_numa     174             137
      migrate:mm_migrate_pages  816             733
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        11059   10281
      numa_hint_faults_local  4733    3242
      numa_hit                41384   36338
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              41383   36338
      numa_other              1       0
      numa_pages_migrated     815     706
      numa_pte_updates        11323   10176
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-3-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1327237a
  6. 04 8月, 2018 1 次提交
    • E
      signal: Add calculate_sigpending() · 088fe47c
      Eric W. Biederman 提交于
      Add a function calculate_sigpending to test to see if any signals are
      pending for a new task immediately following fork.  Signals have to
      happen either before or after fork.  Today our practice is to push
      all of the signals to before the fork, but that has the downside that
      frequent or periodic signals can make fork take much much longer than
      normal or prevent fork from completing entirely.
      
      So we need move signals that we can after the fork to prevent that.
      
      This updates the code to set TIF_SIGPENDING on a new task if there
      are signals or other activities that have moved so that they appear
      to happen after the fork.
      
      As the code today restarts if it sees any such activity this won't
      immediately have an effect, as there will be no reason for it
      to set TIF_SIGPENDING immediately after the fork.
      
      Adding calculate_sigpending means the code in fork can safely be
      changed to not always restart if a signal is pending.
      
      The new calculate_sigpending function sets sigpending if there
      are pending bits in jobctl, pending signals, the freezer needs
      to freeze the new task or the live kernel patching framework
      need the new thread to take the slow path to userspace.
      
      I have verified that setting TIF_SIGPENDING does make a new process
      take the slow path to userspace before it executes it's first userspace
      instruction.
      
      I have looked at the callers of signal_wake_up and the code paths
      setting TIF_SIGPENDING and I don't see anything else that needs to be
      handled.  The code probably doesn't need to set TIF_SIGPENDING for the
      kernel live patching as it uses a separate thread flag as well.  But
      at this point it seems safer reuse the recalc_sigpending logic and get
      the kernel live patching folks to sort out their story later.
      
      V2: I have moved the test into schedule_tail where siglock can
          be grabbed and recalc_sigpending can be reused directly.
          Further as the last action of setting up a new task this
          guarantees that TIF_SIGPENDING will be properly set in the
          new process.
      
          The helper calculate_sigpending takes the siglock and
          uncontitionally sets TIF_SIGPENDING and let's recalc_sigpending
          clear TIF_SIGPENDING if it is unnecessary.  This allows reusing
          the existing code and keeps maintenance of the conditions simple.
      
          Oleg Nesterov <oleg@redhat.com>  suggested the movement
          and pointed out the need to take siglock if this code
          was going to be called while the new task is discoverable.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      088fe47c
  7. 31 7月, 2018 1 次提交
    • J
      tracing: Centralize preemptirq tracepoints and unify their usage · c3bc8fd6
      Joel Fernandes (Google) 提交于
      This patch detaches the preemptirq tracepoints from the tracers and
      keeps it separate.
      
      Advantages:
      * Lockdep and irqsoff event can now run in parallel since they no longer
      have their own calls.
      
      * This unifies the usecase of adding hooks to an irqsoff and irqson
      event, and a preemptoff and preempton event.
        3 users of the events exist:
        - Lockdep
        - irqsoff and preemptoff tracers
        - irqs and preempt trace events
      
      The unification cleans up several ifdefs and makes the code in preempt
      tracer and irqsoff tracers simpler. It gets rid of all the horrific
      ifdeferry around PROVE_LOCKING and makes configuration of the different
      users of the tracepoints more easy and understandable. It also gets rid
      of the time_* function calls from the lockdep hooks used to call into
      the preemptirq tracer which is not needed anymore. The negative delta in
      lines of code in this patch is quite large too.
      
      In the patch we introduce a new CONFIG option PREEMPTIRQ_TRACEPOINTS
      as a single point for registering probes onto the tracepoints. With
      this,
      the web of config options for preempt/irq toggle tracepoints and its
      users becomes:
      
       PREEMPT_TRACER   PREEMPTIRQ_EVENTS  IRQSOFF_TRACER PROVE_LOCKING
             |                 |     \         |           |
             \    (selects)    /      \        \ (selects) /
            TRACE_PREEMPT_TOGGLE       ----> TRACE_IRQFLAGS
                            \                  /
                             \ (depends on)   /
                           PREEMPTIRQ_TRACEPOINTS
      
      Other than the performance tests mentioned in the previous patch, I also
      ran the locking API test suite. I verified that all tests cases are
      passing.
      
      I also injected issues by not registering lockdep probes onto the
      tracepoints and I see failures to confirm that the probes are indeed
      working.
      
      This series + lockdep probes not registered (just to inject errors):
      [    0.000000]      hard-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]      soft-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/12:FAILED|FAILED|  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/21:FAILED|FAILED|  ok  |
      [    0.000000]          hard-safe-A + irqs-on/12:FAILED|FAILED|  ok  |
      [    0.000000]          soft-safe-A + irqs-on/12:FAILED|FAILED|  ok  |
      [    0.000000]          hard-safe-A + irqs-on/21:FAILED|FAILED|  ok  |
      [    0.000000]          soft-safe-A + irqs-on/21:FAILED|FAILED|  ok  |
      [    0.000000]     hard-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      [    0.000000]     soft-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      
      With this series + lockdep probes registered, all locking tests pass:
      
      [    0.000000]      hard-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]      soft-irqs-on + irq-safe-A/21:  ok  |  ok  |  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/12:  ok  |  ok  |  ok  |
      [    0.000000]        sirq-safe-A => hirqs-on/21:  ok  |  ok  |  ok  |
      [    0.000000]          hard-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
      [    0.000000]          soft-safe-A + irqs-on/12:  ok  |  ok  |  ok  |
      [    0.000000]          hard-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
      [    0.000000]          soft-safe-A + irqs-on/21:  ok  |  ok  |  ok  |
      [    0.000000]     hard-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      [    0.000000]     soft-safe-A + unsafe-B #1/123:  ok  |  ok  |  ok  |
      
      Link: http://lkml.kernel.org/r/20180730222423.196630-4-joel@joelfernandes.orgAcked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NNamhyung Kim <namhyung@kernel.org>
      Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      c3bc8fd6
  8. 25 7月, 2018 3 次提交
  9. 20 7月, 2018 1 次提交
  10. 17 7月, 2018 2 次提交
  11. 16 7月, 2018 4 次提交
  12. 03 7月, 2018 2 次提交
    • P
      kthread, sched/core: Fix kthread_parkme() (again...) · 1cef1150
      Peter Zijlstra 提交于
      Gaurav reports that commit:
      
        85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      
      isn't working for him. Because of the following race:
      
      > controller Thread                               CPUHP Thread
      > takedown_cpu
      > kthread_park
      > kthread_parkme
      > Set KTHREAD_SHOULD_PARK
      >                                                 smpboot_thread_fn
      >                                                 set Task interruptible
      >
      >
      > wake_up_process
      >  if (!(p->state & state))
      >                 goto out;
      >
      >                                                 Kthread_parkme
      >                                                 SET TASK_PARKED
      >                                                 schedule
      >                                                 raw_spin_lock(&rq->lock)
      > ttwu_remote
      > waiting for __task_rq_lock
      >                                                 context_switch
      >
      >                                                 finish_lock_switch
      >
      >
      >
      >                                                 Case TASK_PARKED
      >                                                 kthread_park_complete
      >
      >
      > SET Running
      
      Furthermore, Oleg noticed that the whole scheduler TASK_PARKED
      handling is buggered because the TASK_DEAD thing is done with
      preemption disabled, the current code can still complete early on
      preemption :/
      
      So basically revert that earlier fix and go with a variant of the
      alternative mentioned in the commit. Promote TASK_PARKED to special
      state to avoid the store-store issue on task->state leading to the
      WARN in kthread_unpark() -> __kthread_bind().
      
      But in addition, add wait_task_inactive() to kthread_park() to ensure
      the task really is PARKED when we return from kthread_park(). This
      avoids the whole kthread still gets migrated nonsense -- although it
      would be really good to get this done differently.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1cef1150
    • F
      sched/nohz: Skip remote tick on idle task entirely · d9c0ffca
      Frederic Weisbecker 提交于
      Some people have reported that the warning in sched_tick_remote()
      occasionally triggers, especially in favour of some RCU-Torture
      pressure:
      
      	WARNING: CPU: 11 PID: 906 at kernel/sched/core.c:3138 sched_tick_remote+0xb6/0xc0
      	Modules linked in:
      	CPU: 11 PID: 906 Comm: kworker/u32:3 Not tainted 4.18.0-rc2+ #1
      	Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
      	Workqueue: events_unbound sched_tick_remote
      	RIP: 0010:sched_tick_remote+0xb6/0xc0
      	Code: e8 0f 06 b8 00 c6 03 00 fb eb 9d 8b 43 04 85 c0 75 8d 48 8b 83 e0 0a 00 00 48 85 c0 75 81 eb 88 48 89 df e8 bc fe ff ff eb aa <0f> 0b eb
      	+c5 66 0f 1f 44 00 00 bf 17 00 00 00 e8 b6 2e fe ff 0f b6
      	Call Trace:
      	 process_one_work+0x1df/0x3b0
      	 worker_thread+0x44/0x3d0
      	 kthread+0xf3/0x130
      	 ? set_worker_desc+0xb0/0xb0
      	 ? kthread_create_worker_on_cpu+0x70/0x70
      	 ret_from_fork+0x35/0x40
      
      This happens when the remote tick applies on an idle task. Usually the
      idle_cpu() check avoids that, but it is performed before we lock the
      runqueue and it is therefore racy. It was intended to be that way in
      order to prevent from useless runqueue locks since idle task tick
      callback is a no-op.
      
      Now if the racy check slips out of our hands and we end up remotely
      ticking an idle task, the empty task_tick_idle() is harmless. Still
      it won't pass the WARN_ON_ONCE() test that ensures rq_clock_task() is
      not too far from curr->se.exec_start because update_curr_idle() doesn't
      update the exec_start value like other scheduler policies. Hence the
      reported false positive.
      
      So let's have another check, while the rq is locked, to make sure we
      don't remote tick on an idle task. The lockless idle_cpu() still applies
      to avoid unecessary rq lock contention.
      Reported-by: NJacek Tomaka <jacekt@dug.com>
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reported-by: NAnna-Maria Gleixner <anna-maria@linutronix.de>
      Signed-off-by: NFrederic Weisbecker <frederic@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1530203381-31234-1-git-send-email-frederic@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d9c0ffca
  13. 21 6月, 2018 1 次提交
  14. 15 6月, 2018 1 次提交
    • M
      sched/core / kcov: avoid kcov_area during task switch · 0ed557aa
      Mark Rutland 提交于
      During a context switch, we first switch_mm() to the next task's mm,
      then switch_to() that new task.  This means that vmalloc'd regions which
      had previously been faulted in can transiently disappear in the context
      of the prev task.
      
      Functions instrumented by KCOV may try to access a vmalloc'd kcov_area
      during this window, and as the fault handling code is instrumented, this
      results in a recursive fault.
      
      We must avoid accessing any kcov_area during this window.  We can do so
      with a new flag in kcov_mode, set prior to switching the mm, and cleared
      once the new task is live.  Since task_struct::kcov_mode isn't always a
      specific enum kcov_mode value, this is made an unsigned int.
      
      The manipulation is hidden behind kcov_{prepare,finish}_switch() helpers,
      which are empty for !CONFIG_KCOV kernels.
      
      The code uses macros because I can't use static inline functions without a
      circular include dependency between <linux/sched.h> and <linux/kcov.h>,
      since the definition of task_struct uses things defined in <linux/kcov.h>
      
      Link: http://lkml.kernel.org/r/20180504135535.53744-4-mark.rutland@arm.comSigned-off-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ed557aa
  15. 06 6月, 2018 1 次提交
    • M
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers 提交于
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
  16. 31 5月, 2018 2 次提交
    • P
      sched/core: Require cpu_active() in select_task_rq(), for user tasks · 7af443ee
      Paul Burton 提交于
      select_task_rq() is used in a few paths to select the CPU upon which a
      thread should be run - for example it is used by try_to_wake_up() & by
      fork or exec balancing. As-is it allows use of any online CPU that is
      present in the task's cpus_allowed mask.
      
      This presents a problem because there is a period whilst CPUs are
      brought online where a CPU is marked online, but is not yet fully
      initialized - ie. the period where CPUHP_AP_ONLINE_IDLE <= state <
      CPUHP_ONLINE. Usually we don't run any user tasks during this window,
      but there are corner cases where this can happen. An example observed
      is:
      
        - Some user task A, running on CPU X, forks to create task B.
      
        - sched_fork() calls __set_task_cpu() with cpu=X, setting task B's
          task_struct::cpu field to X.
      
        - CPU X is offlined.
      
        - Task A, currently somewhere between the __set_task_cpu() in
          copy_process() and the call to wake_up_new_task(), is migrated to
          CPU Y by migrate_tasks() when CPU X is offlined.
      
        - CPU X is onlined, but still in the CPUHP_AP_ONLINE_IDLE state. The
          scheduler is now active on CPU X, but there are no user tasks on
          the runqueue.
      
        - Task A runs on CPU Y & reaches wake_up_new_task(). This calls
          select_task_rq() with cpu=X, taken from task B's task_struct,
          and select_task_rq() allows CPU X to be returned.
      
        - Task A enqueues task B on CPU X's runqueue, via activate_task() &
          enqueue_task().
      
        - CPU X now has a user task on its runqueue before it has reached the
          CPUHP_ONLINE state.
      
      In most cases, the user tasks that schedule on the newly onlined CPU
      have no idea that anything went wrong, but one case observed to be
      problematic is if the task goes on to invoke the sched_setaffinity
      syscall. The newly onlined CPU reaches the CPUHP_AP_ONLINE_IDLE state
      before the CPU that brought it online calls stop_machine_unpark(). This
      means that for a portion of the window of time between
      CPUHP_AP_ONLINE_IDLE & CPUHP_ONLINE the newly onlined CPU's struct
      cpu_stopper has its enabled field set to false. If a user thread is
      executed on the CPU during this window and it invokes sched_setaffinity
      with a CPU mask that does not include the CPU it's running on, then when
      __set_cpus_allowed_ptr() calls stop_one_cpu() intending to invoke
      migration_cpu_stop() and perform the actual migration away from the CPU
      it will simply return -ENOENT rather than calling migration_cpu_stop().
      We then return from the sched_setaffinity syscall back to the user task
      that is now running on a CPU which it just asked not to run on, and
      which is not present in its cpus_allowed mask.
      
      This patch resolves the problem by having select_task_rq() enforce that
      user tasks run on CPUs that are active - the same requirement that
      select_fallback_rq() already enforces. This should ensure that newly
      onlined CPUs reach the CPUHP_AP_ACTIVE state before being able to
      schedule user tasks, and also implies that bringup_wait_for_ap() will
      have called stop_machine_unpark() which resolves the sched_setaffinity
      issue above.
      
      I haven't yet investigated them, but it may be of interest to review
      whether any of the actions performed by hotplug states between
      CPUHP_AP_ONLINE_IDLE & CPUHP_AP_ACTIVE could have similar unintended
      effects on user tasks that might schedule before they are reached, which
      might widen the scope of the problem from just affecting the behaviour
      of sched_setaffinity.
      Signed-off-by: NPaul Burton <paul.burton@mips.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20180526154648.11635-2-paul.burton@mips.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7af443ee
    • P
      sched/core: Fix rules for running on online && !active CPUs · 175f0e25
      Peter Zijlstra 提交于
      As already enforced by the WARN() in __set_cpus_allowed_ptr(), the rules
      for running on an online && !active CPU are stricter than just being a
      kthread, you need to be a per-cpu kthread.
      
      If you're not strictly per-CPU, you have better CPUs to run on and
      don't need the partially booted one to get your work done.
      
      The exception is to allow smpboot threads to bootstrap the CPU itself
      and get kernel 'services' initialized before we allow userspace on it.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 955dbdf4 ("sched: Allow migrating kthreads into online but inactive CPUs")
      Link: http://lkml.kernel.org/r/20170725165821.cejhb7v2s3kecems@hirez.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      175f0e25
  17. 16 5月, 2018 1 次提交
  18. 14 5月, 2018 2 次提交
    • R
      sched/core: Distinguish between idle_cpu() calls based on desired effect,... · 943d355d
      Rohit Jain 提交于
      sched/core: Distinguish between idle_cpu() calls based on desired effect, introduce available_idle_cpu()
      
      In the following commit:
      
        247f2f6f ("sched/core: Don't schedule threads on pre-empted vCPUs")
      
      ... we distinguish between idle_cpu() when the vCPU is not running for
      scheduling threads.
      
      However, the idle_cpu() function is used in other places for
      actually checking whether the state of the CPU is idle or not.
      
      Hence split the use of that function based on the desired return value,
      by introducing the available_idle_cpu() function.
      
      This fixes a (slight) regression in that initial vCPU commit, because
      some code paths (like the load-balancer) don't care and shouldn't care
      if the vCPU is preempted or not, they just want to know if there's any
      tasks on the CPU.
      Signed-off-by: NRohit Jain <rohit.k.jain@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dhaval.giani@oracle.com
      Cc: linux-kernel@vger.kernel.org
      Cc: matt@codeblueprint.co.uk
      Cc: steven.sistare@oracle.com
      Cc: subhra.mazumdar@oracle.com
      Link: http://lkml.kernel.org/r/1525883988-10356-1-git-send-email-rohit.k.jain@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      943d355d
    • M
      sched/numa: Stagger NUMA balancing scan periods for new threads · 13784475
      Mel Gorman 提交于
      Threads share an address space and each can change the protections of the
      same address space to trap NUMA faults. This is redundant and potentially
      counter-productive as any thread doing the update will suffice. Potentially
      only one thread is required but that thread may be idle or it may not have
      any locality concerns and pick an unsuitable scan rate.
      
      This patch uses independent scan period but they are staggered based on
      the number of address space users when the thread is created.  The intent
      is that threads will avoid scanning at the same time and have a chance
      to adapt their scan rate later if necessary. This reduces the total scan
      activity early in the lifetime of the threads.
      
      The different in headline performance across a range of machines and
      workloads is marginal but the system CPU usage is reduced as well as overall
      scan activity.  The following is the time reported by NAS Parallel Benchmark
      using unbound openmp threads and a D size class:
      
      			      4.17.0-rc1             4.17.0-rc1
      				 vanilla           stagger-v1r1
      	Time bt.D      442.77 (   0.00%)      419.70 (   5.21%)
      	Time cg.D      171.90 (   0.00%)      180.85 (  -5.21%)
      	Time ep.D       33.10 (   0.00%)       32.90 (   0.60%)
      	Time is.D        9.59 (   0.00%)        9.42 (   1.77%)
      	Time lu.D      306.75 (   0.00%)      304.65 (   0.68%)
      	Time mg.D       54.56 (   0.00%)       52.38 (   4.00%)
      	Time sp.D     1020.03 (   0.00%)      903.77 (  11.40%)
      	Time ua.D      400.58 (   0.00%)      386.49 (   3.52%)
      
      Note it's not a universal win but we have no prior knowledge of which
      thread matters but the number of threads created often exceeds the size
      of the node when the threads are not bound. However, there is a reducation
      of overall system CPU usage:
      
      				    4.17.0-rc1             4.17.0-rc1
      				       vanilla           stagger-v1r1
      	sys-time-bt.D         48.78 (   0.00%)       48.22 (   1.15%)
      	sys-time-cg.D         25.31 (   0.00%)       26.63 (  -5.22%)
      	sys-time-ep.D          1.65 (   0.00%)        0.62 (  62.42%)
      	sys-time-is.D         40.05 (   0.00%)       24.45 (  38.95%)
      	sys-time-lu.D         37.55 (   0.00%)       29.02 (  22.72%)
      	sys-time-mg.D         47.52 (   0.00%)       34.92 (  26.52%)
      	sys-time-sp.D        119.01 (   0.00%)      109.05 (   8.37%)
      	sys-time-ua.D         51.52 (   0.00%)       45.13 (  12.40%)
      
      NUMA scan activity is also reduced:
      
      	NUMA alloc local               1042828     1342670
      	NUMA base PTE updates        140481138    93577468
      	NUMA huge PMD updates           272171      180766
      	NUMA page range updates      279832690   186129660
      	NUMA hint faults               1395972     1193897
      	NUMA hint local faults          877925      855053
      	NUMA hint local percent             62          71
      	NUMA pages migrated           12057909     9158023
      
      Similar observations are made for other thread-intensive workloads. System
      CPU usage is lower even though the headline gains in performance tend to be
      small. For example, specjbb 2005 shows almost no difference in performance
      but scan activity is reduced by a third on a 4-socket box. I didn't find
      a workload (thread intensive or otherwise) that suffered badly.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/20180504154109.mvrha2qo5wdl65vr@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      13784475
  19. 05 5月, 2018 1 次提交
  20. 04 5月, 2018 2 次提交
    • R
      sched/core: Don't schedule threads on pre-empted vCPUs · 247f2f6f
      Rohit Jain 提交于
      In paravirt configurations today, spinlocks figure out whether a vCPU is
      running to determine whether or not spinlock should bother spinning. We
      can use the same logic to prioritize CPUs when scheduling threads. If a
      vCPU has been pre-empted, it will incur the extra cost of VMENTER and
      the time it actually spends to be running on the host CPU. If we had
      other vCPUs which were actually running on the host CPU and idle we
      should schedule threads there.
      
      Performance numbers:
      
      Note: With patch is referred to as Paravirt in the following and without
      patch is referred to as Base.
      
      1) When only 1 VM is running:
      
          a) Hackbench test on KVM 8 vCPUs, 10,000 loops (lower is better):
      
      	+-------+-----------------+----------------+
      	|Number |Paravirt         |Base            |
      	|of     +---------+-------+-------+--------+
      	|Threads|Average  |Std Dev|Average| Std Dev|
      	+-------+---------+-------+-------+--------+
      	|1      |1.817    |0.076  |1.721  | 0.067  |
      	|2      |3.467    |0.120  |3.468  | 0.074  |
      	|4      |6.266    |0.035  |6.314  | 0.068  |
      	|8      |11.437   |0.105  |11.418 | 0.132  |
      	|16     |21.862   |0.167  |22.161 | 0.129  |
      	|25     |33.341   |0.326  |33.692 | 0.147  |
      	+-------+---------+-------+-------+--------+
      
      2) When two VMs are running with same CPU affinities:
      
          a) tbench test on VM 8 cpus
      
          Base:
      
      	VM1:
      
      	Throughput 220.59 MB/sec   1 clients  1 procs  max_latency=12.872 ms
      	Throughput 448.716 MB/sec  2 clients  2 procs  max_latency=7.555 ms
      	Throughput 861.009 MB/sec  4 clients  4 procs  max_latency=49.501 ms
      	Throughput 1261.81 MB/sec  7 clients  7 procs  max_latency=76.990 ms
      
      	VM2:
      
      	Throughput 219.937 MB/sec  1 clients  1 procs  max_latency=12.517 ms
      	Throughput 470.99 MB/sec   2 clients  2 procs  max_latency=12.419 ms
      	Throughput 841.299 MB/sec  4 clients  4 procs  max_latency=37.043 ms
      	Throughput 1240.78 MB/sec  7 clients  7 procs  max_latency=77.489 ms
      
          Paravirt:
      
      	VM1:
      
      	Throughput 222.572 MB/sec  1 clients  1 procs  max_latency=7.057 ms
      	Throughput 485.993 MB/sec  2 clients  2 procs  max_latency=26.049 ms
      	Throughput 947.095 MB/sec  4 clients  4 procs  max_latency=45.338 ms
      	Throughput 1364.26 MB/sec  7 clients  7 procs  max_latency=145.124 ms
      
      	VM2:
      
      	Throughput 224.128 MB/sec  1 clients  1 procs  max_latency=4.564 ms
      	Throughput 501.878 MB/sec  2 clients  2 procs  max_latency=11.061 ms
      	Throughput 965.455 MB/sec  4 clients  4 procs  max_latency=45.370 ms
      	Throughput 1359.08 MB/sec  7 clients  7 procs  max_latency=168.053 ms
      
          b) Hackbench with 4 fd 1,000,000 loops
      
      	+-------+--------------------------------------+----------------------------------------+
      	|Number |Paravirt                              |Base                                    |
      	|of     +----------+--------+---------+--------+----------+--------+---------+----------+
      	|Threads|Average1  |Std Dev1|Average2 | Std Dev|Average1  |Std Dev1|Average2 | Std Dev 2|
      	+-------+----------+--------+---------+--------+----------+--------+---------+----------+
      	|  1    | 3.748    | 0.620  | 3.576   | 0.432  | 4.006    | 0.395  | 3.446   | 0.787    |
      	+-------+----------+--------+---------+--------+----------+--------+---------+----------+
      
          Note that this test was run just to show the interference effect
          over-subscription can have in baseline
      
          c) schbench results with 2 message groups on 8 vCPU VMs
      
      	+-----------+-------+---------------+--------------+------------+
      	|           |       | Paravirt      | Base         |            |
      	+-----------+-------+-------+-------+-------+------+------------+
      	|           |Threads| VM1   | VM2   |  VM1  | VM2  |%Improvement|
      	+-----------+-------+-------+-------+-------+------+------------+
      	|50.0000th  |    1  | 52    | 53    |  58   | 54   |  +6.25%    |
      	|75.0000th  |    1  | 69    | 61    |  83   | 59   |  +8.45%    |
      	|90.0000th  |    1  | 80    | 80    |  89   | 83   |  +6.98%    |
      	|95.0000th  |    1  | 83    | 83    |  93   | 87   |  +7.78%    |
      	|*99.0000th |    1  | 92    | 94    |  99   | 97   |  +5.10%    |
      	|99.5000th  |    1  | 95    | 100   |  102  | 103  |  +4.88%    |
      	|99.9000th  |    1  | 107   | 123   |  105  | 203  |  +25.32%   |
      	+-----------+-------+-------+-------+-------+------+------------+
      	|50.0000th  |    2  | 56    | 62    |  67   | 59   |  +6.35%    |
      	|75.0000th  |    2  | 69    | 75    |  80   | 71   |  +4.64%    |
      	|90.0000th  |    2  | 80    | 82    |  90   | 81   |  +5.26%    |
      	|95.0000th  |    2  | 85    | 87    |  97   | 91   |  +8.51%    |
      	|*99.0000th |    2  | 98    | 99    |  107  | 109  |  +8.79%    |
      	|99.5000th  |    2  | 107   | 105   |  109  | 116  |  +5.78%    |
      	|99.9000th  |    2  | 9968  | 609   |  875  | 3116 | -165.02%   |
      	+-----------+-------+-------+-------+-------+------+------------+
      	|50.0000th  |    4  | 78    | 77    |  78   | 79   |  +1.27%    |
      	|75.0000th  |    4  | 98    | 106   |  100  | 104  |   0.00%    |
      	|90.0000th  |    4  | 987   | 1001  |  995  | 1015 |  +1.09%    |
      	|95.0000th  |    4  | 4136  | 5368  |  5752 | 5192 |  +13.16%   |
      	|*99.0000th |    4  | 11632 | 11344 |  11024| 10736|  -5.59%    |
      	|99.5000th  |    4  | 12624 | 13040 |  12720| 12144|  -3.22%    |
      	|99.9000th  |    4  | 13168 | 18912 |  14992| 17824|  +2.24%    |
      	+-----------+-------+-------+-------+-------+------+------------+
      
          Note: Improvement is measured for (VM1+VM2)
      Signed-off-by: NRohit Jain <rohit.k.jain@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dhaval.giani@oracle.com
      Cc: matt@codeblueprint.co.uk
      Cc: steven.sistare@oracle.com
      Cc: subhra.mazumdar@oracle.com
      Link: http://lkml.kernel.org/r/1525294330-7759-1-git-send-email-rohit.k.jain@oracle.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      247f2f6f
    • P
      sched/core: Introduce set_special_state() · b5bf9a90
      Peter Zijlstra 提交于
      Gaurav reported a perceived problem with TASK_PARKED, which turned out
      to be a broken wait-loop pattern in __kthread_parkme(), but the
      reported issue can (and does) in fact happen for states that do not do
      condition based sleeps.
      
      When the 'current->state = TASK_RUNNING' store of a previous
      (concurrent) try_to_wake_up() collides with the setting of a 'special'
      sleep state, we can loose the sleep state.
      
      Normal condition based wait-loops are immune to this problem, but for
      sleep states that are not condition based are subject to this problem.
      
      There already is a fix for TASK_DEAD. Abstract that and also apply it
      to TASK_STOPPED and TASK_TRACED, both of which are also without
      condition based wait-loop.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b5bf9a90
  21. 03 5月, 2018 1 次提交
    • P
      kthread, sched/wait: Fix kthread_parkme() completion issue · 85f1abe0
      Peter Zijlstra 提交于
      Even with the wait-loop fixed, there is a further issue with
      kthread_parkme(). Upon hotplug, when we do takedown_cpu(),
      smpboot_park_threads() can return before all those threads are in fact
      blocked, due to the placement of the complete() in __kthread_parkme().
      
      When that happens, sched_cpu_dying() -> migrate_tasks() can end up
      migrating such a still runnable task onto another CPU.
      
      Normally the task will have hit schedule() and gone to sleep by the
      time we do kthread_unpark(), which will then do __kthread_bind() to
      re-bind the task to the correct CPU.
      
      However, when we loose the initial TASK_PARKED store to the concurrent
      wakeup issue described previously, do the complete(), get migrated, it
      is possible to either:
      
       - observe kthread_unpark()'s clearing of SHOULD_PARK and terminate
         the park and set TASK_RUNNING, or
      
       - __kthread_bind()'s wait_task_inactive() to observe the competing
         TASK_RUNNING store.
      
      Either way the WARN() in __kthread_bind() will trigger and fail to
      correctly set the CPU affinity.
      
      Fix this by only issuing the complete() when the kthread has scheduled
      out. This does away with all the icky 'still running' nonsense.
      
      The alternative is to promote TASK_PARKED to a special state, this
      guarantees wait_task_inactive() cannot observe a 'stale' TASK_RUNNING
      and we'll end up doing the right thing, but this preserves the whole
      icky business of potentially migating the still runnable thing.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      85f1abe0
  22. 06 4月, 2018 1 次提交
  23. 05 4月, 2018 1 次提交
  24. 03 4月, 2018 1 次提交
  25. 27 3月, 2018 1 次提交
  26. 09 3月, 2018 4 次提交
  27. 04 3月, 2018 1 次提交