1. 06 12月, 2018 1 次提交
    • J
      x86/speculation: Enable cross-hyperthread spectre v2 STIBP mitigation · b07fc04c
      Jiri Kosina 提交于
      commit 53c613fe6349994f023245519265999eed75957f upstream
      
      STIBP is a feature provided by certain Intel ucodes / CPUs. This feature
      (once enabled) prevents cross-hyperthread control of decisions made by
      indirect branch predictors.
      
      Enable this feature if
      
      - the CPU is vulnerable to spectre v2
      - the CPU supports SMT and has SMT siblings online
      - spectre_v2 mitigation autoselection is enabled (default)
      
      After some previous discussion, this leaves STIBP on all the time, as wrmsr
      on crossing kernel boundary is a no-no. This could perhaps later be a bit
      more optimized (like disabling it in NOHZ, experiment with disabling it in
      idle, etc) if needed.
      
      Note that the synchronization of the mask manipulation via newly added
      spec_ctrl_mutex is currently not strictly needed, as the only updater is
      already being serialized by cpu_add_remove_lock, but let's make this a
      little bit more future-proof.
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc:  "WoodhouseDavid" <dwmw@amazon.co.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc:  "SchauflerCasey" <casey.schaufler@intel.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251438240.15880@cbobk.fhfr.pmSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      b07fc04c
  2. 01 12月, 2018 3 次提交
    • P
      rcu: Make need_resched() respond to urgent RCU-QS needs · 016a8fc5
      Paul E. McKenney 提交于
      commit 92aa39e9dc77481b90cbef25e547d66cab901496 upstream.
      
      The per-CPU rcu_dynticks.rcu_urgent_qs variable communicates an urgent
      need for an RCU quiescent state from the force-quiescent-state processing
      within the grace-period kthread to context switches and to cond_resched().
      Unfortunately, such urgent needs are not communicated to need_resched(),
      which is sometimes used to decide when to invoke cond_resched(), for
      but one example, within the KVM vcpu_run() function.  As of v4.15, this
      can result in synchronize_sched() being delayed by up to ten seconds,
      which can be problematic, to say nothing of annoying.
      
      This commit therefore checks rcu_dynticks.rcu_urgent_qs from within
      rcu_check_callbacks(), which is invoked from the scheduling-clock
      interrupt handler.  If the current task is not an idle task and is
      not executing in usermode, a context switch is forced, and either way,
      the rcu_dynticks.rcu_urgent_qs variable is set to false.  If the current
      task is an idle task, then RCU's dyntick-idle code will detect the
      quiescent state, so no further action is required.  Similarly, if the
      task is executing in usermode, other code in rcu_check_callbacks() and
      its called functions will report the corresponding quiescent state.
      Reported-by: NMarius Hillenbrand <mhillenb@amazon.de>
      Reported-by: NDavid Woodhouse <dwmw2@infradead.org>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: Backported to make patch apply cleanly on older versions. ]
      Tested-by: NMarius Hillenbrand <mhillenb@amazon.de>
      Cc: <stable@vger.kernel.org> # 4.12.x - 4.19.x
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      016a8fc5
    • P
      kdb: Use strscpy with destination buffer size · 2bc40f89
      Prarit Bhargava 提交于
      [ Upstream commit c2b94c72d93d0929f48157eef128c4f9d2e603ce ]
      
      gcc 8.1.0 warns with:
      
      kernel/debug/kdb/kdb_support.c: In function ‘kallsyms_symbol_next’:
      kernel/debug/kdb/kdb_support.c:239:4: warning: ‘strncpy’ specified bound depends on the length of the source argument [-Wstringop-overflow=]
           strncpy(prefix_name, name, strlen(name)+1);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      kernel/debug/kdb/kdb_support.c:239:31: note: length computed here
      
      Use strscpy() with the destination buffer size, and use ellipses when
      displaying truncated symbols.
      
      v2: Use strscpy()
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Cc: Jonathan Toppins <jtoppins@redhat.com>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: kgdb-bugreport@lists.sourceforge.net
      Reviewed-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2bc40f89
    • P
      sched/fair: Fix cpu_util_wake() for 'execl' type workloads · 08fbd4e0
      Patrick Bellasi 提交于
      [ Upstream commit c469933e772132aad040bd6a2adc8edf9ad6f825 ]
      
      A ~10% regression has been reported for UnixBench's execl throughput
      test by Aaron Lu and Ye Xiaolong:
      
        https://lkml.org/lkml/2018/10/30/765
      
      That test is pretty simple, it does a "recursive" execve() syscall on the
      same binary. Starting from the syscall, this sequence is possible:
      
         do_execve()
           do_execveat_common()
             __do_execve_file()
               sched_exec()
                 select_task_rq_fair()          <==| Task already enqueued
                   find_idlest_cpu()
                     find_idlest_group()
                       capacity_spare_wake()    <==| Functions not called from
      		   cpu_util_wake()           | the wakeup path
      
      which means we can end up calling cpu_util_wake() not only from the
      "wakeup path", as its name would suggest. Indeed, the task doing an
      execve() syscall is already enqueued on the CPU we want to get the
      cpu_util_wake() for.
      
      The estimated utilization for a CPU computed in cpu_util_wake() was
      written under the assumption that function can be called only from the
      wakeup path. If instead the task is already enqueued, we end up with a
      utilization which does not remove the current task's contribution from
      the estimated utilization of the CPU.
      This will wrongly assume a reduced spare capacity on the current CPU and
      increase the chances to migrate the task on execve.
      
      The regression is tracked down to:
      
       commit d519329f ("sched/fair: Update util_est only on util_avg updates")
      
      because in that patch we turn on by default the UTIL_EST sched feature.
      However, the real issue is introduced by:
      
       commit f9be3e59 ("sched/fair: Use util_est in LB and WU paths")
      
      Let's fix this by ensuring to always discount the task estimated
      utilization from the CPU's estimated utilization when the task is also
      the current one. The same benchmark of the bug report, executed on a
      dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine,
      reports these "Execl Throughput" figures (higher the better):
      
         mainline     : 48136.5 lps
         mainline+fix : 55376.5 lps
      
      which correspond to a 15% speedup.
      
      Moreover, since {cpu_util,capacity_spare}_wake() are not really only
      used from the wakeup path, let's remove this ambiguity by using a better
      matching name: {cpu_util,capacity_spare}_without().
      
      Since we are at that, let's also improve the existing documentation.
      Reported-by: NAaron Lu <aaron.lu@intel.com>
      Reported-by: NYe Xiaolong <xiaolong.ye@intel.com>
      Tested-by: NAaron Lu <aaron.lu@intel.com>
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Fixes: f9be3e59 (sched/fair: Use util_est in LB and WU paths)
      Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      08fbd4e0
  3. 27 11月, 2018 2 次提交
    • V
      sched/core: Take the hotplug lock in sched_init_smp() · b1e814e4
      Valentin Schneider 提交于
      [ Upstream commit 40fa3780bac2b654edf23f6b13f4e2dd550aea10 ]
      
      When running on linux-next (8c60c36d0b8c ("Add linux-next specific files
      for 20181019")) + CONFIG_PROVE_LOCKING=y on a big.LITTLE system (e.g.
      Juno or HiKey960), we get the following report:
      
       [    0.748225] Call trace:
       [    0.750685]  lockdep_assert_cpus_held+0x30/0x40
       [    0.755236]  static_key_enable_cpuslocked+0x20/0xc8
       [    0.760137]  build_sched_domains+0x1034/0x1108
       [    0.764601]  sched_init_domains+0x68/0x90
       [    0.768628]  sched_init_smp+0x30/0x80
       [    0.772309]  kernel_init_freeable+0x278/0x51c
       [    0.776685]  kernel_init+0x10/0x108
       [    0.780190]  ret_from_fork+0x10/0x18
      
      The static_key in question is 'sched_asym_cpucapacity' introduced by
      commit:
      
        df054e8445a4 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")
      
      In this particular case, we enable it because smp_prepare_cpus() will
      end up fetching the capacity-dmips-mhz entry from the devicetree,
      so we already have some asymmetry detected when entering sched_init_smp().
      
      This didn't get detected in tip/sched/core because we were missing:
      
        commit cb538267ea1e ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")
      
      Calls to build_sched_domains() post sched_init_smp() will hold the
      hotplug lock, it just so happens that this very first call is a
      special case. As stated by a comment in sched_init_smp(), "There's no
      userspace yet to cause hotplug operations" so this is a harmless
      warning.
      
      However, to both respect the semantics of underlying
      callees and make lockdep happy, take the hotplug lock in
      sched_init_smp(). This also satisfies the comment atop
      sched_init_domains() that says "Callers must hold the hotplug lock".
      Reported-by: NSudeep Holla <sudeep.holla@arm.com>
      Tested-by: NSudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: NValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar.Eggemann@arm.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: morten.rasmussen@arm.com
      Cc: quentin.perret@arm.com
      Link: http://lkml.kernel.org/r/1540301851-3048-1-git-send-email-valentin.schneider@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b1e814e4
    • D
      bpf: fix bpf_prog_get_info_by_fd to return 0 func_lens for unpriv · 1a7ccf42
      Daniel Borkmann 提交于
      [ Upstream commit 28c2fae726bf5003cd209b0d5910a642af98316f ]
      
      While dbecd738 ("bpf: get kernel symbol addresses via syscall")
      zeroed info.nr_jited_ksyms in bpf_prog_get_info_by_fd() for queries
      from unprivileged users, commit 815581c1 ("bpf: get JITed image
      lengths of functions via syscall") forgot about doing so and therefore
      returns the #elems of the user set up buffer which is incorrect. It
      also needs to indicate a info.nr_jited_func_lens of zero.
      
      Fixes: 815581c1 ("bpf: get JITed image lengths of functions via syscall")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Sandipan Das <sandipan@linux.vnet.ibm.com>
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1a7ccf42
  4. 23 11月, 2018 1 次提交
  5. 21 11月, 2018 3 次提交
  6. 14 11月, 2018 15 次提交
  7. 20 10月, 2018 2 次提交
  8. 18 10月, 2018 2 次提交
    • S
      tracing: Use trace_clock_local() for looping in preemptirq_delay_test.c · 12ad0cb2
      Steven Rostedt (VMware) 提交于
      The preemptirq_delay_test module is used for the ftrace selftest code that
      tests the latency tracers. The problem is that it uses ktime for the delay
      loop, and then checks the tracer to see if the delay loop is caught, but the
      tracer uses trace_clock_local() which uses various different other clocks to
      measure the latency. As ktime uses the clock cycles, and the code then
      converts that to nanoseconds, it causes rounding errors, and the preemptirq
      latency tests are failing due to being off by 1 (it expects to see a delay
      of 500000 us, but the delay is only 499999 us). This is happening due to a
      rounding error in the ktime (which is totally legit). The purpose of the
      test is to see if it can catch the delay, not to test the accuracy between
      trace_clock_local() and ktime_get(). Best to use apples to apples, and have
      the delay loop use the same clock as the latency tracer does.
      
      Cc: stable@vger.kernel.org
      Fixes: f96e8577 ("lib: Add module for testing preemptoff/irqsoff latency tracers")
      Acked-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      12ad0cb2
    • M
      tracepoint: Fix tracepoint array element size mismatch · 9c0be3f6
      Mathieu Desnoyers 提交于
      commit 46e0c9be ("kernel: tracepoints: add support for relative
      references") changes the layout of the __tracepoint_ptrs section on
      architectures supporting relative references. However, it does so
      without turning struct tracepoint * const into const int elsewhere in
      the tracepoint code, which has the following side-effect:
      
      Setting mod->num_tracepoints is done in by module.c:
      
          mod->tracepoints_ptrs = section_objs(info, "__tracepoints_ptrs",
                                               sizeof(*mod->tracepoints_ptrs),
                                               &mod->num_tracepoints);
      
      Basically, since sizeof(*mod->tracepoints_ptrs) is a pointer size
      (rather than sizeof(int)), num_tracepoints is erroneously set to half the
      size it should be on 64-bit arch. So a module with an odd number of
      tracepoints misses the last tracepoint due to effect of integer
      division.
      
      So in the module going notifier:
      
              for_each_tracepoint_range(mod->tracepoints_ptrs,
                      mod->tracepoints_ptrs + mod->num_tracepoints,
                      tp_module_going_check_quiescent, NULL);
      
      the expression (mod->tracepoints_ptrs + mod->num_tracepoints) actually
      evaluates to something within the bounds of the array, but miss the
      last tracepoint if the number of tracepoints is odd on 64-bit arch.
      
      Fix this by introducing a new typedef: tracepoint_ptr_t, which
      is either "const int" on architectures that have PREL32 relocations,
      or "struct tracepoint * const" on architectures that does not have
      this feature.
      
      Also provide a new tracepoint_ptr_defer() static inline to
      encapsulate deferencing this type rather than duplicate code and
      ugly idefs within the for_each_tracepoint_range() implementation.
      
      This issue appears in 4.19-rc kernels, and should ideally be fixed
      before the end of the rc cycle.
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NJessica Yu <jeyu@kernel.org>
      Link: http://lkml.kernel.org/r/20181013191050.22389-1-mathieu.desnoyers@efficios.com
      Link: http://lkml.kernel.org/r/20180704083651.24360-7-ard.biesheuvel@linaro.org
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morris <james.morris@microsoft.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Nicolas Pitre <nico@linaro.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      9c0be3f6
  9. 16 10月, 2018 1 次提交
  10. 11 10月, 2018 2 次提交
    • P
      sched/fair: Fix throttle_list starvation with low CFS quota · baa9be4f
      Phil Auld 提交于
      With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
      distribute_cfs_runtime may not empty the throttled_list before it runs
      out of runtime to distribute. In that case, due to the change from
      c06f04c7 to put throttled entries at the head of the list, later entries
      on the list will starve.  Essentially, the same X processes will get pulled
      off the list, given CPU time and then, when expired, get put back on the
      head of the list where distribute_cfs_runtime will give runtime to the same
      set of processes leaving the rest.
      
      Fix the issue by setting a bit in struct cfs_bandwidth when
      distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
      decide to put the throttled entry on the tail or the head of the list.  The
      bit is set/cleared by the callers of distribute_cfs_runtime while they hold
      cfs_bandwidth->lock.
      
      This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
      the live system. In some cases you can simply look at the throttled list and
      see the later entries are not changing:
      
        crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
          1     ffff90b56cb2d200  -976050
          2     ffff90b56cb2cc00  -484925
          3     ffff90b56cb2bc00  -658814
          4     ffff90b56cb2ba00  -275365
          5     ffff90b166a45600  -135138
          6     ffff90b56cb2da00  -282505
          7     ffff90b56cb2e000  -148065
          8     ffff90b56cb2fa00  -872591
          9     ffff90b56cb2c000  -84687
         10     ffff90b56cb2f000  -87237
         11     ffff90b166a40a00  -164582
      
        crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
          1     ffff90b56cb2d200  -994147
          2     ffff90b56cb2cc00  -306051
          3     ffff90b56cb2bc00  -961321
          4     ffff90b56cb2ba00  -24490
          5     ffff90b166a45600  -135138
          6     ffff90b56cb2da00  -282505
          7     ffff90b56cb2e000  -148065
          8     ffff90b56cb2fa00  -872591
          9     ffff90b56cb2c000  -84687
         10     ffff90b56cb2f000  -87237
         11     ffff90b166a40a00  -164582
      
      Sometimes it is easier to see by finding a process getting starved and looking
      at the sched_info:
      
        crash> task ffff8eb765994500 sched_info
        PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
          sched_info = {
            pcount = 8,
            run_delay = 697094208,
            last_arrival = 240260125039,
            last_queued = 240260327513
          },
        crash> task ffff8eb765994500 sched_info
        PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
          sched_info = {
            pcount = 8,
            run_delay = 697094208,
            last_arrival = 240260125039,
            last_queued = 240260327513
          },
      Signed-off-by: NPhil Auld <pauld@redhat.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Fixes: c06f04c7 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
      Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csbSigned-off-by: NIngo Molnar <mingo@kernel.org>
      baa9be4f
    • B
      xsk: do not call synchronize_net() under RCU read lock · cee27167
      Björn Töpel 提交于
      The XSKMAP update and delete functions called synchronize_net(), which
      can sleep. It is not allowed to sleep during an RCU read section.
      
      Instead we need to make sure that the sock sk_destruct (xsk_destruct)
      function is asynchronously called after an RCU grace period. Setting
      the SOCK_RCU_FREE flag for XDP sockets takes care of this.
      
      Fixes: fbfc504a ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cee27167
  11. 06 10月, 2018 1 次提交
    • J
      bpf: 32-bit RSH verification must truncate input before the ALU op · b799207e
      Jann Horn 提交于
      When I wrote commit 468f6eaf ("bpf: fix 32-bit ALU op verification"), I
      assumed that, in order to emulate 64-bit arithmetic with 32-bit logic, it
      is sufficient to just truncate the output to 32 bits; and so I just moved
      the register size coercion that used to be at the start of the function to
      the end of the function.
      
      That assumption is true for almost every op, but not for 32-bit right
      shifts, because those can propagate information towards the least
      significant bit. Fix it by always truncating inputs for 32-bit ops to 32
      bits.
      
      Also get rid of the coerce_reg_to_size() after the ALU op, since that has
      no effect.
      
      Fixes: 468f6eaf ("bpf: fix 32-bit ALU op verification")
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b799207e
  12. 05 10月, 2018 1 次提交
    • T
      cgroup: Fix dom_cgrp propagation when enabling threaded mode · 479adb89
      Tejun Heo 提交于
      A cgroup which is already a threaded domain may be converted into a
      threaded cgroup if the prerequisite conditions are met.  When this
      happens, all threaded descendant should also have their ->dom_cgrp
      updated to the new threaded domain cgroup.  Unfortunately, this
      propagation was missing leading to the following failure.
      
        # cd /sys/fs/cgroup/unified
        # cat cgroup.subtree_control    # show that no controllers are enabled
      
        # mkdir -p mycgrp/a/b/c
        # echo threaded > mycgrp/a/b/cgroup.type
      
        At this point, the hierarchy looks as follows:
      
            mycgrp [d]
      	  a [dt]
      	      b [t]
      		  c [inv]
      
        Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):
      
        # echo threaded > mycgrp/a/cgroup.type
      
        By this point, we now have a hierarchy that looks as follows:
      
            mycgrp [dt]
      	  a [t]
      	      b [t]
      		  c [inv]
      
        But, when we try to convert the node "c" from "domain invalid" to
        "threaded", we get ENOTSUP on the write():
      
        # echo threaded > mycgrp/a/b/c/cgroup.type
        sh: echo: write error: Operation not supported
      
      This patch fixes the problem by
      
      * Moving the opencoded ->dom_cgrp save and restoration in
        cgroup_enable_threaded() into cgroup_{save|restore}_control() so
        that mulitple cgroups can be handled.
      
      * Updating all threaded descendants' ->dom_cgrp to point to the new
        dom_cgrp when enabling threaded mode.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-and-tested-by: N"Michael Kerrisk (man-pages)" <mtk.manpages@gmail.com>
      Reported-by: NAmin Jamali <ajamali@pivotal.io>
      Reported-by: NJoao De Almeida Pereira <jpereira@pivotal.io>
      Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
      Fixes: 454000ad ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling")
      Cc: stable@vger.kernel.org # v4.14+
      479adb89
  13. 03 10月, 2018 1 次提交
    • G
      locking/ww_mutex: Fix runtime warning in the WW mutex selftest · e4a02ed2
      Guenter Roeck 提交于
      If CONFIG_WW_MUTEX_SELFTEST=y is enabled, booting an image
      in an arm64 virtual machine results in the following
      traceback if 8 CPUs are enabled:
      
        DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current)
        WARNING: CPU: 2 PID: 537 at kernel/locking/mutex.c:1033 __mutex_unlock_slowpath+0x1a8/0x2e0
        ...
        Call trace:
         __mutex_unlock_slowpath()
         ww_mutex_unlock()
         test_cycle_work()
         process_one_work()
         worker_thread()
         kthread()
         ret_from_fork()
      
      If requesting b_mutex fails with -EDEADLK, the error variable
      is reassigned to the return value from calling ww_mutex_lock
      on a_mutex again. If this call fails, a_mutex is not locked.
      It is, however, unconditionally unlocked subsequently, causing
      the reported warning. Fix the problem by using two error variables.
      
      With this change, the selftest still fails as follows:
      
        cyclic deadlock not resolved, ret[7/8] = -35
      
      However, the traceback is gone.
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Fixes: d1b42b80 ("locking/ww_mutex: Add kselftests for resolving ww_mutex cyclic deadlocks")
      Link: http://lkml.kernel.org/r/1538516929-9734-1-git-send-email-linux@roeck-us.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e4a02ed2
  14. 02 10月, 2018 5 次提交
    • R
      bpf: don't accept cgroup local storage with zero value size · b0584ea6
      Roman Gushchin 提交于
      Explicitly forbid creating cgroup local storage maps with zero value
      size, as it makes no sense and might even cause a panic.
      
      Reported-by: syzbot+18628320d3b14a5c459c@syzkaller.appspotmail.com
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b0584ea6
    • M
      sched/numa: Migrate pages to local nodes quicker early in the lifetime of a task · 37355bdc
      Mel Gorman 提交于
      Automatic NUMA Balancing uses a multi-stage pass to decide whether a page
      should migrate to a local node. This filter avoids excessive ping-ponging
      if a page is shared or used by threads that migrate cross-node frequently.
      
      Threads inherit both page tables and the preferred node ID from the
      parent. This means that threads can trigger hinting faults earlier than
      a new task which delays scanning for a number of seconds. As it can be
      load balanced very early in its lifetime there can be an unnecessary delay
      before it starts migrating thread-local data. This patch migrates private
      pages faster early in the lifetime of a thread using the sequence counter
      as an identifier of new tasks.
      
      With this patch applied, STREAM performance is the same as 4.17 even though
      processes are not spread cross-node prematurely. Other workloads showed
      a mix of minor gains and losses. This is somewhat expected most workloads
      are not very sensitive to the starting conditions of a process.
      
                               4.19.0-rc5             4.19.0-rc5                 4.17.0
                               numab-v1r1       fastmigrate-v1r1                vanilla
      MB/sec copy     43298.52 (   0.00%)    47335.46 (   9.32%)    47219.24 (   9.06%)
      MB/sec scale    30115.06 (   0.00%)    32568.12 (   8.15%)    32527.56 (   8.01%)
      MB/sec add      32825.12 (   0.00%)    36078.94 (   9.91%)    35928.02 (   9.45%)
      MB/sec triad    32549.52 (   0.00%)    35935.94 (  10.40%)    35969.88 (  10.51%)
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linux-MM <linux-mm@kvack.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20181001100525.29789-3-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      37355bdc
    • S
      sched/numa: Avoid task migration for small NUMA improvement · 6fd98e77
      Srikar Dronamraju 提交于
      If NUMA improvement from the task migration is going to be very
      minimal, then avoid task migration.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     198512  205910   3.72673
      1     313559  318491   1.57291
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev     Current  %Change
      8     74761.9  74935.9  0.232739
      1     214874   226796   5.54837
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     180536  189780   5.12031
      1     210281  205695   -2.18089
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     56511.4  60370    6.828
      1     104899   108100   3.05151
      
      1/7 cases is regressing, if we look at events migrate_pages seem
      to vary the most especially in the regressing case. Also some
      amount of variance is expected between different runs of
      Specjbb2005.
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,818,546      13,801,554
      migrations                1,149,960       1,151,541
      faults                    385,583         433,246
      cache-misses              55,259,546,768  55,168,691,835
      sched:sched_move_numa     2,257           2,551
      sched:sched_stick_numa    9               24
      sched:sched_swap_numa     512             904
      migrate:mm_migrate_pages  2,225           1,571
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        72692   113682
      numa_hint_faults_local  62270   102163
      numa_hit                238762  240181
      numa_huge_pte_updates   48      36
      numa_interleave         75      64
      numa_local              238676  240103
      numa_other              86      78
      numa_pages_migrated     2225    1564
      numa_pte_updates        98557   134080
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,173,490       3,079,150
      migrations                36,966          31,455
      faults                    108,776         99,081
      cache-misses              12,200,075,320  11,588,126,740
      sched:sched_move_numa     1,264           1
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     0               0
      migrate:mm_migrate_pages  899             36
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        21109   430
      numa_hint_faults_local  17120   77
      numa_hit                72934   71277
      numa_huge_pte_updates   42      0
      numa_interleave         33      22
      numa_local              72866   71218
      numa_other              68      59
      numa_pages_migrated     915     23
      numa_pte_updates        42326   0
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,312,022    8,707,565
      migrations                231,705      171,342
      faults                    310,242      310,820
      cache-misses              402,324,573  136,115,400
      sched:sched_move_numa     193          215
      sched:sched_stick_numa    0            6
      sched:sched_swap_numa     3            24
      migrate:mm_migrate_pages  93           162
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        11838   8985
      numa_hint_faults_local  11216   8154
      numa_hit                90689   93819
      numa_huge_pte_updates   0       0
      numa_interleave         1579    882
      numa_local              89634   93496
      numa_other              1055    323
      numa_pages_migrated     92      169
      numa_pte_updates        12109   9217
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before      After
      cs                        2,170,481   2,152,072
      migrations                10,126      10,704
      faults                    160,962     164,376
      cache-misses              10,834,845  3,818,437
      sched:sched_move_numa     10          16
      sched:sched_stick_numa    0           0
      sched:sched_swap_numa     0           7
      migrate:mm_migrate_pages  2           199
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        403     2248
      numa_hint_faults_local  358     1666
      numa_hit                25898   25704
      numa_huge_pte_updates   0       0
      numa_interleave         207     200
      numa_local              25860   25679
      numa_other              38      25
      numa_pages_migrated     2       197
      numa_pte_updates        400     2234
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        110,339,633      93,330,595
      migrations                4,139,812        4,122,061
      faults                    863,622          865,979
      cache-misses              231,838,045,660  225,395,083,479
      sched:sched_move_numa     2,196            2,372
      sched:sched_stick_numa    33               24
      sched:sched_swap_numa     544              769
      migrate:mm_migrate_pages  2,469            1,677
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        85748   91638
      numa_hint_faults_local  66831   78096
      numa_hit                242213  242225
      numa_huge_pte_updates   0       0
      numa_interleave         0       2
      numa_local              242211  242219
      numa_other              2       6
      numa_pages_migrated     2376    1515
      numa_pte_updates        86233   92274
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        59,331,057      51,487,271
      migrations                552,019         537,170
      faults                    266,586         256,921
      cache-misses              73,796,312,990  70,073,831,187
      sched:sched_move_numa     981             576
      sched:sched_stick_numa    54              24
      sched:sched_swap_numa     286             327
      migrate:mm_migrate_pages  713             726
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        14807   12000
      numa_hint_faults_local  5738    5024
      numa_hit                36230   36470
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              36228   36465
      numa_other              2       5
      numa_pages_migrated     703     726
      numa_pte_updates        14742   11930
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-7-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6fd98e77
    • M
      sched/numa: Limit the conditions where scan period is reset · 05cbdf4f
      Mel Gorman 提交于
      migrate_task_rq_fair() resets the scan rate for NUMA balancing on every
      cross-node migration. In the event of excessive load balancing due to
      saturation, this may result in the scan rate being pegged at maximum and
      further overloading the machine.
      
      This patch only resets the scan if NUMA balancing is active, a preferred
      node has been selected and the task is being migrated from the preferred
      node as these are the most harmful. For example, a migration to the preferred
      node does not justify a faster scan rate. Similarly, a migration between two
      nodes that are not preferred is probably bouncing due to over-saturation of
      the machine.  In that case, scanning faster and trapping more NUMA faults
      will further overload the machine.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     203370  205332   0.964744
      1     328431  319785   -2.63252
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     206070  206585   0.249915
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     188386  189162   0.41192
      1     201566  213760   6.04963
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     59157.4  58736.8  -0.710985
      1     105495   105419   -0.0720413
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,825,492      14,285,708
      migrations                1,152,509       1,180,621
      faults                    371,948         339,114
      cache-misses              55,654,206,041  55,205,631,894
      sched:sched_move_numa     1,856           843
      sched:sched_stick_numa    4               6
      sched:sched_swap_numa     428             219
      migrate:mm_migrate_pages  898             365
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        57146   26907
      numa_hint_faults_local  51612   24279
      numa_hit                238164  239771
      numa_huge_pte_updates   16      0
      numa_interleave         63      68
      numa_local              238085  239688
      numa_other              79      83
      numa_pages_migrated     883     363
      numa_pte_updates        67540   27415
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,288,525       3,202,779
      migrations                38,652          37,186
      faults                    111,678         106,076
      cache-misses              12,111,197,376  12,024,873,744
      sched:sched_move_numa     900             931
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     5               1
      migrate:mm_migrate_pages  714             637
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        18572   17409
      numa_hint_faults_local  14850   14367
      numa_hit                73197   73953
      numa_huge_pte_updates   11      20
      numa_interleave         25      25
      numa_local              73138   73892
      numa_other              59      61
      numa_pages_migrated     712     668
      numa_pte_updates        24021   27276
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,451,543    8,474,013
      migrations                202,804      254,934
      faults                    310,024      320,506
      cache-misses              253,522,507  110,580,458
      sched:sched_move_numa     213          725
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     2            7
      migrate:mm_migrate_pages  88           145
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        11830   22797
      numa_hint_faults_local  11301   21539
      numa_hit                90038   89308
      numa_huge_pte_updates   0       0
      numa_interleave         855     865
      numa_local              89796   88955
      numa_other              242     353
      numa_pages_migrated     88      149
      numa_pte_updates        12039   22930
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before     After
      cs                        2,049,153  2,195,628
      migrations                11,405     11,179
      faults                    162,309    149,656
      cache-misses              7,203,343  8,117,515
      sched:sched_move_numa     22         49
      sched:sched_stick_numa    0          0
      sched:sched_swap_numa     0          0
      migrate:mm_migrate_pages  1          5
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        1693    3577
      numa_hint_faults_local  1669    3476
      numa_hit                25177   26142
      numa_huge_pte_updates   0       0
      numa_interleave         194     358
      numa_local              24993   26042
      numa_other              184     100
      numa_pages_migrated     1       5
      numa_pte_updates        1577    3587
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        94,515,937       100,602,296
      migrations                4,203,554        4,135,630
      faults                    832,697          789,256
      cache-misses              226,248,698,331  226,160,621,058
      sched:sched_move_numa     1,730            1,366
      sched:sched_stick_numa    14               16
      sched:sched_swap_numa     432              374
      migrate:mm_migrate_pages  1,398            1,350
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        80079   47857
      numa_hint_faults_local  68620   39768
      numa_hit                241187  240165
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              241186  240165
      numa_other              1       0
      numa_pages_migrated     1347    1224
      numa_pte_updates        80729   48354
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        63,704,961      58,515,496
      migrations                573,404         564,845
      faults                    230,878         245,807
      cache-misses              76,568,222,781  73,603,757,976
      sched:sched_move_numa     509             996
      sched:sched_stick_numa    31              10
      sched:sched_swap_numa     182             193
      migrate:mm_migrate_pages  541             646
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        8501    13422
      numa_hint_faults_local  2960    5619
      numa_hit                35526   36118
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              35526   36116
      numa_other              0       2
      numa_pages_migrated     539     616
      numa_pte_updates        8433    13374
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-5-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      05cbdf4f
    • S
      sched/numa: Reset scan rate whenever task moves across nodes · 3f9672ba
      Srikar Dronamraju 提交于
      Currently task scan rate is reset when NUMA balancer migrates the task
      to a different node. If NUMA balancer initiates a swap, reset is only
      applicable to the task that initiates the swap. Similarly no scan rate
      reset is done if the task is migrated across nodes by traditional load
      balancer.
      
      Instead move the scan reset to the migrate_task_rq. This ensures the
      task moved out of its preferred node, either gets back to its preferred
      node quickly or finds a new preferred node. Doing so, would be fair to
      all tasks migrating across nodes.
      
      Specjbb2005 results (8 warehouses)
      Higher bops are better
      
      2 Socket - 2  Node Haswell - X86
      JVMS  Prev    Current  %Change
      4     200668  203370   1.3465
      1     321791  328431   2.06345
      
      2 Socket - 4 Node Power8 - PowerNV
      JVMS  Prev    Current  %Change
      1     204848  206070   0.59654
      
      2 Socket - 2  Node Power9 - PowerNV
      JVMS  Prev    Current  %Change
      4     188098  188386   0.153112
      1     200351  201566   0.606436
      
      4 Socket - 4  Node Power7 - PowerVM
      JVMS  Prev     Current  %Change
      8     58145.9  59157.4  1.73959
      1     103798   105495   1.63491
      
      Some events stats before and after applying the patch.
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        13,912,183      13,825,492
      migrations                1,155,931       1,152,509
      faults                    367,139         371,948
      cache-misses              54,240,196,814  55,654,206,041
      sched:sched_move_numa     1,571           1,856
      sched:sched_stick_numa    9               4
      sched:sched_swap_numa     463             428
      migrate:mm_migrate_pages  703             898
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        50155   57146
      numa_hint_faults_local  45264   51612
      numa_hit                239652  238164
      numa_huge_pte_updates   36      16
      numa_interleave         68      63
      numa_local              239576  238085
      numa_other              76      79
      numa_pages_migrated     680     883
      numa_pte_updates        71146   67540
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                     Before          After
      cs                        3,156,720       3,288,525
      migrations                30,354          38,652
      faults                    97,261          111,678
      cache-misses              12,400,026,826  12,111,197,376
      sched:sched_move_numa     4               900
      sched:sched_stick_numa    0               0
      sched:sched_swap_numa     1               5
      migrate:mm_migrate_pages  20              714
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Haswell - X86
      Event                   Before  After
      numa_hint_faults        272     18572
      numa_hint_faults_local  186     14850
      numa_hit                71362   73197
      numa_huge_pte_updates   0       11
      numa_interleave         23      25
      numa_local              71299   73138
      numa_other              63      59
      numa_pages_migrated     2       712
      numa_pte_updates        0       24021
      
      perf stats 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before       After
      cs                        8,606,824    8,451,543
      migrations                155,352      202,804
      faults                    301,409      310,024
      cache-misses              157,759,224  253,522,507
      sched:sched_move_numa     168          213
      sched:sched_stick_numa    0            0
      sched:sched_swap_numa     3            2
      migrate:mm_migrate_pages  125          88
      
      vmstat 8th warehouse Multi JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        4650    11830
      numa_hint_faults_local  3946    11301
      numa_hit                90489   90038
      numa_huge_pte_updates   0       0
      numa_interleave         892     855
      numa_local              90034   89796
      numa_other              455     242
      numa_pages_migrated     124     88
      numa_pte_updates        4818    12039
      
      perf stats 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                     Before     After
      cs                        2,113,167  2,049,153
      migrations                10,533     11,405
      faults                    142,727    162,309
      cache-misses              5,594,192  7,203,343
      sched:sched_move_numa     10         22
      sched:sched_stick_numa    0          0
      sched:sched_swap_numa     0          0
      migrate:mm_migrate_pages  6          1
      
      vmstat 8th warehouse Single JVM 2 Socket - 2  Node Power9 - PowerNV
      Event                   Before  After
      numa_hint_faults        744     1693
      numa_hint_faults_local  584     1669
      numa_hit                25551   25177
      numa_huge_pte_updates   0       0
      numa_interleave         263     194
      numa_local              25302   24993
      numa_other              249     184
      numa_pages_migrated     6       1
      numa_pte_updates        744     1577
      
      perf stats 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before           After
      cs                        101,227,352      94,515,937
      migrations                4,151,829        4,203,554
      faults                    745,233          832,697
      cache-misses              224,669,561,766  226,248,698,331
      sched:sched_move_numa     617              1,730
      sched:sched_stick_numa    2                14
      sched:sched_swap_numa     187              432
      migrate:mm_migrate_pages  316              1,398
      
      vmstat 8th warehouse Multi JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        24195   80079
      numa_hint_faults_local  21639   68620
      numa_hit                238331  241187
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              238331  241186
      numa_other              0       1
      numa_pages_migrated     204     1347
      numa_pte_updates        24561   80729
      
      perf stats 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                     Before          After
      cs                        62,738,978      63,704,961
      migrations                562,702         573,404
      faults                    228,465         230,878
      cache-misses              75,778,067,952  76,568,222,781
      sched:sched_move_numa     648             509
      sched:sched_stick_numa    13              31
      sched:sched_swap_numa     137             182
      migrate:mm_migrate_pages  733             541
      
      vmstat 8th warehouse Single JVM 4 Socket - 4  Node Power7 - PowerVM
      Event                   Before  After
      numa_hint_faults        10281   8501
      numa_hint_faults_local  3242    2960
      numa_hit                36338   35526
      numa_huge_pte_updates   0       0
      numa_interleave         0       0
      numa_local              36338   35526
      numa_other              0       0
      numa_pages_migrated     706     539
      numa_pte_updates        10176   8433
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1537552141-27815-4-git-send-email-srikar@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3f9672ba