1. 06 12月, 2018 9 次提交
    • S
      function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack · 39237432
      Steven Rostedt (VMware) 提交于
      commit 39eb456dacb543de90d3bc6a8e0ac5cf51ac475e upstream.
      
      Currently, the depth of the ret_stack is determined by curr_ret_stack index.
      The issue is that there's a race between setting of the curr_ret_stack and
      calling of the callback attached to the return of the function.
      
      Commit 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling
      trace return callback") moved the calling of the callback to after the
      setting of the curr_ret_stack, even stating that it was safe to do so, when
      in fact, it was the reason there was a barrier() there (yes, I should have
      commented that barrier()).
      
      Not only does the curr_ret_stack keep track of the current call graph depth,
      it also keeps the ret_stack content from being overwritten by new data.
      
      The function profiler, uses the "subtime" variable of ret_stack structure
      and by moving the curr_ret_stack, it allows for interrupts to use the same
      structure it was using, corrupting the data, and breaking the profiler.
      
      To fix this, there needs to be two variables to handle the call stack depth
      and the pointer to where the ret_stack is being used, as they need to change
      at two different locations.
      
      Cc: stable@kernel.org
      Fixes: 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      39237432
    • S
      function_graph: Make ftrace_push_return_trace() static · 72c33b23
      Steven Rostedt (VMware) 提交于
      commit d125f3f8 upstream.
      
      As all architectures now call function_graph_enter() to do the entry work,
      no architecture should ever call ftrace_push_return_trace(). Make it static.
      
      This is needed to prepare for a fix of a design bug on how the curr_ret_stack
      is used.
      
      Cc: stable@kernel.org
      Fixes: 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      72c33b23
    • S
      function_graph: Create function_graph_enter() to consolidate architecture code · 67d7bec3
      Steven Rostedt (VMware) 提交于
      commit 8114865f upstream.
      
      Currently all the architectures do basically the same thing in preparing the
      function graph tracer on entry to a function. This code can be pulled into a
      generic location and then this will allow the function graph tracer to be
      fixed, as well as extended.
      
      Create a new function graph helper function_graph_enter() that will call the
      hook function (ftrace_graph_entry) and the shadow stack operation
      (ftrace_push_return_trace), and remove the need of the architecture code to
      manage the shadow stack.
      
      This is needed to prepare for a fix of a design bug on how the curr_ret_stack
      is used.
      
      Cc: stable@kernel.org
      Fixes: 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      67d7bec3
    • T
      ptrace: Remove unused ptrace_may_access_sched() and MODE_IBRS · aecb9969
      Thomas Gleixner 提交于
      commit 46f7ecb1 upstream
      
      The IBPB control code in x86 removed the usage. Remove the functionality
      which was introduced for this.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185005.559149393@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      aecb9969
    • T
      x86/speculation: Rework SMT state change · f55e301e
      Thomas Gleixner 提交于
      commit a74cfffb03b73d41e08f84c2e5c87dec0ce3db9f upstream
      
      arch_smt_update() is only called when the sysfs SMT control knob is
      changed. This means that when SMT is enabled in the sysfs control knob the
      system is considered to have SMT active even if all siblings are offline.
      
      To allow finegrained control of the speculation mitigations, the actual SMT
      state is more interesting than the fact that siblings could be enabled.
      
      Rework the code, so arch_smt_update() is invoked from each individual CPU
      hotplug function, and simplify the update function while at it.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185004.521974984@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f55e301e
    • T
      sched/smt: Expose sched_smt_present static key · 340693ee
      Thomas Gleixner 提交于
      commit 321a874a upstream
      
      Make the scheduler's 'sched_smt_present' static key globaly available, so
      it can be used in the x86 speculation control code.
      
      Provide a query function and a stub for the CONFIG_SMP=n case.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185004.430168326@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      340693ee
    • P
      sched/smt: Make sched_smt_present track topology · a2c09481
      Peter Zijlstra (Intel) 提交于
      commit c5511d03ec090980732e929c318a7a6374b5550e upstream
      
      Currently the 'sched_smt_present' static key is enabled when at CPU bringup
      SMT topology is observed, but it is never disabled. However there is demand
      to also disable the key when the topology changes such that there is no SMT
      present anymore.
      
      Implement this by making the key count the number of cores that have SMT
      enabled.
      
      In particular, the SMT topology bits are set before interrrupts are enabled
      and similarly, are cleared after interrupts are disabled for the last time
      and the CPU dies.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185004.246110444@linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a2c09481
    • J
      x86/speculation: Apply IBPB more strictly to avoid cross-process data leak · cacd9385
      Jiri Kosina 提交于
      commit dbfe2953 upstream
      
      Currently, IBPB is only issued in cases when switching into a non-dumpable
      process, the rationale being to protect such 'important and security
      sensitive' processess (such as GPG) from data leaking into a different
      userspace process via spectre v2.
      
      This is however completely insufficient to provide proper userspace-to-userpace
      spectrev2 protection, as any process can poison branch buffers before being
      scheduled out, and the newly scheduled process immediately becomes spectrev2
      victim.
      
      In order to minimize the performance impact (for usecases that do require
      spectrev2 protection), issue the barrier only in cases when switching between
      processess where the victim can't be ptraced by the potential attacker (as in
      such cases, the attacker doesn't have to bother with branch buffers at all).
      
      [ tglx: Split up PTRACE_MODE_NOACCESS_CHK into PTRACE_MODE_SCHED and
        PTRACE_MODE_IBPB to be able to do ptrace() context tracking reasonably
        fine-grained ]
      
      Fixes: 18bf3c3e ("x86/speculation: Use Indirect Branch Prediction Barrier in context switch")
      Originally-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc:  "WoodhouseDavid" <dwmw@amazon.co.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc:  "SchauflerCasey" <casey.schaufler@intel.com>
      Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251437340.15880@cbobk.fhfr.pmSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      cacd9385
    • J
      x86/speculation: Enable cross-hyperthread spectre v2 STIBP mitigation · b07fc04c
      Jiri Kosina 提交于
      commit 53c613fe upstream
      
      STIBP is a feature provided by certain Intel ucodes / CPUs. This feature
      (once enabled) prevents cross-hyperthread control of decisions made by
      indirect branch predictors.
      
      Enable this feature if
      
      - the CPU is vulnerable to spectre v2
      - the CPU supports SMT and has SMT siblings online
      - spectre_v2 mitigation autoselection is enabled (default)
      
      After some previous discussion, this leaves STIBP on all the time, as wrmsr
      on crossing kernel boundary is a no-no. This could perhaps later be a bit
      more optimized (like disabling it in NOHZ, experiment with disabling it in
      idle, etc) if needed.
      
      Note that the synchronization of the mask manipulation via newly added
      spec_ctrl_mutex is currently not strictly needed, as the only updater is
      already being serialized by cpu_add_remove_lock, but let's make this a
      little bit more future-proof.
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc:  "WoodhouseDavid" <dwmw@amazon.co.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc:  "SchauflerCasey" <casey.schaufler@intel.com>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251438240.15880@cbobk.fhfr.pmSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      b07fc04c
  2. 01 12月, 2018 3 次提交
    • P
      rcu: Make need_resched() respond to urgent RCU-QS needs · 016a8fc5
      Paul E. McKenney 提交于
      commit 92aa39e9 upstream.
      
      The per-CPU rcu_dynticks.rcu_urgent_qs variable communicates an urgent
      need for an RCU quiescent state from the force-quiescent-state processing
      within the grace-period kthread to context switches and to cond_resched().
      Unfortunately, such urgent needs are not communicated to need_resched(),
      which is sometimes used to decide when to invoke cond_resched(), for
      but one example, within the KVM vcpu_run() function.  As of v4.15, this
      can result in synchronize_sched() being delayed by up to ten seconds,
      which can be problematic, to say nothing of annoying.
      
      This commit therefore checks rcu_dynticks.rcu_urgent_qs from within
      rcu_check_callbacks(), which is invoked from the scheduling-clock
      interrupt handler.  If the current task is not an idle task and is
      not executing in usermode, a context switch is forced, and either way,
      the rcu_dynticks.rcu_urgent_qs variable is set to false.  If the current
      task is an idle task, then RCU's dyntick-idle code will detect the
      quiescent state, so no further action is required.  Similarly, if the
      task is executing in usermode, other code in rcu_check_callbacks() and
      its called functions will report the corresponding quiescent state.
      Reported-by: NMarius Hillenbrand <mhillenb@amazon.de>
      Reported-by: NDavid Woodhouse <dwmw2@infradead.org>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      [ paulmck: Backported to make patch apply cleanly on older versions. ]
      Tested-by: NMarius Hillenbrand <mhillenb@amazon.de>
      Cc: <stable@vger.kernel.org> # 4.12.x - 4.19.x
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      016a8fc5
    • P
      kdb: Use strscpy with destination buffer size · 2bc40f89
      Prarit Bhargava 提交于
      [ Upstream commit c2b94c72 ]
      
      gcc 8.1.0 warns with:
      
      kernel/debug/kdb/kdb_support.c: In function ‘kallsyms_symbol_next’:
      kernel/debug/kdb/kdb_support.c:239:4: warning: ‘strncpy’ specified bound depends on the length of the source argument [-Wstringop-overflow=]
           strncpy(prefix_name, name, strlen(name)+1);
           ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      kernel/debug/kdb/kdb_support.c:239:31: note: length computed here
      
      Use strscpy() with the destination buffer size, and use ellipses when
      displaying truncated symbols.
      
      v2: Use strscpy()
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Cc: Jonathan Toppins <jtoppins@redhat.com>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      Cc: Daniel Thompson <daniel.thompson@linaro.org>
      Cc: kgdb-bugreport@lists.sourceforge.net
      Reviewed-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2bc40f89
    • P
      sched/fair: Fix cpu_util_wake() for 'execl' type workloads · 08fbd4e0
      Patrick Bellasi 提交于
      [ Upstream commit c469933e ]
      
      A ~10% regression has been reported for UnixBench's execl throughput
      test by Aaron Lu and Ye Xiaolong:
      
        https://lkml.org/lkml/2018/10/30/765
      
      That test is pretty simple, it does a "recursive" execve() syscall on the
      same binary. Starting from the syscall, this sequence is possible:
      
         do_execve()
           do_execveat_common()
             __do_execve_file()
               sched_exec()
                 select_task_rq_fair()          <==| Task already enqueued
                   find_idlest_cpu()
                     find_idlest_group()
                       capacity_spare_wake()    <==| Functions not called from
      		   cpu_util_wake()           | the wakeup path
      
      which means we can end up calling cpu_util_wake() not only from the
      "wakeup path", as its name would suggest. Indeed, the task doing an
      execve() syscall is already enqueued on the CPU we want to get the
      cpu_util_wake() for.
      
      The estimated utilization for a CPU computed in cpu_util_wake() was
      written under the assumption that function can be called only from the
      wakeup path. If instead the task is already enqueued, we end up with a
      utilization which does not remove the current task's contribution from
      the estimated utilization of the CPU.
      This will wrongly assume a reduced spare capacity on the current CPU and
      increase the chances to migrate the task on execve.
      
      The regression is tracked down to:
      
       commit d519329f ("sched/fair: Update util_est only on util_avg updates")
      
      because in that patch we turn on by default the UTIL_EST sched feature.
      However, the real issue is introduced by:
      
       commit f9be3e59 ("sched/fair: Use util_est in LB and WU paths")
      
      Let's fix this by ensuring to always discount the task estimated
      utilization from the CPU's estimated utilization when the task is also
      the current one. The same benchmark of the bug report, executed on a
      dual socket 40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz machine,
      reports these "Execl Throughput" figures (higher the better):
      
         mainline     : 48136.5 lps
         mainline+fix : 55376.5 lps
      
      which correspond to a 15% speedup.
      
      Moreover, since {cpu_util,capacity_spare}_wake() are not really only
      used from the wakeup path, let's remove this ambiguity by using a better
      matching name: {cpu_util,capacity_spare}_without().
      
      Since we are at that, let's also improve the existing documentation.
      Reported-by: NAaron Lu <aaron.lu@intel.com>
      Reported-by: NYe Xiaolong <xiaolong.ye@intel.com>
      Tested-by: NAaron Lu <aaron.lu@intel.com>
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Fixes: f9be3e59 (sched/fair: Use util_est in LB and WU paths)
      Link: https://lore.kernel.org/lkml/20181025093100.GB13236@e110439-lin/Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      08fbd4e0
  3. 27 11月, 2018 2 次提交
    • V
      sched/core: Take the hotplug lock in sched_init_smp() · b1e814e4
      Valentin Schneider 提交于
      [ Upstream commit 40fa3780 ]
      
      When running on linux-next (8c60c36d0b8c ("Add linux-next specific files
      for 20181019")) + CONFIG_PROVE_LOCKING=y on a big.LITTLE system (e.g.
      Juno or HiKey960), we get the following report:
      
       [    0.748225] Call trace:
       [    0.750685]  lockdep_assert_cpus_held+0x30/0x40
       [    0.755236]  static_key_enable_cpuslocked+0x20/0xc8
       [    0.760137]  build_sched_domains+0x1034/0x1108
       [    0.764601]  sched_init_domains+0x68/0x90
       [    0.768628]  sched_init_smp+0x30/0x80
       [    0.772309]  kernel_init_freeable+0x278/0x51c
       [    0.776685]  kernel_init+0x10/0x108
       [    0.780190]  ret_from_fork+0x10/0x18
      
      The static_key in question is 'sched_asym_cpucapacity' introduced by
      commit:
      
        df054e84 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")
      
      In this particular case, we enable it because smp_prepare_cpus() will
      end up fetching the capacity-dmips-mhz entry from the devicetree,
      so we already have some asymmetry detected when entering sched_init_smp().
      
      This didn't get detected in tip/sched/core because we were missing:
      
        commit cb538267 ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")
      
      Calls to build_sched_domains() post sched_init_smp() will hold the
      hotplug lock, it just so happens that this very first call is a
      special case. As stated by a comment in sched_init_smp(), "There's no
      userspace yet to cause hotplug operations" so this is a harmless
      warning.
      
      However, to both respect the semantics of underlying
      callees and make lockdep happy, take the hotplug lock in
      sched_init_smp(). This also satisfies the comment atop
      sched_init_domains() that says "Callers must hold the hotplug lock".
      Reported-by: NSudeep Holla <sudeep.holla@arm.com>
      Tested-by: NSudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: NValentin Schneider <valentin.schneider@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar.Eggemann@arm.com
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: morten.rasmussen@arm.com
      Cc: quentin.perret@arm.com
      Link: http://lkml.kernel.org/r/1540301851-3048-1-git-send-email-valentin.schneider@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b1e814e4
    • D
      bpf: fix bpf_prog_get_info_by_fd to return 0 func_lens for unpriv · 1a7ccf42
      Daniel Borkmann 提交于
      [ Upstream commit 28c2fae7 ]
      
      While dbecd738 ("bpf: get kernel symbol addresses via syscall")
      zeroed info.nr_jited_ksyms in bpf_prog_get_info_by_fd() for queries
      from unprivileged users, commit 815581c1 ("bpf: get JITed image
      lengths of functions via syscall") forgot about doing so and therefore
      returns the #elems of the user set up buffer which is incorrect. It
      also needs to indicate a info.nr_jited_func_lens of zero.
      
      Fixes: 815581c1 ("bpf: get JITed image lengths of functions via syscall")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Sandipan Das <sandipan@linux.vnet.ibm.com>
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      1a7ccf42
  4. 23 11月, 2018 1 次提交
  5. 21 11月, 2018 3 次提交
  6. 14 11月, 2018 15 次提交
  7. 20 10月, 2018 2 次提交
  8. 18 10月, 2018 2 次提交
    • S
      tracing: Use trace_clock_local() for looping in preemptirq_delay_test.c · 12ad0cb2
      Steven Rostedt (VMware) 提交于
      The preemptirq_delay_test module is used for the ftrace selftest code that
      tests the latency tracers. The problem is that it uses ktime for the delay
      loop, and then checks the tracer to see if the delay loop is caught, but the
      tracer uses trace_clock_local() which uses various different other clocks to
      measure the latency. As ktime uses the clock cycles, and the code then
      converts that to nanoseconds, it causes rounding errors, and the preemptirq
      latency tests are failing due to being off by 1 (it expects to see a delay
      of 500000 us, but the delay is only 499999 us). This is happening due to a
      rounding error in the ktime (which is totally legit). The purpose of the
      test is to see if it can catch the delay, not to test the accuracy between
      trace_clock_local() and ktime_get(). Best to use apples to apples, and have
      the delay loop use the same clock as the latency tracer does.
      
      Cc: stable@vger.kernel.org
      Fixes: f96e8577 ("lib: Add module for testing preemptoff/irqsoff latency tracers")
      Acked-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      12ad0cb2
    • M
      tracepoint: Fix tracepoint array element size mismatch · 9c0be3f6
      Mathieu Desnoyers 提交于
      commit 46e0c9be ("kernel: tracepoints: add support for relative
      references") changes the layout of the __tracepoint_ptrs section on
      architectures supporting relative references. However, it does so
      without turning struct tracepoint * const into const int elsewhere in
      the tracepoint code, which has the following side-effect:
      
      Setting mod->num_tracepoints is done in by module.c:
      
          mod->tracepoints_ptrs = section_objs(info, "__tracepoints_ptrs",
                                               sizeof(*mod->tracepoints_ptrs),
                                               &mod->num_tracepoints);
      
      Basically, since sizeof(*mod->tracepoints_ptrs) is a pointer size
      (rather than sizeof(int)), num_tracepoints is erroneously set to half the
      size it should be on 64-bit arch. So a module with an odd number of
      tracepoints misses the last tracepoint due to effect of integer
      division.
      
      So in the module going notifier:
      
              for_each_tracepoint_range(mod->tracepoints_ptrs,
                      mod->tracepoints_ptrs + mod->num_tracepoints,
                      tp_module_going_check_quiescent, NULL);
      
      the expression (mod->tracepoints_ptrs + mod->num_tracepoints) actually
      evaluates to something within the bounds of the array, but miss the
      last tracepoint if the number of tracepoints is odd on 64-bit arch.
      
      Fix this by introducing a new typedef: tracepoint_ptr_t, which
      is either "const int" on architectures that have PREL32 relocations,
      or "struct tracepoint * const" on architectures that does not have
      this feature.
      
      Also provide a new tracepoint_ptr_defer() static inline to
      encapsulate deferencing this type rather than duplicate code and
      ugly idefs within the for_each_tracepoint_range() implementation.
      
      This issue appears in 4.19-rc kernels, and should ideally be fixed
      before the end of the rc cycle.
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NJessica Yu <jeyu@kernel.org>
      Link: http://lkml.kernel.org/r/20181013191050.22389-1-mathieu.desnoyers@efficios.com
      Link: http://lkml.kernel.org/r/20180704083651.24360-7-ard.biesheuvel@linaro.org
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: James Morris <james.morris@microsoft.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Nicolas Pitre <nico@linaro.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      9c0be3f6
  9. 16 10月, 2018 1 次提交
  10. 11 10月, 2018 2 次提交
    • P
      sched/fair: Fix throttle_list starvation with low CFS quota · baa9be4f
      Phil Auld 提交于
      With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
      distribute_cfs_runtime may not empty the throttled_list before it runs
      out of runtime to distribute. In that case, due to the change from
      c06f04c7 to put throttled entries at the head of the list, later entries
      on the list will starve.  Essentially, the same X processes will get pulled
      off the list, given CPU time and then, when expired, get put back on the
      head of the list where distribute_cfs_runtime will give runtime to the same
      set of processes leaving the rest.
      
      Fix the issue by setting a bit in struct cfs_bandwidth when
      distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
      decide to put the throttled entry on the tail or the head of the list.  The
      bit is set/cleared by the callers of distribute_cfs_runtime while they hold
      cfs_bandwidth->lock.
      
      This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
      the live system. In some cases you can simply look at the throttled list and
      see the later entries are not changing:
      
        crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
          1     ffff90b56cb2d200  -976050
          2     ffff90b56cb2cc00  -484925
          3     ffff90b56cb2bc00  -658814
          4     ffff90b56cb2ba00  -275365
          5     ffff90b166a45600  -135138
          6     ffff90b56cb2da00  -282505
          7     ffff90b56cb2e000  -148065
          8     ffff90b56cb2fa00  -872591
          9     ffff90b56cb2c000  -84687
         10     ffff90b56cb2f000  -87237
         11     ffff90b166a40a00  -164582
      
        crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
          1     ffff90b56cb2d200  -994147
          2     ffff90b56cb2cc00  -306051
          3     ffff90b56cb2bc00  -961321
          4     ffff90b56cb2ba00  -24490
          5     ffff90b166a45600  -135138
          6     ffff90b56cb2da00  -282505
          7     ffff90b56cb2e000  -148065
          8     ffff90b56cb2fa00  -872591
          9     ffff90b56cb2c000  -84687
         10     ffff90b56cb2f000  -87237
         11     ffff90b166a40a00  -164582
      
      Sometimes it is easier to see by finding a process getting starved and looking
      at the sched_info:
      
        crash> task ffff8eb765994500 sched_info
        PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
          sched_info = {
            pcount = 8,
            run_delay = 697094208,
            last_arrival = 240260125039,
            last_queued = 240260327513
          },
        crash> task ffff8eb765994500 sched_info
        PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
          sched_info = {
            pcount = 8,
            run_delay = 697094208,
            last_arrival = 240260125039,
            last_queued = 240260327513
          },
      Signed-off-by: NPhil Auld <pauld@redhat.com>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      Fixes: c06f04c7 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
      Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csbSigned-off-by: NIngo Molnar <mingo@kernel.org>
      baa9be4f
    • B
      xsk: do not call synchronize_net() under RCU read lock · cee27167
      Björn Töpel 提交于
      The XSKMAP update and delete functions called synchronize_net(), which
      can sleep. It is not allowed to sleep during an RCU read section.
      
      Instead we need to make sure that the sock sk_destruct (xsk_destruct)
      function is asynchronously called after an RCU grace period. Setting
      the SOCK_RCU_FREE flag for XDP sockets takes care of this.
      
      Fixes: fbfc504a ("bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP")
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      cee27167