1. 26 2月, 2019 1 次提交
    • L
      Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses" · 53a41cb7
      Linus Torvalds 提交于
      This reverts commit 9da3f2b7.
      
      It was well-intentioned, but wrong.  Overriding the exception tables for
      instructions for random reasons is just wrong, and that is what the new
      code did.
      
      It caused problems for tracing, and it caused problems for strncpy_from_user(),
      because the new checks made perfectly valid use cases break, rather than
      catch things that did bad things.
      
      Unchecked user space accesses are a problem, but that's not a reason to
      add invalid checks that then people have to work around with silly flags
      (in this case, that 'kernel_uaccess_faults_ok' flag, which is just an
      odd way to say "this commit was wrong" and was sprinked into random
      places to hide the wrongness).
      
      The real fix to unchecked user space accesses is to get rid of the
      special "let's not check __get_user() and __put_user() at all" logic.
      Make __{get|put}_user() be just aliases to the regular {get|put}_user()
      functions, and make it impossible to access user space without having
      the proper checks in places.
      
      The raison d'être of the special double-underscore versions used to be
      that the range check was expensive, and if you did multiple user
      accesses, you'd do the range check up front (like the signal frame
      handling code, for example).  But SMAP (on x86) and PAN (on ARM) have
      made that optimization pointless, because the _real_ expense is the "set
      CPU flag to allow user space access".
      
      Do let's not break the valid cases to catch invalid cases that shouldn't
      even exist.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53a41cb7
  2. 04 2月, 2019 4 次提交
    • A
      sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock() · c546951d
      Andrea Parri 提交于
      move_queued_task() synchronizes with task_rq_lock() as follows:
      
      	move_queued_task()		task_rq_lock()
      
      	[S] ->on_rq = MIGRATING		[L] rq = task_rq()
      	WMB (__set_task_cpu())		ACQUIRE (rq->lock);
      	[S] ->cpu = new_cpu		[L] ->on_rq
      
      where "[L] rq = task_rq()" is ordered before "ACQUIRE (rq->lock)" by an
      address dependency and, in turn, "ACQUIRE (rq->lock)" is ordered before
      "[L] ->on_rq" by the ACQUIRE itself.
      
      Use READ_ONCE() to load ->cpu in task_rq() (c.f., task_cpu()) to honor
      this address dependency.  Also, mark the accesses to ->cpu and ->on_rq
      with READ_ONCE()/WRITE_ONCE() to comply with the LKMM.
      Signed-off-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: https://lkml.kernel.org/r/20190121155240.27173-1-andrea.parri@amarulasolutions.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c546951d
    • V
      sched/fair: Update scale invariance of PELT · 23127296
      Vincent Guittot 提交于
      The current implementation of load tracking invariance scales the
      contribution with current frequency and uarch performance (only for
      utilization) of the CPU. One main result of this formula is that the
      figures are capped by current capacity of CPU. Another one is that the
      load_avg is not invariant because not scaled with uarch.
      
      The util_avg of a periodic task that runs r time slots every p time slots
      varies in the range :
      
          U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
      
      with U is the max util_avg value = SCHED_CAPACITY_SCALE
      
      At a lower capacity, the range becomes:
      
          U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)
      
      with C reflecting the compute capacity ratio between current capacity and
      max capacity.
      
      so C tries to compensate changes in (1-y^r') but it can't be accurate.
      
      Instead of scaling the contribution value of PELT algo, we should scale the
      running time. The PELT signal aims to track the amount of computation of
      tasks and/or rq so it seems more correct to scale the running time to
      reflect the effective amount of computation done since the last update.
      
      In order to be fully invariant, we need to apply the same amount of
      running time and idle time whatever the current capacity. Because running
      at lower capacity implies that the task will run longer, we have to ensure
      that the same amount of idle time will be applied when system becomes idle
      and no idle time has been "stolen". But reaching the maximum utilization
      value (SCHED_CAPACITY_SCALE) means that the task is seen as an
      always-running task whatever the capacity of the CPU (even at max compute
      capacity). In this case, we can discard this "stolen" idle times which
      becomes meaningless.
      
      In order to achieve this time scaling, a new clock_pelt is created per rq.
      The increase of this clock scales with current capacity when something
      is running on rq and synchronizes with clock_task when rq is idle. With
      this mechanism, we ensure the same running and idle time whatever the
      current capacity. This also enables to simplify the pelt algorithm by
      removing all references of uarch and frequency and applying the same
      contribution to utilization and loads. Furthermore, the scaling is done
      only once per update of clock (update_rq_clock_task()) instead of during
      each update of sched_entities and cfs/rt/dl_rq of the rq like the current
      implementation. This is interesting when cgroup are involved as shown in
      the results below:
      
      On a hikey (octo Arm64 platform).
      Performance cpufreq governor and only shallowest c-state to remove variance
      generated by those power features so we only track the impact of pelt algo.
      
      each test runs 16 times:
      
      	./perf bench sched pipe
      	(higher is better)
      	kernel	tip/sched/core     + patch
      	        ops/seconds        ops/seconds         diff
      	cgroup
      	root    59652(+/- 0.18%)   59876(+/- 0.24%)    +0.38%
      	level1  55608(+/- 0.27%)   55923(+/- 0.24%)    +0.57%
      	level2  52115(+/- 0.29%)   52564(+/- 0.22%)    +0.86%
      
      	hackbench -l 1000
      	(lower is better)
      	kernel	tip/sched/core     + patch
      	        duration(sec)      duration(sec)        diff
      	cgroup
      	root    4.453(+/- 2.37%)   4.383(+/- 2.88%)     -1.57%
      	level1  4.859(+/- 8.50%)   4.830(+/- 7.07%)     -0.60%
      	level2  5.063(+/- 9.83%)   4.928(+/- 9.66%)     -2.66%
      
      Then, the responsiveness of PELT is improved when CPU is not running at max
      capacity with this new algorithm. I have put below some examples of
      duration to reach some typical load values according to the capacity of the
      CPU with current implementation and with this patch. These values has been
      computed based on the geometric series and the half period value:
      
        Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
        972 (95%)    138ms         not reachable            276ms
        486 (47.5%)  30ms          138ms                     60ms
        256 (25%)    13ms           32ms                     26ms
      
      On my hikey (octo Arm64 platform) with schedutil governor, the time to
      reach max OPP when starting from a null utilization, decreases from 223ms
      with current scale invariance down to 121ms with the new algorithm.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: pjt@google.com
      Cc: pkondeti@codeaurora.org
      Cc: quentin.perret@arm.com
      Cc: rjw@rjwysocki.net
      Cc: srinivas.pandruvada@linux.intel.com
      Cc: thara.gopinath@linaro.org
      Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      23127296
    • E
      sched/core: Convert task_struct.stack_refcount to refcount_t · f0b89d39
      Elena Reshetova 提交于
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable task_struct.stack_refcount is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the task_struct.stack_refcount it might make a difference
      in following places:
      
       - try_get_task_stack(): increment in refcount_inc_not_zero() only
         guarantees control dependency on success vs. fully ordered
         atomic counterpart
       - put_task_stack(): decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      Suggested-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: NHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-6-git-send-email-elena.reshetova@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f0b89d39
    • E
      sched/core: Convert task_struct.usage to refcount_t · ec1d2819
      Elena Reshetova 提交于
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable task_struct.usage is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the task_struct.usage it might make a difference
      in following places:
      
       - put_task_struct(): decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      Suggested-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: NHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-5-git-send-email-elena.reshetova@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ec1d2819
  3. 02 2月, 2019 1 次提交
    • J
      x86/resctrl: Avoid confusion over the new X86_RESCTRL config · e6d42931
      Johannes Weiner 提交于
      "Resource Control" is a very broad term for this CPU feature, and a term
      that is also associated with containers, cgroups etc. This can easily
      cause confusion.
      
      Make the user prompt more specific. Match the config symbol name.
      
       [ bp: In the future, the corresponding ARM arch-specific code will be
         under ARM_CPU_RESCTRL and the arch-agnostic bits will be carved out
         under the CPU_RESCTRL umbrella symbol. ]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Babu Moger <Babu.Moger@amd.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: linux-doc@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: Reinette Chatre <reinette.chatre@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190130195621.GA30653@cmpxchg.org
      e6d42931
  4. 30 1月, 2019 2 次提交
    • W
      x86/speculation: Add PR_SPEC_DISABLE_NOEXEC · 71368af9
      Waiman Long 提交于
      With the default SPEC_STORE_BYPASS_SECCOMP/SPEC_STORE_BYPASS_PRCTL mode,
      the TIF_SSBD bit will be inherited when a new task is fork'ed or cloned.
      It will also remain when a new program is execve'ed.
      
      Only certain class of applications (like Java) that can run on behalf of
      multiple users on a single thread will require disabling speculative store
      bypass for security purposes. Those applications will call prctl(2) at
      startup time to disable SSB. They won't rely on the fact the SSB might have
      been disabled. Other applications that don't need SSBD will just move on
      without checking if SSBD has been turned on or not.
      
      The fact that the TIF_SSBD is inherited across execve(2) boundary will
      cause performance of applications that don't need SSBD but their
      predecessors have SSBD on to be unwittingly impacted especially if they
      write to memory a lot.
      
      To remedy this problem, a new PR_SPEC_DISABLE_NOEXEC argument for the
      PR_SET_SPECULATION_CTRL option of prctl(2) is added to allow applications
      to specify that the SSBD feature bit on the task structure should be
      cleared whenever a new program is being execve'ed.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: linux-doc@vger.kernel.org
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: KarimAllah Ahmed <karahmed@amazon.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Link: https://lkml.kernel.org/r/1547676096-3281-1-git-send-email-longman@redhat.com
      71368af9
    • T
      sched: Remove stale PF_MUTEX_TESTER bit · 15917dc0
      Thomas Gleixner 提交于
      The RTMUTEX tester was removed long ago but the PF bit stayed
      around. Remove it and free up the space.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      15917dc0
  5. 12 1月, 2019 1 次提交
    • T
      umh: add exit routine for UMH process · 73ab1cb2
      Taehee Yoo 提交于
      A UMH process which is created by the fork_usermode_blob() such as
      bpfilter needs to release members of the umh_info when process is
      terminated.
      But the do_exit() does not release members of the umh_info. hence module
      which uses UMH needs own code to detect whether UMH process is
      terminated or not.
      But this implementation needs extra code for checking the status of
      UMH process. it eventually makes the code more complex.
      
      The new PF_UMH flag is added and it is used to identify UMH processes.
      The exit_umh() does not release members of the umh_info.
      Hence umh_info->cleanup callback should release both members of the
      umh_info and the private data.
      Suggested-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73ab1cb2
  6. 09 1月, 2019 1 次提交
  7. 03 12月, 2018 1 次提交
    • I
      sched: Fix various typos in comments · dfcb245e
      Ingo Molnar 提交于
      Go over the scheduler source code and fix common typos
      in comments - and a typo in an actual variable name.
      
      No change in functionality intended.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      dfcb245e
  8. 28 11月, 2018 2 次提交
    • T
      x86/speculation: Add prctl() control for indirect branch speculation · 9137bb27
      Thomas Gleixner 提交于
      Add the PR_SPEC_INDIRECT_BRANCH option for the PR_GET_SPECULATION_CTRL and
      PR_SET_SPECULATION_CTRL prctls to allow fine grained per task control of
      indirect branch speculation via STIBP and IBPB.
      
      Invocations:
       Check indirect branch speculation status with
       - prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, 0, 0, 0);
      
       Enable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_ENABLE, 0, 0);
      
       Disable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE, 0, 0);
      
       Force disable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0);
      
      See Documentation/userspace-api/spec_ctrl.rst.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185005.866780996@linutronix.de
      9137bb27
    • S
      function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack · 39eb456d
      Steven Rostedt (VMware) 提交于
      Currently, the depth of the ret_stack is determined by curr_ret_stack index.
      The issue is that there's a race between setting of the curr_ret_stack and
      calling of the callback attached to the return of the function.
      
      Commit 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling
      trace return callback") moved the calling of the callback to after the
      setting of the curr_ret_stack, even stating that it was safe to do so, when
      in fact, it was the reason there was a barrier() there (yes, I should have
      commented that barrier()).
      
      Not only does the curr_ret_stack keep track of the current call graph depth,
      it also keeps the ret_stack content from being overwritten by new data.
      
      The function profiler, uses the "subtime" variable of ret_stack structure
      and by moving the curr_ret_stack, it allows for interrupts to use the same
      structure it was using, corrupting the data, and breaking the profiler.
      
      To fix this, there needs to be two variables to handle the call stack depth
      and the pointer to where the ret_stack is being used, as they need to change
      at two different locations.
      
      Cc: stable@kernel.org
      Fixes: 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      39eb456d
  9. 23 11月, 2018 1 次提交
    • B
      x86/resctrl: Rename the config option INTEL_RDT to RESCTRL · 6fe07ce3
      Babu Moger 提交于
      The resource control feature is supported by both Intel and AMD. So,
      rename CONFIG_INTEL_RDT to the vendor-neutral CONFIG_RESCTRL.
      
      Now CONFIG_RESCTRL will be used for both Intel and AMD to enable
      Resource Control support. Update the texts in config and condition
      accordingly.
      
       [ bp: Simplify Kconfig text. ]
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Dmitry Safonov <dima@arista.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <linux-doc@vger.kernel.org>
      Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: <qianyue.zj@alibaba-inc.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Reinette Chatre <reinette.chatre@intel.com>
      Cc: Rian Hunter <rian@alum.mit.edu>
      Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: <xiaochen.shen@intel.com>
      Link: https://lkml.kernel.org/r/20181121202811.4492-9-babu.moger@amd.com
      6fe07ce3
  10. 13 11月, 2018 1 次提交
    • P
      rcu: Speed up expedited GPs when interrupting RCU reader · 05f41571
      Paul E. McKenney 提交于
      In PREEMPT kernels, an expedited grace period might send an IPI to a
      CPU that is executing an RCU read-side critical section.  In that case,
      it would be nice if the rcu_read_unlock() directly interacted with the
      RCU core code to immediately report the quiescent state.  And this does
      happen in the case where the reader has been preempted.  But it would
      also be a nice performance optimization if immediate reporting also
      happened in the preemption-free case.
      
      This commit therefore adds an ->exp_hint field to the task_struct structure's
      ->rcu_read_unlock_special field.  The IPI handler sets this hint when
      it has interrupted an RCU read-side critical section, and this causes
      the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(),
      which, if preemption is enabled, reports the quiescent state immediately.
      If preemption is disabled, then the report is required to be deferred
      until preemption (or bottom halves or interrupts or whatever) is re-enabled.
      
      Because this is a hint, it does nothing for more complicated cases.  For
      example, if the IPI interrupts an RCU reader, but interrupts are disabled
      across the rcu_read_unlock(), but another rcu_read_lock() is executed
      before interrupts are re-enabled, the hint will already have been cleared.
      If you do crazy things like this, reporting will be deferred until some
      later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar.
      Reported-by: NJoel Fernandes <joel@joelfernandes.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Acked-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      05f41571
  11. 27 10月, 2018 2 次提交
  12. 03 10月, 2018 1 次提交
    • E
      signal: Distinguish between kernel_siginfo and siginfo · ae7795bc
      Eric W. Biederman 提交于
      Linus recently observed that if we did not worry about the padding
      member in struct siginfo it is only about 48 bytes, and 48 bytes is
      much nicer than 128 bytes for allocating on the stack and copying
      around in the kernel.
      
      The obvious thing of only adding the padding when userspace is
      including siginfo.h won't work as there are sigframe definitions in
      the kernel that embed struct siginfo.
      
      So split siginfo in two; kernel_siginfo and siginfo.  Keeping the
      traditional name for the userspace definition.  While the version that
      is used internally to the kernel and ultimately will not be padded to
      128 bytes is called kernel_siginfo.
      
      The definition of struct kernel_siginfo I have put in include/signal_types.h
      
      A set of buildtime checks has been added to verify the two structures have
      the same field offsets.
      
      To make it easy to verify the change kernel_siginfo retains the same
      size as siginfo.  The reduction in size comes in a following change.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      ae7795bc
  13. 05 9月, 2018 2 次提交
    • A
      fs/proc: Show STACKLEAK metrics in the /proc file system · c8d12627
      Alexander Popov 提交于
      Introduce CONFIG_STACKLEAK_METRICS providing STACKLEAK information about
      tasks via the /proc file system. In particular, /proc/<pid>/stack_depth
      shows the maximum kernel stack consumption for the current and previous
      syscalls. Although this information is not precise, it can be useful for
      estimating the STACKLEAK performance impact for your workloads.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NAlexander Popov <alex.popov@linux.com>
      Tested-by: NLaura Abbott <labbott@redhat.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      c8d12627
    • A
      x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls · afaef01c
      Alexander Popov 提交于
      The STACKLEAK feature (initially developed by PaX Team) has the following
      benefits:
      
      1. Reduces the information that can be revealed through kernel stack leak
         bugs. The idea of erasing the thread stack at the end of syscalls is
         similar to CONFIG_PAGE_POISONING and memzero_explicit() in kernel
         crypto, which all comply with FDP_RIP.2 (Full Residual Information
         Protection) of the Common Criteria standard.
      
      2. Blocks some uninitialized stack variable attacks (e.g. CVE-2017-17712,
         CVE-2010-2963). That kind of bugs should be killed by improving C
         compilers in future, which might take a long time.
      
      This commit introduces the code filling the used part of the kernel
      stack with a poison value before returning to userspace. Full
      STACKLEAK feature also contains the gcc plugin which comes in a
      separate commit.
      
      The STACKLEAK feature is ported from grsecurity/PaX. More information at:
        https://grsecurity.net/
        https://pax.grsecurity.net/
      
      This code is modified from Brad Spengler/PaX Team's code in the last
      public patch of grsecurity/PaX based on our understanding of the code.
      Changes or omissions from the original code are ours and don't reflect
      the original grsecurity/PaX code.
      
      Performance impact:
      
      Hardware: Intel Core i7-4770, 16 GB RAM
      
      Test #1: building the Linux kernel on a single core
              0.91% slowdown
      
      Test #2: hackbench -s 4096 -l 2000 -g 15 -f 25 -P
              4.2% slowdown
      
      So the STACKLEAK description in Kconfig includes: "The tradeoff is the
      performance impact: on a single CPU system kernel compilation sees a 1%
      slowdown, other systems and workloads may vary and you are advised to
      test this feature on your expected workload before deploying it".
      Signed-off-by: NAlexander Popov <alex.popov@linux.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      afaef01c
  14. 03 9月, 2018 1 次提交
    • J
      x86/fault: BUG() when uaccess helpers fault on kernel addresses · 9da3f2b7
      Jann Horn 提交于
      There have been multiple kernel vulnerabilities that permitted userspace to
      pass completely unchecked pointers through to userspace accessors:
      
       - the waitid() bug - commit 96ca579a ("waitid(): Add missing
         access_ok() checks")
       - the sg/bsg read/write APIs
       - the infiniband read/write APIs
      
      These don't happen all that often, but when they do happen, it is hard to
      test for them properly; and it is probably also hard to discover them with
      fuzzing. Even when an unmapped kernel address is supplied to such buggy
      code, it just returns -EFAULT instead of doing a proper BUG() or at least
      WARN().
      
      Try to make such misbehaving code a bit more visible by refusing to do a
      fixup in the pagefault handler code when a userspace accessor causes a #PF
      on a kernel address and the current context isn't whitelisted.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: kernel-hardening@lists.openwall.com
      Cc: dvyukov@google.com
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Borislav Petkov <bp@alien8.de>
      Link: https://lkml.kernel.org/r/20180828201421.157735-7-jannh@google.com
      9da3f2b7
  15. 31 8月, 2018 1 次提交
  16. 23 8月, 2018 1 次提交
    • D
      kernel/hung_task.c: allow to set checking interval separately from timeout · a2e51445
      Dmitry Vyukov 提交于
      Currently task hung checking interval is equal to timeout, as the result
      hung is detected anywhere between timeout and 2*timeout.  This is fine for
      most interactive environments, but this hurts automated testing setups
      (syzbot).  In an automated setup we need to strictly order CPU lockup <
      RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall
      is not detected as task hung and task hung is not detected as silent
      machine loss.  The large variance in task hung detection timeout requires
      setting silent machine loss timeout to a very large value (e.g.  if task
      hung is 3 mins, then silent loss need to be set to ~7 mins).  The
      additional 3 minutes significantly reduce testing efficiency because
      usually we crash kernel within a minute, and this can add hours to bug
      localization process as it needs to do dozens of tests.
      
      Allow setting checking interval separately from timeout.  This allows to
      set timeout to, say, 3 minutes, but checking interval to 10 secs.
      
      The interval is controlled via a new hung_task_check_interval_secs sysctl,
      similar to the existing hung_task_timeout_secs sysctl.  The default value
      of 0 results in the current behavior: checking interval is equal to
      timeout.
      
      [akpm@linux-foundation.org: update hung_task_timeout_max's comment]
      Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.comSigned-off-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2e51445
  17. 18 8月, 2018 3 次提交
    • K
      mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOB · 84c07d11
      Kirill Tkhai 提交于
      Introduce new config option, which is used to replace repeating
      CONFIG_MEMCG && !CONFIG_SLOB pattern.  Next patches add a little more
      memcg+kmem related code, so let's keep the defines more clearly.
      
      Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84c07d11
    • M
      memcg, oom: move out_of_memory back to the charge path · 29ef680a
      Michal Hocko 提交于
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") has changed the ENOMEM semantic of memcg charges.
      Rather than invoking the oom killer from the charging context it delays
      the oom killer to the page fault path (pagefault_out_of_memory).  This
      in turn means that many users (e.g.  slab or g-u-p) will get ENOMEM when
      the corresponding memcg hits the hard limit and the memcg is is OOM.
      This is behavior is inconsistent with !memcg case where the oom killer
      is invoked from the allocation context and the allocator keeps retrying
      until it succeeds.
      
      The difference in the behavior is user visible.  mmap(MAP_POPULATE)
      might result in not fully populated ranges while the mmap return code
      doesn't tell that to the userspace.  Random syscalls might fail with
      ENOMEM etc.
      
      The primary motivation of the different memcg oom semantic was the
      deadlock avoidance.  Things have changed since then, though.  We have an
      async oom teardown by the oom reaper now and so we do not have to rely
      on the victim to tear down its memory anymore.  Therefore we can return
      to the original semantic as long as the memcg oom killer is not handed
      over to the users space.
      
      There is still one thing to be careful about here though.  If the oom
      killer is not able to make any forward progress - e.g.  because there is
      no eligible task to kill - then we have to bail out of the charge path
      to prevent from same class of deadlocks.  We have basically two options
      here.  Either we fail the charge with ENOMEM or force the charge and
      allow overcharge.  The first option has been considered more harmful
      than useful because rare inconsistencies in the ENOMEM behavior is hard
      to test for and error prone.  Basically the same reason why the page
      allocator doesn't fail allocations under such conditions.  The later
      might allow runaways but those should be really unlikely unless somebody
      misconfigures the system.  E.g.  allowing to migrate tasks away from the
      memcg to a different unlimited memcg with move_charge_at_immigrate
      disabled.
      
      Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29ef680a
    • S
      fs: fsnotify: account fsnotify metadata to kmemcg · d46eb14b
      Shakeel Butt 提交于
      Patch series "Directed kmem charging", v8.
      
      The Linux kernel's memory cgroup allows limiting the memory usage of the
      jobs running on the system to provide isolation between the jobs.  All
      the kernel memory allocated in the context of the job and marked with
      __GFP_ACCOUNT will also be included in the memory usage and be limited
      by the job's limit.
      
      The kernel memory can only be charged to the memcg of the process in
      whose context kernel memory was allocated.  However there are cases
      where the allocated kernel memory should be charged to the memcg
      different from the current processes's memcg.  This patch series
      contains two such concrete use-cases i.e.  fsnotify and buffer_head.
      
      The fsnotify event objects can consume a lot of system memory for large
      or unlimited queues if there is either no or slow listener.  The events
      are allocated in the context of the event producer.  However they should
      be charged to the event consumer.  Similarly the buffer_head objects can
      be allocated in a memcg different from the memcg of the page for which
      buffer_head objects are being allocated.
      
      To solve this issue, this patch series introduces mechanism to charge
      kernel memory to a given memcg.  In case of fsnotify events, the memcg
      of the consumer can be used for charging and for buffer_head, the memcg
      of the page can be charged.  For directed charging, the caller can use
      the scope API memalloc_[un]use_memcg() to specify the memcg to charge
      for all the __GFP_ACCOUNT allocations within the scope.
      
      This patch (of 2):
      
      A lot of memory can be consumed by the events generated for the huge or
      unlimited queues if there is either no or slow listener.  This can cause
      system level memory pressure or OOMs.  So, it's better to account the
      fsnotify kmem caches to the memcg of the listener.
      
      However the listener can be in a different memcg than the memcg of the
      producer and these allocations happen in the context of the event
      producer.  This patch introduces remote memcg charging API which the
      producer can use to charge the allocations to the memcg of the listener.
      
      There are seven fsnotify kmem caches and among them allocations from
      dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
      inotify_inode_mark_cachep happens in the context of syscall from the
      listener.  So, SLAB_ACCOUNT is enough for these caches.
      
      The objects from fsnotify_mark_connector_cachep are not accounted as
      they are small compared to the notification mark or events and it is
      unclear whom to account connector to since it is shared by all events
      attached to the inode.
      
      The allocations from the event caches happen in the context of the event
      producer.  For such caches we will need to remote charge the allocations
      to the listener's memcg.  Thus we save the memcg reference in the
      fsnotify_group structure of the listener.
      
      This patch has also moved the members of fsnotify_group to keep the size
      same, at least for 64 bit build, even with additional member by filling
      the holes.
      
      [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
        Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d46eb14b
  18. 25 7月, 2018 1 次提交
  19. 21 7月, 2018 3 次提交
    • E
      pid: Implement PIDTYPE_TGID · 6883f81a
      Eric W. Biederman 提交于
      Everywhere except in the pid array we distinguish between a tasks pid and
      a tasks tgid (thread group id).  Even in the enumeration we want that
      distinction sometimes so we have added __PIDTYPE_TGID.  With leader_pid
      we almost have an implementation of PIDTYPE_TGID in struct signal_struct.
      
      Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
      into the pids array.  Then remove the __PIDTYPE_TGID special case and the
      leader_pid in signal_struct.
      
      The net size increase is just an extra pointer added to struct pid and
      an extra pair of pointers of an hlist_node added to task_struct.
      
      The effect on code maintenance is the removal of a number of special
      cases today and the potential to remove many more special cases as
      PIDTYPE_TGID gets used to it's fullest.  The long term potential
      is allowing zombie thread group leaders to exit, which will remove
      a lot more special cases in the code.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      6883f81a
    • E
      pids: Move the pgrp and session pid pointers from task_struct to signal_struct · 2c470475
      Eric W. Biederman 提交于
      To access these fields the code always has to go to group leader so
      going to signal struct is no loss and is actually a fundamental simplification.
      
      This saves a little bit of memory by only allocating the pid pointer array
      once instead of once for every thread, and even better this removes a
      few potential races caused by the fact that group_leader can be changed
      by de_thread, while signal_struct can not.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      2c470475
    • E
      pids: Compute task_tgid using signal->leader_pid · 7a36094d
      Eric W. Biederman 提交于
      The cost is the the same and this removes the need
      to worry about complications that come from de_thread
      and group_leader changing.
      
      __task_pid_nr_ns has been updated to take advantage of this change.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      7a36094d
  20. 17 7月, 2018 1 次提交
  21. 09 7月, 2018 1 次提交
    • J
      blkcg: add generic throttling mechanism · d09d8df3
      Josef Bacik 提交于
      Since IO can be issued from literally anywhere it's almost impossible to
      do throttling without having some sort of adverse effect somewhere else
      in the system because of locking or other dependencies.  The best way to
      solve this is to do the throttling when we know we aren't holding any
      other kernel resources.  Do this by tracking throttling in a per-blkg
      basis, and if we require throttling flag the task that it needs to check
      before it returns to user space and possibly sleep there.
      
      This is to address the case where a process is doing work that is
      generating IO that can't be throttled, whether that is directly with a
      lot of REQ_META IO, or indirectly by allocating so much memory that it
      is swamping the disk with REQ_SWAP.  We can't use task_add_work as we
      don't want to induce a memory allocation in the IO path, so simply
      saving the request queue in the task and flagging it to do the
      notify_resume thing achieves the same result without the overhead of a
      memory allocation.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d09d8df3
  22. 03 7月, 2018 1 次提交
    • P
      kthread, sched/core: Fix kthread_parkme() (again...) · 1cef1150
      Peter Zijlstra 提交于
      Gaurav reports that commit:
      
        85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      
      isn't working for him. Because of the following race:
      
      > controller Thread                               CPUHP Thread
      > takedown_cpu
      > kthread_park
      > kthread_parkme
      > Set KTHREAD_SHOULD_PARK
      >                                                 smpboot_thread_fn
      >                                                 set Task interruptible
      >
      >
      > wake_up_process
      >  if (!(p->state & state))
      >                 goto out;
      >
      >                                                 Kthread_parkme
      >                                                 SET TASK_PARKED
      >                                                 schedule
      >                                                 raw_spin_lock(&rq->lock)
      > ttwu_remote
      > waiting for __task_rq_lock
      >                                                 context_switch
      >
      >                                                 finish_lock_switch
      >
      >
      >
      >                                                 Case TASK_PARKED
      >                                                 kthread_park_complete
      >
      >
      > SET Running
      
      Furthermore, Oleg noticed that the whole scheduler TASK_PARKED
      handling is buggered because the TASK_DEAD thing is done with
      preemption disabled, the current code can still complete early on
      preemption :/
      
      So basically revert that earlier fix and go with a variant of the
      alternative mentioned in the commit. Promote TASK_PARKED to special
      state to avoid the store-store issue on task->state leading to the
      WARN in kthread_unpark() -> __kthread_bind().
      
      But in addition, add wait_task_inactive() to kthread_park() to ensure
      the task really is PARKED when we return from kthread_park(). This
      avoids the whole kthread still gets migrated nonsense -- although it
      would be really good to get this done differently.
      Reported-by: NGaurav Kohli <gkohli@codeaurora.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 85f1abe0 ("kthread, sched/wait: Fix kthread_parkme() completion issue")
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      1cef1150
  23. 23 6月, 2018 1 次提交
    • W
      rseq: Avoid infinite recursion when delivering SIGSEGV · 784e0300
      Will Deacon 提交于
      When delivering a signal to a task that is using rseq, we call into
      __rseq_handle_notify_resume() so that the registers pushed in the
      sigframe are updated to reflect the state of the restartable sequence
      (for example, ensuring that the signal returns to the abort handler if
      necessary).
      
      However, if the rseq management fails due to an unrecoverable fault when
      accessing userspace or certain combinations of RSEQ_CS_* flags, then we
      will attempt to deliver a SIGSEGV. This has the potential for infinite
      recursion if the rseq code continuously fails on signal delivery.
      
      Avoid this problem by using force_sigsegv() instead of force_sig(), which
      is explicitly designed to reset the SEGV handler to SIG_DFL in the case
      of a recursive fault. In doing so, remove rseq_signal_deliver() from the
      internal rseq API and have an optional struct ksignal * parameter to
      rseq_handle_notify_resume() instead.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: peterz@infradead.org
      Cc: paulmck@linux.vnet.ibm.com
      Cc: boqun.feng@gmail.com
      Link: https://lkml.kernel.org/r/1529664307-983-1-git-send-email-will.deacon@arm.com
      784e0300
  24. 21 6月, 2018 1 次提交
    • M
      rseq/cleanup: Do not abort rseq c.s. in child on fork() · 9a789fcf
      Mathieu Desnoyers 提交于
      Considering that we explicitly forbid system calls in rseq critical
      sections, it is not valid to issue a fork or clone system call within a
      rseq critical section, so rseq_fork() is not required to restart an
      active rseq c.s. in the child process.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-kselftest@vger.kernel.org
      Link: https://lore.kernel.org/lkml/20180619133230.4087-4-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9a789fcf
  25. 15 6月, 2018 1 次提交
    • M
      sched/core / kcov: avoid kcov_area during task switch · 0ed557aa
      Mark Rutland 提交于
      During a context switch, we first switch_mm() to the next task's mm,
      then switch_to() that new task.  This means that vmalloc'd regions which
      had previously been faulted in can transiently disappear in the context
      of the prev task.
      
      Functions instrumented by KCOV may try to access a vmalloc'd kcov_area
      during this window, and as the fault handling code is instrumented, this
      results in a recursive fault.
      
      We must avoid accessing any kcov_area during this window.  We can do so
      with a new flag in kcov_mode, set prior to switching the mm, and cleared
      once the new task is live.  Since task_struct::kcov_mode isn't always a
      specific enum kcov_mode value, this is made an unsigned int.
      
      The manipulation is hidden behind kcov_{prepare,finish}_switch() helpers,
      which are empty for !CONFIG_KCOV kernels.
      
      The code uses macros because I can't use static inline functions without a
      circular include dependency between <linux/sched.h> and <linux/kcov.h>,
      since the definition of task_struct uses things defined in <linux/kcov.h>
      
      Link: http://lkml.kernel.org/r/20180504135535.53744-4-mark.rutland@arm.comSigned-off-by: NMark Rutland <mark.rutland@arm.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ed557aa
  26. 14 6月, 2018 1 次提交
    • L
      Kbuild: rename CC_STACKPROTECTOR[_STRONG] config variables · 050e9baa
      Linus Torvalds 提交于
      The changes to automatically test for working stack protector compiler
      support in the Kconfig files removed the special STACKPROTECTOR_AUTO
      option that picked the strongest stack protector that the compiler
      supported.
      
      That was all a nice cleanup - it makes no sense to have the AUTO case
      now that the Kconfig phase can just determine the compiler support
      directly.
      
      HOWEVER.
      
      It also meant that doing "make oldconfig" would now _disable_ the strong
      stackprotector if you had AUTO enabled, because in a legacy config file,
      the sane stack protector configuration would look like
      
        CONFIG_HAVE_CC_STACKPROTECTOR=y
        # CONFIG_CC_STACKPROTECTOR_NONE is not set
        # CONFIG_CC_STACKPROTECTOR_REGULAR is not set
        # CONFIG_CC_STACKPROTECTOR_STRONG is not set
        CONFIG_CC_STACKPROTECTOR_AUTO=y
      
      and when you ran this through "make oldconfig" with the Kbuild changes,
      it would ask you about the regular CONFIG_CC_STACKPROTECTOR (that had
      been renamed from CONFIG_CC_STACKPROTECTOR_REGULAR to just
      CONFIG_CC_STACKPROTECTOR), but it would think that the STRONG version
      used to be disabled (because it was really enabled by AUTO), and would
      disable it in the new config, resulting in:
      
        CONFIG_HAVE_CC_STACKPROTECTOR=y
        CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
        CONFIG_CC_STACKPROTECTOR=y
        # CONFIG_CC_STACKPROTECTOR_STRONG is not set
        CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
      
      That's dangerously subtle - people could suddenly find themselves with
      the weaker stack protector setup without even realizing.
      
      The solution here is to just rename not just the old RECULAR stack
      protector option, but also the strong one.  This does that by just
      removing the CC_ prefix entirely for the user choices, because it really
      is not about the compiler support (the compiler support now instead
      automatially impacts _visibility_ of the options to users).
      
      This results in "make oldconfig" actually asking the user for their
      choice, so that we don't have any silent subtle security model changes.
      The end result would generally look like this:
      
        CONFIG_HAVE_CC_STACKPROTECTOR=y
        CONFIG_CC_HAS_STACKPROTECTOR_NONE=y
        CONFIG_STACKPROTECTOR=y
        CONFIG_STACKPROTECTOR_STRONG=y
        CONFIG_CC_HAS_SANE_STACKPROTECTOR=y
      
      where the "CC_" versions really are about internal compiler
      infrastructure, not the user selections.
      Acked-by: NMasahiro Yamada <yamada.masahiro@socionext.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      050e9baa
  27. 06 6月, 2018 1 次提交
    • M
      rseq: Introduce restartable sequences system call · d7822b1e
      Mathieu Desnoyers 提交于
      Expose a new system call allowing each thread to register one userspace
      memory area to be used as an ABI between kernel and user-space for two
      purposes: user-space restartable sequences and quick access to read the
      current CPU number value from user-space.
      
      * Restartable sequences (per-cpu atomics)
      
      Restartables sequences allow user-space to perform update operations on
      per-cpu data without requiring heavy-weight atomic operations.
      
      The restartable critical sections (percpu atomics) work has been started
      by Paul Turner and Andrew Hunter. It lets the kernel handle restart of
      critical sections. [1] [2] The re-implementation proposed here brings a
      few simplifications to the ABI which facilitates porting to other
      architectures and speeds up the user-space fast path.
      
      Here are benchmarks of various rseq use-cases.
      
      Test hardware:
      
      arm32: ARMv7 Processor rev 4 (v7l) "Cubietruck", 2-core
      x86-64: Intel E5-2630 v3@2.40GHz, 16-core, hyperthreading
      
      The following benchmarks were all performed on a single thread.
      
      * Per-CPU statistic counter increment
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                344.0                 31.4          11.0
      x86-64:                15.3                  2.0           7.7
      
      * LTTng-UST: write event 32-bit header, 32-bit payload into tracer
                   per-cpu buffer
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:               2502.0                 2250.0         1.1
      x86-64:               117.4                   98.0         1.2
      
      * liburcu percpu: lock-unlock pair, dereference, read/compare word
      
                      getcpu+atomic (ns/op)    rseq (ns/op)    speedup
      arm32:                751.0                 128.5          5.8
      x86-64:                53.4                  28.6          1.9
      
      * jemalloc memory allocator adapted to use rseq
      
      Using rseq with per-cpu memory pools in jemalloc at Facebook (based on
      rseq 2016 implementation):
      
      The production workload response-time has 1-2% gain avg. latency, and
      the P99 overall latency drops by 2-3%.
      
      * Reading the current CPU number
      
      Speeding up reading the current CPU number on which the caller thread is
      running is done by keeping the current CPU number up do date within the
      cpu_id field of the memory area registered by the thread. This is done
      by making scheduler preemption set the TIF_NOTIFY_RESUME flag on the
      current thread. Upon return to user-space, a notify-resume handler
      updates the current CPU value within the registered user-space memory
      area. User-space can then read the current CPU number directly from
      memory.
      
      Keeping the current cpu id in a memory area shared between kernel and
      user-space is an improvement over current mechanisms available to read
      the current CPU number, which has the following benefits over
      alternative approaches:
      
      - 35x speedup on ARM vs system call through glibc
      - 20x speedup on x86 compared to calling glibc, which calls vdso
        executing a "lsl" instruction,
      - 14x speedup on x86 compared to inlined "lsl" instruction,
      - Unlike vdso approaches, this cpu_id value can be read from an inline
        assembly, which makes it a useful building block for restartable
        sequences.
      - The approach of reading the cpu id through memory mapping shared
        between kernel and user-space is portable (e.g. ARM), which is not the
        case for the lsl-based x86 vdso.
      
      On x86, yet another possible approach would be to use the gs segment
      selector to point to user-space per-cpu data. This approach performs
      similarly to the cpu id cache, but it has two disadvantages: it is
      not portable, and it is incompatible with existing applications already
      using the gs segment selector for other purposes.
      
      Benchmarking various approaches for reading the current CPU number:
      
      ARMv7 Processor rev 4 (v7l)
      Machine model: Cubietruck
      - Baseline (empty loop):                                    8.4 ns
      - Read CPU from rseq cpu_id:                               16.7 ns
      - Read CPU from rseq cpu_id (lazy register):               19.8 ns
      - glibc 2.19-0ubuntu6.6 getcpu:                           301.8 ns
      - getcpu system call:                                     234.9 ns
      
      x86-64 Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz:
      - Baseline (empty loop):                                    0.8 ns
      - Read CPU from rseq cpu_id:                                0.8 ns
      - Read CPU from rseq cpu_id (lazy register):                0.8 ns
      - Read using gs segment selector:                           0.8 ns
      - "lsl" inline assembly:                                   13.0 ns
      - glibc 2.19-0ubuntu6 getcpu:                              16.6 ns
      - getcpu system call:                                      53.9 ns
      
      - Speed (benchmark taken on v8 of patchset)
      
      Running 10 runs of hackbench -l 100000 seems to indicate, contrary to
      expectations, that enabling CONFIG_RSEQ slightly accelerates the
      scheduler:
      
      Configuration: 2 sockets * 8-core Intel(R) Xeon(R) CPU E5-2630 v3 @
      2.40GHz (directly on hardware, hyperthreading disabled in BIOS, energy
      saving disabled in BIOS, turboboost disabled in BIOS, cpuidle.off=1
      kernel parameter), with a Linux v4.6 defconfig+localyesconfig,
      restartable sequences series applied.
      
      * CONFIG_RSEQ=n
      
      avg.:      41.37 s
      std.dev.:   0.36 s
      
      * CONFIG_RSEQ=y
      
      avg.:      40.46 s
      std.dev.:   0.33 s
      
      - Size
      
      On x86-64, between CONFIG_RSEQ=n/y, the text size increase of vmlinux is
      567 bytes, and the data size increase of vmlinux is 5696 bytes.
      
      [1] https://lwn.net/Articles/650333/
      [2] http://www.linuxplumbersconf.org/2013/ocw/system/presentations/1695/original/LPC%20-%20PerCpu%20Atomics.pdfSigned-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-api@vger.kernel.org
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20151027235635.16059.11630.stgit@pjt-glaptop.roam.corp.google.com
      Link: http://lkml.kernel.org/r/20150624222609.6116.86035.stgit@kitami.mtv.corp.google.com
      Link: https://lkml.kernel.org/r/20180602124408.8430-3-mathieu.desnoyers@efficios.com
      d7822b1e
  28. 25 5月, 2018 1 次提交
  29. 16 5月, 2018 1 次提交