1. 25 6月, 2019 4 次提交
    • P
      sched/uclamp: Extend sched_setattr() to support utilization clamping · a509a7cd
      Patrick Bellasi 提交于
      The SCHED_DEADLINE scheduling class provides an advanced and formal
      model to define tasks requirements that can translate into proper
      decisions for both task placements and frequencies selections. Other
      classes have a more simplified model based on the POSIX concept of
      priorities.
      
      Such a simple priority based model however does not allow to exploit
      most advanced features of the Linux scheduler like, for example, driving
      frequencies selection via the schedutil cpufreq governor. However, also
      for non SCHED_DEADLINE tasks, it's still interesting to define tasks
      properties to support scheduler decisions.
      
      Utilization clamping exposes to user-space a new set of per-task
      attributes the scheduler can use as hints about the expected/required
      utilization for a task. This allows to implement a "proactive" per-task
      frequency control policy, a more advanced policy than the current one
      based just on "passive" measured task utilization. For example, it's
      possible to boost interactive tasks (e.g. to get better performance) or
      cap background tasks (e.g. to be more energy/thermal efficient).
      
      Introduce a new API to set utilization clamping values for a specified
      task by extending sched_setattr(), a syscall which already allows to
      define task specific properties for different scheduling classes. A new
      pair of attributes allows to specify a minimum and maximum utilization
      the scheduler can consider for a task.
      
      Do that by validating the required clamp values before and then applying
      the required changes using _the_ same pattern already in use for
      __setscheduler(). This ensures that the task is re-enqueued with the new
      clamp values.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190621084217.8167-7-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a509a7cd
    • P
      sched/uclamp: Add system default clamps · e8f14172
      Patrick Bellasi 提交于
      Tasks without a user-defined clamp value are considered not clamped
      and by default their utilization can have any value in the
      [0..SCHED_CAPACITY_SCALE] range.
      
      Tasks with a user-defined clamp value are allowed to request any value
      in that range, and the required clamp is unconditionally enforced.
      However, a "System Management Software" could be interested in limiting
      the range of clamp values allowed for all tasks.
      
      Add a privileged interface to define a system default configuration via:
      
        /proc/sys/kernel/sched_uclamp_util_{min,max}
      
      which works as an unconditional clamp range restriction for all tasks.
      
      With the default configuration, the full SCHED_CAPACITY_SCALE range of
      values is allowed for each clamp index. Otherwise, the task-specific
      clamp is capped by the corresponding system default value.
      
      Do that by tracking, for each task, the "effective" clamp value and
      bucket the task has been refcounted in at enqueue time. This
      allows to lazy aggregate "requested" and "system default" values at
      enqueue time and simplifies refcounting updates at dequeue time.
      
      The cached bucket ids are used to avoid (relatively) more expensive
      integer divisions every time a task is enqueued.
      
      An active flag is used to report when the "effective" value is valid and
      thus the task is actually refcounted in the corresponding rq's bucket.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e8f14172
    • P
      sched/uclamp: Add CPU's clamp buckets refcounting · 69842cba
      Patrick Bellasi 提交于
      Utilization clamping allows to clamp the CPU's utilization within a
      [util_min, util_max] range, depending on the set of RUNNABLE tasks on
      that CPU. Each task references two "clamp buckets" defining its minimum
      and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
      bucket is active if there is at least one RUNNABLE tasks enqueued on
      that CPU and refcounting that bucket.
      
      When a task is {en,de}queued {on,from} a rq, the set of active clamp
      buckets on that CPU can change. If the set of active clamp buckets
      changes for a CPU a new "aggregated" clamp value is computed for that
      CPU. This is because each clamp bucket enforces a different utilization
      clamp value.
      
      Clamp values are always MAX aggregated for both util_min and util_max.
      This ensures that no task can affect the performance of other
      co-scheduled tasks which are more boosted (i.e. with higher util_min
      clamp) or less capped (i.e. with higher util_max clamp).
      
      A task has:
         task_struct::uclamp[clamp_id]::bucket_id
      to track the "bucket index" of the CPU's clamp bucket it refcounts while
      enqueued, for each clamp index (clamp_id).
      
      A runqueue has:
         rq::uclamp[clamp_id]::bucket[bucket_id].tasks
      to track how many RUNNABLE tasks on that CPU refcount each
      clamp bucket (bucket_id) of a clamp index (clamp_id).
      It also has a:
         rq::uclamp[clamp_id]::bucket[bucket_id].value
      to track the clamp value of each clamp bucket (bucket_id) of a clamp
      index (clamp_id).
      
      The rq::uclamp::bucket[clamp_id][] array is scanned every time it's
      needed to find a new MAX aggregated clamp value for a clamp_id. This
      operation is required only when it's dequeued the last task of a clamp
      bucket tracking the current MAX aggregated clamp value. In this case,
      the CPU is either entering IDLE or going to schedule a less boosted or
      more clamped task.
      The expected number of different clamp values configured at build time
      is small enough to fit the full unordered array into a single cache
      line, for configurations of up to 7 buckets.
      
      Add to struct rq the basic data structures required to refcount the
      number of RUNNABLE tasks for each clamp bucket. Add also the max
      aggregation required to update the rq's clamp value at each
      enqueue/dequeue event.
      
      Use a simple linear mapping of clamp values into clamp buckets.
      Pre-compute and cache bucket_id to avoid integer divisions at
      enqueue/dequeue time.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190621084217.8167-2-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      69842cba
    • Q
      sched/debug: Add a new sched_trace_*() helper functions · 3c93a0c0
      Qais Yousef 提交于
      The new functions allow modules to access internal data structures of
      unexported struct cfs_rq and struct rq to extract important information
      from the tracepoints to be introduced in later patches.
      
      While at it fix alphabetical order of struct declarations in sched.h
      Signed-off-by: NQais Yousef <qais.yousef@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pavankumar Kondeti <pkondeti@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Uwe Kleine-Konig <u.kleine-koenig@pengutronix.de>
      Link: https://lkml.kernel.org/r/20190604111459.2862-3-qais.yousef@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3c93a0c0
  2. 03 6月, 2019 1 次提交
  3. 15 5月, 2019 1 次提交
  4. 20 4月, 2019 1 次提交
    • R
      cgroup: cgroup v2 freezer · 76f969e8
      Roman Gushchin 提交于
      Cgroup v1 implements the freezer controller, which provides an ability
      to stop the workload in a cgroup and temporarily free up some
      resources (cpu, io, network bandwidth and, potentially, memory)
      for some other tasks. Cgroup v2 lacks this functionality.
      
      This patch implements freezer for cgroup v2.
      
      Cgroup v2 freezer tries to put tasks into a state similar to jobctl
      stop. This means that tasks can be killed, ptraced (using
      PTRACE_SEIZE*), and interrupted. It is possible to attach to
      a frozen task, get some information (e.g. read registers) and detach.
      It's also possible to migrate a frozen tasks to another cgroup.
      
      This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
      tried to imitate the system-wide freezer. However uninterruptible
      sleep is fine when all tasks are going to be frozen (hibernation case),
      it's not the acceptable state for some subset of the system.
      
      Cgroup v2 freezer is not supporting freezing kthreads.
      If a non-root cgroup contains kthread, the cgroup still can be frozen,
      but the kthread will remain running, the cgroup will be shown
      as non-frozen, and the notification will not be delivered.
      
      * PTRACE_ATTACH is not working because non-fatal signal delivery
      is blocked in frozen state.
      
      There are some interface differences between cgroup v1 and cgroup v2
      freezer too, which are required to conform the cgroup v2 interface
      design principles:
      1) There is no separate controller, which has to be turned on:
      the functionality is always available and is represented by
      cgroup.freeze and cgroup.events cgroup control files.
      2) The desired state is defined by the cgroup.freeze control file.
      Any hierarchical configuration is allowed.
      3) The interface is asynchronous. The actual state is available
      using cgroup.events control file ("frozen" field). There are no
      dedicated transitional states.
      4) It's allowed to make any changes with the cgroup hierarchy
      (create new cgroups, remove old cgroups, move tasks between cgroups)
      no matter if some cgroups are frozen.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      No-objection-from-me-by: NOleg Nesterov <oleg@redhat.com>
      Cc: kernel-team@fb.com
      76f969e8
  5. 19 4月, 2019 1 次提交
    • M
      rseq: Remove superfluous rseq_len from task_struct · 83b0b15b
      Mathieu Desnoyers 提交于
      The rseq system call, when invoked with flags of "0" or
      "RSEQ_FLAG_UNREGISTER" values, expects the rseq_len parameter to
      be equal to sizeof(struct rseq), which is fixed-size and fixed-layout,
      specified in uapi linux/rseq.h.
      
      Expecting a fixed size for rseq_len is a design choice that ensures
      multiple libraries and application defining __rseq_abi in the same
      process agree on its exact size.
      
      Considering that this size is and will always be the same value, there
      is no point in saving this value within task_struct rseq_len. Remove
      this field from task_struct.
      
      No change in functionality intended.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/20190305194755.2602-3-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      83b0b15b
  6. 06 3月, 2019 2 次提交
    • A
      mm/cma: add PF flag to force non cma alloc · d7fefcc8
      Aneesh Kumar K.V 提交于
      Patch series "mm/kvm/vfio/ppc64: Migrate compound pages out of CMA
      region", v8.
      
      ppc64 uses the CMA area for the allocation of guest page table (hash
      page table).  We won't be able to start guest if we fail to allocate
      hash page table.  We have observed hash table allocation failure because
      we failed to migrate pages out of CMA region because they were pinned.
      This happen when we are using VFIO.  VFIO on ppc64 pins the entire guest
      RAM.  If the guest RAM pages get allocated out of CMA region, we won't
      be able to migrate those pages.  The pages are also pinned for the
      lifetime of the guest.
      
      Currently we support migration of non-compound pages.  With THP and with
      the addition of hugetlb migration we can end up allocating compound
      pages from CMA region.  This patch series add support for migrating
      compound pages.
      
      This patch (of 4):
      
      Add PF_MEMALLOC_NOCMA which make sure any allocation in that context is
      marked non-movable and hence cannot be satisfied by CMA region.
      
      This is useful with get_user_pages_longterm where we want to take a page
      pin by migrating pages from CMA region.  Marking the section
      PF_MEMALLOC_NOCMA ensures that we avoid unnecessary page migration
      later.
      
      Link: http://lkml.kernel.org/r/20190114095438.32470-2-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7fefcc8
    • M
      mm, compaction: capture a page under direct compaction · 5e1f0f09
      Mel Gorman 提交于
      Compaction is inherently race-prone as a suitable page freed during
      compaction can be allocated by any parallel task.  This patch uses a
      capture_control structure to isolate a page immediately when it is freed
      by a direct compactor in the slow path of the page allocator.  The
      intent is to avoid redundant scanning.
      
                                           5.0.0-rc1              5.0.0-rc1
                                     selective-v3r17          capture-v3r19
      Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
      Amean     fault-both-3      2582.11 (   0.00%)     2563.68 (   0.71%)
      Amean     fault-both-5      4500.26 (   0.00%)     4233.52 (   5.93%)
      Amean     fault-both-7      5819.53 (   0.00%)     6333.65 (  -8.83%)
      Amean     fault-both-12     9321.18 (   0.00%)     9759.38 (  -4.70%)
      Amean     fault-both-18     9782.76 (   0.00%)    10338.76 (  -5.68%)
      Amean     fault-both-24    15272.81 (   0.00%)    13379.55 *  12.40%*
      Amean     fault-both-30    15121.34 (   0.00%)    16158.25 (  -6.86%)
      Amean     fault-both-32    18466.67 (   0.00%)    18971.21 (  -2.73%)
      
      Latency is only moderately affected but the devil is in the details.  A
      closer examination indicates that base page fault latency is reduced but
      latency of huge pages is increased as it takes creater care to succeed.
      Part of the "problem" is that allocation success rates are close to 100%
      even when under pressure and compaction gets harder
      
                                      5.0.0-rc1              5.0.0-rc1
                                selective-v3r17          capture-v3r19
      Percentage huge-3        96.70 (   0.00%)       98.23 (   1.58%)
      Percentage huge-5        96.99 (   0.00%)       95.30 (  -1.75%)
      Percentage huge-7        94.19 (   0.00%)       97.24 (   3.24%)
      Percentage huge-12       94.95 (   0.00%)       97.35 (   2.53%)
      Percentage huge-18       96.74 (   0.00%)       97.30 (   0.58%)
      Percentage huge-24       97.07 (   0.00%)       97.55 (   0.50%)
      Percentage huge-30       95.69 (   0.00%)       98.50 (   2.95%)
      Percentage huge-32       96.70 (   0.00%)       99.27 (   2.65%)
      
      And scan rates are reduced as expected by 6% for the migration scanner
      and 29% for the free scanner indicating that there is less redundant
      work.
      
      Compaction migrate scanned    20815362    19573286
      Compaction free scanned       16352612    11510663
      
      [mgorman@techsingularity.net: remove redundant check]
        Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
      Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e1f0f09
  7. 26 2月, 2019 1 次提交
    • L
      Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses" · 53a41cb7
      Linus Torvalds 提交于
      This reverts commit 9da3f2b7.
      
      It was well-intentioned, but wrong.  Overriding the exception tables for
      instructions for random reasons is just wrong, and that is what the new
      code did.
      
      It caused problems for tracing, and it caused problems for strncpy_from_user(),
      because the new checks made perfectly valid use cases break, rather than
      catch things that did bad things.
      
      Unchecked user space accesses are a problem, but that's not a reason to
      add invalid checks that then people have to work around with silly flags
      (in this case, that 'kernel_uaccess_faults_ok' flag, which is just an
      odd way to say "this commit was wrong" and was sprinked into random
      places to hide the wrongness).
      
      The real fix to unchecked user space accesses is to get rid of the
      special "let's not check __get_user() and __put_user() at all" logic.
      Make __{get|put}_user() be just aliases to the regular {get|put}_user()
      functions, and make it impossible to access user space without having
      the proper checks in places.
      
      The raison d'être of the special double-underscore versions used to be
      that the range check was expensive, and if you did multiple user
      accesses, you'd do the range check up front (like the signal frame
      handling code, for example).  But SMAP (on x86) and PAN (on ARM) have
      made that optimization pointless, because the _real_ expense is the "set
      CPU flag to allow user space access".
      
      Do let's not break the valid cases to catch invalid cases that shouldn't
      even exist.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53a41cb7
  8. 04 2月, 2019 5 次提交
    • A
      sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock() · c546951d
      Andrea Parri 提交于
      move_queued_task() synchronizes with task_rq_lock() as follows:
      
      	move_queued_task()		task_rq_lock()
      
      	[S] ->on_rq = MIGRATING		[L] rq = task_rq()
      	WMB (__set_task_cpu())		ACQUIRE (rq->lock);
      	[S] ->cpu = new_cpu		[L] ->on_rq
      
      where "[L] rq = task_rq()" is ordered before "ACQUIRE (rq->lock)" by an
      address dependency and, in turn, "ACQUIRE (rq->lock)" is ordered before
      "[L] ->on_rq" by the ACQUIRE itself.
      
      Use READ_ONCE() to load ->cpu in task_rq() (c.f., task_cpu()) to honor
      this address dependency.  Also, mark the accesses to ->cpu and ->on_rq
      with READ_ONCE()/WRITE_ONCE() to comply with the LKMM.
      Signed-off-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: https://lkml.kernel.org/r/20190121155240.27173-1-andrea.parri@amarulasolutions.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c546951d
    • V
      sched/fair: Update scale invariance of PELT · 23127296
      Vincent Guittot 提交于
      The current implementation of load tracking invariance scales the
      contribution with current frequency and uarch performance (only for
      utilization) of the CPU. One main result of this formula is that the
      figures are capped by current capacity of CPU. Another one is that the
      load_avg is not invariant because not scaled with uarch.
      
      The util_avg of a periodic task that runs r time slots every p time slots
      varies in the range :
      
          U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
      
      with U is the max util_avg value = SCHED_CAPACITY_SCALE
      
      At a lower capacity, the range becomes:
      
          U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)
      
      with C reflecting the compute capacity ratio between current capacity and
      max capacity.
      
      so C tries to compensate changes in (1-y^r') but it can't be accurate.
      
      Instead of scaling the contribution value of PELT algo, we should scale the
      running time. The PELT signal aims to track the amount of computation of
      tasks and/or rq so it seems more correct to scale the running time to
      reflect the effective amount of computation done since the last update.
      
      In order to be fully invariant, we need to apply the same amount of
      running time and idle time whatever the current capacity. Because running
      at lower capacity implies that the task will run longer, we have to ensure
      that the same amount of idle time will be applied when system becomes idle
      and no idle time has been "stolen". But reaching the maximum utilization
      value (SCHED_CAPACITY_SCALE) means that the task is seen as an
      always-running task whatever the capacity of the CPU (even at max compute
      capacity). In this case, we can discard this "stolen" idle times which
      becomes meaningless.
      
      In order to achieve this time scaling, a new clock_pelt is created per rq.
      The increase of this clock scales with current capacity when something
      is running on rq and synchronizes with clock_task when rq is idle. With
      this mechanism, we ensure the same running and idle time whatever the
      current capacity. This also enables to simplify the pelt algorithm by
      removing all references of uarch and frequency and applying the same
      contribution to utilization and loads. Furthermore, the scaling is done
      only once per update of clock (update_rq_clock_task()) instead of during
      each update of sched_entities and cfs/rt/dl_rq of the rq like the current
      implementation. This is interesting when cgroup are involved as shown in
      the results below:
      
      On a hikey (octo Arm64 platform).
      Performance cpufreq governor and only shallowest c-state to remove variance
      generated by those power features so we only track the impact of pelt algo.
      
      each test runs 16 times:
      
      	./perf bench sched pipe
      	(higher is better)
      	kernel	tip/sched/core     + patch
      	        ops/seconds        ops/seconds         diff
      	cgroup
      	root    59652(+/- 0.18%)   59876(+/- 0.24%)    +0.38%
      	level1  55608(+/- 0.27%)   55923(+/- 0.24%)    +0.57%
      	level2  52115(+/- 0.29%)   52564(+/- 0.22%)    +0.86%
      
      	hackbench -l 1000
      	(lower is better)
      	kernel	tip/sched/core     + patch
      	        duration(sec)      duration(sec)        diff
      	cgroup
      	root    4.453(+/- 2.37%)   4.383(+/- 2.88%)     -1.57%
      	level1  4.859(+/- 8.50%)   4.830(+/- 7.07%)     -0.60%
      	level2  5.063(+/- 9.83%)   4.928(+/- 9.66%)     -2.66%
      
      Then, the responsiveness of PELT is improved when CPU is not running at max
      capacity with this new algorithm. I have put below some examples of
      duration to reach some typical load values according to the capacity of the
      CPU with current implementation and with this patch. These values has been
      computed based on the geometric series and the half period value:
      
        Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
        972 (95%)    138ms         not reachable            276ms
        486 (47.5%)  30ms          138ms                     60ms
        256 (25%)    13ms           32ms                     26ms
      
      On my hikey (octo Arm64 platform) with schedutil governor, the time to
      reach max OPP when starting from a null utilization, decreases from 223ms
      with current scale invariance down to 121ms with the new algorithm.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: pjt@google.com
      Cc: pkondeti@codeaurora.org
      Cc: quentin.perret@arm.com
      Cc: rjw@rjwysocki.net
      Cc: srinivas.pandruvada@linux.intel.com
      Cc: thara.gopinath@linaro.org
      Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      23127296
    • E
      sched/core: Convert task_struct.stack_refcount to refcount_t · f0b89d39
      Elena Reshetova 提交于
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable task_struct.stack_refcount is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the task_struct.stack_refcount it might make a difference
      in following places:
      
       - try_get_task_stack(): increment in refcount_inc_not_zero() only
         guarantees control dependency on success vs. fully ordered
         atomic counterpart
       - put_task_stack(): decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      Suggested-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: NHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-6-git-send-email-elena.reshetova@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f0b89d39
    • E
      sched/core: Convert task_struct.usage to refcount_t · ec1d2819
      Elena Reshetova 提交于
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable task_struct.usage is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the task_struct.usage it might make a difference
      in following places:
      
       - put_task_struct(): decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      Suggested-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: NHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-5-git-send-email-elena.reshetova@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ec1d2819
    • R
      audit: remove audit_context when CONFIG_ AUDIT and not AUDITSYSCALL · 5f3d544f
      Richard Guy Briggs 提交于
      Remove audit_context from struct task_struct and struct audit_buffer
      when CONFIG_AUDIT is enabled but CONFIG_AUDITSYSCALL is not.
      
      Also, audit_log_name() (and supporting inode and fcaps functions) should
      have been put back in auditsc.c when soft and hard link logging was
      normalized since it is only used by syscall auditing.
      
      See github issue https://github.com/linux-audit/audit-kernel/issues/105Signed-off-by: NRichard Guy Briggs <rgb@redhat.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      5f3d544f
  9. 02 2月, 2019 1 次提交
    • J
      x86/resctrl: Avoid confusion over the new X86_RESCTRL config · e6d42931
      Johannes Weiner 提交于
      "Resource Control" is a very broad term for this CPU feature, and a term
      that is also associated with containers, cgroups etc. This can easily
      cause confusion.
      
      Make the user prompt more specific. Match the config symbol name.
      
       [ bp: In the future, the corresponding ARM arch-specific code will be
         under ARM_CPU_RESCTRL and the arch-agnostic bits will be carved out
         under the CPU_RESCTRL umbrella symbol. ]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Babu Moger <Babu.Moger@amd.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: linux-doc@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: Reinette Chatre <reinette.chatre@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190130195621.GA30653@cmpxchg.org
      e6d42931
  10. 30 1月, 2019 2 次提交
    • W
      x86/speculation: Add PR_SPEC_DISABLE_NOEXEC · 71368af9
      Waiman Long 提交于
      With the default SPEC_STORE_BYPASS_SECCOMP/SPEC_STORE_BYPASS_PRCTL mode,
      the TIF_SSBD bit will be inherited when a new task is fork'ed or cloned.
      It will also remain when a new program is execve'ed.
      
      Only certain class of applications (like Java) that can run on behalf of
      multiple users on a single thread will require disabling speculative store
      bypass for security purposes. Those applications will call prctl(2) at
      startup time to disable SSB. They won't rely on the fact the SSB might have
      been disabled. Other applications that don't need SSBD will just move on
      without checking if SSBD has been turned on or not.
      
      The fact that the TIF_SSBD is inherited across execve(2) boundary will
      cause performance of applications that don't need SSBD but their
      predecessors have SSBD on to be unwittingly impacted especially if they
      write to memory a lot.
      
      To remedy this problem, a new PR_SPEC_DISABLE_NOEXEC argument for the
      PR_SET_SPECULATION_CTRL option of prctl(2) is added to allow applications
      to specify that the SSBD feature bit on the task structure should be
      cleared whenever a new program is being execve'ed.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: linux-doc@vger.kernel.org
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: KarimAllah Ahmed <karahmed@amazon.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Link: https://lkml.kernel.org/r/1547676096-3281-1-git-send-email-longman@redhat.com
      71368af9
    • T
      sched: Remove stale PF_MUTEX_TESTER bit · 15917dc0
      Thomas Gleixner 提交于
      The RTMUTEX tester was removed long ago but the PF bit stayed
      around. Remove it and free up the space.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      15917dc0
  11. 26 1月, 2019 1 次提交
  12. 12 1月, 2019 1 次提交
    • T
      umh: add exit routine for UMH process · 73ab1cb2
      Taehee Yoo 提交于
      A UMH process which is created by the fork_usermode_blob() such as
      bpfilter needs to release members of the umh_info when process is
      terminated.
      But the do_exit() does not release members of the umh_info. hence module
      which uses UMH needs own code to detect whether UMH process is
      terminated or not.
      But this implementation needs extra code for checking the status of
      UMH process. it eventually makes the code more complex.
      
      The new PF_UMH flag is added and it is used to identify UMH processes.
      The exit_umh() does not release members of the umh_info.
      Hence umh_info->cleanup callback should release both members of the
      umh_info and the private data.
      Suggested-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73ab1cb2
  13. 09 1月, 2019 1 次提交
  14. 03 12月, 2018 1 次提交
    • I
      sched: Fix various typos in comments · dfcb245e
      Ingo Molnar 提交于
      Go over the scheduler source code and fix common typos
      in comments - and a typo in an actual variable name.
      
      No change in functionality intended.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      dfcb245e
  15. 28 11月, 2018 2 次提交
    • T
      x86/speculation: Add prctl() control for indirect branch speculation · 9137bb27
      Thomas Gleixner 提交于
      Add the PR_SPEC_INDIRECT_BRANCH option for the PR_GET_SPECULATION_CTRL and
      PR_SET_SPECULATION_CTRL prctls to allow fine grained per task control of
      indirect branch speculation via STIBP and IBPB.
      
      Invocations:
       Check indirect branch speculation status with
       - prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, 0, 0, 0);
      
       Enable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_ENABLE, 0, 0);
      
       Disable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE, 0, 0);
      
       Force disable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0);
      
      See Documentation/userspace-api/spec_ctrl.rst.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185005.866780996@linutronix.de
      9137bb27
    • S
      function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack · 39eb456d
      Steven Rostedt (VMware) 提交于
      Currently, the depth of the ret_stack is determined by curr_ret_stack index.
      The issue is that there's a race between setting of the curr_ret_stack and
      calling of the callback attached to the return of the function.
      
      Commit 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling
      trace return callback") moved the calling of the callback to after the
      setting of the curr_ret_stack, even stating that it was safe to do so, when
      in fact, it was the reason there was a barrier() there (yes, I should have
      commented that barrier()).
      
      Not only does the curr_ret_stack keep track of the current call graph depth,
      it also keeps the ret_stack content from being overwritten by new data.
      
      The function profiler, uses the "subtime" variable of ret_stack structure
      and by moving the curr_ret_stack, it allows for interrupts to use the same
      structure it was using, corrupting the data, and breaking the profiler.
      
      To fix this, there needs to be two variables to handle the call stack depth
      and the pointer to where the ret_stack is being used, as they need to change
      at two different locations.
      
      Cc: stable@kernel.org
      Fixes: 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      39eb456d
  16. 23 11月, 2018 1 次提交
    • B
      x86/resctrl: Rename the config option INTEL_RDT to RESCTRL · 6fe07ce3
      Babu Moger 提交于
      The resource control feature is supported by both Intel and AMD. So,
      rename CONFIG_INTEL_RDT to the vendor-neutral CONFIG_RESCTRL.
      
      Now CONFIG_RESCTRL will be used for both Intel and AMD to enable
      Resource Control support. Update the texts in config and condition
      accordingly.
      
       [ bp: Simplify Kconfig text. ]
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Dmitry Safonov <dima@arista.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <linux-doc@vger.kernel.org>
      Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: <qianyue.zj@alibaba-inc.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Reinette Chatre <reinette.chatre@intel.com>
      Cc: Rian Hunter <rian@alum.mit.edu>
      Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: <xiaochen.shen@intel.com>
      Link: https://lkml.kernel.org/r/20181121202811.4492-9-babu.moger@amd.com
      6fe07ce3
  17. 13 11月, 2018 1 次提交
    • P
      rcu: Speed up expedited GPs when interrupting RCU reader · 05f41571
      Paul E. McKenney 提交于
      In PREEMPT kernels, an expedited grace period might send an IPI to a
      CPU that is executing an RCU read-side critical section.  In that case,
      it would be nice if the rcu_read_unlock() directly interacted with the
      RCU core code to immediately report the quiescent state.  And this does
      happen in the case where the reader has been preempted.  But it would
      also be a nice performance optimization if immediate reporting also
      happened in the preemption-free case.
      
      This commit therefore adds an ->exp_hint field to the task_struct structure's
      ->rcu_read_unlock_special field.  The IPI handler sets this hint when
      it has interrupted an RCU read-side critical section, and this causes
      the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(),
      which, if preemption is enabled, reports the quiescent state immediately.
      If preemption is disabled, then the report is required to be deferred
      until preemption (or bottom halves or interrupts or whatever) is re-enabled.
      
      Because this is a hint, it does nothing for more complicated cases.  For
      example, if the IPI interrupts an RCU reader, but interrupts are disabled
      across the rcu_read_unlock(), but another rcu_read_lock() is executed
      before interrupts are re-enabled, the hint will already have been cleared.
      If you do crazy things like this, reporting will be deferred until some
      later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar.
      Reported-by: NJoel Fernandes <joel@joelfernandes.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Acked-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      05f41571
  18. 27 10月, 2018 2 次提交
  19. 03 10月, 2018 1 次提交
    • E
      signal: Distinguish between kernel_siginfo and siginfo · ae7795bc
      Eric W. Biederman 提交于
      Linus recently observed that if we did not worry about the padding
      member in struct siginfo it is only about 48 bytes, and 48 bytes is
      much nicer than 128 bytes for allocating on the stack and copying
      around in the kernel.
      
      The obvious thing of only adding the padding when userspace is
      including siginfo.h won't work as there are sigframe definitions in
      the kernel that embed struct siginfo.
      
      So split siginfo in two; kernel_siginfo and siginfo.  Keeping the
      traditional name for the userspace definition.  While the version that
      is used internally to the kernel and ultimately will not be padded to
      128 bytes is called kernel_siginfo.
      
      The definition of struct kernel_siginfo I have put in include/signal_types.h
      
      A set of buildtime checks has been added to verify the two structures have
      the same field offsets.
      
      To make it easy to verify the change kernel_siginfo retains the same
      size as siginfo.  The reduction in size comes in a following change.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      ae7795bc
  20. 05 9月, 2018 2 次提交
    • A
      fs/proc: Show STACKLEAK metrics in the /proc file system · c8d12627
      Alexander Popov 提交于
      Introduce CONFIG_STACKLEAK_METRICS providing STACKLEAK information about
      tasks via the /proc file system. In particular, /proc/<pid>/stack_depth
      shows the maximum kernel stack consumption for the current and previous
      syscalls. Although this information is not precise, it can be useful for
      estimating the STACKLEAK performance impact for your workloads.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NAlexander Popov <alex.popov@linux.com>
      Tested-by: NLaura Abbott <labbott@redhat.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      c8d12627
    • A
      x86/entry: Add STACKLEAK erasing the kernel stack at the end of syscalls · afaef01c
      Alexander Popov 提交于
      The STACKLEAK feature (initially developed by PaX Team) has the following
      benefits:
      
      1. Reduces the information that can be revealed through kernel stack leak
         bugs. The idea of erasing the thread stack at the end of syscalls is
         similar to CONFIG_PAGE_POISONING and memzero_explicit() in kernel
         crypto, which all comply with FDP_RIP.2 (Full Residual Information
         Protection) of the Common Criteria standard.
      
      2. Blocks some uninitialized stack variable attacks (e.g. CVE-2017-17712,
         CVE-2010-2963). That kind of bugs should be killed by improving C
         compilers in future, which might take a long time.
      
      This commit introduces the code filling the used part of the kernel
      stack with a poison value before returning to userspace. Full
      STACKLEAK feature also contains the gcc plugin which comes in a
      separate commit.
      
      The STACKLEAK feature is ported from grsecurity/PaX. More information at:
        https://grsecurity.net/
        https://pax.grsecurity.net/
      
      This code is modified from Brad Spengler/PaX Team's code in the last
      public patch of grsecurity/PaX based on our understanding of the code.
      Changes or omissions from the original code are ours and don't reflect
      the original grsecurity/PaX code.
      
      Performance impact:
      
      Hardware: Intel Core i7-4770, 16 GB RAM
      
      Test #1: building the Linux kernel on a single core
              0.91% slowdown
      
      Test #2: hackbench -s 4096 -l 2000 -g 15 -f 25 -P
              4.2% slowdown
      
      So the STACKLEAK description in Kconfig includes: "The tradeoff is the
      performance impact: on a single CPU system kernel compilation sees a 1%
      slowdown, other systems and workloads may vary and you are advised to
      test this feature on your expected workload before deploying it".
      Signed-off-by: NAlexander Popov <alex.popov@linux.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDave Hansen <dave.hansen@linux.intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      afaef01c
  21. 03 9月, 2018 1 次提交
    • J
      x86/fault: BUG() when uaccess helpers fault on kernel addresses · 9da3f2b7
      Jann Horn 提交于
      There have been multiple kernel vulnerabilities that permitted userspace to
      pass completely unchecked pointers through to userspace accessors:
      
       - the waitid() bug - commit 96ca579a ("waitid(): Add missing
         access_ok() checks")
       - the sg/bsg read/write APIs
       - the infiniband read/write APIs
      
      These don't happen all that often, but when they do happen, it is hard to
      test for them properly; and it is probably also hard to discover them with
      fuzzing. Even when an unmapped kernel address is supplied to such buggy
      code, it just returns -EFAULT instead of doing a proper BUG() or at least
      WARN().
      
      Try to make such misbehaving code a bit more visible by refusing to do a
      fixup in the pagefault handler code when a userspace accessor causes a #PF
      on a kernel address and the current context isn't whitelisted.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NKees Cook <keescook@chromium.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: kernel-hardening@lists.openwall.com
      Cc: dvyukov@google.com
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: Borislav Petkov <bp@alien8.de>
      Link: https://lkml.kernel.org/r/20180828201421.157735-7-jannh@google.com
      9da3f2b7
  22. 31 8月, 2018 1 次提交
  23. 23 8月, 2018 1 次提交
    • D
      kernel/hung_task.c: allow to set checking interval separately from timeout · a2e51445
      Dmitry Vyukov 提交于
      Currently task hung checking interval is equal to timeout, as the result
      hung is detected anywhere between timeout and 2*timeout.  This is fine for
      most interactive environments, but this hurts automated testing setups
      (syzbot).  In an automated setup we need to strictly order CPU lockup <
      RCU stall < workqueue lockup < task hung < silent loss, so that RCU stall
      is not detected as task hung and task hung is not detected as silent
      machine loss.  The large variance in task hung detection timeout requires
      setting silent machine loss timeout to a very large value (e.g.  if task
      hung is 3 mins, then silent loss need to be set to ~7 mins).  The
      additional 3 minutes significantly reduce testing efficiency because
      usually we crash kernel within a minute, and this can add hours to bug
      localization process as it needs to do dozens of tests.
      
      Allow setting checking interval separately from timeout.  This allows to
      set timeout to, say, 3 minutes, but checking interval to 10 secs.
      
      The interval is controlled via a new hung_task_check_interval_secs sysctl,
      similar to the existing hung_task_timeout_secs sysctl.  The default value
      of 0 results in the current behavior: checking interval is equal to
      timeout.
      
      [akpm@linux-foundation.org: update hung_task_timeout_max's comment]
      Link: http://lkml.kernel.org/r/20180611111004.203513-1-dvyukov@google.comSigned-off-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2e51445
  24. 18 8月, 2018 3 次提交
    • K
      mm: introduce CONFIG_MEMCG_KMEM as combination of CONFIG_MEMCG && !CONFIG_SLOB · 84c07d11
      Kirill Tkhai 提交于
      Introduce new config option, which is used to replace repeating
      CONFIG_MEMCG && !CONFIG_SLOB pattern.  Next patches add a little more
      memcg+kmem related code, so let's keep the defines more clearly.
      
      Link: http://lkml.kernel.org/r/153063053670.1818.15013136946600481138.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NVladimir Davydov <vdavydov.dev@gmail.com>
      Tested-by: NShakeel Butt <shakeelb@google.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Josef Bacik <jbacik@fb.com>
      Cc: Li RongQing <lirongqing@baidu.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Matthias Kaehlcke <mka@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Sahitya Tummala <stummala@codeaurora.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Waiman Long <longman@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84c07d11
    • M
      memcg, oom: move out_of_memory back to the charge path · 29ef680a
      Michal Hocko 提交于
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") has changed the ENOMEM semantic of memcg charges.
      Rather than invoking the oom killer from the charging context it delays
      the oom killer to the page fault path (pagefault_out_of_memory).  This
      in turn means that many users (e.g.  slab or g-u-p) will get ENOMEM when
      the corresponding memcg hits the hard limit and the memcg is is OOM.
      This is behavior is inconsistent with !memcg case where the oom killer
      is invoked from the allocation context and the allocator keeps retrying
      until it succeeds.
      
      The difference in the behavior is user visible.  mmap(MAP_POPULATE)
      might result in not fully populated ranges while the mmap return code
      doesn't tell that to the userspace.  Random syscalls might fail with
      ENOMEM etc.
      
      The primary motivation of the different memcg oom semantic was the
      deadlock avoidance.  Things have changed since then, though.  We have an
      async oom teardown by the oom reaper now and so we do not have to rely
      on the victim to tear down its memory anymore.  Therefore we can return
      to the original semantic as long as the memcg oom killer is not handed
      over to the users space.
      
      There is still one thing to be careful about here though.  If the oom
      killer is not able to make any forward progress - e.g.  because there is
      no eligible task to kill - then we have to bail out of the charge path
      to prevent from same class of deadlocks.  We have basically two options
      here.  Either we fail the charge with ENOMEM or force the charge and
      allow overcharge.  The first option has been considered more harmful
      than useful because rare inconsistencies in the ENOMEM behavior is hard
      to test for and error prone.  Basically the same reason why the page
      allocator doesn't fail allocations under such conditions.  The later
      might allow runaways but those should be really unlikely unless somebody
      misconfigures the system.  E.g.  allowing to migrate tasks away from the
      memcg to a different unlimited memcg with move_charge_at_immigrate
      disabled.
      
      Link: http://lkml.kernel.org/r/20180628151101.25307-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NGreg Thelen <gthelen@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      29ef680a
    • S
      fs: fsnotify: account fsnotify metadata to kmemcg · d46eb14b
      Shakeel Butt 提交于
      Patch series "Directed kmem charging", v8.
      
      The Linux kernel's memory cgroup allows limiting the memory usage of the
      jobs running on the system to provide isolation between the jobs.  All
      the kernel memory allocated in the context of the job and marked with
      __GFP_ACCOUNT will also be included in the memory usage and be limited
      by the job's limit.
      
      The kernel memory can only be charged to the memcg of the process in
      whose context kernel memory was allocated.  However there are cases
      where the allocated kernel memory should be charged to the memcg
      different from the current processes's memcg.  This patch series
      contains two such concrete use-cases i.e.  fsnotify and buffer_head.
      
      The fsnotify event objects can consume a lot of system memory for large
      or unlimited queues if there is either no or slow listener.  The events
      are allocated in the context of the event producer.  However they should
      be charged to the event consumer.  Similarly the buffer_head objects can
      be allocated in a memcg different from the memcg of the page for which
      buffer_head objects are being allocated.
      
      To solve this issue, this patch series introduces mechanism to charge
      kernel memory to a given memcg.  In case of fsnotify events, the memcg
      of the consumer can be used for charging and for buffer_head, the memcg
      of the page can be charged.  For directed charging, the caller can use
      the scope API memalloc_[un]use_memcg() to specify the memcg to charge
      for all the __GFP_ACCOUNT allocations within the scope.
      
      This patch (of 2):
      
      A lot of memory can be consumed by the events generated for the huge or
      unlimited queues if there is either no or slow listener.  This can cause
      system level memory pressure or OOMs.  So, it's better to account the
      fsnotify kmem caches to the memcg of the listener.
      
      However the listener can be in a different memcg than the memcg of the
      producer and these allocations happen in the context of the event
      producer.  This patch introduces remote memcg charging API which the
      producer can use to charge the allocations to the memcg of the listener.
      
      There are seven fsnotify kmem caches and among them allocations from
      dnotify_struct_cache, dnotify_mark_cache, fanotify_mark_cache and
      inotify_inode_mark_cachep happens in the context of syscall from the
      listener.  So, SLAB_ACCOUNT is enough for these caches.
      
      The objects from fsnotify_mark_connector_cachep are not accounted as
      they are small compared to the notification mark or events and it is
      unclear whom to account connector to since it is shared by all events
      attached to the inode.
      
      The allocations from the event caches happen in the context of the event
      producer.  For such caches we will need to remote charge the allocations
      to the listener's memcg.  Thus we save the memcg reference in the
      fsnotify_group structure of the listener.
      
      This patch has also moved the members of fsnotify_group to keep the size
      same, at least for 64 bit build, even with additional member by filling
      the holes.
      
      [shakeelb@google.com: use GFP_KERNEL_ACCOUNT rather than open-coding it]
        Link: http://lkml.kernel.org/r/20180702215439.211597-1-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20180627191250.209150-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d46eb14b
  25. 25 7月, 2018 1 次提交
  26. 21 7月, 2018 1 次提交
    • E
      pid: Implement PIDTYPE_TGID · 6883f81a
      Eric W. Biederman 提交于
      Everywhere except in the pid array we distinguish between a tasks pid and
      a tasks tgid (thread group id).  Even in the enumeration we want that
      distinction sometimes so we have added __PIDTYPE_TGID.  With leader_pid
      we almost have an implementation of PIDTYPE_TGID in struct signal_struct.
      
      Add PIDTYPE_TGID as a first class member of the pid_type enumeration and
      into the pids array.  Then remove the __PIDTYPE_TGID special case and the
      leader_pid in signal_struct.
      
      The net size increase is just an extra pointer added to struct pid and
      an extra pair of pointers of an hlist_node added to task_struct.
      
      The effect on code maintenance is the removal of a number of special
      cases today and the potential to remove many more special cases as
      PIDTYPE_TGID gets used to it's fullest.  The long term potential
      is allowing zombie thread group leaders to exit, which will remove
      a lot more special cases in the code.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      6883f81a