1. 25 9月, 2019 1 次提交
  2. 18 9月, 2019 1 次提交
  3. 01 8月, 2019 1 次提交
  4. 25 7月, 2019 2 次提交
  5. 25 6月, 2019 4 次提交
    • P
      sched/uclamp: Extend sched_setattr() to support utilization clamping · a509a7cd
      Patrick Bellasi 提交于
      The SCHED_DEADLINE scheduling class provides an advanced and formal
      model to define tasks requirements that can translate into proper
      decisions for both task placements and frequencies selections. Other
      classes have a more simplified model based on the POSIX concept of
      priorities.
      
      Such a simple priority based model however does not allow to exploit
      most advanced features of the Linux scheduler like, for example, driving
      frequencies selection via the schedutil cpufreq governor. However, also
      for non SCHED_DEADLINE tasks, it's still interesting to define tasks
      properties to support scheduler decisions.
      
      Utilization clamping exposes to user-space a new set of per-task
      attributes the scheduler can use as hints about the expected/required
      utilization for a task. This allows to implement a "proactive" per-task
      frequency control policy, a more advanced policy than the current one
      based just on "passive" measured task utilization. For example, it's
      possible to boost interactive tasks (e.g. to get better performance) or
      cap background tasks (e.g. to be more energy/thermal efficient).
      
      Introduce a new API to set utilization clamping values for a specified
      task by extending sched_setattr(), a syscall which already allows to
      define task specific properties for different scheduling classes. A new
      pair of attributes allows to specify a minimum and maximum utilization
      the scheduler can consider for a task.
      
      Do that by validating the required clamp values before and then applying
      the required changes using _the_ same pattern already in use for
      __setscheduler(). This ensures that the task is re-enqueued with the new
      clamp values.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190621084217.8167-7-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a509a7cd
    • P
      sched/uclamp: Add system default clamps · e8f14172
      Patrick Bellasi 提交于
      Tasks without a user-defined clamp value are considered not clamped
      and by default their utilization can have any value in the
      [0..SCHED_CAPACITY_SCALE] range.
      
      Tasks with a user-defined clamp value are allowed to request any value
      in that range, and the required clamp is unconditionally enforced.
      However, a "System Management Software" could be interested in limiting
      the range of clamp values allowed for all tasks.
      
      Add a privileged interface to define a system default configuration via:
      
        /proc/sys/kernel/sched_uclamp_util_{min,max}
      
      which works as an unconditional clamp range restriction for all tasks.
      
      With the default configuration, the full SCHED_CAPACITY_SCALE range of
      values is allowed for each clamp index. Otherwise, the task-specific
      clamp is capped by the corresponding system default value.
      
      Do that by tracking, for each task, the "effective" clamp value and
      bucket the task has been refcounted in at enqueue time. This
      allows to lazy aggregate "requested" and "system default" values at
      enqueue time and simplifies refcounting updates at dequeue time.
      
      The cached bucket ids are used to avoid (relatively) more expensive
      integer divisions every time a task is enqueued.
      
      An active flag is used to report when the "effective" value is valid and
      thus the task is actually refcounted in the corresponding rq's bucket.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e8f14172
    • P
      sched/uclamp: Add CPU's clamp buckets refcounting · 69842cba
      Patrick Bellasi 提交于
      Utilization clamping allows to clamp the CPU's utilization within a
      [util_min, util_max] range, depending on the set of RUNNABLE tasks on
      that CPU. Each task references two "clamp buckets" defining its minimum
      and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
      bucket is active if there is at least one RUNNABLE tasks enqueued on
      that CPU and refcounting that bucket.
      
      When a task is {en,de}queued {on,from} a rq, the set of active clamp
      buckets on that CPU can change. If the set of active clamp buckets
      changes for a CPU a new "aggregated" clamp value is computed for that
      CPU. This is because each clamp bucket enforces a different utilization
      clamp value.
      
      Clamp values are always MAX aggregated for both util_min and util_max.
      This ensures that no task can affect the performance of other
      co-scheduled tasks which are more boosted (i.e. with higher util_min
      clamp) or less capped (i.e. with higher util_max clamp).
      
      A task has:
         task_struct::uclamp[clamp_id]::bucket_id
      to track the "bucket index" of the CPU's clamp bucket it refcounts while
      enqueued, for each clamp index (clamp_id).
      
      A runqueue has:
         rq::uclamp[clamp_id]::bucket[bucket_id].tasks
      to track how many RUNNABLE tasks on that CPU refcount each
      clamp bucket (bucket_id) of a clamp index (clamp_id).
      It also has a:
         rq::uclamp[clamp_id]::bucket[bucket_id].value
      to track the clamp value of each clamp bucket (bucket_id) of a clamp
      index (clamp_id).
      
      The rq::uclamp::bucket[clamp_id][] array is scanned every time it's
      needed to find a new MAX aggregated clamp value for a clamp_id. This
      operation is required only when it's dequeued the last task of a clamp
      bucket tracking the current MAX aggregated clamp value. In this case,
      the CPU is either entering IDLE or going to schedule a less boosted or
      more clamped task.
      The expected number of different clamp values configured at build time
      is small enough to fit the full unordered array into a single cache
      line, for configurations of up to 7 buckets.
      
      Add to struct rq the basic data structures required to refcount the
      number of RUNNABLE tasks for each clamp bucket. Add also the max
      aggregation required to update the rq's clamp value at each
      enqueue/dequeue event.
      
      Use a simple linear mapping of clamp values into clamp buckets.
      Pre-compute and cache bucket_id to avoid integer divisions at
      enqueue/dequeue time.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190621084217.8167-2-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      69842cba
    • Q
      sched/debug: Add a new sched_trace_*() helper functions · 3c93a0c0
      Qais Yousef 提交于
      The new functions allow modules to access internal data structures of
      unexported struct cfs_rq and struct rq to extract important information
      from the tracepoints to be introduced in later patches.
      
      While at it fix alphabetical order of struct declarations in sched.h
      Signed-off-by: NQais Yousef <qais.yousef@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pavankumar Kondeti <pkondeti@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Uwe Kleine-Konig <u.kleine-koenig@pengutronix.de>
      Link: https://lkml.kernel.org/r/20190604111459.2862-3-qais.yousef@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3c93a0c0
  6. 19 6月, 2019 1 次提交
    • D
      keys: Cache result of request_key*() temporarily in task_struct · 7743c48e
      David Howells 提交于
      If a filesystem uses keys to hold authentication tokens, then it needs a
      token for each VFS operation that might perform an authentication check -
      either by passing it to the server, or using to perform a check based on
      authentication data cached locally.
      
      For open files this isn't a problem, since the key should be cached in the
      file struct since it represents the subject performing operations on that
      file descriptor.
      
      During pathwalk, however, there isn't anywhere to cache the key, except
      perhaps in the nameidata struct - but that isn't exposed to the
      filesystems.  Further, a pathwalk can incur a lot of operations, calling
      one or more of the following, for instance:
      
      	->lookup()
      	->permission()
      	->d_revalidate()
      	->d_automount()
      	->get_acl()
      	->getxattr()
      
      on each dentry/inode it encounters - and each one may need to call
      request_key().  And then, at the end of pathwalk, it will call the actual
      operation:
      
      	->mkdir()
      	->mknod()
      	->getattr()
      	->open()
      	...
      
      which may need to go and get the token again.
      
      However, it is very likely that all of the operations on a single
      dentry/inode - and quite possibly a sequence of them - will all want to use
      the same authentication token, which suggests that caching it would be a
      good idea.
      
      To this end:
      
       (1) Make it so that a positive result of request_key() and co. that didn't
           require upcalling to userspace is cached temporarily in task_struct.
      
       (2) The cache is 1 deep, so a new result displaces the old one.
      
       (3) The key is released by exit and by notify-resume.
      
       (4) The cache is cleared in a newly forked process.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      7743c48e
  7. 15 6月, 2019 2 次提交
    • H
      processor: get rid of cpu_relax_yield · 4ecf0a43
      Heiko Carstens 提交于
      stop_machine is the only user left of cpu_relax_yield. Given that it
      now has special semantics which are tied to stop_machine introduce a
      weak stop_machine_yield function which architectures can override, and
      get rid of the generic cpu_relax_yield implementation.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      4ecf0a43
    • M
      s390: improve wait logic of stop_machine · 38f2c691
      Martin Schwidefsky 提交于
      The stop_machine loop to advance the state machine and to wait for all
      affected CPUs to check-in calls cpu_relax_yield in a tight loop until
      the last missing CPUs acknowledged the state transition.
      
      On a virtual system where not all logical CPUs are backed by real CPUs
      all the time it can take a while for all CPUs to check-in. With the
      current definition of cpu_relax_yield a diagnose 0x44 is done which
      tells the hypervisor to schedule *some* other CPU. That can be any
      CPU and not necessarily one of the CPUs that need to run in order to
      advance the state machine. This can lead to a pretty bad diagnose 0x44
      storm until the last missing CPU finally checked-in.
      
      Replace the undirected cpu_relax_yield based on diagnose 0x44 with a
      directed yield. Each CPU in the wait loop will pick up the next CPU
      in the cpumask of stop_machine. The diagnose 0x9c is used to tell the
      hypervisor to run this next CPU instead of the current one. If there
      is only a limited number of real CPUs backing the virtual CPUs we
      end up with the real CPUs passed around in a round-robin fashion.
      
      [heiko.carstens@de.ibm.com]:
          Use cpumask_next_wrap as suggested by Peter Zijlstra.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      38f2c691
  8. 03 6月, 2019 1 次提交
  9. 26 5月, 2019 1 次提交
    • P
      rcu: Check for wakeup-safe conditions in rcu_read_unlock_special() · 23634ebc
      Paul E. McKenney 提交于
      When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc
      kthreads, a full and unconditional wakeup is required to initiate RCU
      core processing.  In contrast, when RCU core processing is carried
      out by RCU_SOFTIRQ, a raise_softirq() suffices.  Of course, there are
      situations where raise_softirq() does a full wakeup, but these do not
      occur with normal usage of rcu_read_unlock().
      
      The reason that full wakeups can be problematic is that the scheduler
      sometimes invokes rcu_read_unlock() with its pi or rq locks held,
      which can of course result in deadlock in CONFIG_PREEMPT=y kernels when
      rcu_read_unlock() invokes the scheduler.  Scheduler invocations can happen
      in the following situations: (1) The just-ended reader has been subjected
      to RCU priority boosting, in which case rcu_read_unlock() must deboost,
      (2) Interrupts were disabled across the call to rcu_read_unlock(), so
      the quiescent state must be deferred, requiring a wakeup of the rcuc
      kthread corresponding to the current CPU.
      
      Now, the scheduler may hold one of its locks across rcu_read_unlock()
      only if preemption has been disabled across the entire RCU read-side
      critical section, which in the days prior to RCU flavor consolidation
      meant that rcu_read_unlock() never needed to do wakeups.  However, this
      is no longer the case for any but the first rcu_read_unlock() following a
      condition (e.g., preempted RCU reader) requiring special rcu_read_unlock()
      attention.  For example, an RCU read-side critical section might be
      preempted, but preemption might be disabled across the rcu_read_unlock().
      The rcu_read_unlock() must defer the quiescent state, and therefore
      leaves the task queued on its leaf rcu_node structure.  If a scheduler
      interrupt occurs, the scheduler might well invoke rcu_read_unlock() with
      one of its locks held.  However, the preempted task is still queued, so
      rcu_read_unlock() will attempt to defer the quiescent state once more.
      When RCU core processing is carried out by RCU_SOFTIRQ, this works just
      fine: The raise_softirq() function simply sets a bit in a per-CPU mask
      and the RCU core processing will be undertaken upon return from interrupt.
      
      Not so when RCU core processing is carried out by the rcuc kthread: In this
      case, the required wakeup can result in deadlock.
      
      The initial solution to this problem was to use set_tsk_need_resched() and
      set_preempt_need_resched() to force a future context switch, which allows
      rcu_preempt_note_context_switch() to report the deferred quiescent state
      to RCU's core processing.  Unfortunately for expedited grace periods,
      there can be a significant delay between the call for a context switch
      and the actual context switch.
      
      This commit therefore introduces a ->deferred_qs flag to the task_struct
      structure's rcu_special structure.  This flag is initially false, and
      is set to true by the first call to rcu_read_unlock() requiring special
      attention, then finally reset back to false when the quiescent state is
      finally reported.  Then rcu_read_unlock() attempts full wakeups only when
      ->deferred_qs is false, that is, on the first rcu_read_unlock() requiring
      special attention.  Note that a chain of RCU readers linked by some other
      sort of reader may find that a later rcu_read_unlock() is once again able
      to do a full wakeup, courtesy of an intervening preemption:
      
      	rcu_read_lock();
      	/* preempted */
      	local_irq_disable();
      	rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */
      	rcu_read_lock();
      	local_irq_enable();
      	preempt_disable()
      	rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */
      	rcu_read_lock();
      	preempt_enable();
      	/* preempted, >deferred_qs reset. */
      	local_irq_disable();
      	rcu_read_unlock(); /* Can again do full wakeup, sets ->deferred_qs. */
      
      Such linked RCU readers do not yet seem to appear in the Linux kernel, and
      it is probably best if they don't.  However, RCU needs to handle them, and
      some variations on this theme could make even raise_softirq() unsafe due to
      the possibility of its doing a full wakeup.  This commit therefore also
      avoids invoking raise_softirq() when the ->deferred_qs set flag is set.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      23634ebc
  10. 15 5月, 2019 1 次提交
  11. 20 4月, 2019 1 次提交
    • R
      cgroup: cgroup v2 freezer · 76f969e8
      Roman Gushchin 提交于
      Cgroup v1 implements the freezer controller, which provides an ability
      to stop the workload in a cgroup and temporarily free up some
      resources (cpu, io, network bandwidth and, potentially, memory)
      for some other tasks. Cgroup v2 lacks this functionality.
      
      This patch implements freezer for cgroup v2.
      
      Cgroup v2 freezer tries to put tasks into a state similar to jobctl
      stop. This means that tasks can be killed, ptraced (using
      PTRACE_SEIZE*), and interrupted. It is possible to attach to
      a frozen task, get some information (e.g. read registers) and detach.
      It's also possible to migrate a frozen tasks to another cgroup.
      
      This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
      tried to imitate the system-wide freezer. However uninterruptible
      sleep is fine when all tasks are going to be frozen (hibernation case),
      it's not the acceptable state for some subset of the system.
      
      Cgroup v2 freezer is not supporting freezing kthreads.
      If a non-root cgroup contains kthread, the cgroup still can be frozen,
      but the kthread will remain running, the cgroup will be shown
      as non-frozen, and the notification will not be delivered.
      
      * PTRACE_ATTACH is not working because non-fatal signal delivery
      is blocked in frozen state.
      
      There are some interface differences between cgroup v1 and cgroup v2
      freezer too, which are required to conform the cgroup v2 interface
      design principles:
      1) There is no separate controller, which has to be turned on:
      the functionality is always available and is represented by
      cgroup.freeze and cgroup.events cgroup control files.
      2) The desired state is defined by the cgroup.freeze control file.
      Any hierarchical configuration is allowed.
      3) The interface is asynchronous. The actual state is available
      using cgroup.events control file ("frozen" field). There are no
      dedicated transitional states.
      4) It's allowed to make any changes with the cgroup hierarchy
      (create new cgroups, remove old cgroups, move tasks between cgroups)
      no matter if some cgroups are frozen.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      No-objection-from-me-by: NOleg Nesterov <oleg@redhat.com>
      Cc: kernel-team@fb.com
      76f969e8
  12. 19 4月, 2019 1 次提交
    • M
      rseq: Remove superfluous rseq_len from task_struct · 83b0b15b
      Mathieu Desnoyers 提交于
      The rseq system call, when invoked with flags of "0" or
      "RSEQ_FLAG_UNREGISTER" values, expects the rseq_len parameter to
      be equal to sizeof(struct rseq), which is fixed-size and fixed-layout,
      specified in uapi linux/rseq.h.
      
      Expecting a fixed size for rseq_len is a design choice that ensures
      multiple libraries and application defining __rseq_abi in the same
      process agree on its exact size.
      
      Considering that this size is and will always be the same value, there
      is no point in saving this value within task_struct rseq_len. Remove
      this field from task_struct.
      
      No change in functionality intended.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/20190305194755.2602-3-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      83b0b15b
  13. 06 3月, 2019 2 次提交
    • A
      mm/cma: add PF flag to force non cma alloc · d7fefcc8
      Aneesh Kumar K.V 提交于
      Patch series "mm/kvm/vfio/ppc64: Migrate compound pages out of CMA
      region", v8.
      
      ppc64 uses the CMA area for the allocation of guest page table (hash
      page table).  We won't be able to start guest if we fail to allocate
      hash page table.  We have observed hash table allocation failure because
      we failed to migrate pages out of CMA region because they were pinned.
      This happen when we are using VFIO.  VFIO on ppc64 pins the entire guest
      RAM.  If the guest RAM pages get allocated out of CMA region, we won't
      be able to migrate those pages.  The pages are also pinned for the
      lifetime of the guest.
      
      Currently we support migration of non-compound pages.  With THP and with
      the addition of hugetlb migration we can end up allocating compound
      pages from CMA region.  This patch series add support for migrating
      compound pages.
      
      This patch (of 4):
      
      Add PF_MEMALLOC_NOCMA which make sure any allocation in that context is
      marked non-movable and hence cannot be satisfied by CMA region.
      
      This is useful with get_user_pages_longterm where we want to take a page
      pin by migrating pages from CMA region.  Marking the section
      PF_MEMALLOC_NOCMA ensures that we avoid unnecessary page migration
      later.
      
      Link: http://lkml.kernel.org/r/20190114095438.32470-2-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7fefcc8
    • M
      mm, compaction: capture a page under direct compaction · 5e1f0f09
      Mel Gorman 提交于
      Compaction is inherently race-prone as a suitable page freed during
      compaction can be allocated by any parallel task.  This patch uses a
      capture_control structure to isolate a page immediately when it is freed
      by a direct compactor in the slow path of the page allocator.  The
      intent is to avoid redundant scanning.
      
                                           5.0.0-rc1              5.0.0-rc1
                                     selective-v3r17          capture-v3r19
      Amean     fault-both-1         0.00 (   0.00%)        0.00 *   0.00%*
      Amean     fault-both-3      2582.11 (   0.00%)     2563.68 (   0.71%)
      Amean     fault-both-5      4500.26 (   0.00%)     4233.52 (   5.93%)
      Amean     fault-both-7      5819.53 (   0.00%)     6333.65 (  -8.83%)
      Amean     fault-both-12     9321.18 (   0.00%)     9759.38 (  -4.70%)
      Amean     fault-both-18     9782.76 (   0.00%)    10338.76 (  -5.68%)
      Amean     fault-both-24    15272.81 (   0.00%)    13379.55 *  12.40%*
      Amean     fault-both-30    15121.34 (   0.00%)    16158.25 (  -6.86%)
      Amean     fault-both-32    18466.67 (   0.00%)    18971.21 (  -2.73%)
      
      Latency is only moderately affected but the devil is in the details.  A
      closer examination indicates that base page fault latency is reduced but
      latency of huge pages is increased as it takes creater care to succeed.
      Part of the "problem" is that allocation success rates are close to 100%
      even when under pressure and compaction gets harder
      
                                      5.0.0-rc1              5.0.0-rc1
                                selective-v3r17          capture-v3r19
      Percentage huge-3        96.70 (   0.00%)       98.23 (   1.58%)
      Percentage huge-5        96.99 (   0.00%)       95.30 (  -1.75%)
      Percentage huge-7        94.19 (   0.00%)       97.24 (   3.24%)
      Percentage huge-12       94.95 (   0.00%)       97.35 (   2.53%)
      Percentage huge-18       96.74 (   0.00%)       97.30 (   0.58%)
      Percentage huge-24       97.07 (   0.00%)       97.55 (   0.50%)
      Percentage huge-30       95.69 (   0.00%)       98.50 (   2.95%)
      Percentage huge-32       96.70 (   0.00%)       99.27 (   2.65%)
      
      And scan rates are reduced as expected by 6% for the migration scanner
      and 29% for the free scanner indicating that there is less redundant
      work.
      
      Compaction migrate scanned    20815362    19573286
      Compaction free scanned       16352612    11510663
      
      [mgorman@techsingularity.net: remove redundant check]
        Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
      Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Carpenter <dan.carpenter@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e1f0f09
  14. 26 2月, 2019 1 次提交
    • L
      Revert "x86/fault: BUG() when uaccess helpers fault on kernel addresses" · 53a41cb7
      Linus Torvalds 提交于
      This reverts commit 9da3f2b7.
      
      It was well-intentioned, but wrong.  Overriding the exception tables for
      instructions for random reasons is just wrong, and that is what the new
      code did.
      
      It caused problems for tracing, and it caused problems for strncpy_from_user(),
      because the new checks made perfectly valid use cases break, rather than
      catch things that did bad things.
      
      Unchecked user space accesses are a problem, but that's not a reason to
      add invalid checks that then people have to work around with silly flags
      (in this case, that 'kernel_uaccess_faults_ok' flag, which is just an
      odd way to say "this commit was wrong" and was sprinked into random
      places to hide the wrongness).
      
      The real fix to unchecked user space accesses is to get rid of the
      special "let's not check __get_user() and __put_user() at all" logic.
      Make __{get|put}_user() be just aliases to the regular {get|put}_user()
      functions, and make it impossible to access user space without having
      the proper checks in places.
      
      The raison d'être of the special double-underscore versions used to be
      that the range check was expensive, and if you did multiple user
      accesses, you'd do the range check up front (like the signal frame
      handling code, for example).  But SMAP (on x86) and PAN (on ARM) have
      made that optimization pointless, because the _real_ expense is the "set
      CPU flag to allow user space access".
      
      Do let's not break the valid cases to catch invalid cases that shouldn't
      even exist.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53a41cb7
  15. 04 2月, 2019 5 次提交
    • A
      sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock() · c546951d
      Andrea Parri 提交于
      move_queued_task() synchronizes with task_rq_lock() as follows:
      
      	move_queued_task()		task_rq_lock()
      
      	[S] ->on_rq = MIGRATING		[L] rq = task_rq()
      	WMB (__set_task_cpu())		ACQUIRE (rq->lock);
      	[S] ->cpu = new_cpu		[L] ->on_rq
      
      where "[L] rq = task_rq()" is ordered before "ACQUIRE (rq->lock)" by an
      address dependency and, in turn, "ACQUIRE (rq->lock)" is ordered before
      "[L] ->on_rq" by the ACQUIRE itself.
      
      Use READ_ONCE() to load ->cpu in task_rq() (c.f., task_cpu()) to honor
      this address dependency.  Also, mark the accesses to ->cpu and ->on_rq
      with READ_ONCE()/WRITE_ONCE() to comply with the LKMM.
      Signed-off-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E. McKenney <paulmck@linux.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: https://lkml.kernel.org/r/20190121155240.27173-1-andrea.parri@amarulasolutions.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c546951d
    • V
      sched/fair: Update scale invariance of PELT · 23127296
      Vincent Guittot 提交于
      The current implementation of load tracking invariance scales the
      contribution with current frequency and uarch performance (only for
      utilization) of the CPU. One main result of this formula is that the
      figures are capped by current capacity of CPU. Another one is that the
      load_avg is not invariant because not scaled with uarch.
      
      The util_avg of a periodic task that runs r time slots every p time slots
      varies in the range :
      
          U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)
      
      with U is the max util_avg value = SCHED_CAPACITY_SCALE
      
      At a lower capacity, the range becomes:
      
          U * C * (1-y^r')/(1-y^p) * y^i' < Utilization <  U * C * (1-y^r')/(1-y^p)
      
      with C reflecting the compute capacity ratio between current capacity and
      max capacity.
      
      so C tries to compensate changes in (1-y^r') but it can't be accurate.
      
      Instead of scaling the contribution value of PELT algo, we should scale the
      running time. The PELT signal aims to track the amount of computation of
      tasks and/or rq so it seems more correct to scale the running time to
      reflect the effective amount of computation done since the last update.
      
      In order to be fully invariant, we need to apply the same amount of
      running time and idle time whatever the current capacity. Because running
      at lower capacity implies that the task will run longer, we have to ensure
      that the same amount of idle time will be applied when system becomes idle
      and no idle time has been "stolen". But reaching the maximum utilization
      value (SCHED_CAPACITY_SCALE) means that the task is seen as an
      always-running task whatever the capacity of the CPU (even at max compute
      capacity). In this case, we can discard this "stolen" idle times which
      becomes meaningless.
      
      In order to achieve this time scaling, a new clock_pelt is created per rq.
      The increase of this clock scales with current capacity when something
      is running on rq and synchronizes with clock_task when rq is idle. With
      this mechanism, we ensure the same running and idle time whatever the
      current capacity. This also enables to simplify the pelt algorithm by
      removing all references of uarch and frequency and applying the same
      contribution to utilization and loads. Furthermore, the scaling is done
      only once per update of clock (update_rq_clock_task()) instead of during
      each update of sched_entities and cfs/rt/dl_rq of the rq like the current
      implementation. This is interesting when cgroup are involved as shown in
      the results below:
      
      On a hikey (octo Arm64 platform).
      Performance cpufreq governor and only shallowest c-state to remove variance
      generated by those power features so we only track the impact of pelt algo.
      
      each test runs 16 times:
      
      	./perf bench sched pipe
      	(higher is better)
      	kernel	tip/sched/core     + patch
      	        ops/seconds        ops/seconds         diff
      	cgroup
      	root    59652(+/- 0.18%)   59876(+/- 0.24%)    +0.38%
      	level1  55608(+/- 0.27%)   55923(+/- 0.24%)    +0.57%
      	level2  52115(+/- 0.29%)   52564(+/- 0.22%)    +0.86%
      
      	hackbench -l 1000
      	(lower is better)
      	kernel	tip/sched/core     + patch
      	        duration(sec)      duration(sec)        diff
      	cgroup
      	root    4.453(+/- 2.37%)   4.383(+/- 2.88%)     -1.57%
      	level1  4.859(+/- 8.50%)   4.830(+/- 7.07%)     -0.60%
      	level2  5.063(+/- 9.83%)   4.928(+/- 9.66%)     -2.66%
      
      Then, the responsiveness of PELT is improved when CPU is not running at max
      capacity with this new algorithm. I have put below some examples of
      duration to reach some typical load values according to the capacity of the
      CPU with current implementation and with this patch. These values has been
      computed based on the geometric series and the half period value:
      
        Util (%)     max capacity  half capacity(mainline)  half capacity(w/ patch)
        972 (95%)    138ms         not reachable            276ms
        486 (47.5%)  30ms          138ms                     60ms
        256 (25%)    13ms           32ms                     26ms
      
      On my hikey (octo Arm64 platform) with schedutil governor, the time to
      reach max OPP when starting from a null utilization, decreases from 223ms
      with current scale invariance down to 121ms with the new algorithm.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Morten.Rasmussen@arm.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: patrick.bellasi@arm.com
      Cc: pjt@google.com
      Cc: pkondeti@codeaurora.org
      Cc: quentin.perret@arm.com
      Cc: rjw@rjwysocki.net
      Cc: srinivas.pandruvada@linux.intel.com
      Cc: thara.gopinath@linaro.org
      Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      23127296
    • E
      sched/core: Convert task_struct.stack_refcount to refcount_t · f0b89d39
      Elena Reshetova 提交于
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable task_struct.stack_refcount is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the task_struct.stack_refcount it might make a difference
      in following places:
      
       - try_get_task_stack(): increment in refcount_inc_not_zero() only
         guarantees control dependency on success vs. fully ordered
         atomic counterpart
       - put_task_stack(): decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      Suggested-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: NHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-6-git-send-email-elena.reshetova@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f0b89d39
    • E
      sched/core: Convert task_struct.usage to refcount_t · ec1d2819
      Elena Reshetova 提交于
      atomic_t variables are currently used to implement reference
      counters with the following properties:
      
       - counter is initialized to 1 using atomic_set()
       - a resource is freed upon counter reaching zero
       - once counter reaches zero, its further
         increments aren't allowed
       - counter schema uses basic atomic operations
         (set, inc, inc_not_zero, dec_and_test, etc.)
      
      Such atomic variables should be converted to a newly provided
      refcount_t type and API that prevents accidental counter overflows
      and underflows. This is important since overflows and underflows
      can lead to use-after-free situation and be exploitable.
      
      The variable task_struct.usage is used as pure reference counter.
      Convert it to refcount_t and fix up the operations.
      
      ** Important note for maintainers:
      
      Some functions from refcount_t API defined in lib/refcount.c
      have different memory ordering guarantees than their atomic
      counterparts.
      
      The full comparison can be seen in
      https://lkml.org/lkml/2017/11/15/57 and it is hopefully soon
      in state to be merged to the documentation tree.
      
      Normally the differences should not matter since refcount_t provides
      enough guarantees to satisfy the refcounting use cases, but in
      some rare cases it might matter.
      
      Please double check that you don't have some undocumented
      memory guarantees for this variable usage.
      
      For the task_struct.usage it might make a difference
      in following places:
      
       - put_task_struct(): decrement in refcount_dec_and_test() only
         provides RELEASE ordering and control dependency on success
         vs. fully ordered atomic counterpart
      Suggested-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NElena Reshetova <elena.reshetova@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NDavid Windsor <dwindsor@gmail.com>
      Reviewed-by: NHans Liljestrand <ishkamiel@gmail.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: viro@zeniv.linux.org.uk
      Link: https://lkml.kernel.org/r/1547814450-18902-5-git-send-email-elena.reshetova@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ec1d2819
    • R
      audit: remove audit_context when CONFIG_ AUDIT and not AUDITSYSCALL · 5f3d544f
      Richard Guy Briggs 提交于
      Remove audit_context from struct task_struct and struct audit_buffer
      when CONFIG_AUDIT is enabled but CONFIG_AUDITSYSCALL is not.
      
      Also, audit_log_name() (and supporting inode and fcaps functions) should
      have been put back in auditsc.c when soft and hard link logging was
      normalized since it is only used by syscall auditing.
      
      See github issue https://github.com/linux-audit/audit-kernel/issues/105Signed-off-by: NRichard Guy Briggs <rgb@redhat.com>
      Signed-off-by: NPaul Moore <paul@paul-moore.com>
      5f3d544f
  16. 02 2月, 2019 1 次提交
    • J
      x86/resctrl: Avoid confusion over the new X86_RESCTRL config · e6d42931
      Johannes Weiner 提交于
      "Resource Control" is a very broad term for this CPU feature, and a term
      that is also associated with containers, cgroups etc. This can easily
      cause confusion.
      
      Make the user prompt more specific. Match the config symbol name.
      
       [ bp: In the future, the corresponding ARM arch-specific code will be
         under ARM_CPU_RESCTRL and the arch-agnostic bits will be carved out
         under the CPU_RESCTRL umbrella symbol. ]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Babu Moger <Babu.Moger@amd.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: linux-doc@vger.kernel.org
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: Reinette Chatre <reinette.chatre@intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: x86-ml <x86@kernel.org>
      Link: https://lkml.kernel.org/r/20190130195621.GA30653@cmpxchg.org
      e6d42931
  17. 30 1月, 2019 2 次提交
    • W
      x86/speculation: Add PR_SPEC_DISABLE_NOEXEC · 71368af9
      Waiman Long 提交于
      With the default SPEC_STORE_BYPASS_SECCOMP/SPEC_STORE_BYPASS_PRCTL mode,
      the TIF_SSBD bit will be inherited when a new task is fork'ed or cloned.
      It will also remain when a new program is execve'ed.
      
      Only certain class of applications (like Java) that can run on behalf of
      multiple users on a single thread will require disabling speculative store
      bypass for security purposes. Those applications will call prctl(2) at
      startup time to disable SSB. They won't rely on the fact the SSB might have
      been disabled. Other applications that don't need SSBD will just move on
      without checking if SSBD has been turned on or not.
      
      The fact that the TIF_SSBD is inherited across execve(2) boundary will
      cause performance of applications that don't need SSBD but their
      predecessors have SSBD on to be unwittingly impacted especially if they
      write to memory a lot.
      
      To remedy this problem, a new PR_SPEC_DISABLE_NOEXEC argument for the
      PR_SET_SPECULATION_CTRL option of prctl(2) is added to allow applications
      to specify that the SSBD feature bit on the task structure should be
      cleared whenever a new program is being execve'ed.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: linux-doc@vger.kernel.org
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: KarimAllah Ahmed <karahmed@amazon.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Link: https://lkml.kernel.org/r/1547676096-3281-1-git-send-email-longman@redhat.com
      71368af9
    • T
      sched: Remove stale PF_MUTEX_TESTER bit · 15917dc0
      Thomas Gleixner 提交于
      The RTMUTEX tester was removed long ago but the PF bit stayed
      around. Remove it and free up the space.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      15917dc0
  18. 26 1月, 2019 1 次提交
  19. 12 1月, 2019 1 次提交
    • T
      umh: add exit routine for UMH process · 73ab1cb2
      Taehee Yoo 提交于
      A UMH process which is created by the fork_usermode_blob() such as
      bpfilter needs to release members of the umh_info when process is
      terminated.
      But the do_exit() does not release members of the umh_info. hence module
      which uses UMH needs own code to detect whether UMH process is
      terminated or not.
      But this implementation needs extra code for checking the status of
      UMH process. it eventually makes the code more complex.
      
      The new PF_UMH flag is added and it is used to identify UMH processes.
      The exit_umh() does not release members of the umh_info.
      Hence umh_info->cleanup callback should release both members of the
      umh_info and the private data.
      Suggested-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NTaehee Yoo <ap420073@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      73ab1cb2
  20. 09 1月, 2019 1 次提交
  21. 03 12月, 2018 1 次提交
    • I
      sched: Fix various typos in comments · dfcb245e
      Ingo Molnar 提交于
      Go over the scheduler source code and fix common typos
      in comments - and a typo in an actual variable name.
      
      No change in functionality intended.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      dfcb245e
  22. 28 11月, 2018 2 次提交
    • T
      x86/speculation: Add prctl() control for indirect branch speculation · 9137bb27
      Thomas Gleixner 提交于
      Add the PR_SPEC_INDIRECT_BRANCH option for the PR_GET_SPECULATION_CTRL and
      PR_SET_SPECULATION_CTRL prctls to allow fine grained per task control of
      indirect branch speculation via STIBP and IBPB.
      
      Invocations:
       Check indirect branch speculation status with
       - prctl(PR_GET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, 0, 0, 0);
      
       Enable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_ENABLE, 0, 0);
      
       Disable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_DISABLE, 0, 0);
      
       Force disable indirect branch speculation with
       - prctl(PR_SET_SPECULATION_CTRL, PR_SPEC_INDIRECT_BRANCH, PR_SPEC_FORCE_DISABLE, 0, 0);
      
      See Documentation/userspace-api/spec_ctrl.rst.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw@amazon.co.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Casey Schaufler <casey.schaufler@intel.com>
      Cc: Asit Mallick <asit.k.mallick@intel.com>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Jon Masters <jcm@redhat.com>
      Cc: Waiman Long <longman9394@gmail.com>
      Cc: Greg KH <gregkh@linuxfoundation.org>
      Cc: Dave Stewart <david.c.stewart@intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Link: https://lkml.kernel.org/r/20181125185005.866780996@linutronix.de
      9137bb27
    • S
      function_graph: Use new curr_ret_depth to manage depth instead of curr_ret_stack · 39eb456d
      Steven Rostedt (VMware) 提交于
      Currently, the depth of the ret_stack is determined by curr_ret_stack index.
      The issue is that there's a race between setting of the curr_ret_stack and
      calling of the callback attached to the return of the function.
      
      Commit 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling
      trace return callback") moved the calling of the callback to after the
      setting of the curr_ret_stack, even stating that it was safe to do so, when
      in fact, it was the reason there was a barrier() there (yes, I should have
      commented that barrier()).
      
      Not only does the curr_ret_stack keep track of the current call graph depth,
      it also keeps the ret_stack content from being overwritten by new data.
      
      The function profiler, uses the "subtime" variable of ret_stack structure
      and by moving the curr_ret_stack, it allows for interrupts to use the same
      structure it was using, corrupting the data, and breaking the profiler.
      
      To fix this, there needs to be two variables to handle the call stack depth
      and the pointer to where the ret_stack is being used, as they need to change
      at two different locations.
      
      Cc: stable@kernel.org
      Fixes: 03274a3f ("tracing/fgraph: Adjust fgraph depth before calling trace return callback")
      Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      39eb456d
  23. 23 11月, 2018 1 次提交
    • B
      x86/resctrl: Rename the config option INTEL_RDT to RESCTRL · 6fe07ce3
      Babu Moger 提交于
      The resource control feature is supported by both Intel and AMD. So,
      rename CONFIG_INTEL_RDT to the vendor-neutral CONFIG_RESCTRL.
      
      Now CONFIG_RESCTRL will be used for both Intel and AMD to enable
      Resource Control support. Update the texts in config and condition
      accordingly.
      
       [ bp: Simplify Kconfig text. ]
      Signed-off-by: NBabu Moger <babu.moger@amd.com>
      Signed-off-by: NBorislav Petkov <bp@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: "Chang S. Bae" <chang.seok.bae@intel.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Dmitry Safonov <dima@arista.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <linux-doc@vger.kernel.org>
      Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Pu Wen <puwen@hygon.cn>
      Cc: <qianyue.zj@alibaba-inc.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Reinette Chatre <reinette.chatre@intel.com>
      Cc: Rian Hunter <rian@alum.mit.edu>
      Cc: Sherry Hurwitz <sherry.hurwitz@amd.com>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: <xiaochen.shen@intel.com>
      Link: https://lkml.kernel.org/r/20181121202811.4492-9-babu.moger@amd.com
      6fe07ce3
  24. 13 11月, 2018 1 次提交
    • P
      rcu: Speed up expedited GPs when interrupting RCU reader · 05f41571
      Paul E. McKenney 提交于
      In PREEMPT kernels, an expedited grace period might send an IPI to a
      CPU that is executing an RCU read-side critical section.  In that case,
      it would be nice if the rcu_read_unlock() directly interacted with the
      RCU core code to immediately report the quiescent state.  And this does
      happen in the case where the reader has been preempted.  But it would
      also be a nice performance optimization if immediate reporting also
      happened in the preemption-free case.
      
      This commit therefore adds an ->exp_hint field to the task_struct structure's
      ->rcu_read_unlock_special field.  The IPI handler sets this hint when
      it has interrupted an RCU read-side critical section, and this causes
      the outermost rcu_read_unlock() call to invoke rcu_read_unlock_special(),
      which, if preemption is enabled, reports the quiescent state immediately.
      If preemption is disabled, then the report is required to be deferred
      until preemption (or bottom halves or interrupts or whatever) is re-enabled.
      
      Because this is a hint, it does nothing for more complicated cases.  For
      example, if the IPI interrupts an RCU reader, but interrupts are disabled
      across the rcu_read_unlock(), but another rcu_read_lock() is executed
      before interrupts are re-enabled, the hint will already have been cleared.
      If you do crazy things like this, reporting will be deferred until some
      later RCU_SOFTIRQ handler, context switch, cond_resched(), or similar.
      Reported-by: NJoel Fernandes <joel@joelfernandes.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Acked-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
      05f41571
  25. 27 10月, 2018 2 次提交
  26. 03 10月, 2018 1 次提交
    • E
      signal: Distinguish between kernel_siginfo and siginfo · ae7795bc
      Eric W. Biederman 提交于
      Linus recently observed that if we did not worry about the padding
      member in struct siginfo it is only about 48 bytes, and 48 bytes is
      much nicer than 128 bytes for allocating on the stack and copying
      around in the kernel.
      
      The obvious thing of only adding the padding when userspace is
      including siginfo.h won't work as there are sigframe definitions in
      the kernel that embed struct siginfo.
      
      So split siginfo in two; kernel_siginfo and siginfo.  Keeping the
      traditional name for the userspace definition.  While the version that
      is used internally to the kernel and ultimately will not be padded to
      128 bytes is called kernel_siginfo.
      
      The definition of struct kernel_siginfo I have put in include/signal_types.h
      
      A set of buildtime checks has been added to verify the two structures have
      the same field offsets.
      
      To make it easy to verify the change kernel_siginfo retains the same
      size as siginfo.  The reduction in size comes in a following change.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      ae7795bc
  27. 05 9月, 2018 1 次提交