1. 03 6月, 2020 1 次提交
    • N
      mm/writeback: replace PF_LESS_THROTTLE with PF_LOCAL_THROTTLE · a37b0715
      NeilBrown 提交于
      PF_LESS_THROTTLE exists for loop-back nfsd (and a similar need in the
      loop block driver and callers of prctl(PR_SET_IO_FLUSHER)), where a
      daemon needs to write to one bdi (the final bdi) in order to free up
      writes queued to another bdi (the client bdi).
      
      The daemon sets PF_LESS_THROTTLE and gets a larger allowance of dirty
      pages, so that it can still dirty pages after other processses have been
      throttled.  The purpose of this is to avoid deadlock that happen when
      the PF_LESS_THROTTLE process must write for any dirty pages to be freed,
      but it is being thottled and cannot write.
      
      This approach was designed when all threads were blocked equally,
      independently on which device they were writing to, or how fast it was.
      Since that time the writeback algorithm has changed substantially with
      different threads getting different allowances based on non-trivial
      heuristics.  This means the simple "add 25%" heuristic is no longer
      reliable.
      
      The important issue is not that the daemon needs a *larger* dirty page
      allowance, but that it needs a *private* dirty page allowance, so that
      dirty pages for the "client" bdi that it is helping to clear (the bdi
      for an NFS filesystem or loop block device etc) do not affect the
      throttling of the daemon writing to the "final" bdi.
      
      This patch changes the heuristic so that the task is not throttled when
      the bdi it is writing to has a dirty page count below below (or equal
      to) the free-run threshold for that bdi.  This ensures it will always be
      able to have some pages in flight, and so will not deadlock.
      
      In a steady-state, it is expected that PF_LOCAL_THROTTLE tasks might
      still be throttled by global threshold, but that is acceptable as it is
      only the deadlock state that is interesting for this flag.
      
      This approach of "only throttle when target bdi is busy" is consistent
      with the other use of PF_LESS_THROTTLE in current_may_throttle(), were
      it causes attention to be focussed only on the target bdi.
      
      So this patch
       - renames PF_LESS_THROTTLE to PF_LOCAL_THROTTLE,
       - removes the 25% bonus that that flag gives, and
       - If PF_LOCAL_THROTTLE is set, don't delay at all unless the
         global and the local free-run thresholds are exceeded.
      
      Note that previously realtime threads were treated the same as
      PF_LESS_THROTTLE threads.  This patch does *not* change the behvaiour
      for real-time threads, so it is now different from the behaviour of nfsd
      and loop tasks.  I don't know what is wanted for realtime.
      
      [akpm@linux-foundation.org: coding style fixes]
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: Chuck Lever <chuck.lever@oracle.com>	[nfsd]
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Trond Myklebust <trond.myklebust@hammerspace.com>
      Link: http://lkml.kernel.org/r/87ftbf7gs3.fsf@notabene.neil.brown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a37b0715
  2. 02 4月, 2020 1 次提交
    • E
      signal: Extend exec_id to 64bits · d1e7fd64
      Eric W. Biederman 提交于
      Replace the 32bit exec_id with a 64bit exec_id to make it impossible
      to wrap the exec_id counter.  With care an attacker can cause exec_id
      wrap and send arbitrary signals to a newly exec'd parent.  This
      bypasses the signal sending checks if the parent changes their
      credentials during exec.
      
      The severity of this problem can been seen that in my limited testing
      of a 32bit exec_id it can take as little as 19s to exec 65536 times.
      Which means that it can take as little as 14 days to wrap a 32bit
      exec_id.  Adam Zabrocki has succeeded wrapping the self_exe_id in 7
      days.  Even my slower timing is in the uptime of a typical server.
      Which means self_exec_id is simply a speed bump today, and if exec
      gets noticably faster self_exec_id won't even be a speed bump.
      
      Extending self_exec_id to 64bits introduces a problem on 32bit
      architectures where reading self_exec_id is no longer atomic and can
      take two read instructions.  Which means that is is possible to hit
      a window where the read value of exec_id does not match the written
      value.  So with very lucky timing after this change this still
      remains expoiltable.
      
      I have updated the update of exec_id on exec to use WRITE_ONCE
      and the read of exec_id in do_notify_parent to use READ_ONCE
      to make it clear that there is no locking between these two
      locations.
      
      Link: https://lore.kernel.org/kernel-hardening/20200324215049.GA3710@pi3.com.pl
      Fixes: 2.3.23pre2
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      d1e7fd64
  3. 21 3月, 2020 2 次提交
    • S
      lockdep: Add hrtimer context tracing bits · 40db1739
      Sebastian Andrzej Siewior 提交于
      Set current->irq_config = 1 for hrtimers which are not marked to expire in
      hard interrupt context during hrtimer_init(). These timers will expire in
      softirq context on PREEMPT_RT.
      
      Setting this allows lockdep to differentiate these timers. If a timer is
      marked to expire in hard interrupt context then the timer callback is not
      supposed to acquire a regular spinlock instead of a raw_spinlock in the
      expiry callback.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200321113242.534508206@linutronix.de
      40db1739
    • P
      lockdep: Introduce wait-type checks · de8f5e4f
      Peter Zijlstra 提交于
      Extend lockdep to validate lock wait-type context.
      
      The current wait-types are:
      
      	LD_WAIT_FREE,		/* wait free, rcu etc.. */
      	LD_WAIT_SPIN,		/* spin loops, raw_spinlock_t etc.. */
      	LD_WAIT_CONFIG,		/* CONFIG_PREEMPT_LOCK, spinlock_t etc.. */
      	LD_WAIT_SLEEP,		/* sleeping locks, mutex_t etc.. */
      
      Where lockdep validates that the current lock (the one being acquired)
      fits in the current wait-context (as generated by the held stack).
      
      This ensures that there is no attempt to acquire mutexes while holding
      spinlocks, to acquire spinlocks while holding raw_spinlocks and so on. In
      other words, its a more fancy might_sleep().
      
      Obviously RCU made the entire ordeal more complex than a simple single
      value test because RCU can be acquired in (pretty much) any context and
      while it presents a context to nested locks it is not the same as it
      got acquired in.
      
      Therefore its necessary to split the wait_type into two values, one
      representing the acquire (outer) and one representing the nested context
      (inner). For most 'normal' locks these two are the same.
      
      [ To make static initialization easier we have the rule that:
        .outer == INV means .outer == .inner; because INV == 0. ]
      
      It further means that its required to find the minimal .inner of the held
      stack to compare against the outer of the new lock; because while 'normal'
      RCU presents a CONFIG type to nested locks, if it is taken while already
      holding a SPIN type it obviously doesn't relax the rules.
      
      Below is an example output generated by the trivial test code:
      
        raw_spin_lock(&foo);
        spin_lock(&bar);
        spin_unlock(&bar);
        raw_spin_unlock(&foo);
      
       [ BUG: Invalid wait context ]
       -----------------------------
       swapper/0/1 is trying to lock:
       ffffc90000013f20 (&bar){....}-{3:3}, at: kernel_init+0xdb/0x187
       other info that might help us debug this:
       1 lock held by swapper/0/1:
        #0: ffffc90000013ee0 (&foo){+.+.}-{2:2}, at: kernel_init+0xd1/0x187
      
      The way to read it is to look at the new -{n,m} part in the lock
      description; -{3:3} for the attempted lock, and try and match that up to
      the held locks, which in this case is the one: -{2,2}.
      
      This tells that the acquiring lock requires a more relaxed environment than
      presented by the lock stack.
      
      Currently only the normal locks and RCU are converted, the rest of the
      lockdep users defaults to .inner = INV which is ignored. More conversions
      can be done when desired.
      
      The check for spinlock_t nesting is not enabled by default. It's a separate
      config option for now as there are known problems which are currently
      addressed. The config option allows to identify these problems and to
      verify that the solutions found are indeed solving them.
      
      The config switch will be removed and the checks will permanently enabled
      once the vast majority of issues has been addressed.
      
      [ bigeasy: Move LD_WAIT_FREE,… out of CONFIG_LOCKDEP to avoid compile
      	   failure with CONFIG_DEBUG_SPINLOCK + !CONFIG_LOCKDEP]
      [ tglx: Add the config option ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/20200321113242.427089655@linutronix.de
      de8f5e4f
  4. 20 3月, 2020 1 次提交
  5. 24 2月, 2020 2 次提交
  6. 26 1月, 2020 1 次提交
  7. 25 12月, 2019 1 次提交
    • M
      rseq: Unregister rseq for clone CLONE_VM · 463f550f
      Mathieu Desnoyers 提交于
      It has been reported by Google that rseq is not behaving properly
      with respect to clone when CLONE_VM is used without CLONE_THREAD.
      
      It keeps the prior thread's rseq TLS registered when the TLS of the
      thread has moved, so the kernel can corrupt the TLS of the parent.
      
      The approach of clearing the per task-struct rseq registration
      on clone with CLONE_THREAD flag is incomplete. It does not cover
      the use-case of clone with CLONE_VM set, but without CLONE_THREAD.
      
      Here is the rationale for unregistering rseq on clone with CLONE_VM
      flag set:
      
      1) CLONE_THREAD requires CLONE_SIGHAND, which requires CLONE_VM to be
         set. Therefore, just checking for CLONE_VM covers all CLONE_THREAD
         uses. There is no point in checking for both CLONE_THREAD and
         CLONE_VM,
      
      2) There is the possibility of an unlikely scenario where CLONE_SETTLS
         is used without CLONE_VM. In order to be an issue, it would require
         that the rseq TLS is in a shared memory area.
      
         I do not plan on adding CLONE_SETTLS to the set of clone flags which
         unregister RSEQ, because it would require that we also unregister RSEQ
         on set_thread_area(2) and arch_prctl(2) ARCH_SET_FS for completeness.
         So rather than doing a partial solution, it appears better to let
         user-space explicitly perform rseq unregistration across clone if
         needed in scenarios where CLONE_VM is not set.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/20191211161713.4490-3-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      463f550f
  8. 05 12月, 2019 1 次提交
    • A
      kcov: remote coverage support · eec028c9
      Andrey Konovalov 提交于
      Patch series " kcov: collect coverage from usb and vhost", v3.
      
      This patchset extends kcov to allow collecting coverage from backgound
      kernel threads.  This extension requires custom annotations for each of
      the places where coverage collection is desired.  This patchset
      implements this for hub events in the USB subsystem and for vhost
      workers.  See the first patch description for details about the kcov
      extension.  The other two patches apply this kcov extension to USB and
      vhost.
      
      Examples of other subsystems that might potentially benefit from this
      when custom annotations are added (the list is based on
      process_one_work() callers for bugs recently reported by syzbot):
      
      1. fs: writeback wb_workfn() worker,
      2. net: addrconf_dad_work()/addrconf_verify_work() workers,
      3. net: neigh_periodic_work() worker,
      4. net/p9: p9_write_work()/p9_read_work() workers,
      5. block: blk_mq_run_work_fn() worker.
      
      These patches have been used to enable coverage-guided USB fuzzing with
      syzkaller for the last few years, see the details here:
      
        https://github.com/google/syzkaller/blob/master/docs/linux/external_fuzzing_usb.md
      
      This patchset has been pushed to the public Linux kernel Gerrit
      instance:
      
        https://linux-review.googlesource.com/c/linux/kernel/git/torvalds/linux/+/1524
      
      This patch (of 3):
      
      Add background thread coverage collection ability to kcov.
      
      With KCOV_ENABLE coverage is collected only for syscalls that are issued
      from the current process.  With KCOV_REMOTE_ENABLE it's possible to
      collect coverage for arbitrary parts of the kernel code, provided that
      those parts are annotated with kcov_remote_start()/kcov_remote_stop().
      
      This allows to collect coverage from two types of kernel background
      threads: the global ones, that are spawned during kernel boot in a
      limited number of instances (e.g.  one USB hub_event() worker thread is
      spawned per USB HCD); and the local ones, that are spawned when a user
      interacts with some kernel interface (e.g.  vhost workers).
      
      To enable collecting coverage from a global background thread, a unique
      global handle must be assigned and passed to the corresponding
      kcov_remote_start() call.  Then a userspace process can pass a list of
      such handles to the KCOV_REMOTE_ENABLE ioctl in the handles array field
      of the kcov_remote_arg struct.  This will attach the used kcov device to
      the code sections, that are referenced by those handles.
      
      Since there might be many local background threads spawned from
      different userspace processes, we can't use a single global handle per
      annotation.  Instead, the userspace process passes a non-zero handle
      through the common_handle field of the kcov_remote_arg struct.  This
      common handle gets saved to the kcov_handle field in the current
      task_struct and needs to be passed to the newly spawned threads via
      custom annotations.  Those threads should in turn be annotated with
      kcov_remote_start()/kcov_remote_stop().
      
      Internally kcov stores handles as u64 integers.  The top byte of a
      handle is used to denote the id of a subsystem that this handle belongs
      to, and the lower 4 bytes are used to denote the id of a thread instance
      within that subsystem.  A reserved value 0 is used as a subsystem id for
      common handles as they don't belong to a particular subsystem.  The
      bytes 4-7 are currently reserved and must be zero.  In the future the
      number of bytes used for the subsystem or handle ids might be increased.
      
      When a particular userspace process collects coverage by via a common
      handle, kcov will collect coverage for each code section that is
      annotated to use the common handle obtained as kcov_handle from the
      current task_struct.  However non common handles allow to collect
      coverage selectively from different subsystems.
      
      Link: http://lkml.kernel.org/r/e90e315426a384207edbec1d6aa89e43008e4caf.1572366574.git.andreyknvl@google.comSigned-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: David Windsor <dwindsor@gmail.com>
      Cc: Elena Reshetova <elena.reshetova@intel.com>
      Cc: Anders Roxell <anders.roxell@linaro.org>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eec028c9
  9. 20 11月, 2019 2 次提交
  10. 13 11月, 2019 1 次提交
  11. 30 10月, 2019 1 次提交
    • J
      io-wq: small threadpool implementation for io_uring · 771b53d0
      Jens Axboe 提交于
      This adds support for io-wq, a smaller and specialized thread pool
      implementation. This is meant to replace workqueues for io_uring. Among
      the reasons for this addition are:
      
      - We can assign memory context smarter and more persistently if we
        manage the life time of threads.
      
      - We can drop various work-arounds we have in io_uring, like the
        async_list.
      
      - We can implement hashed work insertion, to manage concurrency of
        buffered writes without needing a) an extra workqueue, or b)
        needlessly making the concurrency of said workqueue very low
        which hurts performance of multiple buffered file writers.
      
      - We can implement cancel through signals, for cancelling
        interruptible work like read/write (or send/recv) to/from sockets.
      
      - We need the above cancel for being able to assign and use file tables
        from a process.
      
      - We can implement a more thorough cancel operation in general.
      
      - We need it to move towards a syslet/threadlet model for even faster
        async execution. For that we need to take ownership of the used
        threads.
      
      This list is just off the top of my head. Performance should be the
      same, or better, at least that's what I've seen in my testing. io-wq
      supports basic NUMA functionality, setting up a pool per node.
      
      io-wq hooks up to the scheduler schedule in/out just like workqueue
      and uses that to drive the need for more/less workers.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      771b53d0
  12. 29 10月, 2019 3 次提交
  13. 17 10月, 2019 1 次提交
    • J
      arm64: entry.S: Do not preempt from IRQ before all cpufeatures are enabled · 19c95f26
      Julien Thierry 提交于
      Preempting from IRQ-return means that the task has its PSTATE saved
      on the stack, which will get restored when the task is resumed and does
      the actual IRQ return.
      
      However, enabling some CPU features requires modifying the PSTATE. This
      means that, if a task was scheduled out during an IRQ-return before all
      CPU features are enabled, the task might restore a PSTATE that does not
      include the feature enablement changes once scheduled back in.
      
      * Task 1:
      
      PAN == 0 ---|                          |---------------
                  |                          |<- return from IRQ, PSTATE.PAN = 0
                  | <- IRQ                   |
                  +--------+ <- preempt()  +--
                                           ^
                                           |
                                           reschedule Task 1, PSTATE.PAN == 1
      * Init:
              --------------------+------------------------
                                  ^
                                  |
                                  enable_cpu_features
                                  set PSTATE.PAN on all CPUs
      
      Worse than this, since PSTATE is untouched when task switching is done,
      a task missing the new bits in PSTATE might affect another task, if both
      do direct calls to schedule() (outside of IRQ/exception contexts).
      
      Fix this by preventing preemption on IRQ-return until features are
      enabled on all CPUs.
      
      This way the only PSTATE values that are saved on the stack are from
      synchronous exceptions. These are expected to be fatal this early, the
      exception is BRK for WARN_ON(), but as this uses do_debug_exception()
      which keeps IRQs masked, it shouldn't call schedule().
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      [james: Replaced a really cool hack, with an even simpler static key in C.
       expanded commit message with Julien's cover-letter ascii art]
      Signed-off-by: NJames Morse <james.morse@arm.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      19c95f26
  14. 25 9月, 2019 1 次提交
  15. 18 9月, 2019 1 次提交
  16. 07 9月, 2019 1 次提交
    • D
      kernel.h: Add non_block_start/end() · 312364f3
      Daniel Vetter 提交于
      In some special cases we must not block, but there's not a spinlock,
      preempt-off, irqs-off or similar critical section already that arms the
      might_sleep() debug checks. Add a non_block_start/end() pair to annotate
      these.
      
      This will be used in the oom paths of mmu-notifiers, where blocking is not
      allowed to make sure there's forward progress. Quoting Michal:
      
      "The notifier is called from quite a restricted context - oom_reaper -
      which shouldn't depend on any locks or sleepable conditionals. The code
      should be swift as well but we mostly do care about it to make a forward
      progress. Checking for sleepable context is the best thing we could come
      up with that would describe these demands at least partially."
      
      Peter also asked whether we want to catch spinlocks on top, but Michal
      said those are less of a problem because spinlocks can't have an indirect
      dependency upon the page allocator and hence close the loop with the oom
      reaper.
      
      Suggested by Michal Hocko.
      
      Link: https://lore.kernel.org/r/20190826201425.17547-4-daniel.vetter@ffwll.ch
      Acked-by: Christian König <christian.koenig@amd.com> (v1)
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NDaniel Vetter <daniel.vetter@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      312364f3
  17. 28 8月, 2019 3 次提交
  18. 01 8月, 2019 1 次提交
  19. 25 7月, 2019 2 次提交
  20. 25 6月, 2019 4 次提交
    • P
      sched/uclamp: Extend sched_setattr() to support utilization clamping · a509a7cd
      Patrick Bellasi 提交于
      The SCHED_DEADLINE scheduling class provides an advanced and formal
      model to define tasks requirements that can translate into proper
      decisions for both task placements and frequencies selections. Other
      classes have a more simplified model based on the POSIX concept of
      priorities.
      
      Such a simple priority based model however does not allow to exploit
      most advanced features of the Linux scheduler like, for example, driving
      frequencies selection via the schedutil cpufreq governor. However, also
      for non SCHED_DEADLINE tasks, it's still interesting to define tasks
      properties to support scheduler decisions.
      
      Utilization clamping exposes to user-space a new set of per-task
      attributes the scheduler can use as hints about the expected/required
      utilization for a task. This allows to implement a "proactive" per-task
      frequency control policy, a more advanced policy than the current one
      based just on "passive" measured task utilization. For example, it's
      possible to boost interactive tasks (e.g. to get better performance) or
      cap background tasks (e.g. to be more energy/thermal efficient).
      
      Introduce a new API to set utilization clamping values for a specified
      task by extending sched_setattr(), a syscall which already allows to
      define task specific properties for different scheduling classes. A new
      pair of attributes allows to specify a minimum and maximum utilization
      the scheduler can consider for a task.
      
      Do that by validating the required clamp values before and then applying
      the required changes using _the_ same pattern already in use for
      __setscheduler(). This ensures that the task is re-enqueued with the new
      clamp values.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190621084217.8167-7-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a509a7cd
    • P
      sched/uclamp: Add system default clamps · e8f14172
      Patrick Bellasi 提交于
      Tasks without a user-defined clamp value are considered not clamped
      and by default their utilization can have any value in the
      [0..SCHED_CAPACITY_SCALE] range.
      
      Tasks with a user-defined clamp value are allowed to request any value
      in that range, and the required clamp is unconditionally enforced.
      However, a "System Management Software" could be interested in limiting
      the range of clamp values allowed for all tasks.
      
      Add a privileged interface to define a system default configuration via:
      
        /proc/sys/kernel/sched_uclamp_util_{min,max}
      
      which works as an unconditional clamp range restriction for all tasks.
      
      With the default configuration, the full SCHED_CAPACITY_SCALE range of
      values is allowed for each clamp index. Otherwise, the task-specific
      clamp is capped by the corresponding system default value.
      
      Do that by tracking, for each task, the "effective" clamp value and
      bucket the task has been refcounted in at enqueue time. This
      allows to lazy aggregate "requested" and "system default" values at
      enqueue time and simplifies refcounting updates at dequeue time.
      
      The cached bucket ids are used to avoid (relatively) more expensive
      integer divisions every time a task is enqueued.
      
      An active flag is used to report when the "effective" value is valid and
      thus the task is actually refcounted in the corresponding rq's bucket.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190621084217.8167-5-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e8f14172
    • P
      sched/uclamp: Add CPU's clamp buckets refcounting · 69842cba
      Patrick Bellasi 提交于
      Utilization clamping allows to clamp the CPU's utilization within a
      [util_min, util_max] range, depending on the set of RUNNABLE tasks on
      that CPU. Each task references two "clamp buckets" defining its minimum
      and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
      bucket is active if there is at least one RUNNABLE tasks enqueued on
      that CPU and refcounting that bucket.
      
      When a task is {en,de}queued {on,from} a rq, the set of active clamp
      buckets on that CPU can change. If the set of active clamp buckets
      changes for a CPU a new "aggregated" clamp value is computed for that
      CPU. This is because each clamp bucket enforces a different utilization
      clamp value.
      
      Clamp values are always MAX aggregated for both util_min and util_max.
      This ensures that no task can affect the performance of other
      co-scheduled tasks which are more boosted (i.e. with higher util_min
      clamp) or less capped (i.e. with higher util_max clamp).
      
      A task has:
         task_struct::uclamp[clamp_id]::bucket_id
      to track the "bucket index" of the CPU's clamp bucket it refcounts while
      enqueued, for each clamp index (clamp_id).
      
      A runqueue has:
         rq::uclamp[clamp_id]::bucket[bucket_id].tasks
      to track how many RUNNABLE tasks on that CPU refcount each
      clamp bucket (bucket_id) of a clamp index (clamp_id).
      It also has a:
         rq::uclamp[clamp_id]::bucket[bucket_id].value
      to track the clamp value of each clamp bucket (bucket_id) of a clamp
      index (clamp_id).
      
      The rq::uclamp::bucket[clamp_id][] array is scanned every time it's
      needed to find a new MAX aggregated clamp value for a clamp_id. This
      operation is required only when it's dequeued the last task of a clamp
      bucket tracking the current MAX aggregated clamp value. In this case,
      the CPU is either entering IDLE or going to schedule a less boosted or
      more clamped task.
      The expected number of different clamp values configured at build time
      is small enough to fit the full unordered array into a single cache
      line, for configurations of up to 7 buckets.
      
      Add to struct rq the basic data structures required to refcount the
      number of RUNNABLE tasks for each clamp bucket. Add also the max
      aggregation required to update the rq's clamp value at each
      enqueue/dequeue event.
      
      Use a simple linear mapping of clamp values into clamp buckets.
      Pre-compute and cache bucket_id to avoid integer divisions at
      enqueue/dequeue time.
      Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alessio Balsini <balsini@android.com>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
      Cc: Steve Muckle <smuckle@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Todd Kjos <tkjos@google.com>
      Cc: Vincent Guittot <vincent.guittot@linaro.org>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Link: https://lkml.kernel.org/r/20190621084217.8167-2-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      69842cba
    • Q
      sched/debug: Add a new sched_trace_*() helper functions · 3c93a0c0
      Qais Yousef 提交于
      The new functions allow modules to access internal data structures of
      unexported struct cfs_rq and struct rq to extract important information
      from the tracepoints to be introduced in later patches.
      
      While at it fix alphabetical order of struct declarations in sched.h
      Signed-off-by: NQais Yousef <qais.yousef@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Pavankumar Kondeti <pkondeti@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Perret <quentin.perret@arm.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Uwe Kleine-Konig <u.kleine-koenig@pengutronix.de>
      Link: https://lkml.kernel.org/r/20190604111459.2862-3-qais.yousef@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3c93a0c0
  21. 19 6月, 2019 1 次提交
    • D
      keys: Cache result of request_key*() temporarily in task_struct · 7743c48e
      David Howells 提交于
      If a filesystem uses keys to hold authentication tokens, then it needs a
      token for each VFS operation that might perform an authentication check -
      either by passing it to the server, or using to perform a check based on
      authentication data cached locally.
      
      For open files this isn't a problem, since the key should be cached in the
      file struct since it represents the subject performing operations on that
      file descriptor.
      
      During pathwalk, however, there isn't anywhere to cache the key, except
      perhaps in the nameidata struct - but that isn't exposed to the
      filesystems.  Further, a pathwalk can incur a lot of operations, calling
      one or more of the following, for instance:
      
      	->lookup()
      	->permission()
      	->d_revalidate()
      	->d_automount()
      	->get_acl()
      	->getxattr()
      
      on each dentry/inode it encounters - and each one may need to call
      request_key().  And then, at the end of pathwalk, it will call the actual
      operation:
      
      	->mkdir()
      	->mknod()
      	->getattr()
      	->open()
      	...
      
      which may need to go and get the token again.
      
      However, it is very likely that all of the operations on a single
      dentry/inode - and quite possibly a sequence of them - will all want to use
      the same authentication token, which suggests that caching it would be a
      good idea.
      
      To this end:
      
       (1) Make it so that a positive result of request_key() and co. that didn't
           require upcalling to userspace is cached temporarily in task_struct.
      
       (2) The cache is 1 deep, so a new result displaces the old one.
      
       (3) The key is released by exit and by notify-resume.
      
       (4) The cache is cleared in a newly forked process.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      7743c48e
  22. 15 6月, 2019 2 次提交
    • H
      processor: get rid of cpu_relax_yield · 4ecf0a43
      Heiko Carstens 提交于
      stop_machine is the only user left of cpu_relax_yield. Given that it
      now has special semantics which are tied to stop_machine introduce a
      weak stop_machine_yield function which architectures can override, and
      get rid of the generic cpu_relax_yield implementation.
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      4ecf0a43
    • M
      s390: improve wait logic of stop_machine · 38f2c691
      Martin Schwidefsky 提交于
      The stop_machine loop to advance the state machine and to wait for all
      affected CPUs to check-in calls cpu_relax_yield in a tight loop until
      the last missing CPUs acknowledged the state transition.
      
      On a virtual system where not all logical CPUs are backed by real CPUs
      all the time it can take a while for all CPUs to check-in. With the
      current definition of cpu_relax_yield a diagnose 0x44 is done which
      tells the hypervisor to schedule *some* other CPU. That can be any
      CPU and not necessarily one of the CPUs that need to run in order to
      advance the state machine. This can lead to a pretty bad diagnose 0x44
      storm until the last missing CPU finally checked-in.
      
      Replace the undirected cpu_relax_yield based on diagnose 0x44 with a
      directed yield. Each CPU in the wait loop will pick up the next CPU
      in the cpumask of stop_machine. The diagnose 0x9c is used to tell the
      hypervisor to run this next CPU instead of the current one. If there
      is only a limited number of real CPUs backing the virtual CPUs we
      end up with the real CPUs passed around in a round-robin fashion.
      
      [heiko.carstens@de.ibm.com]:
          Use cpumask_next_wrap as suggested by Peter Zijlstra.
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      38f2c691
  23. 03 6月, 2019 1 次提交
  24. 26 5月, 2019 1 次提交
    • P
      rcu: Check for wakeup-safe conditions in rcu_read_unlock_special() · 23634ebc
      Paul E. McKenney 提交于
      When RCU core processing is offloaded from RCU_SOFTIRQ to the rcuc
      kthreads, a full and unconditional wakeup is required to initiate RCU
      core processing.  In contrast, when RCU core processing is carried
      out by RCU_SOFTIRQ, a raise_softirq() suffices.  Of course, there are
      situations where raise_softirq() does a full wakeup, but these do not
      occur with normal usage of rcu_read_unlock().
      
      The reason that full wakeups can be problematic is that the scheduler
      sometimes invokes rcu_read_unlock() with its pi or rq locks held,
      which can of course result in deadlock in CONFIG_PREEMPT=y kernels when
      rcu_read_unlock() invokes the scheduler.  Scheduler invocations can happen
      in the following situations: (1) The just-ended reader has been subjected
      to RCU priority boosting, in which case rcu_read_unlock() must deboost,
      (2) Interrupts were disabled across the call to rcu_read_unlock(), so
      the quiescent state must be deferred, requiring a wakeup of the rcuc
      kthread corresponding to the current CPU.
      
      Now, the scheduler may hold one of its locks across rcu_read_unlock()
      only if preemption has been disabled across the entire RCU read-side
      critical section, which in the days prior to RCU flavor consolidation
      meant that rcu_read_unlock() never needed to do wakeups.  However, this
      is no longer the case for any but the first rcu_read_unlock() following a
      condition (e.g., preempted RCU reader) requiring special rcu_read_unlock()
      attention.  For example, an RCU read-side critical section might be
      preempted, but preemption might be disabled across the rcu_read_unlock().
      The rcu_read_unlock() must defer the quiescent state, and therefore
      leaves the task queued on its leaf rcu_node structure.  If a scheduler
      interrupt occurs, the scheduler might well invoke rcu_read_unlock() with
      one of its locks held.  However, the preempted task is still queued, so
      rcu_read_unlock() will attempt to defer the quiescent state once more.
      When RCU core processing is carried out by RCU_SOFTIRQ, this works just
      fine: The raise_softirq() function simply sets a bit in a per-CPU mask
      and the RCU core processing will be undertaken upon return from interrupt.
      
      Not so when RCU core processing is carried out by the rcuc kthread: In this
      case, the required wakeup can result in deadlock.
      
      The initial solution to this problem was to use set_tsk_need_resched() and
      set_preempt_need_resched() to force a future context switch, which allows
      rcu_preempt_note_context_switch() to report the deferred quiescent state
      to RCU's core processing.  Unfortunately for expedited grace periods,
      there can be a significant delay between the call for a context switch
      and the actual context switch.
      
      This commit therefore introduces a ->deferred_qs flag to the task_struct
      structure's rcu_special structure.  This flag is initially false, and
      is set to true by the first call to rcu_read_unlock() requiring special
      attention, then finally reset back to false when the quiescent state is
      finally reported.  Then rcu_read_unlock() attempts full wakeups only when
      ->deferred_qs is false, that is, on the first rcu_read_unlock() requiring
      special attention.  Note that a chain of RCU readers linked by some other
      sort of reader may find that a later rcu_read_unlock() is once again able
      to do a full wakeup, courtesy of an intervening preemption:
      
      	rcu_read_lock();
      	/* preempted */
      	local_irq_disable();
      	rcu_read_unlock(); /* Can do full wakeup, sets ->deferred_qs. */
      	rcu_read_lock();
      	local_irq_enable();
      	preempt_disable()
      	rcu_read_unlock(); /* Cannot do full wakeup, ->deferred_qs set. */
      	rcu_read_lock();
      	preempt_enable();
      	/* preempted, >deferred_qs reset. */
      	local_irq_disable();
      	rcu_read_unlock(); /* Can again do full wakeup, sets ->deferred_qs. */
      
      Such linked RCU readers do not yet seem to appear in the Linux kernel, and
      it is probably best if they don't.  However, RCU needs to handle them, and
      some variations on this theme could make even raise_softirq() unsafe due to
      the possibility of its doing a full wakeup.  This commit therefore also
      avoids invoking raise_softirq() when the ->deferred_qs set flag is set.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      23634ebc
  25. 15 5月, 2019 1 次提交
  26. 20 4月, 2019 1 次提交
    • R
      cgroup: cgroup v2 freezer · 76f969e8
      Roman Gushchin 提交于
      Cgroup v1 implements the freezer controller, which provides an ability
      to stop the workload in a cgroup and temporarily free up some
      resources (cpu, io, network bandwidth and, potentially, memory)
      for some other tasks. Cgroup v2 lacks this functionality.
      
      This patch implements freezer for cgroup v2.
      
      Cgroup v2 freezer tries to put tasks into a state similar to jobctl
      stop. This means that tasks can be killed, ptraced (using
      PTRACE_SEIZE*), and interrupted. It is possible to attach to
      a frozen task, get some information (e.g. read registers) and detach.
      It's also possible to migrate a frozen tasks to another cgroup.
      
      This differs cgroup v2 freezer from cgroup v1 freezer, which mostly
      tried to imitate the system-wide freezer. However uninterruptible
      sleep is fine when all tasks are going to be frozen (hibernation case),
      it's not the acceptable state for some subset of the system.
      
      Cgroup v2 freezer is not supporting freezing kthreads.
      If a non-root cgroup contains kthread, the cgroup still can be frozen,
      but the kthread will remain running, the cgroup will be shown
      as non-frozen, and the notification will not be delivered.
      
      * PTRACE_ATTACH is not working because non-fatal signal delivery
      is blocked in frozen state.
      
      There are some interface differences between cgroup v1 and cgroup v2
      freezer too, which are required to conform the cgroup v2 interface
      design principles:
      1) There is no separate controller, which has to be turned on:
      the functionality is always available and is represented by
      cgroup.freeze and cgroup.events cgroup control files.
      2) The desired state is defined by the cgroup.freeze control file.
      Any hierarchical configuration is allowed.
      3) The interface is asynchronous. The actual state is available
      using cgroup.events control file ("frozen" field). There are no
      dedicated transitional states.
      4) It's allowed to make any changes with the cgroup hierarchy
      (create new cgroups, remove old cgroups, move tasks between cgroups)
      no matter if some cgroups are frozen.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      No-objection-from-me-by: NOleg Nesterov <oleg@redhat.com>
      Cc: kernel-team@fb.com
      76f969e8
  27. 19 4月, 2019 1 次提交
    • M
      rseq: Remove superfluous rseq_len from task_struct · 83b0b15b
      Mathieu Desnoyers 提交于
      The rseq system call, when invoked with flags of "0" or
      "RSEQ_FLAG_UNREGISTER" values, expects the rseq_len parameter to
      be equal to sizeof(struct rseq), which is fixed-size and fixed-layout,
      specified in uapi linux/rseq.h.
      
      Expecting a fixed size for rseq_len is a design choice that ensures
      multiple libraries and application defining __rseq_abi in the same
      process agree on its exact size.
      
      Considering that this size is and will always be the same value, there
      is no point in saving this value within task_struct rseq_len. Remove
      this field from task_struct.
      
      No change in functionality intended.
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ben Maurer <bmaurer@fb.com>
      Cc: Boqun Feng <boqun.feng@gmail.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Lameter <cl@linux.com>
      Cc: Dave Watson <davejwatson@fb.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/20190305194755.2602-3-mathieu.desnoyers@efficios.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      83b0b15b
  28. 06 3月, 2019 1 次提交
    • A
      mm/cma: add PF flag to force non cma alloc · d7fefcc8
      Aneesh Kumar K.V 提交于
      Patch series "mm/kvm/vfio/ppc64: Migrate compound pages out of CMA
      region", v8.
      
      ppc64 uses the CMA area for the allocation of guest page table (hash
      page table).  We won't be able to start guest if we fail to allocate
      hash page table.  We have observed hash table allocation failure because
      we failed to migrate pages out of CMA region because they were pinned.
      This happen when we are using VFIO.  VFIO on ppc64 pins the entire guest
      RAM.  If the guest RAM pages get allocated out of CMA region, we won't
      be able to migrate those pages.  The pages are also pinned for the
      lifetime of the guest.
      
      Currently we support migration of non-compound pages.  With THP and with
      the addition of hugetlb migration we can end up allocating compound
      pages from CMA region.  This patch series add support for migrating
      compound pages.
      
      This patch (of 4):
      
      Add PF_MEMALLOC_NOCMA which make sure any allocation in that context is
      marked non-movable and hence cannot be satisfied by CMA region.
      
      This is useful with get_user_pages_longterm where we want to take a page
      pin by migrating pages from CMA region.  Marking the section
      PF_MEMALLOC_NOCMA ensures that we avoid unnecessary page migration
      later.
      
      Link: http://lkml.kernel.org/r/20190114095438.32470-2-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7fefcc8