1. 24 6月, 2017 2 次提交
  2. 23 6月, 2017 3 次提交
  3. 22 6月, 2017 3 次提交
  4. 20 6月, 2017 12 次提交
    • D
      sched/core: Drop the unused try_get_task_struct() helper function · f11cc076
      Davidlohr Bueso 提交于
      This function was introduced by:
      
        150593bf ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
      
      ... to allow easier usage of task_rcu_dereference(), however no users
      were ever added. Drop the helper.
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dave@stgolabs.net
      Link: http://lkml.kernel.org/r/20170615023730.22827-1-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f11cc076
    • D
      sched/fair: WARN() and refuse to set buddy when !se->on_rq · c5ae366e
      Daniel Axtens 提交于
      If we set a next or last buddy for a se that is not on_rq, we will
      end up taking a NULL pointer dereference in wakeup_preempt_entity
      via pick_next_task_fair.
      
      Detect when we would be about to do that, throw a warning and
      then refuse to actually set it.
      
      This has been suggested at least twice:
      
        https://marc.info/?l=linux-kernel&m=146651668921468&w=2
        https://lkml.org/lkml/2016/6/16/663
      
      I recently had to debug a problem with these (we hadn't backported
      Konstantin's patches in this area) and this would have saved a lot
      of time/pain.
      
      Just do it.
      Signed-off-by: NDaniel Axtens <dja@axtens.net>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170510201139.16236-1-dja@axtens.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c5ae366e
    • I
      sched/debug: Fix SCHED_WARN_ON() to return a value on !CONFIG_SCHED_DEBUG as well · 6d3aed3d
      Ingo Molnar 提交于
      This definition of SCHED_WARN_ON():
      
       #define SCHED_WARN_ON(x)        ((void)(x))
      
      is not fully compatible with the 'real' WARN_ON_ONCE() primitive, as it
      has no return value, so it cannot be used in conditionals.
      
      Fix it.
      
      Cc: Daniel Axtens <dja@axtens.net>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6d3aed3d
    • I
      sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming · 2055da97
      Ingo Molnar 提交于
      So I've noticed a number of instances where it was not obvious from the
      code whether ->task_list was for a wait-queue head or a wait-queue entry.
      
      Furthermore, there's a number of wait-queue users where the lists are
      not for 'tasks' but other entities (poll tables, etc.), in which case
      the 'task_list' name is actively confusing.
      
      To clear this all up, name the wait-queue head and entry list structure
      fields unambiguously:
      
      	struct wait_queue_head::task_list	=> ::head
      	struct wait_queue_entry::task_list	=> ::entry
      
      For example, this code:
      
      	rqw->wait.task_list.next != &wait->task_list
      
      ... is was pretty unclear (to me) what it's doing, while now it's written this way:
      
      	rqw->wait.head.next != &wait->entry
      
      ... which makes it pretty clear that we are iterating a list until we see the head.
      
      Other examples are:
      
      	list_for_each_entry_safe(pos, next, &x->task_list, task_list) {
      	list_for_each_entry(wq, &fence->wait.task_list, task_list) {
      
      ... where it's unclear (to me) what we are iterating, and during review it's
      hard to tell whether it's trying to walk a wait-queue entry (which would be
      a bug), while now it's written as:
      
      	list_for_each_entry_safe(pos, next, &x->head, entry) {
      	list_for_each_entry(wq, &fence->wait.head, entry) {
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      2055da97
    • I
      sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c · 5822a454
      Ingo Molnar 提交于
      The key hashed waitqueue data structures and their initialization
      was done in the main scheduler file for no good reason, move them
      to sched/wait_bit.c instead.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5822a454
    • I
      sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h> · 5dd43ce2
      Ingo Molnar 提交于
      The wait_bit*() types and APIs are mixed into wait.h, but they
      are a pretty orthogonal extension of wait-queues.
      
      Furthermore, only about 50 kernel files use these APIs, while
      over 1000 use the regular wait-queue functionality.
      
      So clean up the main wait.h by moving the wait-bit functionality
      out of it, into a separate .h and .c file:
      
        include/linux/wait_bit.h  for types and APIs
        kernel/sched/wait_bit.c   for the implementation
      
      Update all header dependencies.
      
      This reduces the size of wait.h rather significantly, by about 30%.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5dd43ce2
    • I
      sched/wait: Standardize wait_bit_queue naming · 76c85ddc
      Ingo Molnar 提交于
      So wait-bit-queue head variables are often named:
      
      	struct wait_bit_queue *q
      
      ... which is a bit ambiguous and super confusing, because
      they clearly suggest wait-queue head semantics and behavior
      (they rhyme with the old wait_queue_t *q naming), while they
      are extended wait-queue _entries_, not heads!
      
      They are misnomers in two ways:
      
       - the 'wait_bit_queue' leaves open the question of whether
         it's an entry or a head
      
       - the 'q' parameter and local variable naming falsely implies
         that it's a 'queue' - while it's an entry.
      
      This resulted in sometimes confusing cases such as:
      
      	finish_wait(wq, &q->wait);
      
      where the 'q' is not a wait-queue head, but a wait-bit-queue entry.
      
      So improve this all by standardizing wait-bit-queue nomenclature
      similar to wait-queue head naming:
      
      	struct wait_bit_queue   => struct wait_bit_queue_entry
      	q			=> wbq_entry
      
      Which makes it all a much clearer:
      
      	struct wait_bit_queue_entry *wbq_entry
      
      ... and turns the former confusing piece of code into:
      
      	finish_wait(wq_head, &wbq_entry->wq_entry;
      
      which IMHO makes it apparently clear what we are doing,
      without having to analyze the context of the code: we are
      adding a wait-queue entry to a regular wait-queue head,
      which entry is embedded in a wait-bit-queue entry.
      
      I'm not a big fan of acronyms, but repeating wait_bit_queue_entry
      in field and local variable names is too long, so Hopefully it's
      clear enough that 'wq_' prefixes stand for wait-queues, while
      'wbq_' prefixes stand for wait-bit-queues.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      76c85ddc
    • I
      sched/wait: Standardize 'struct wait_bit_queue' wait-queue entry field name · 21417136
      Ingo Molnar 提交于
      Rename 'struct wait_bit_queue::wait' to ::wq_entry, to more clearly
      name it as a wait-queue entry.
      
      Propagate it to a couple of usage sites where the wait-bit-queue internals
      are exposed.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      21417136
    • I
      sched/wait: Standardize internal naming of wait-queue heads · 9d9d676f
      Ingo Molnar 提交于
      The wait-queue head parameters and variables are named in a
      couple of ways, we have the following variants currently:
      
      	wait_queue_head_t *q
      	wait_queue_head_t *wq
      	wait_queue_head_t *head
      
      In particular the 'wq' naming is ambiguous in the sense whether it's
      a wait-queue head or entry name - as entries were often named 'wait'.
      
      ( Not to mention the confusion of any readers coming over from
        workqueue-land. )
      
      Standardize all this around a single, unambiguous parameter and
      variable name:
      
      	struct wait_queue_head *wq_head
      
      which is easy to grep for and also rhymes nicely with the wait-queue
      entry naming:
      
      	struct wait_queue_entry *wq_entry
      
      Also rename:
      
      	struct __wait_queue_head => struct wait_queue_head
      
      ... and use this struct type to migrate from typedefs usage to 'struct'
      usage, which is more in line with existing kernel practices.
      
      Don't touch any external users and preserve the main wait_queue_head_t
      typedef.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9d9d676f
    • I
      sched/wait: Standardize internal naming of wait-queue entries · 50816c48
      Ingo Molnar 提交于
      So the various wait-queue entry variables in include/linux/wait.h
      and kernel/sched/wait.c are named in a colorfully inconsistent
      way:
      
      	wait_queue_entry_t *wait
      	wait_queue_entry_t *__wait	(even in plain C code!)
      	wait_queue_entry_t *q		(!)
      	wait_queue_entry_t *new		(making anyone who knows C++ cringe)
      	wait_queue_entry_t *old
      
      I think part of the reason for the inconsistency is the constant
      apparent confusion about what a wait queue 'head' versus 'entry' is.
      
      ( Some of the documentation talks about a 'wait descriptor', which is
        the wait-queue entry itself - further adding to the confusion. )
      
      The most common name is 'wait', but that in itself is somewhat
      ambiguous as well, as it does not really make it clear whether
      it's a wait-queue entry or head.
      
      To improve all this name the wait-queue entry structure parameters
      and variables consistently and push through this naming into all
      the wait.h and wait.c code:
      
      	struct wait_queue_entry *wq_entry
      
      The 'wq_' prefix makes it easy to grep for, and we also use the
      opportunity to move away from the typedef to a plain 'struct' naming:
      in the kernel we typically reserve typedefs for cases where a
      C structure is really small and somewhat opaque - such as pte_t.
      
      wait-queue entries are neither small nor opaque, so use the more
      standard 'struct xxx_entry' list management code nomenclature instead.
      
      ( We don't touch external users, and we preserve the typedef as well
        for actual wait-queue users, to reduce unnecessary churn. )
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      50816c48
    • I
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar 提交于
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac6424b9
    • P
      livepatch: Fix stacking of patches with respect to RCU · 842c0884
      Petr Mladek 提交于
      rcu_read_(un)lock(), list_*_rcu(), and synchronize_rcu() are used for a secure
      access and manipulation of the list of patches that modify the same function.
      In particular, it is the variable func_stack that is accessible from the ftrace
      handler via struct ftrace_ops and klp_ops.
      
      Of course, it synchronizes also some states of the patch on the top of the
      stack, e.g. func->transition in klp_ftrace_handler.
      
      At the same time, this mechanism guards also the manipulation of
      task->patch_state. It is modified according to the state of the transition and
      the state of the process.
      
      Now, all this works well as long as RCU works well. Sadly livepatching might
      get into some corner cases when this is not true. For example, RCU is not
      watching when rcu_read_lock() is taken in idle threads.  It is because they
      might sleep and prevent reaching the grace period for too long.
      
      There are ways how to make RCU watching even in idle threads, see
      rcu_irq_enter(). But there is a small location inside RCU infrastructure when
      even this does not work.
      
      This small problematic location can be detected either before calling
      rcu_irq_enter() by rcu_irq_enter_disabled() or later by rcu_is_watching().
      Sadly, there is no safe way how to handle it.  Once we detect that RCU was not
      watching, we might see inconsistent state of the function stack and the related
      variables in klp_ftrace_handler(). Then we could do a wrong decision, use an
      incompatible implementation of the function and break the consistency of the
      system. We could warn but we could not avoid the damage.
      
      Fortunately, ftrace has similar problems and they seem to be solved well there.
      It uses a heavy weight implementation of some RCU operations. In particular, it
      replaces:
      
        + rcu_read_lock() with preempt_disable_notrace()
        + rcu_read_unlock() with preempt_enable_notrace()
        + synchronize_rcu() with schedule_on_each_cpu(sync_work)
      
      My understanding is that this is RCU implementation from a stone age. It meets
      the core RCU requirements but it is rather ineffective. Especially, it does not
      allow to batch or speed up the synchronize calls.
      
      On the other hand, it is very trivial. It allows to safely trace and/or
      livepatch even the RCU core infrastructure.  And the effectiveness is a not a
      big issue because using ftrace or livepatches on productive systems is a rare
      operation.  The safety is much more important than a negligible extra load.
      
      Note that the alternative implementation follows the RCU principles. Therefore,
           we could and actually must use list_*_rcu() variants when manipulating the
           func_stack.  These functions allow to access the pointers in the right
           order and with the right barriers. But they do not use any other
           information that would be set only by rcu_read_lock().
      
      Also note that there are actually two problems solved in ftrace:
      
      First, it cares about the consistency of RCU read sections.  It is being solved
      the way as described and used in this patch.
      
      Second, ftrace needs to make sure that nobody is inside the dynamic trampoline
      when it is being freed. For this, it also calls synchronize_rcu_tasks() in
      preemptive kernel in ftrace_shutdown().
      
      Livepatch has similar problem but it is solved by ftrace for free.
      klp_ftrace_handler() is a good guy and never sleeps. In addition, it is
      registered with FTRACE_OPS_FL_DYNAMIC. It causes that
      unregister_ftrace_function() calls:
      
      	* schedule_on_each_cpu(ftrace_sync) - always
      	* synchronize_rcu_tasks() - in preemptive kernel
      
      The effect is that nobody is neither inside the dynamic trampoline nor inside
      the ftrace handler after unregister_ftrace_function() returns.
      
      [jkosina@suse.cz: reformat changelog, fix comment]
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      Acked-by: NJosh Poimboeuf <jpoimboe@redhat.com>
      Acked-by: NMiroslav Benes <mbenes@suse.cz>
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      842c0884
  5. 13 6月, 2017 2 次提交
  6. 12 6月, 2017 1 次提交
  7. 11 6月, 2017 2 次提交
  8. 08 6月, 2017 15 次提交
    • P
      srcu: Allow use of Classic SRCU from both process and interrupt context · 1123a604
      Paolo Bonzini 提交于
      Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
      down a guest running iperf on a VFIO assigned device.  This happens
      because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
      context, while a worker thread does the same inside kvm_set_irq().  If the
      interrupt happens while the worker thread is executing __srcu_read_lock(),
      updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
      ->srcu_lock_count[] field can be lost.
      
      The docs say you are not supposed to call srcu_read_lock() and
      srcu_read_unlock() from irq context, but KVM interrupt injection happens
      from (host) interrupt context and it would be nice if SRCU supported the
      use case.  KVM is using SRCU here not really for the "sleepable" part,
      but rather due to its IPI-free fast detection of grace periods.  It is
      therefore not desirable to switch back to RCU, which would effectively
      revert commit 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
      2014-01-16).
      
      However, the docs are overly conservative.  You can have an SRCU instance
      only has users in irq context, and you can mix process and irq context
      as long as process context users disable interrupts.  In addition,
      __srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
      Classic SRCU.  For those two implementations, only srcu_read_lock()
      is unsafe.
      
      When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
      in commit 5a41344a ("srcu: Simplify __srcu_read_unlock() via
      this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
      Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
      the caller.  Tree SRCU however only does one increment, so on most
      architectures it is more efficient for __srcu_read_lock() to use
      this_cpu_inc(), and any performance differences appear to be down in
      the noise.
      
      Cc: stable@vger.kernel.org
      Fixes: 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
      Reported-by: NLinu Cherian <linuc.decode@gmail.com>
      Suggested-by: NLinu Cherian <linuc.decode@gmail.com>
      Cc: kvm@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      1123a604
    • P
      srcu: Allow use of Tiny/Tree SRCU from both process and interrupt context · cdf7abc4
      Paolo Bonzini 提交于
      Linu Cherian reported a WARN in cleanup_srcu_struct() when shutting
      down a guest running iperf on a VFIO assigned device.  This happens
      because irqfd_wakeup() calls srcu_read_lock(&kvm->irq_srcu) in interrupt
      context, while a worker thread does the same inside kvm_set_irq().  If the
      interrupt happens while the worker thread is executing __srcu_read_lock(),
      updates to the Classic SRCU ->lock_count[] field or the Tree SRCU
      ->srcu_lock_count[] field can be lost.
      
      The docs say you are not supposed to call srcu_read_lock() and
      srcu_read_unlock() from irq context, but KVM interrupt injection happens
      from (host) interrupt context and it would be nice if SRCU supported the
      use case.  KVM is using SRCU here not really for the "sleepable" part,
      but rather due to its IPI-free fast detection of grace periods.  It is
      therefore not desirable to switch back to RCU, which would effectively
      revert commit 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING",
      2014-01-16).
      
      However, the docs are overly conservative.  You can have an SRCU instance
      only has users in irq context, and you can mix process and irq context
      as long as process context users disable interrupts.  In addition,
      __srcu_read_unlock() actually uses this_cpu_dec() on both Tree SRCU and
      Classic SRCU.  For those two implementations, only srcu_read_lock()
      is unsafe.
      
      When Classic SRCU's __srcu_read_unlock() was changed to use this_cpu_dec(),
      in commit 5a41344a ("srcu: Simplify __srcu_read_unlock() via
      this_cpu_dec()", 2012-11-29), __srcu_read_lock() did two increments.
      Therefore it kept __this_cpu_inc(), with preempt_disable/enable in
      the caller.  Tree SRCU however only does one increment, so on most
      architectures it is more efficient for __srcu_read_lock() to use
      this_cpu_inc(), and any performance differences appear to be down in
      the noise.
      
      Unlike Classic and Tree SRCU, Tiny SRCU does increments and decrements on
      a single variable.  Therefore, as Peter Zijlstra pointed out, Tiny SRCU's
      implementation already supports mixed-context use of srcu_read_lock()
      and srcu_read_unlock(), at least as long as uses of srcu_read_lock()
      and srcu_read_unlock() in each handler are nested and paired properly.
      In other words, it is still illegal to (say) invoke srcu_read_lock()
      in an interrupt handler and to invoke the matching srcu_read_unlock()
      in a softirq handler.  Therefore, the only change required for Tiny SRCU
      is to its comments.
      
      Fixes: 719d93cd ("kvm/irqchip: Speed up KVM_SET_GSI_ROUTING")
      Reported-by: NLinu Cherian <linuc.decode@gmail.com>
      Suggested-by: NLinu Cherian <linuc.decode@gmail.com>
      Cc: kvm@vger.kernel.org
      Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Tested-by: NPaolo Bonzini <pbonzini@redhat.com>
      cdf7abc4
    • P
      Revert "printk: fix double printing with earlycon" · dac8bbba
      Petr Mladek 提交于
      This reverts commit cf39bf58.
      
      The commit regression to users that define both console=ttyS1
      and console=ttyS0 on the command line, see
      https://lkml.kernel.org/r/20170509082915.GA13236@bistromath.localdomain
      
      The kernel log messages always appeared only on one serial port. It is
      even documented in Documentation/admin-guide/serial-console.rst:
      
      "Note that you can only define one console per device type (serial,
      video)."
      
      The above mentioned commit changed the order in which the command line
      parameters are searched. As a result, the kernel log messages go to
      the last mentioned ttyS* instead of the first one.
      
      We long thought that using two console=ttyS* on the command line
      did not make sense. But then we realized that console= parameters
      were handled also by systemd, see
      http://0pointer.de/blog/projects/serial-console.html
      
      "By default systemd will instantiate one serial-getty@.service on
      the main kernel console, if it is not a virtual terminal."
      
      where
      
      "[4] If multiple kernel consoles are used simultaneously, the main
      console is the one listed first in /sys/class/tty/console/active,
      which is the last one listed on the kernel command line."
      
      This puts the original report into another light. The system is running
      in qemu. The first serial port is used to store the messages into a file.
      The second one is used to login to the system via a socket. It depends
      on systemd and the historic kernel behavior.
      
      By other words, systemd causes that it makes sense to define both
      console=ttyS1 console=ttyS0 on the command line. The kernel fix
      caused regression related to userspace (systemd) and need to be
      reverted.
      
      In addition, it went out that the fix helped only partially.
      The messages still were duplicated when the boot console was
      removed early by late_initcall(printk_late_init). Then the entire
      log was replayed when the same console was registered as a normal one.
      
      Link: 20170606160339.GC7604@pathway.suse.cz
      Cc: Aleksey Makarov <aleksey.makarov@linaro.org>
      Cc: Sabrina Dubroca <sd@queasysnail.net>
      Cc: Sudeep Holla <sudeep.holla@arm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Jiri Slaby <jslaby@suse.com>
      Cc: Robin Murphy <robin.murphy@arm.com>,
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: "Nair, Jayachandran" <Jayachandran.Nair@cavium.com>
      Cc: linux-serial@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Reported-by: NSabrina Dubroca <sd@queasysnail.net>
      Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NPetr Mladek <pmladek@suse.com>
      dac8bbba
    • A
      sched/idle: Add deferrable vmstat_updater back · ebfa4c02
      Aubrey Li 提交于
      Deferrable vmstat_updater was missing in commit:
      
        c1de45ca ("sched/idle: Add support for tasks that inject idle")
      
      Add it back.
      Signed-off-by: NAubrey Li <aubrey.li@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Aubrey Li <aubrey.li@intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1496803742-38274-1-git-send-email-aubrey.li@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ebfa4c02
    • N
      sched/core: Omit building stop_sched_class when !SMP · f5832c19
      Nicolas Pitre 提交于
      The stop class is invoked through stop_machine only.
      This is dead code on UP builds.
      Signed-off-by: NNicolas Pitre <nico@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20170529210302.26868-3-nicolas.pitre@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f5832c19
    • D
      sched/deadline: Use the revised wakeup rule for suspending constrained dl tasks · 3effcb42
      Daniel Bristot de Oliveira 提交于
      We have been facing some problems with self-suspending constrained
      deadline tasks. The main reason is that the original CBS was not
      designed for such sort of tasks.
      
      One problem reported by Xunlei Pang takes place when a task
      suspends, and then is awakened before the deadline, but so close
      to the deadline that its remaining runtime can cause the task
      to have an absolute density higher than allowed. In such situation,
      the original CBS assumes that the task is facing an early activation,
      and so it replenishes the task and set another deadline, one deadline
      in the future. This rule works fine for implicit deadline tasks.
      Moreover, it allows the system to adapt the period of a task in which
      the external event source suffered from a clock drift.
      
      However, this opens the window for bandwidth leakage for constrained
      deadline tasks. For instance, a task with the following parameters:
      
        runtime   = 5 ms
        deadline  = 7 ms
        [density] = 5 / 7 = 0.71
        period    = 1000 ms
      
      If the task runs for 1 ms, and then suspends for another 1ms,
      it will be awakened with the following parameters:
      
        remaining runtime = 4
        laxity = 5
      
      presenting a absolute density of 4 / 5 = 0.80.
      
      In this case, the original CBS would assume the task had an early
      wakeup. Then, CBS will reset the runtime, and the absolute deadline will
      be postponed by one relative deadline, allowing the task to run.
      
      The problem is that, if the task runs this pattern forever, it will keep
      receiving bandwidth, being able to run 1ms every 2ms. Following this
      behavior, the task would be able to run 500 ms in 1 sec. Thus running
      more than the 5 ms / 1 sec the admission control allowed it to run.
      
      Trying to address the self-suspending case, Luca Abeni, Giuseppe
      Lipari, and Juri Lelli [1] revisited the CBS in order to deal with
      self-suspending tasks. In the new approach, rather than
      replenishing/postponing the absolute deadline, the revised wakeup rule
      adjusts the remaining runtime, reducing it to fit into the allowed
      density.
      
      A revised version of the idea is:
      
      At a given time t, the maximum absolute density of a task cannot be
      higher than its relative density, that is:
      
        runtime / (deadline - t) <= dl_runtime / dl_deadline
      
      Knowing the laxity of a task (deadline - t), it is possible to move
      it to the other side of the equality, thus enabling to define max
      remaining runtime a task can use within the absolute deadline, without
      over-running the allowed density:
      
        runtime = (dl_runtime / dl_deadline) * (deadline - t)
      
      For instance, in our previous example, the task could still run:
      
        runtime = ( 5 / 7 ) * 5
        runtime = 3.57 ms
      
      Without causing damage for other deadline tasks. It is note worthy
      that the laxity cannot be negative because that would cause a negative
      runtime. Thus, this patch depends on the patch:
      
        df8eac8c ("sched/deadline: Throttle a constrained deadline task activated after the deadline")
      
      Which throttles a constrained deadline task activated after the
      deadline.
      
      Finally, it is also possible to use the revised wakeup rule for
      all other tasks, but that would require some more discussions
      about pros and cons.
      Reported-by: NXunlei Pang <xpang@redhat.com>
      Signed-off-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      [peterz: replaced dl_is_constrained with dl_is_implicit]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Romulo Silva de Oliveira <romulo.deoliveira@ufsc.br>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/5c800ab3a74a168a84ee5f3f84d12a02e11383be.1495803804.git.bristot@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      3effcb42
    • X
      sched/deadline: Zero out positive runtime after throttling constrained tasks · ae83b56a
      Xunlei Pang 提交于
      When a contrained task is throttled by dl_check_constrained_dl(),
      it may carry the remaining positive runtime, as a result when
      dl_task_timer() fires and calls replenish_dl_entity(), it will
      not be replenished correctly due to the positive dl_se->runtime.
      
      This patch assigns its runtime to 0 if positive after throttling.
      Signed-off-by: NXunlei Pang <xlpang@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Luca Abeni <luca.abeni@santannapisa.it>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: df8eac8c ("sched/deadline: Throttle a constrained deadline task activated after the deadline)
      Link: http://lkml.kernel.org/r/1494421417-27550-1-git-send-email-xlpang@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ae83b56a
    • L
      sched/deadline: Reclaim bandwidth not used by dl tasks · daec5798
      Luca Abeni 提交于
      This commit introduces a per-runqueue "extra utilization" that can be
      reclaimed by deadline tasks. In this way, the maximum fraction of CPU
      time that can reclaimed by deadline tasks is fixed (and configurable)
      and does not depend on the total deadline utilization.
      The GRUB accounting rule is modified to add this "extra utilization"
      to the inactive utilization of the runqueue, and to avoid reclaiming
      more than a maximum fraction of the CPU time.
      Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Claudio Scordino <claudio@evidence.eu.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/1495138417-6203-10-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
      daec5798
    • L
      sched/deadline: Base GRUB reclaiming on the inactive utilization · 9f0d1a50
      Luca Abeni 提交于
      Instead of decreasing the runtime as "dq = -Uact dt" (eventually
      divided by the maximum utilization available for deadline tasks),
      decrease it as "dq = -max{u, (1 - Uinact)} dt", where u is the task
      utilization and Uinact is the "inactive utilization".
      In this way, the maximum fraction of CPU time that can be reclaimed
      is given by the total utilization of deadline tasks.
      This approach solves a fairness issue with "traditional" global GRUB
      reclaiming: using the traditional GRUB algorithm, if tasks are
      allocated to the various cores in a non-uniform way, the
      reclaiming mechanism allows some tasks to reclaim more time than
      others. This issue is visible starting 11 time-consuming tasks with
      runtime 10ms and period 30ms (total utilization 3.666) on a 4-cores
      system: some tasks will receive much more than the reserved runtime
      (thanks to the reclaiming mechanism), while other tasks will receive
      less than the reserved runtime.
      Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Claudio Scordino <claudio@evidence.eu.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/1495138417-6203-9-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9f0d1a50
    • L
      sched/deadline: Track the "total rq utilization" too · 8fd27231
      Luca Abeni 提交于
      The total rq utilization is defined as the sum of the utilisations of
      tasks that are "assigned" to a runqueue, independently from their state
      (TASK_RUNNING or blocked)
      Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it>
      Signed-off-by: NClaudio Scordino <claudio@evidence.eu.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/1495138417-6203-8-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8fd27231
    • L
      sched/deadline: Make GRUB a task's flag · 2d4283e9
      Luca Abeni 提交于
      This patch introduces the SCHED_FLAG_RECLAIM flag to specify
      that a DL task is allowed to reclaim unused CPU time (using
      the GRUB algorithm).
      Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Claudio Scordino <claudio@evidence.eu.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/1495138417-6203-7-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2d4283e9
    • L
      sched/deadline: Do not reclaim the whole CPU bandwidth · 4da3abce
      Luca Abeni 提交于
      Original GRUB tends to reclaim 100% of the CPU time... And this
      allows a CPU hog to starve non-deadline tasks.
      To address this issue, allow the scheduler to reclaim only a
      specified fraction of CPU time, stored in the new "bw_ratio"
      field of the dl runqueue structure.
      Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Claudio Scordino <claudio@evidence.eu.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/1495138417-6203-6-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4da3abce
    • L
      sched/deadline: Implement GRUB accounting · c52f14d3
      Luca Abeni 提交于
      According to the GRUB (Greedy Reclaimation of Unused Bandwidth)
      reclaiming algorithm, the runtime is not decreased as "dq = -dt",
      but as "dq = -Uact dt" (where Uact is the per-runqueue active
      utilization).
      Hence, this commit modifies the runtime accounting rule in
      update_curr_dl() to implement the GRUB rule.
      Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Claudio Scordino <claudio@evidence.eu.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/1495138417-6203-5-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c52f14d3
    • L
      sched/deadline: Fix the update of the total -deadline utilization · 387e3130
      Luca Abeni 提交于
      Now that the inactive timer can be armed to fire at the 0-lag time,
      it is possible to use inactive_task_timer() to update the total
      -deadline utilization (dl_b->total_bw) at the correct time, fixing
      dl_overflow() and __setparam_dl().
      Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Claudio Scordino <claudio@evidence.eu.com>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/1495138417-6203-4-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
      387e3130
    • L
      sched/deadline: Improve the tracking of active utilization · 209a0cbd
      Luca Abeni 提交于
      This patch implements a more theoretically sound algorithm for
      tracking active utilization: instead of decreasing it when a
      task blocks, use a timer (the "inactive timer", named after the
      "Inactive" task state of the GRUB algorithm) to decrease the
      active utilization at the so called "0-lag time".
      Tested-by: NClaudio Scordino <claudio@evidence.eu.com>
      Tested-by: NDaniel Bristot de Oliveira <bristot@redhat.com>
      Signed-off-by: NLuca Abeni <luca.abeni@santannapisa.it>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Juri Lelli <juri.lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mathieu Poirier <mathieu.poirier@linaro.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tommaso Cucinotta <tommaso.cucinotta@sssup.it>
      Link: http://lkml.kernel.org/r/1495138417-6203-3-git-send-email-luca.abeni@santannapisa.itSigned-off-by: NIngo Molnar <mingo@kernel.org>
      209a0cbd