1. 27 6月, 2016 1 次提交
    • P
      sched/fair: Fix PELT integrity for new tasks · 7dc603c9
      Peter Zijlstra 提交于
      Vincent and Yuyang found another few scenarios in which entity
      tracking goes wobbly.
      
      The scenarios are basically due to the fact that new tasks are not
      immediately attached and thereby differ from the normal situation -- a
      task is always attached to a cfs_rq load average (such that it
      includes its blocked contribution) and are explicitly
      detached/attached on migration to another cfs_rq.
      
      Scenario 1: switch to fair class
      
        p->sched_class = fair_class;
        if (queued)
          enqueue_task(p);
            ...
              enqueue_entity()
      	  enqueue_entity_load_avg()
      	    migrated = !sa->last_update_time (true)
      	    if (migrated)
      	      attach_entity_load_avg()
        check_class_changed()
          switched_from() (!fair)
          switched_to()   (fair)
            switched_to_fair()
              attach_entity_load_avg()
      
      If @p is a new task that hasn't been fair before, it will have
      !last_update_time and, per the above, end up in
      attach_entity_load_avg() _twice_.
      
      Scenario 2: change between cgroups
      
        sched_move_group(p)
          if (queued)
            dequeue_task()
          task_move_group_fair()
            detach_task_cfs_rq()
              detach_entity_load_avg()
            set_task_rq()
            attach_task_cfs_rq()
              attach_entity_load_avg()
          if (queued)
            enqueue_task();
              ...
                enqueue_entity()
      	    enqueue_entity_load_avg()
      	      migrated = !sa->last_update_time (true)
      	      if (migrated)
      	        attach_entity_load_avg()
      
      Similar as with scenario 1, if @p is a new task, it will have
      !load_update_time and we'll end up in attach_entity_load_avg()
      _twice_.
      
      Furthermore, notice how we do a detach_entity_load_avg() on something
      that wasn't attached to begin with.
      
      As stated above; the problem is that the new task isn't yet attached
      to the load tracking and thereby violates the invariant assumption.
      
      This patch remedies this by ensuring a new task is indeed properly
      attached to the load tracking on creation, through
      post_init_entity_util_avg().
      
      Of course, this isn't entirely as straightforward as one might think,
      since the task is hashed before we call wake_up_new_task() and thus
      can be poked at. We avoid this by adding TASK_NEW and teaching
      cpu_cgroup_can_attach() to refuse such tasks.
      Reported-by: NYuyang Du <yuyang.du@intel.com>
      Reported-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7dc603c9
  2. 25 6月, 2016 1 次提交
    • L
      Clarify naming of thread info/stack allocators · b235beea
      Linus Torvalds 提交于
      We've had the thread info allocated together with the thread stack for
      most architectures for a long time (since the thread_info was split off
      from the task struct), but that is about to change.
      
      But the patches that move the thread info to be off-stack (and a part of
      the task struct instead) made it clear how confused the allocator and
      freeing functions are.
      
      Because the common case was that we share an allocation with the thread
      stack and the thread_info, the two pointers were identical.  That
      identity then meant that we would have things like
      
      	ti = alloc_thread_info_node(tsk, node);
      	...
      	tsk->stack = ti;
      
      which certainly _worked_ (since stack and thread_info have the same
      value), but is rather confusing: why are we assigning a thread_info to
      the stack? And if we move the thread_info away, the "confusing" code
      just gets to be entirely bogus.
      
      So remove all this confusion, and make it clear that we are doing the
      stack allocation by renaming and clarifying the function names to be
      about the stack.  The fact that the thread_info then shares the
      allocation is an implementation detail, and not really about the
      allocation itself.
      
      This is a pure renaming and type fix: we pass in the same pointer, it's
      just that we clarify what the pointer means.
      
      The ia64 code that actually only has one single allocation (for all of
      task_struct, thread_info and kernel thread stack) now looks a bit odd,
      but since "tsk->stack" is actually not even used there, that oddity
      doesn't matter.  It would be a separate thing to clean that up, I
      intentionally left the ia64 changes as a pure brute-force renaming and
      type change.
      Acked-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b235beea
  3. 03 6月, 2016 1 次提交
    • O
      sched/api: Introduce task_rcu_dereference() and try_get_task_struct() · 150593bf
      Oleg Nesterov 提交于
      Generally task_struct is only protected by RCU if it was found on a
      RCU protected list (say, for_each_process() or find_task_by_vpid()).
      
      As Kirill pointed out rq->curr isn't protected by RCU, the scheduler
      drops the (potentially) last reference without RCU gp, this means
      that we need to fix the code which uses foreign_rq->curr under
      rcu_read_lock().
      
      Add a new helper which can be used to dereference rq->curr or any
      other pointer to task_struct assuming that it should be cleared or
      updated before the final put_task_struct(). It returns non-NULL
      only if this task can't go away before rcu_read_unlock().
      
      ( Also add try_get_task_struct() to make it easier to use this API
        correctly. )
      Suggested-by: NKirill Tkhai <ktkhai@parallels.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      [ Updated comments; added try_get_task_struct()]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Kirill Tkhai <tkhai@yandex.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Link: http://lkml.kernel.org/r/20160518170218.GY3192@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      150593bf
  4. 27 5月, 2016 1 次提交
  5. 25 5月, 2016 1 次提交
    • P
      sched/core: Fix remote wakeups · b7e7ade3
      Peter Zijlstra 提交于
      Commit:
      
        b5179ac7 ("sched/fair: Prepare to fix fairness problems on migration")
      
      ... introduced a bug: Mike Galbraith found that it introduced a
      performance regression, while Paul E. McKenney reported lost
      wakeups and bisected it to this commit.
      
      The reason is that I mis-read ttwu_queue() such that I assumed any
      wakeup that got a remote queue must have had the task migrated.
      
      Since this is not so; we need to transfer this information between
      queueing the wakeup and actually doing the wakeup. Use a new
      task_struct::sched_flag for this, we already write to
      sched_contributes_to_load in the wakeup path so this is a hot and
      modified cacheline.
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reported-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Tested-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Hunter <ahh@google.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Morten Rasmussen <morten.rasmussen@arm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Pavan Kondeti <pkondeti@codeaurora.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: byungchul.park@lge.com
      Fixes: b5179ac7 ("sched/fair: Prepare to fix fairness problems on migration")
      Link: http://lkml.kernel.org/r/20160523091907.GD15728@worktop.ger.corp.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b7e7ade3
  6. 24 5月, 2016 1 次提交
    • T
      signal: make oom_flags a bool · c96fc2d8
      Tetsuo Handa 提交于
      Currently the size of "struct signal_struct"->oom_flags member is
      sizeof(unsigned) bytes, but only one flag OOM_FLAG_ORIGIN which is
      updated by current thread is defined.  We can convert OOM_FLAG_ORIGIN
      into a bool, and reuse the saved bytes for updating from the OOM killer
      and/or the OOM reaper thread.
      
      By the way, do we care about a race window between run_store() and
      swapoff() because it would be theoretically possible that two threads
      sharing the "struct signal_struct" concurrently call respective
      functions? If we care, we can make oom_flags an atomic_t.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c96fc2d8
  7. 21 5月, 2016 6 次提交
    • J
      exit_thread: accept a task parameter to be exited · e6464694
      Jiri Slaby 提交于
      We need to call exit_thread from copy_process in a fail path.  So make it
      accept task_struct as a parameter.
      
      [v2]
      * s390: exit_thread_runtime_instr doesn't make sense to be called for
        non-current tasks.
      * arm: fix the comment in vfp_thread_copy
      * change 'me' to 'tsk' for task_struct
      * now we can change only archs that actually have exit_thread
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e6464694
    • J
      exit_thread: remove empty bodies · 5f56a5df
      Jiri Slaby 提交于
      Define HAVE_EXIT_THREAD for archs which want to do something in
      exit_thread. For others, let's define exit_thread as an empty inline.
      
      This is a cleanup before we change the prototype of exit_thread to
      accept a task parameter.
      
      [akpm@linux-foundation.org: fix mips]
      Signed-off-by: NJiri Slaby <jslaby@suse.cz>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Chris Metcalf <cmetcalf@mellanox.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Koichi Yasutake <yasutake.koichi@jp.panasonic.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f56a5df
    • O
      userfaultfd: don't pin the user memory in userfaultfd_file_create() · d2005e3f
      Oleg Nesterov 提交于
      userfaultfd_file_create() increments mm->mm_users; this means that the
      memory won't be unmapped/freed if mm owner exits/execs, and UFFDIO_COPY
      after that can populate the orphaned mm more.
      
      Change userfaultfd_file_create() and userfaultfd_ctx_put() to use
      mm->mm_count to pin mm_struct.  This means that
      atomic_inc_not_zero(mm->mm_users) is needed when we are going to
      actually play with this memory.  Except handle_userfault() path doesn't
      need this, the caller must already have a reference.
      
      The patch adds the new trivial helper, mmget_not_zero(), it can have
      more users.
      
      Link: http://lkml.kernel.org/r/20160516172254.GA8595@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d2005e3f
    • T
      mm,oom: speed up select_bad_process() loop · f44666b0
      Tetsuo Handa 提交于
      Since commit 3a5dda7a ("oom: prevent unnecessary oom kills or kernel
      panics"), select_bad_process() is using for_each_process_thread().
      
      Since oom_unkillable_task() scans all threads in the caller's thread
      group and oom_task_origin() scans signal_struct of the caller's thread
      group, we don't need to call oom_unkillable_task() and oom_task_origin()
      on each thread.  Also, since !mm test will be done later at
      oom_badness(), we don't need to do !mm test on each thread.  Therefore,
      we only need to do TIF_MEMDIE test on each thread.
      
      Although the original code was correct it was quite inefficient because
      each thread group was scanned num_threads times which can be a lot
      especially with processes with many threads.  Even though the OOM is
      extremely cold path it is always good to be as effective as possible
      when we are inside rcu_read_lock() - aka unpreemptible context.
      
      If we track number of TIF_MEMDIE threads inside signal_struct, we don't
      need to do TIF_MEMDIE test on each thread.  This will allow
      select_bad_process() to use for_each_process().
      
      This patch adds a counter to signal_struct for tracking how many
      TIF_MEMDIE threads are in a given thread group, and check it at
      oom_scan_process_thread() so that select_bad_process() can use
      for_each_process() rather than for_each_process_thread().
      
      [mhocko@suse.com: do not blow the signal_struct size]
        Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f44666b0
    • M
      mm, oom_reaper: do not mmput synchronously from the oom reaper context · ec8d7c14
      Michal Hocko 提交于
      Tetsuo has properly noted that mmput slow path might get blocked waiting
      for another party (e.g.  exit_aio waits for an IO).  If that happens the
      oom_reaper would be put out of the way and will not be able to process
      next oom victim.  We should strive for making this context as reliable
      and independent on other subsystems as much as possible.
      
      Introduce mmput_async which will perform the slow path from an async
      (WQ) context.  This will delay the operation but that shouldn't be a
      problem because the oom_reaper has reclaimed the victim's address space
      for most cases as much as possible and the remaining context shouldn't
      bind too much memory anymore.  The only exception is when mmap_sem
      trylock has failed which shouldn't happen too often.
      
      The issue is only theoretical but not impossible.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec8d7c14
    • M
      mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully · bb8a4b7f
      Michal Hocko 提交于
      Commit 36324a99 ("oom: clear TIF_MEMDIE after oom_reaper managed to
      unmap the address space") not only clears TIF_MEMDIE for oom reaped task
      but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the
      oom killer.  This works in simple cases but it is not sufficient for
      (unlikely) cases where the mm is shared between independent processes
      (as they do not share signal struct).  If the mm had only small amount
      of memory which could be reaped then another task sharing the mm could
      be selected and that wouldn't help to move out from the oom situation.
      
      Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same
      as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set.  Set the
      flag after __oom_reap_task is done with a task.  This will force the
      select_bad_process() to ignore all already oom reaped tasks as well as
      no such task is sacrificed for its parent.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb8a4b7f
  8. 12 5月, 2016 1 次提交
  9. 06 5月, 2016 3 次提交
  10. 05 5月, 2016 2 次提交
  11. 04 5月, 2016 1 次提交
    • A
      signals/sigaltstack: If SS_AUTODISARM, bypass on_sig_stack() · c876eeab
      Andy Lutomirski 提交于
      If a signal stack is set up with SS_AUTODISARM, then the kernel
      inherently avoids incorrectly resetting the signal stack if signals
      recurse: the signal stack will be reset on the first signal
      delivery.  This means that we don't need check the stack pointer
      when delivering signals if SS_AUTODISARM is set.
      
      This will make segmented x86 programs more robust: currently there's
      a hole that could be triggered if ESP/RSP appears to point to the
      signal stack but actually doesn't due to a nonzero SS base.
      Signed-off-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Amanieu d'Antras <amanieu@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Moore <pmoore@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Stas Sergeev <stsp@list.ru>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: linux-api@vger.kernel.org
      Link: http://lkml.kernel.org/r/c46bee4654ca9e68c498462fd11746e2bd0d98c8.1462296606.git.luto@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c876eeab
  12. 03 5月, 2016 1 次提交
    • S
      signals/sigaltstack: Implement SS_AUTODISARM flag · 2a742138
      Stas Sergeev 提交于
      This patch implements the SS_AUTODISARM flag that can be OR-ed with
      SS_ONSTACK when forming ss_flags.
      
      When this flag is set, sigaltstack will be disabled when entering
      the signal handler; more precisely, after saving sas to uc_stack.
      When leaving the signal handler, the sigaltstack is restored by
      uc_stack.
      
      When this flag is used, it is safe to switch from sighandler with
      swapcontext(). Without this flag, the subsequent signal will corrupt
      the state of the switched-away sighandler.
      
      To detect the support of this functionality, one can do:
      
        err = sigaltstack(SS_DISABLE | SS_AUTODISARM);
        if (err && errno == EINVAL)
      	unsupported();
      Signed-off-by: NStas Sergeev <stsp@list.ru>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Amanieu d'Antras <amanieu@gmail.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brian Gerst <brgerst@gmail.com>
      Cc: Denys Vlasenko <dvlasenk@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Heinrich Schuchardt <xypron.glpk@gmx.de>
      Cc: Jason Low <jason.low2@hp.com>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Palmer Dabbelt <palmer@dabbelt.com>
      Cc: Paul Moore <pmoore@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: linux-api@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Link: http://lkml.kernel.org/r/1460665206-13646-4-git-send-email-stsp@list.ruSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2a742138
  13. 23 4月, 2016 2 次提交
    • F
      sched/fair: Correctly handle nohz ticks CPU load accounting · 1f41906a
      Frederic Weisbecker 提交于
      Ticks can happen while the CPU is in dynticks-idle or dynticks-singletask
      mode. In fact "nohz" or "dynticks" only mean that we exit the periodic
      mode and we try to minimize the ticks as much as possible. The nohz
      subsystem uses a confusing terminology with the internal state
      "ts->tick_stopped" which is also available through its public interface
      with tick_nohz_tick_stopped(). This is a misnomer as the tick is instead
      reduced with the best effort rather than stopped. In the best case the
      tick can indeed be actually stopped but there is no guarantee about that.
      If a timer needs to fire one second later, a tick will fire while the
      CPU is in nohz mode and this is a very common scenario.
      
      Now this confusion happens to be a problem with CPU load updates:
      cpu_load_update_active() doesn't handle nohz ticks correctly because it
      assumes that ticks are completely stopped in nohz mode and that
      cpu_load_update_active() can't be called in dynticks mode. When that
      happens, the whole previous tickless load is ignored and the function
      just records the load for the current tick, ignoring potentially long
      idle periods behind.
      
      In order to solve this, we could account the current load for the
      previous nohz time but there is a risk that we account the load of a
      task that got freshly enqueued for the whole nohz period.
      
      So instead, lets record the dynticks load on nohz frame entry so we know
      what to record in case of nohz ticks, then use this record to account
      the tickless load on nohz ticks and nohz frame end.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1460555812-25375-3-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1f41906a
    • F
      sched/fair: Gather CPU load functions under a more conventional namespace · cee1afce
      Frederic Weisbecker 提交于
      The CPU load update related functions have a weak naming convention
      currently, starting with update_cpu_load_*() which isn't ideal as
      "update" is a very generic concept.
      
      Since two of these functions are public already (and a third is to come)
      that's enough to introduce a more conventional naming scheme. So let's
      do the following rename instead:
      
      	update_cpu_load_*() -> cpu_load_update_*()
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Byungchul Park <byungchul.park@lge.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul E . McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1460555812-25375-2-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cee1afce
  14. 13 4月, 2016 1 次提交
    • D
      sched/clock: Make local_clock()/cpu_clock() inline · 2c923e94
      Daniel Lezcano 提交于
      The local_clock/cpu_clock functions were changed to prevent a double
      identical test with sched_clock_cpu() when HAVE_UNSTABLE_SCHED_CLOCK
      is set. That resulted in one line functions.
      
      As these functions are in all the cases one line functions and in the
      hot path, it is useful to specify them as static inline in order to
      give a strong hint to the compiler.
      
      After verification, it appears the compiler does not inline them
      without this hint. Change those functions to static inline.
      
      sched_clock_cpu() is called via the inlined local_clock()/cpu_clock()
      functions from sched.h. So any module code including sched.h will
      reference sched_clock_cpu(). Thus it must be exported with the
      EXPORT_SYMBOL_GPL macro.
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1460385514-14700-2-git-send-email-daniel.lezcano@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      2c923e94
  15. 05 4月, 2016 1 次提交
    • T
      android,lowmemorykiller: Don't abuse TIF_MEMDIE. · 77ed2c57
      Tetsuo Handa 提交于
      Currently, lowmemorykiller (LMK) is using TIF_MEMDIE for two purposes.
      One is to remember processes killed by LMK, and the other is to
      accelerate termination of processes killed by LMK.
      
      But since LMK is invoked as a memory shrinker function, there still
      should be some memory available. It is very likely that memory
      allocations by processes killed by LMK will succeed without using
      ALLOC_NO_WATERMARKS via TIF_MEMDIE. Even if their allocations cannot
      escape from memory allocation loop unless they use ALLOC_NO_WATERMARKS,
      lowmem_deathpending_timeout can guarantee forward progress by choosing
      next victim process.
      
      On the other hand, mark_oom_victim() assumes that it must be called with
      oom_lock held and it must not be called after oom_killer_disable() was
      called. But LMK is calling it without holding oom_lock and checking
      oom_killer_disabled. It is possible that LMK calls mark_oom_victim()
      due to allocation requests by kernel threads after current thread
      returned from oom_killer_disabled(). This will break synchronization
      for PM/suspend.
      
      This patch introduces per a task_struct flag for remembering processes
      killed by LMK, and replaces TIF_MEMDIE with that flag. By applying this
      patch, assumption by mark_oom_victim() becomes true.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Arve Hjonnevag <arve@android.com>
      Cc: Riley Andrews <riandrews@android.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      77ed2c57
  16. 02 4月, 2016 1 次提交
  17. 01 4月, 2016 1 次提交
  18. 29 3月, 2016 1 次提交
  19. 26 3月, 2016 5 次提交
  20. 23 3月, 2016 2 次提交
    • H
      parisc,metag: Implement CONFIG_DEBUG_STACK_USAGE option · 6c31da34
      Helge Deller 提交于
      On parisc and metag the stack grows upwards, so for those we need to
      scan the stack downwards in order to calculate how much stack a process
      has used.
      
      Tested on a 64bit parisc kernel.
      Signed-off-by: NHelge Deller <deller@gmx.de>
      6c31da34
    • D
      kernel: add kcov code coverage · 5c9a8750
      Dmitry Vyukov 提交于
      kcov provides code coverage collection for coverage-guided fuzzing
      (randomized testing).  Coverage-guided fuzzing is a testing technique
      that uses coverage feedback to determine new interesting inputs to a
      system.  A notable user-space example is AFL
      (http://lcamtuf.coredump.cx/afl/).  However, this technique is not
      widely used for kernel testing due to missing compiler and kernel
      support.
      
      kcov does not aim to collect as much coverage as possible.  It aims to
      collect more or less stable coverage that is function of syscall inputs.
      To achieve this goal it does not collect coverage in soft/hard
      interrupts and instrumentation of some inherently non-deterministic or
      non-interesting parts of kernel is disbled (e.g.  scheduler, locking).
      
      Currently there is a single coverage collection mode (tracing), but the
      API anticipates additional collection modes.  Initially I also
      implemented a second mode which exposes coverage in a fixed-size hash
      table of counters (what Quentin used in his original patch).  I've
      dropped the second mode for simplicity.
      
      This patch adds the necessary support on kernel side.  The complimentary
      compiler support was added in gcc revision 231296.
      
      We've used this support to build syzkaller system call fuzzer, which has
      found 90 kernel bugs in just 2 months:
      
        https://github.com/google/syzkaller/wiki/Found-Bugs
      
      We've also found 30+ bugs in our internal systems with syzkaller.
      Another (yet unexplored) direction where kcov coverage would greatly
      help is more traditional "blob mutation".  For example, mounting a
      random blob as a filesystem, or receiving a random blob over wire.
      
      Why not gcov.  Typical fuzzing loop looks as follows: (1) reset
      coverage, (2) execute a bit of code, (3) collect coverage, repeat.  A
      typical coverage can be just a dozen of basic blocks (e.g.  an invalid
      input).  In such context gcov becomes prohibitively expensive as
      reset/collect coverage steps depend on total number of basic
      blocks/edges in program (in case of kernel it is about 2M).  Cost of
      kcov depends only on number of executed basic blocks/edges.  On top of
      that, kernel requires per-thread coverage because there are always
      background threads and unrelated processes that also produce coverage.
      With inlined gcov instrumentation per-thread coverage is not possible.
      
      kcov exposes kernel PCs and control flow to user-space which is
      insecure.  But debugfs should not be mapped as user accessible.
      
      Based on a patch by Quentin Casasnovas.
      
      [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
      [akpm@linux-foundation.org: unbreak allmodconfig]
      [akpm@linux-foundation.org: follow x86 Makefile layout standards]
      Signed-off-by: NDmitry Vyukov <dvyukov@google.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: syzkaller <syzkaller@googlegroups.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Tavis Ormandy <taviso@google.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@google.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: David Drysdale <drysdale@google.com>
      Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5c9a8750
  21. 18 3月, 2016 1 次提交
    • J
      timer: convert timer_slack_ns from unsigned long to u64 · da8b44d5
      John Stultz 提交于
      This patchset introduces a /proc/<pid>/timerslack_ns interface which
      would allow controlling processes to be able to set the timerslack value
      on other processes in order to save power by avoiding wakeups (Something
      Android currently does via out-of-tree patches).
      
      The first patch tries to fix the internal timer_slack_ns usage which was
      defined as a long, which limits the slack range to ~4 seconds on 32bit
      systems.  It converts it to a u64, which provides the same basically
      unlimited slack (500 years) on both 32bit and 64bit machines.
      
      The second patch introduces the /proc/<pid>/timerslack_ns interface
      which allows the full 64bit slack range for a task to be read or set on
      both 32bit and 64bit machines.
      
      With these two patches, on a 32bit machine, after setting the slack on
      bash to 10 seconds:
      
      $ time sleep 1
      
      real    0m10.747s
      user    0m0.001s
      sys     0m0.005s
      
      The first patch is a little ugly, since I had to chase the slack delta
      arguments through a number of functions converting them to u64s.  Let me
      know if it makes sense to break that up more or not.
      
      Other than that things are fairly straightforward.
      
      This patch (of 2):
      
      The timer_slack_ns value in the task struct is currently a unsigned
      long.  This means that on 32bit applications, the maximum slack is just
      over 4 seconds.  However, on 64bit machines, its much much larger (~500
      years).
      
      This disparity could make application development a little (as well as
      the default_slack) to a u64.  This means both 32bit and 64bit systems
      have the same effective internal slack range.
      
      Now the existing ABI via PR_GET_TIMERSLACK and PR_SET_TIMERSLACK specify
      the interface as a unsigned long, so we preserve that limitation on
      32bit systems, where SET_TIMERSLACK can only set the slack to a unsigned
      long value, and GET_TIMERSLACK will return ULONG_MAX if the slack is
      actually larger then what can be stored by an unsigned long.
      
      This patch also modifies hrtimer functions which specified the slack
      delta as a unsigned long.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Oren Laadan <orenl@cellrox.com>
      Cc: Ruchi Kandoi <kandoiruchi@google.com>
      Cc: Rom Lemarchand <romlem@android.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Android Kernel Team <kernel-team@android.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da8b44d5
  22. 11 3月, 2016 1 次提交
  23. 08 3月, 2016 1 次提交
  24. 02 3月, 2016 2 次提交
    • F
      sched: Migrate sched to use new tick dependency mask model · 76d92ac3
      Frederic Weisbecker 提交于
      Instead of providing asynchronous checks for the nohz subsystem to verify
      sched tick dependency, migrate sched to the new mask.
      
      Everytime a task is enqueued or dequeued, we evaluate the state of the
      tick dependency on top of the policy of the tasks in the runqueue, by
      order of priority:
      
      SCHED_DEADLINE: Need the tick in order to periodically check for runtime
      SCHED_FIFO    : Don't need the tick (no round-robin)
      SCHED_RR      : Need the tick if more than 1 task of the same priority
                      for round robin (simplified with checking if more than
                      one SCHED_RR task no matter what priority).
      SCHED_NORMAL  : Need the tick if more than 1 task for round-robin.
      
      We could optimize that further with one flag per sched policy on the tick
      dependency mask and perform only the checks relevant to the policy
      concerned by an enqueue/dequeue operation.
      
      Since the checks aren't based on the current task anymore, we could get
      rid of the task switch hook but it's still needed for posix cpu
      timers.
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      76d92ac3
    • F
      nohz: New tick dependency mask · d027d45d
      Frederic Weisbecker 提交于
      The tick dependency is evaluated on every IRQ and context switch. This
      consists is a batch of checks which determine whether it is safe to
      stop the tick or not. These checks are often split in many details:
      posix cpu timers, scheduler, sched clock, perf events.... each of which
      are made of smaller details: posix cpu timer involves checking process
      wide timers then thread wide timers. Perf involves checking freq events
      then more per cpu details.
      
      Checking these informations asynchronously every time we update the full
      dynticks state bring avoidable overhead and a messy layout.
      
      Let's introduce instead tick dependency masks: one for system wide
      dependency (unstable sched clock, freq based perf events), one for CPU
      wide dependency (sched, throttling perf events), and task/signal level
      dependencies (posix cpu timers). The subsystems are responsible
      for setting and clearing their dependency through a set of APIs that will
      take care of concurrent dependency mask modifications and kick targets
      to restart the relevant CPU tick whenever needed.
      
      This new dependency engine stays beside the old one until all subsystems
      having a tick dependency are converted to it.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      d027d45d
  25. 29 2月, 2016 1 次提交