1. 04 7月, 2015 2 次提交
  2. 26 6月, 2015 1 次提交
    • J
      clone: support passing tls argument via C rather than pt_regs magic · 3033f14a
      Josh Triplett 提交于
      clone has some of the quirkiest syscall handling in the kernel, with a
      pile of special cases, historical curiosities, and architecture-specific
      calling conventions.  In particular, clone with CLONE_SETTLS accepts a
      parameter "tls" that the C entry point completely ignores and some
      assembly entry points overwrite; instead, the low-level arch-specific
      code pulls the tls parameter out of the arch-specific register captured
      as part of pt_regs on entry to the kernel.  That's a massive hack, and
      it makes the arch-specific code only work when called via the specific
      existing syscall entry points; because of this hack, any new clone-like
      system call would have to accept an identical tls argument in exactly
      the same arch-specific position, rather than providing a unified system
      call entry point across architectures.
      
      The first patch allows architectures to handle the tls argument via
      normal C parameter passing, if they opt in by selecting
      HAVE_COPY_THREAD_TLS.  The second patch makes 32-bit and 64-bit x86 opt
      into this.
      
      These two patches came out of the clone4 series, which isn't ready for
      this merge window, but these first two cleanup patches were entirely
      uncontroversial and have acks.  I'd like to go ahead and submit these
      two so that other architectures can begin building on top of this and
      opting into HAVE_COPY_THREAD_TLS.  However, I'm also happy to wait and
      send these through the next merge window (along with v3 of clone4) if
      anyone would prefer that.
      
      This patch (of 2):
      
      clone with CLONE_SETTLS accepts an argument to set the thread-local
      storage area for the new thread.  sys_clone declares an int argument
      tls_val in the appropriate point in the argument list (based on the
      various CLONE_BACKWARDS variants), but doesn't actually use or pass along
      that argument.  Instead, sys_clone calls do_fork, which calls
      copy_process, which calls the arch-specific copy_thread, and copy_thread
      pulls the corresponding syscall argument out of the pt_regs captured at
      kernel entry (knowing what argument of clone that architecture passes tls
      in).
      
      Apart from being awful and inscrutable, that also only works because only
      one code path into copy_thread can pass the CLONE_SETTLS flag, and that
      code path comes from sys_clone with its architecture-specific
      argument-passing order.  This prevents introducing a new version of the
      clone system call without propagating the same architecture-specific
      position of the tls argument.
      
      However, there's no reason to pull the argument out of pt_regs when
      sys_clone could just pass it down via C function call arguments.
      
      Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt into,
      and a new copy_thread_tls that accepts the tls parameter as an additional
      unsigned long (syscall-argument-sized) argument.  Change sys_clone's tls
      argument to an unsigned long (which does not change the ABI), and pass
      that down to copy_thread_tls.
      
      Architectures that don't opt into copy_thread_tls will continue to ignore
      the C argument to sys_clone in favor of the pt_regs captured at kernel
      entry, and thus will be unable to introduce new versions of the clone
      syscall.
      
      Patch co-authored by Josh Triplett and Thiago Macieira.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thiago Macieira <thiago.macieira@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3033f14a
  3. 19 6月, 2015 1 次提交
    • T
      timer: Reduce timer migration overhead if disabled · bc7a34b8
      Thomas Gleixner 提交于
      Eric reported that the timer_migration sysctl is not really nice
      performance wise as it needs to check at every timer insertion whether
      the feature is enabled or not. Further the check does not live in the
      timer code, so we have an extra function call which checks an extra
      cache line to figure out that it is disabled.
      
      We can do better and store that information in the per cpu (hr)timer
      bases. I pondered to use a static key, but that's a nightmare to
      update from the nohz code and the timer base cache line is hot anyway
      when we select a timer base.
      
      The old logic enabled the timer migration unconditionally if
      CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
      line.
      
      With this modification, we start off with migration disabled. The user
      visible sysctl is still set to enabled. If the kernel switches to NOHZ
      migration is enabled, if the user did not disable it via the sysctl
      prior to the switch. If nohz=off is on the kernel command line,
      migration stays disabled no matter what.
      
      Before:
        47.76%  hog       [.] main
        14.84%  [kernel]  [k] _raw_spin_lock_irqsave
         9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.71%  [kernel]  [k] mod_timer
         6.24%  [kernel]  [k] lock_timer_base.isra.38
         3.76%  [kernel]  [k] detach_if_pending
         3.71%  [kernel]  [k] del_timer
         2.50%  [kernel]  [k] internal_add_timer
         1.51%  [kernel]  [k] get_nohz_timer_target
         1.28%  [kernel]  [k] __internal_add_timer
         0.78%  [kernel]  [k] timerfn
         0.48%  [kernel]  [k] wake_up_nohz_cpu
      
      After:
        48.10%  hog       [.] main
        15.25%  [kernel]  [k] _raw_spin_lock_irqsave
         9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.50%  [kernel]  [k] mod_timer
         6.44%  [kernel]  [k] lock_timer_base.isra.38
         3.87%  [kernel]  [k] detach_if_pending
         3.80%  [kernel]  [k] del_timer
         2.67%  [kernel]  [k] internal_add_timer
         1.33%  [kernel]  [k] __internal_add_timer
         0.73%  [kernel]  [k] timerfn
         0.54%  [kernel]  [k] wake_up_nohz_cpu
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bc7a34b8
  4. 05 6月, 2015 1 次提交
    • O
      signals: don't abuse __flush_signals() in selinux_bprm_committed_creds() · 9e7c8f8c
      Oleg Nesterov 提交于
      selinux_bprm_committed_creds()->__flush_signals() is not right, we
      shouldn't clear TIF_SIGPENDING unconditionally. There can be other
      reasons for signal_pending(): freezing(), JOBCTL_PENDING_MASK, and
      potentially more.
      
      Also change this code to check fatal_signal_pending() rather than
      SIGNAL_GROUP_EXIT, it looks a bit better.
      
      Now we can kill __flush_signals() before it finds another buggy user.
      
      Note: this code looks racy, we can flush a signal which was sent after
      the task SID has been updated.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      9e7c8f8c
  5. 27 5月, 2015 2 次提交
    • T
      sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem · d59cfc09
      Tejun Heo 提交于
      The cgroup side of threadgroup locking uses signal_struct->group_rwsem
      to synchronize against threadgroup changes.  This per-process rwsem
      adds small overhead to thread creation, exit and exec paths, forces
      cgroup code paths to do lock-verify-unlock-retry dance in a couple
      places and makes it impossible to atomically perform operations across
      multiple processes.
      
      This patch replaces signal_struct->group_rwsem with a global
      percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
      side and contained in cgroups proper.  This patch converts one-to-one.
      
      This does make writer side heavier and lower the granularity; however,
      cgroup process migration is a fairly cold path, we do want to optimize
      thread operations over it and cgroup migration operations don't take
      enough time for the lower granularity to matter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      d59cfc09
    • T
      sched, cgroup: reorganize threadgroup locking · 7d7efec3
      Tejun Heo 提交于
      threadgroup_change_begin/end() are used to mark the beginning and end
      of threadgroup modifying operations to allow code paths which require
      a threadgroup to stay stable across blocking operations to synchronize
      against those sections using threadgroup_lock/unlock().
      
      It's currently implemented as a general mechanism in sched.h using
      per-signal_struct rwsem; however, this never grew non-cgroup use cases
      and becomes noop if !CONFIG_CGROUPS.  It turns out that cgroups is
      gonna be better served with a different sycnrhonization scheme and is
      a bit silly to keep cgroups specific details as a general mechanism.
      
      What's general here is identifying the places where threadgroups are
      modified.  This patch restructures threadgroup locking so that
      threadgroup_change_begin/end() become a place where subsystems which
      need to sycnhronize against threadgroup changes can hook into.
      
      cgroup_threadgroup_change_begin/end() which operate on the
      per-signal_struct rwsem are created and threadgroup_lock/unlock() are
      moved to cgroup.c and made static.
      
      This is pure reorganization which doesn't cause any functional
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      7d7efec3
  6. 19 5月, 2015 4 次提交
    • P
      sched/wait: Introduce TASK_NOLOAD and TASK_IDLE · 80ed87c8
      Peter Zijlstra 提交于
      Currently people use TASK_INTERRUPTIBLE to idle kthreads and wait for
      'work' because TASK_UNINTERRUPTIBLE contributes to the loadavg. Having
      all idle kthreads contribute to the loadavg is somewhat silly.
      
      Now mostly this works OK, because kthreads have all their signals
      masked. However there's a few sites where this is causing problems and
      TASK_UNINTERRUPTIBLE should be used, except for that loadavg issue.
      
      This patch adds TASK_NOLOAD which, when combined with
      TASK_UNINTERRUPTIBLE avoids the loadavg accounting.
      
      As most of imagined usage sites are loops where a thread wants to
      idle, waiting for work, a helper TASK_IDLE is introduced.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      80ed87c8
    • D
      sched/preempt, mm/fault: Count pagefault_disable() levels in pagefault_disabled · 8bcbde54
      David Hildenbrand 提交于
      Until now, pagefault_disable()/pagefault_enabled() used the preempt
      count to track whether in an environment with pagefaults disabled (can
      be queried via in_atomic()).
      
      This patch introduces a separate counter in task_struct to count the
      level of pagefault_disable() calls. We'll keep manipulating the preempt
      count to retain compatibility to existing pagefault handlers.
      
      It is now possible to verify whether in a pagefault_disable() envionment
      by calling pagefault_disabled(). In contrast to in_atomic() it will not
      be influenced by preempt_enable()/preempt_disable().
      
      This patch is based on a patch from Ingo Molnar.
      Reviewed-and-tested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David.Laight@ACULAB.COM
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: airlied@linux.ie
      Cc: akpm@linux-foundation.org
      Cc: benh@kernel.crashing.org
      Cc: bigeasy@linutronix.de
      Cc: borntraeger@de.ibm.com
      Cc: daniel.vetter@intel.com
      Cc: heiko.carstens@de.ibm.com
      Cc: herbert@gondor.apana.org.au
      Cc: hocko@suse.cz
      Cc: hughd@google.com
      Cc: mst@redhat.com
      Cc: paulus@samba.org
      Cc: ralf@linux-mips.org
      Cc: schwidefsky@de.ibm.com
      Cc: yang.shi@windriver.com
      Link: http://lkml.kernel.org/r/1431359540-32227-2-git-send-email-dahi@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8bcbde54
    • F
      sched/preempt: Merge preempt_mask.h into preempt.h · 92cf2118
      Frederic Weisbecker 提交于
      preempt_mask.h defines all the preempt_count semantics and related
      symbols: preempt, softirq, hardirq, nmi, preempt active, need resched,
      etc...
      
      preempt.h defines the accessors and mutators of preempt_count.
      
      But there is a messy dependency game around those two header files:
      
      	* preempt_mask.h includes preempt.h in order to access preempt_count()
      
      	* preempt_mask.h defines all preempt_count semantic and symbols
      	  except PREEMPT_NEED_RESCHED that is needed by asm/preempt.h
      	  Thus we need to define it from preempt.h, right before including
      	  asm/preempt.h, instead of defining it to preempt_mask.h with the
      	  other preempt_count symbols. Therefore the preempt_count semantics
      	  happen to be spread out.
      
      	* We plan to introduce preempt_active_[enter,exit]() to consolidate
      	  preempt_schedule*() code. But we'll need to access both preempt_count
      	  mutators (preempt_count_add()) and preempt_count symbols
      	  (PREEMPT_ACTIVE, PREEMPT_OFFSET). The usual place to define preempt
      	  operations is in preempt.h but then we'll need symbols in
      	  preempt_mask.h which already includes preempt.h. So we end up with
      	  a ressource circle dependency.
      
      Lets merge preempt_mask.h into preempt.h to solve these dependency issues.
      This way we gather semantic symbols and operation definition of
      preempt_count in a single file.
      
      This is a dumb copy-paste merge. Further merge re-arrangments are
      performed in a subsequent patch to ease review.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431441711-29753-2-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      92cf2118
    • P
      locking/arch: Rename set_mb() to smp_store_mb() · b92b8b35
      Peter Zijlstra 提交于
      Since set_mb() is really about an smp_mb() -- not a IO/DMA barrier
      like mb() rename it to match the recent smp_load_acquire() and
      smp_store_release().
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b92b8b35
  7. 15 5月, 2015 2 次提交
  8. 11 5月, 2015 1 次提交
    • N
      VFS: replace {, total_}link_count in task_struct with pointer to nameidata · 756daf26
      NeilBrown 提交于
      task_struct currently contains two ad-hoc members for use by the VFS:
      link_count and total_link_count.  These are only interesting to fs/namei.c,
      so exposing them explicitly is poor layering.  Incidentally, link_count
      isn't used anymore, so it can just die.
      
      This patches replaces those with a single pointer to 'struct nameidata'.
      This structure represents the current filename lookup of which
      there can only be one per process, and is a natural place to
      store total_link_count.
      
      This will allow the current "nameidata" argument to all
      follow_link operations to be removed as current->nameidata
      can be used instead in the _very_ few instances that care about
      it at all.
      
      As there are occasional circumstances where pathname lookup can
      recurse, such as through kern_path_locked, we always save and old
      current->nameidata (if there is one) when setting a new value, and
      make sure any active link_counts are preserved.
      
      follow_mount and follow_automount now get a 'struct nameidata *'
      rather than 'int flags' so that they can directly access
      total_link_count, rather than going through 'current'.
      Suggested-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      756daf26
  9. 10 5月, 2015 1 次提交
  10. 08 5月, 2015 9 次提交
    • P
      perf: Fix software migrate events · ff303e66
      Peter Zijlstra 提交于
      Stephane asked about PERF_COUNT_SW_CPU_MIGRATIONS and I realized it
      was borken:
      
       > The problem is that the task isn't actually scheduled while its being
       > migrated (obviously), and if its not scheduled, the counters aren't
       > scheduled either, so there's no observing of the fact.
       >
       > A further problem with migrations is that many migrations happen from
       > softirq context, which is nested inside the 'random' task context of
       > whoemever happens to run at that time, similarly for the wakeup
       > migrations triggered from (soft)irq context. All those end up being
       > accounted in the task that's currently running, eg. your 'ls'.
      
      The below cures this by marking a task as migrated and accounting it
      on the subsequent sched_in().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ff303e66
    • P
      sched: Implement lockless wake-queues · 76751049
      Peter Zijlstra 提交于
      This is useful for locking primitives that can effect multiple
      wakeups per operation and want to avoid lock internal lock contention
      by delaying the wakeups until we've released the lock internal locks.
      
      Alternatively it can be used to avoid issuing multiple wakeups, and
      thus save a few cycles, in packet processing. Queue all target tasks
      and wakeup once you've processed all packets. That way you avoid
      waking the target task multiple times if there were multiple packets
      for the same task.
      
      Properties of a wake_q are:
      - Lockless, as queue head must reside on the stack.
      - Being a queue, maintains wakeup order passed by the callers. This can
        be important for otherwise, in scenarios where highly contended locks
        could affect any reliance on lock fairness.
      - A queued task cannot be added again until it is woken up.
      
      This patch adds the needed infrastructure into the scheduler code
      and uses the new wake_list to delay the futex wakeups until
      after we've released the hash bucket locks.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      [tweaks, adjustments, comments, etc.]
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris Mason <clm@fb.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: George Spelvin <linux@horizon.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1430494072-30283-2-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      76751049
    • J
      sched, timer: Use the atomic task_cputime in thread_group_cputimer · 71107445
      Jason Low 提交于
      Recent optimizations were made to thread_group_cputimer to improve its
      scalability by keeping track of cputime stats without a lock. However,
      the values were open coded to the structure, causing them to be at
      a different abstraction level from the regular task_cputime structure.
      Furthermore, any subsequent similar optimizations would not be able to
      share the new code, since they are specific to thread_group_cputimer.
      
      This patch adds the new task_cputime_atomic data structure (introduced in
      the previous patch in the series) to thread_group_cputimer for keeping
      track of the cputime atomically, which also helps generalize the code.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Link: http://lkml.kernel.org/r/1430251224-5764-6-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      71107445
    • J
      sched, timer: Provide an atomic 'struct task_cputime' data structure · 971e8a98
      Jason Low 提交于
      This patch adds an atomic variant of the 'struct task_cputime' data structure,
      which can be used to store and update task_cputime statistics without
      needing to do locking.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Link: http://lkml.kernel.org/r/1430251224-5764-5-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      971e8a98
    • J
      sched, timer: Replace spinlocks with atomics in thread_group_cputimer(), to improve scalability · 1018016c
      Jason Low 提交于
      While running a database workload, we found a scalability issue with itimers.
      
      Much of the problem was caused by the thread_group_cputimer spinlock.
      Each time we account for group system/user time, we need to obtain a
      thread_group_cputimer's spinlock to update the timers. On larger systems
      (such as a 16 socket machine), this caused more than 30% of total time
      spent trying to obtain this kernel lock to update these group timer stats.
      
      This patch converts the timers to 64-bit atomic variables and use
      atomic add to update them without a lock. With this patch, the percent
      of total time spent updating thread group cputimer timers was reduced
      from 30% down to less than 1%.
      
      Note: On 32-bit systems using the generic 64-bit atomics, this causes
      sample_group_cputimer() to take locks 3 times instead of just 1 time.
      However, we tested this patch on a 32-bit system ARM system using the
      generic atomics and did not find the overhead to be much of an issue.
      An explanation for why this isn't an issue is that 32-bit systems usually
      have small numbers of CPUs, and cacheline contention from extra spinlocks
      called periodically is not really apparent on smaller systems.
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Link: http://lkml.kernel.org/r/1430251224-5764-4-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1018016c
    • J
      sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE() · 316c1608
      Jason Low 提交于
      ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes
      the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use
      the new READ_ONCE() and WRITE_ONCE() APIs as appropriate.
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NWaiman Long <Waiman.Long@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      316c1608
    • P
      signals, ptrace, sched: Fix a misaligned load inside ptrace_attach() · e7cc4173
      Palmer Dabbelt 提交于
      The misaligned load exception arises when running ptrace_attach() on
      the RISC-V (which hasn't been upstreamed yet).  The problem is that
      wait_on_bit() takes a void* but then proceeds to call test_bit(),
      which takes a long*.  This allows an int-aligned pointer to be passed
      to test_bit(), which promptly fails.  This will manifest on any other
      asm-generic port where unaligned loads trap, where sizeof(long) >
      sizeof(int), and where task_struct.jobctl ends up not being
      long-aligned.
      
      This patch changes task_struct.jobctl to be a long, which ensures it
      has the correct alignment.
      Signed-off-by: NPalmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bobby.prani@gmail.com
      Cc: oleg@redhat.com
      Cc: paulmck@linux.vnet.ibm.com
      Cc: richard@nod.at
      Cc: vdavydov@parallels.com
      Link: http://lkml.kernel.org/r/1430453997-32459-2-git-send-email-palmer@dabbelt.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e7cc4173
    • P
      signals, sched: Change all uses of JOBCTL_* from 'int' to 'long' · b76808e6
      Palmer Dabbelt 提交于
      c56fb6564dcd ("Fix a misaligned load inside ptrace_attach()") makes
      jobctl an "unsigned long".  It makes sense to have the masks applied
      to it match that type.  This is currently just a cosmetic change, but
      it will prevent the mask from being unexpectedly truncated if we ever
      end up with masks with more bits.
      
      One instance of "signr" is an int, but I left this alone because the
      mask ensures that it will never overflow.
      Signed-off-by: NPalmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bobby.prani@gmail.com
      Cc: oleg@redhat.com
      Cc: paulmck@linux.vnet.ibm.com
      Cc: richard@nod.at
      Cc: vdavydov@parallels.com
      Link: http://lkml.kernel.org/r/1430453997-32459-4-git-send-email-palmer@dabbelt.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b76808e6
    • P
      sched: Move the loadavg code to a more obvious location · 3289bdb4
      Peter Zijlstra 提交于
      I could not find the loadavg code.. turns out it was hidden in a file
      called proc.c. It further got mingled up with the cruft per rq load
      indexes (which we really want to get rid of).
      
      Move the per rq load indexes into the fair.c load-balance code (that's
      the only thing that uses them) and rename proc.c to loadavg.c so we
      can find it again.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      [ Did minor cleanups to the code. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3289bdb4
  11. 07 5月, 2015 1 次提交
  12. 27 4月, 2015 1 次提交
  13. 13 4月, 2015 1 次提交
  14. 27 3月, 2015 1 次提交
    • V
      sched: Add sched_avg::utilization_avg_contrib · 36ee28e4
      Vincent Guittot 提交于
      Add new statistics which reflect the average time a task is running on the CPU
      and the sum of these running time of the tasks on a runqueue. The latter is
      named utilization_load_avg.
      
      This patch is based on the usage metric that was proposed in the 1st
      versions of the per-entity load tracking patchset by Paul Turner
      <pjt@google.com> but that has be removed afterwards. This version differs from
      the original one in the sense that it's not linked to task_group.
      
      The rq's utilization_load_avg will be used to check if a rq is overloaded or
      not instead of trying to compute how many tasks a group of CPUs can handle.
      
      Rename runnable_avg_period into avg_period as it is now used with both
      runnable_avg_sum and running_avg_sum.
      
      Add some descriptions of the variables to explain their differences.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-2-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      36ee28e4
  15. 26 3月, 2015 1 次提交
    • M
      mm: numa: slow PTE scan rate if migration failures occur · 074c2381
      Mel Gorman 提交于
      Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226
      
        Across the board the 4.0-rc1 numbers are much slower, and the degradation
        is far worse when using the large memory footprint configs. Perf points
        straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:
      
         -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
            - default_send_IPI_mask_sequence_phys
               - 99.99% physflat_send_IPI_mask
                  - 99.37% native_send_call_func_ipi
                       smp_call_function_many
                     - native_flush_tlb_others
                        - 99.85% flush_tlb_page
                             ptep_clear_flush
                             try_to_unmap_one
                             rmap_walk
                             try_to_unmap
                             migrate_pages
                             migrate_misplaced_page
                           - handle_mm_fault
                              - 99.73% __do_page_fault
                                   trace_do_page_fault
                                   do_async_page_fault
                                 + async_page_fault
                    0.63% native_send_call_func_single_ipi
                       generic_exec_single
                       smp_call_function_single
      
      This is showing excessive migration activity even though excessive
      migrations are meant to get throttled.  Normally, the scan rate is tuned
      on a per-task basis depending on the locality of faults.  However, if
      migrations fail for any reason then the PTE scanner may scan faster if
      the faults continue to be remote.  This means there is higher system CPU
      overhead and fault trapping at exactly the time we know that migrations
      cannot happen.  This patch tracks when migration failures occur and
      slows the PTE scanner.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Tested-by: NDave Chinner <david@fromorbit.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      074c2381
  16. 24 3月, 2015 1 次提交
  17. 20 3月, 2015 1 次提交
    • R
      sched, isolcpu: make cpu_isolated_map visible outside scheduler · 3fa0818b
      Rik van Riel 提交于
      Needed by the next patch. Also makes cpu_isolated_map present
      when compiled without SMP and/or with CONFIG_NR_CPUS=1, like
      the other cpu masks.
      
      At some point we may want to clean things up so cpumasks do
      not exist in UP kernels. Maybe something for the CONFIG_TINY
      crowd.
      
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: cgroups@vger.kernel.org
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3fa0818b
  18. 18 2月, 2015 1 次提交
    • N
      sched: Prevent recursion in io_schedule() · 9cff8ade
      NeilBrown 提交于
      io_schedule() calls blk_flush_plug() which, depending on the
      contents of current->plug, can initiate arbitrary blk-io requests.
      
      Note that this contrasts with blk_schedule_flush_plug() which requires
      all non-trivial work to be handed off to a separate thread.
      
      This makes it possible for io_schedule() to recurse, and initiating
      block requests could possibly call mempool_alloc() which, in times of
      memory pressure, uses io_schedule().
      
      Apart from any stack usage issues, io_schedule() will not behave
      correctly when called recursively as delayacct_blkio_start() does
      not allow for repeated calls.
      
      So:
       - use ->in_iowait to detect recursion.  Set it earlier, and restore
         it to the old value.
       - move the call to "raw_rq" after the call to blk_flush_plug().
         As this is some sort of per-cpu thing, we want some chance that
         we are on the right CPU
       - When io_schedule() is called recurively, use blk_schedule_flush_plug()
         which cannot further recurse.
       - as this makes io_schedule() a lot more complex and as io_schedule()
         must match io_schedule_timeout(), but all the changes in io_schedule_timeout()
         and make io_schedule a simple wrapper for that.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      [ Moved the now rudimentary io_schedule() into sched.h. ]
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Tony Battersby <tonyb@cybernetics.com>
      Link: http://lkml.kernel.org/r/20150213162600.059fffb2@notabene.brownSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9cff8ade
  19. 14 2月, 2015 1 次提交
    • A
      kasan: add kernel address sanitizer infrastructure · 0b24becc
      Andrey Ryabinin 提交于
      Kernel Address sanitizer (KASan) is a dynamic memory error detector.  It
      provides fast and comprehensive solution for finding use-after-free and
      out-of-bounds bugs.
      
      KASAN uses compile-time instrumentation for checking every memory access,
      therefore GCC > v4.9.2 required.  v4.9.2 almost works, but has issues with
      putting symbol aliases into the wrong section, which breaks kasan
      instrumentation of globals.
      
      This patch only adds infrastructure for kernel address sanitizer.  It's
      not available for use yet.  The idea and some code was borrowed from [1].
      
      Basic idea:
      
      The main idea of KASAN is to use shadow memory to record whether each byte
      of memory is safe to access or not, and use compiler's instrumentation to
      check the shadow memory on each memory access.
      
      Address sanitizer uses 1/8 of the memory addressable in kernel for shadow
      memory and uses direct mapping with a scale and offset to translate a
      memory address to its corresponding shadow address.
      
      Here is function to translate address to corresponding shadow address:
      
           unsigned long kasan_mem_to_shadow(unsigned long addr)
           {
                      return (addr >> KASAN_SHADOW_SCALE_SHIFT) + KASAN_SHADOW_OFFSET;
           }
      
      where KASAN_SHADOW_SCALE_SHIFT = 3.
      
      So for every 8 bytes there is one corresponding byte of shadow memory.
      The following encoding used for each shadow byte: 0 means that all 8 bytes
      of the corresponding memory region are valid for access; k (1 <= k <= 7)
      means that the first k bytes are valid for access, and other (8 - k) bytes
      are not; Any negative value indicates that the entire 8-bytes are
      inaccessible.  Different negative values used to distinguish between
      different kinds of inaccessible memory (redzones, freed memory) (see
      mm/kasan/kasan.h).
      
      To be able to detect accesses to bad memory we need a special compiler.
      Such compiler inserts a specific function calls (__asan_load*(addr),
      __asan_store*(addr)) before each memory access of size 1, 2, 4, 8 or 16.
      
      These functions check whether memory region is valid to access or not by
      checking corresponding shadow memory.  If access is not valid an error
      printed.
      
      Historical background of the address sanitizer from Dmitry Vyukov:
      
      	"We've developed the set of tools, AddressSanitizer (Asan),
      	ThreadSanitizer and MemorySanitizer, for user space. We actively use
      	them for testing inside of Google (continuous testing, fuzzing,
      	running prod services). To date the tools have found more than 10'000
      	scary bugs in Chromium, Google internal codebase and various
      	open-source projects (Firefox, OpenSSL, gcc, clang, ffmpeg, MySQL and
      	lots of others): [2] [3] [4].
      	The tools are part of both gcc and clang compilers.
      
      	We have not yet done massive testing under the Kernel AddressSanitizer
      	(it's kind of chicken and egg problem, you need it to be upstream to
      	start applying it extensively). To date it has found about 50 bugs.
      	Bugs that we've found in upstream kernel are listed in [5].
      	We've also found ~20 bugs in out internal version of the kernel. Also
      	people from Samsung and Oracle have found some.
      
      	[...]
      
      	As others noted, the main feature of AddressSanitizer is its
      	performance due to inline compiler instrumentation and simple linear
      	shadow memory. User-space Asan has ~2x slowdown on computational
      	programs and ~2x memory consumption increase. Taking into account that
      	kernel usually consumes only small fraction of CPU and memory when
      	running real user-space programs, I would expect that kernel Asan will
      	have ~10-30% slowdown and similar memory consumption increase (when we
      	finish all tuning).
      
      	I agree that Asan can well replace kmemcheck. We have plans to start
      	working on Kernel MemorySanitizer that finds uses of unitialized
      	memory. Asan+Msan will provide feature-parity with kmemcheck. As
      	others noted, Asan will unlikely replace debug slab and pagealloc that
      	can be enabled at runtime. Asan uses compiler instrumentation, so even
      	if it is disabled, it still incurs visible overheads.
      
      	Asan technology is easily portable to other architectures. Compiler
      	instrumentation is fully portable. Runtime has some arch-dependent
      	parts like shadow mapping and atomic operation interception. They are
      	relatively easy to port."
      
      Comparison with other debugging features:
      ========================================
      
      KMEMCHECK:
      
        - KASan can do almost everything that kmemcheck can.  KASan uses
          compile-time instrumentation, which makes it significantly faster than
          kmemcheck.  The only advantage of kmemcheck over KASan is detection of
          uninitialized memory reads.
      
          Some brief performance testing showed that kasan could be
          x500-x600 times faster than kmemcheck:
      
      $ netperf -l 30
      		MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to localhost (127.0.0.1) port 0 AF_INET
      		Recv   Send    Send
      		Socket Socket  Message  Elapsed
      		Size   Size    Size     Time     Throughput
      		bytes  bytes   bytes    secs.    10^6bits/sec
      
      no debug:	87380  16384  16384    30.00    41624.72
      
      kasan inline:	87380  16384  16384    30.00    12870.54
      
      kasan outline:	87380  16384  16384    30.00    10586.39
      
      kmemcheck: 	87380  16384  16384    30.03      20.23
      
        - Also kmemcheck couldn't work on several CPUs.  It always sets
          number of CPUs to 1.  KASan doesn't have such limitation.
      
      DEBUG_PAGEALLOC:
      	- KASan is slower than DEBUG_PAGEALLOC, but KASan works on sub-page
      	  granularity level, so it able to find more bugs.
      
      SLUB_DEBUG (poisoning, redzones):
      	- SLUB_DEBUG has lower overhead than KASan.
      
      	- SLUB_DEBUG in most cases are not able to detect bad reads,
      	  KASan able to detect both reads and writes.
      
      	- In some cases (e.g. redzone overwritten) SLUB_DEBUG detect
      	  bugs only on allocation/freeing of object. KASan catch
      	  bugs right before it will happen, so we always know exact
      	  place of first bad read/write.
      
      [1] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel
      [2] https://code.google.com/p/address-sanitizer/wiki/FoundBugs
      [3] https://code.google.com/p/thread-sanitizer/wiki/FoundBugs
      [4] https://code.google.com/p/memory-sanitizer/wiki/FoundBugs
      [5] https://code.google.com/p/address-sanitizer/wiki/AddressSanitizerForKernel#Trophies
      
      Based on work by Andrey Konovalov.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Acked-by: NMichal Marek <mmarek@suse.cz>
      Signed-off-by: NAndrey Konovalov <adech.fo@gmail.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Konstantin Serebryany <kcc@google.com>
      Cc: Dmitry Chernenkov <dmitryc@google.com>
      Cc: Yuri Gribov <tetra2005@gmail.com>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b24becc
  20. 13 2月, 2015 2 次提交
    • C
      kernel/sched/clock.c: add another clock for use with the soft lockup watchdog · 545a2bf7
      Cyril Bur 提交于
      When the hypervisor pauses a virtualised kernel the kernel will observe a
      jump in timebase, this can cause spurious messages from the softlockup
      detector.
      
      Whilst these messages are harmless, they are accompanied with a stack
      trace which causes undue concern and more problematically the stack trace
      in the guest has nothing to do with the observed problem and can only be
      misleading.
      
      Futhermore, on POWER8 this is completely avoidable with the introduction
      of the Virtual Time Base (VTB) register.
      
      This patch (of 2):
      
      This permits the use of arch specific clocks for which virtualised kernels
      can use their notion of 'running' time, not the elpased wall time which
      will include host execution time.
      Signed-off-by: NCyril Bur <cyrilbur@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Andrew Jones <drjones@redhat.com>
      Acked-by: NDon Zickus <dzickus@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: chai wen <chaiw.fnst@cn.fujitsu.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Ben Zhang <benzh@chromium.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      545a2bf7
    • A
      all arches, signal: move restart_block to struct task_struct · f56141e3
      Andy Lutomirski 提交于
      If an attacker can cause a controlled kernel stack overflow, overwriting
      the restart block is a very juicy exploit target.  This is because the
      restart_block is held in the same memory allocation as the kernel stack.
      
      Moving the restart block to struct task_struct prevents this exploit by
      making the restart_block harder to locate.
      
      Note that there are other fields in thread_info that are also easy
      targets, at least on some architectures.
      
      It's also a decent simplification, since the restart code is more or less
      identical on all architectures.
      
      [james.hogan@imgtec.com: metag: align thread_info::supervisor_stack]
      Signed-off-by: NAndy Lutomirski <luto@amacapital.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: David Miller <davem@davemloft.net>
      Acked-by: NRichard Weinberger <richard@nod.at>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Haavard Skinnemoen <hskinnemoen@gmail.com>
      Cc: Hans-Christian Egtvedt <egtvedt@samfundet.no>
      Cc: Steven Miao <realmz6@gmail.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Aurelien Jacquiot <a-jacquiot@ti.com>
      Cc: Mikael Starvik <starvik@axis.com>
      Cc: Jesper Nilsson <jesper.nilsson@axis.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Tested-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Chen Liqin <liqin.linux@gmail.com>
      Cc: Lennox Wu <lennox.wu@gmail.com>
      Cc: Chris Metcalf <cmetcalf@ezchip.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Max Filippov <jcmvbkbc@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NJames Hogan <james.hogan@imgtec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f56141e3
  21. 14 12月, 2014 2 次提交
    • D
      syscalls: implement execveat() system call · 51f39a1f
      David Drysdale 提交于
      This patchset adds execveat(2) for x86, and is derived from Meredydd
      Luff's patch from Sept 2012 (https://lkml.org/lkml/2012/9/11/528).
      
      The primary aim of adding an execveat syscall is to allow an
      implementation of fexecve(3) that does not rely on the /proc filesystem,
      at least for executables (rather than scripts).  The current glibc version
      of fexecve(3) is implemented via /proc, which causes problems in sandboxed
      or otherwise restricted environments.
      
      Given the desire for a /proc-free fexecve() implementation, HPA suggested
      (https://lkml.org/lkml/2006/7/11/556) that an execveat(2) syscall would be
      an appropriate generalization.
      
      Also, having a new syscall means that it can take a flags argument without
      back-compatibility concerns.  The current implementation just defines the
      AT_EMPTY_PATH and AT_SYMLINK_NOFOLLOW flags, but other flags could be
      added in future -- for example, flags for new namespaces (as suggested at
      https://lkml.org/lkml/2006/7/11/474).
      
      Related history:
       - https://lkml.org/lkml/2006/12/27/123 is an example of someone
         realizing that fexecve() is likely to fail in a chroot environment.
       - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=514043 covered
         documenting the /proc requirement of fexecve(3) in its manpage, to
         "prevent other people from wasting their time".
       - https://bugzilla.redhat.com/show_bug.cgi?id=241609 described a
         problem where a process that did setuid() could not fexecve()
         because it no longer had access to /proc/self/fd; this has since
         been fixed.
      
      This patch (of 4):
      
      Add a new execveat(2) system call.  execveat() is to execve() as openat()
      is to open(): it takes a file descriptor that refers to a directory, and
      resolves the filename relative to that.
      
      In addition, if the filename is empty and AT_EMPTY_PATH is specified,
      execveat() executes the file to which the file descriptor refers.  This
      replicates the functionality of fexecve(), which is a system call in other
      UNIXen, but in Linux glibc it depends on opening "/proc/self/fd/<fd>" (and
      so relies on /proc being mounted).
      
      The filename fed to the executed program as argv[0] (or the name of the
      script fed to a script interpreter) will be of the form "/dev/fd/<fd>"
      (for an empty filename) or "/dev/fd/<fd>/<filename>", effectively
      reflecting how the executable was found.  This does however mean that
      execution of a script in a /proc-less environment won't work; also, script
      execution via an O_CLOEXEC file descriptor fails (as the file will not be
      accessible after exec).
      
      Based on patches by Meredydd Luff.
      Signed-off-by: NDavid Drysdale <drysdale@google.com>
      Cc: Meredydd Luff <meredydd@senatehouse.org>
      Cc: Shuah Khan <shuah.kh@samsung.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Rich Felker <dalias@aerifal.cx>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51f39a1f
    • V
      memcg: turn memcg_kmem_skip_account into a bit field · 6f185c29
      Vladimir Davydov 提交于
      It isn't supposed to stack, so turn it into a bit-field to save 4 bytes on
      the task_struct.
      
      Also, remove the memcg_stop/resume_kmem_account helpers - it is clearer to
      set/clear the flag inline.  Regarding the overwhelming comment to the
      helpers, which is removed by this patch too, we already have a compact yet
      accurate explanation in memcg_schedule_cache_create, no need in yet
      another one.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f185c29
  22. 04 11月, 2014 1 次提交
  23. 30 10月, 2014 1 次提交
  24. 28 10月, 2014 1 次提交