1. 09 10月, 2012 7 次提交
    • M
      mm: replace vma prio_tree with an interval tree · 6b2dbba8
      Michel Lespinasse 提交于
      Implement an interval tree as a replacement for the VMA prio_tree.  The
      algorithms are similar to lib/interval_tree.c; however that code can't be
      directly reused as the interval endpoints are not explicitly stored in the
      VMA.  So instead, the common algorithm is moved into a template and the
      details (node type, how to get interval endpoints from the node, etc) are
      filled in using the C preprocessor.
      
      Once the interval tree functions are available, using them as a
      replacement to the VMA prio tree is a relatively simple, mechanical job.
      Signed-off-by: NMichel Lespinasse <walken@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Hillf Danton <dhillf@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b2dbba8
    • D
      oom: remove deprecated oom_adj · 01dc52eb
      Davidlohr Bueso 提交于
      The deprecated /proc/<pid>/oom_adj is scheduled for removal this month.
      Signed-off-by: NDavidlohr Bueso <dave@gnu.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01dc52eb
    • K
      mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9
      Konstantin Khlebnikov 提交于
      A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
      currently it lost original meaning but still has some effects:
      
       | effect                 | alternative flags
      -+------------------------+---------------------------------------------
      1| account as reserved_vm | VM_IO
      2| skip in core dump      | VM_IO, VM_DONTDUMP
      3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      
      This patch removes reserved_vm counter from mm_struct.  Seems like nobody
      cares about it, it does not exported into userspace directly, it only
      reduces total_vm showed in proc.
      
      Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
      
      remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
      remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
      
      [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314e51b9
    • K
      mm: kill vma flag VM_EXECUTABLE and mm->num_exe_file_vmas · e9714acf
      Konstantin Khlebnikov 提交于
      Currently the kernel sets mm->exe_file during sys_execve() and then tracks
      number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon
      as this counter drops to zero kernel resets mm->exe_file to NULL.  Plus it
      resets mm->exe_file at last mmput() when mm->mm_users drops to zero.
      
      VMA with VM_EXECUTABLE flag appears after mapping file with flag
      MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma
      splitting, because sys_mmap ignores this flag.  Usually binfmt module sets
      mm->exe_file and mmaps executable vmas with this file, they hold
      mm->exe_file while task is running.
      
      comment from v2.6.25-6245-g925d1c40 ("procfs task exe symlink"),
      where all this stuff was introduced:
      
      > The kernel implements readlink of /proc/pid/exe by getting the file from
      > the first executable VMA.  Then the path to the file is reconstructed and
      > reported as the result.
      >
      > Because of the VMA walk the code is slightly different on nommu systems.
      > This patch avoids separate /proc/pid/exe code on nommu systems.  Instead of
      > walking the VMAs to find the first executable file-backed VMA we store a
      > reference to the exec'd file in the mm_struct.
      >
      > That reference would prevent the filesystem holding the executable file
      > from being unmounted even after unmapping the VMAs.  So we track the number
      > of VM_EXECUTABLE VMAs and drop the new reference when the last one is
      > unmapped.  This avoids pinning the mounted filesystem.
      
      exe_file's vma accounting is hooked into every file mmap/unmmap and vma
      split/merge just to fix some hypothetical pinning fs from umounting by mm,
      which already unmapped all its executable files, but still alive.
      
      Seems like currently nobody depends on this behaviour.  We can try to
      remove this logic and keep mm->exe_file until final mmput().
      
      mm->exe_file is still protected with mm->mmap_sem, because we want to
      change it via new sys_prctl(PR_SET_MM_EXE_FILE).  Also via this syscall
      task can change its mm->exe_file and unpin mountpoint explicitly.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9714acf
    • K
      mm: use mm->exe_file instead of first VM_EXECUTABLE vma->vm_file · 2dd8ad81
      Konstantin Khlebnikov 提交于
      Some security modules and oprofile still uses VM_EXECUTABLE for retrieving
      a task's executable file.  After this patch they will use mm->exe_file
      directly.  mm->exe_file is protected with mm->mmap_sem, so locking stays
      the same.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Acked-by: Chris Metcalf <cmetcalf@tilera.com>			[arch/tile]
      Acked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>	[tomoyo]
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Acked-by: NJames Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2dd8ad81
    • S
      CPU hotplug, debug: detect imbalance between get_online_cpus() and put_online_cpus() · 075663d1
      Srivatsa S. Bhat 提交于
      The synchronization between CPU hotplug readers and writers is achieved
      by means of refcounting, safeguarded by the cpu_hotplug.lock.
      
      get_online_cpus() increments the refcount, whereas put_online_cpus()
      decrements it.  If we ever hit an imbalance between the two, we end up
      compromising the guarantees of the hotplug synchronization i.e, for
      example, an extra call to put_online_cpus() can end up allowing a
      hotplug reader to execute concurrently with a hotplug writer.
      
      So, add a WARN_ON() in put_online_cpus() to detect such cases where the
      refcount can go negative, and also attempt to fix it up, so that we can
      continue to run.
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      075663d1
    • C
      Kconfig: clean up the "#if defined(arch)" list for exception-trace sysctl entry · 7ac57a89
      Catalin Marinas 提交于
      Introduce SYSCTL_EXCEPTION_TRACE config option and selec it in the
      architectures requiring support for the "exception-trace" debug_table
      entry in kernel/sysctl.c.
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ac57a89
  2. 06 10月, 2012 8 次提交
  3. 28 9月, 2012 1 次提交
  4. 27 9月, 2012 7 次提交
  5. 26 9月, 2012 11 次提交
    • P
      rcu: Apply micro-optimization and int/bool fixes to RCU's idle handling · cb349ca9
      Paul E. McKenney 提交于
      Checking "user" before "is_idle_task()" allows better optimizations
      in cases where inlining is possible.  Also, "bool" should be passed
      "true" or "false" rather than "1" or "0".  This commit therefore makes
      these changes, as noted in Josh's review.
      Reported-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      cb349ca9
    • F
      rcu: Userspace RCU extended QS selftest · 1fd2b442
      Frederic Weisbecker 提交于
      Provide a config option that enables the userspace
      RCU extended quiescent state on every CPUs by default.
      
      This is for testing purpose.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      1fd2b442
    • F
      rcu: Exit RCU extended QS on user preemption · 20ab65e3
      Frederic Weisbecker 提交于
      When exceptions or irq are about to resume userspace, if
      the task needs to be rescheduled, the arch low level code
      calls schedule() directly.
      
      If we call it, it is because we have the TIF_RESCHED flag:
      
      - It can be set after random local calls to set_need_resched()
      (RCU, drm, ...)
      
      - A wake up happened and the CPU needs preemption. This can
        happen in several ways:
      
          * Remotely: the remote waking CPU has set TIF_RESCHED and send the
            wakee an IPI to schedule the new task.
          * Remotely enqueued: the remote waking CPU sends an IPI to the target
            and the wake up is made by the target.
          * Locally: waking CPU == wakee CPU and the wakeup is done locally.
            set_need_resched() is called without IPI.
      
      In the case of local and remotely enqueued wake ups, the tick can
      be restarted when we enqueue the new task and RCU can exit the
      extended quiescent state at the same time. Then by the time we reach
      irq exit path and we call schedule, we are not in RCU user mode.
      
      But if we call schedule() only because something called set_need_resched(),
      RCU may still be in user mode when we reach schedule.
      
      Also if a wake up is done remotely, the CPU might see the TIF_RESCHED
      flag and call schedule while the IPI has not yet happen to restart the
      tick and exit RCU user mode.
      
      We need to manually protect against these corner cases.
      
      Create a new API schedule_user() that calls schedule() inside
      rcu_user_exit()-rcu_user_enter() in order to protect it. Archs
      will need to rely on it now to implement user preemption safely.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      20ab65e3
    • F
      rcu: Exit RCU extended QS on kernel preemption after irq/exception · 90a340ed
      Frederic Weisbecker 提交于
      When an exception or an irq exits, and we are going to resume into
      interrupted kernel code, the low level architecture code calls
      preempt_schedule_irq() if there is a need to reschedule.
      
      If the interrupt/exception occured between a call to rcu_user_enter()
      (from syscall exit, exception exit, do_notify_resume exit, ...) and
      a real resume to userspace (iret,...), preempt_schedule_irq() can be
      called whereas RCU thinks we are in userspace. But preempt_schedule_irq()
      is going to run kernel code and may be some RCU read side critical
      section. We must exit the userspace extended quiescent state before
      we call it.
      
      To solve this, just call rcu_user_exit() in the beginning of
      preempt_schedule_irq().
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      90a340ed
    • F
      rcu: Switch task's syscall hooks on context switch · 04e7e951
      Frederic Weisbecker 提交于
      Clear the syscalls hook of a task when it's scheduled out so that if
      the task migrates, it doesn't run the syscall slow path on a CPU
      that might not need it.
      
      Also set the syscalls hook on the next task if needed.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      04e7e951
    • F
      rcu: Ignore userspace extended quiescent state by default · 1e1a689f
      Frederic Weisbecker 提交于
      By default we don't want to enter into RCU extended quiescent
      state while in userspace because doing this produces some overhead
      (eg: use of syscall slowpath). Set it off by default and ready to
      run when some feature like adaptive tickless need it.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      1e1a689f
    • F
      rcu: Allow rcu_user_enter()/exit() to nest · c5d900bf
      Frederic Weisbecker 提交于
      Allow calls to rcu_user_enter() even if we are already
      in userspace (as seen by RCU) and allow calls to rcu_user_exit()
      even if we are already in the kernel.
      
      This makes the APIs more flexible to be called from architectures.
      Exception entries for example won't need to know if they come from
      userspace before calling rcu_user_exit().
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      c5d900bf
    • F
      rcu: Settle config for userspace extended quiescent state · 2b1d5024
      Frederic Weisbecker 提交于
      Create a new config option under the RCU menu that put
      CPUs under RCU extended quiescent state (as in dynticks
      idle mode) when they run in userspace. This require
      some contribution from architectures to hook into kernel
      and userspace boundaries.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Josh Triplett <josh@joshtriplett.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      2b1d5024
    • P
      rcu: Make RCU_FAST_NO_HZ handle adaptive ticks · 9a0c6fef
      Paul E. McKenney 提交于
      The current implementation of RCU_FAST_NO_HZ tries reasonably hard to rid
      the current CPU of RCU callbacks.  This is appropriate when the CPU is
      entering idle, where it doesn't have much useful to do anyway, but is most
      definitely not what you want when transitioning to user-mode execution.
      This commit therefore detects the adaptive-tick case, and refrains from
      burning CPU time getting rid of RCU callbacks in that case.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      9a0c6fef
    • F
      rcu: New rcu_user_enter_after_irq() and rcu_user_exit_after_irq() APIs · 19dd1591
      Frederic Weisbecker 提交于
      In some cases, it is necessary to enter or exit userspace-RCU-idle mode
      from an interrupt handler, for example, if some other CPU sends this
      CPU a resched IPI.  In this case, the current CPU would enter the IPI
      handler in userspace-RCU-idle mode, but would need to exit the IPI handler
      after having exited that mode.
      
      To allow this to work, this commit adds two new APIs to TREE_RCU:
      
      - rcu_user_enter_after_irq(). This must be called from an interrupt between
      rcu_irq_enter() and rcu_irq_exit().  After the irq calls rcu_irq_exit(),
      the irq handler will return into an RCU extended quiescent state.
      In theory, this interrupt is never a nested interrupt, but in practice
      it might interrupt softirq, which looks to RCU like a nested interrupt.
      
      - rcu_user_exit_after_irq(). This must be called from a non-nesting
      interrupt, interrupting an RCU extended quiescent state, also
      between rcu_irq_enter() and rcu_irq_exit(). After the irq calls
      rcu_irq_exit(), the irq handler will return in an RCU non-quiescent
      state.
      
      [ Combined with "Allow calls to rcu_exit_user_irq from nesting irqs." ]
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      19dd1591
    • F
      rcu: New rcu_user_enter() and rcu_user_exit() APIs · adf5091e
      Frederic Weisbecker 提交于
      RCU currently insists that only idle tasks can enter RCU idle mode, which
      prohibits an adaptive tickless kernel (AKA nohz cpusets), which in turn
      would mean that usermode execution would always take scheduling-clock
      interrupts, even when there is only one task runnable on the CPU in
      question.
      
      This commit therefore adds rcu_user_enter() and rcu_user_exit(), which
      allow non-idle tasks to enter RCU idle mode.  These are quite similar
      to rcu_idle_enter() and rcu_idle_exit(), respectively, except that they
      omit the idle-task checks.
      
      [ Updated to use "user" flag rather than separate check functions. ]
      
      [ paulmck: Updated to drop exports of new functions based on Josh's patch
        getting rid of the need for them. ]
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Alessio Igor Bogani <abogani@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@ti.com>
      Cc: Max Krasnyansky <maxk@qualcomm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephen Hemminger <shemminger@vyatta.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Sven-Thorsten Dietrich <thebigcorporation@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      adf5091e
  6. 25 9月, 2012 5 次提交
    • F
      vtime: Consolidate system/idle context detection · a7e1a9e3
      Frederic Weisbecker 提交于
      Move the code that finds out to which context we account the
      cputime into generic layer.
      
      Archs that consider the whole time spent in the idle task as idle
      time (ia64, powerpc) can rely on the generic vtime_account()
      and implement vtime_account_system() and vtime_account_idle(),
      letting the generic code to decide when to call which API.
      
      Archs that have their own meaning of idle time, such as s390
      that only considers the time spent in CPU low power mode as idle
      time, can just override vtime_account().
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      a7e1a9e3
    • F
      cputime: Use a proper subsystem naming for vtime related APIs · bf9fae9f
      Frederic Weisbecker 提交于
      Use a naming based on vtime as a prefix for virtual based
      cputime accounting APIs:
      
      - account_system_vtime() -> vtime_account()
      - account_switch_vtime() -> vtime_task_switch()
      
      It makes it easier to allow for further declension such
      as vtime_account_system(), vtime_account_idle(), ... if we
      want to find out the context we account to from generic code.
      
      This also make it better to know on which subsystem these APIs
      refer to.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      bf9fae9f
    • E
      net: use a per task frag allocator · 5640f768
      Eric Dumazet 提交于
      We currently use a per socket order-0 page cache for tcp_sendmsg()
      operations.
      
      This page is used to build fragments for skbs.
      
      Its done to increase probability of coalescing small write() into
      single segments in skbs still in write queue (not yet sent)
      
      But it wastes a lot of memory for applications handling many mostly
      idle sockets, since each socket holds one page in sk->sk_sndmsg_page
      
      Its also quite inefficient to build TSO 64KB packets, because we need
      about 16 pages per skb on arches where PAGE_SIZE = 4096, so we hit
      page allocator more than wanted.
      
      This patch adds a per task frag allocator and uses bigger pages,
      if available. An automatic fallback is done in case of memory pressure.
      
      (up to 32768 bytes per frag, thats order-3 pages on x86)
      
      This increases TCP stream performance by 20% on loopback device,
      but also benefits on other network devices, since 8x less frags are
      mapped on transmit and unmapped on tx completion. Alexander Duyck
      mentioned a probable performance win on systems with IOMMU enabled.
      
      Its possible some SG enabled hardware cant cope with bigger fragments,
      but their ndo_start_xmit() should already handle this, splitting a
      fragment in sub fragments, since some arches have PAGE_SIZE=65536
      
      Successfully tested on various ethernet devices.
      (ixgbe, igb, bnx2x, tg3, mellanox mlx4)
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Ben Hutchings <bhutchings@solarflare.com>
      Cc: Vijay Subramanian <subramanian.vijay@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@intel.com>
      Tested-by: NVijay Subramanian <subramanian.vijay@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5640f768
    • E
      trace: Move trace event enable from fs_initcall to core_initcall · 8781915a
      Ezequiel Garcia 提交于
      This patch splits trace event initialization in two stages:
       * ftrace enable
       * sysfs event entry creation
      
      This allows to capture trace events from an earlier point
      by using 'trace_event' kernel parameter and is important
      to trace boot-up allocations.
      
      Note that, in order to enable events at core_initcall,
      it's necessary to move init_ftrace_syscalls() from
      core_initcall to early_initcall.
      
      Link: http://lkml.kernel.org/r/1347461277-25302-1-git-send-email-elezegarcia@gmail.comSigned-off-by: NEzequiel Garcia <elezegarcia@gmail.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      8781915a
    • M
      tracing: Add an option for disabling markers · 5224c3a3
      Mandeep Singh Baines 提交于
      In our application, we have trace markers spread through user-space.
      We have markers in GL, X, etc. These are super handy for Chrome's
      about:tracing feature (Chrome + system + kernel trace view), but
      can be very distracting when you're trying to debug a kernel issue.
      
      I normally, use "grep -v tracing_mark_write" but it would be nice
      if I could just temporarily disable markers all together.
      
      Link: http://lkml.kernel.org/r/1347066739-26285-1-git-send-email-msb@chromium.org
      
      CC: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NMandeep Singh Baines <msb@chromium.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      5224c3a3
  7. 23 9月, 2012 1 次提交
    • P
      sched: Fix load avg vs cpu-hotplug · 5d180232
      Peter Zijlstra 提交于
      Rabik and Paul reported two different issues related to the same few
      lines of code.
      
      Rabik's issue is that the nr_uninterruptible migration code is wrong in
      that he sees artifacts due to this (Rabik please do expand in more
      detail).
      
      Paul's issue is that this code as it stands relies on us using
      stop_machine() for unplug, we all would like to remove this assumption
      so that eventually we can remove this stop_machine() usage altogether.
      
      The only reason we'd have to migrate nr_uninterruptible is so that we
      could use for_each_online_cpu() loops in favour of
      for_each_possible_cpu() loops, however since nr_uninterruptible() is the
      only such loop and its using possible lets not bother at all.
      
      The problem Rabik sees is (probably) caused by the fact that by
      migrating nr_uninterruptible we screw rq->calc_load_active for both rqs
      involved.
      
      So don't bother with fancy migration schemes (meaning we now have to
      keep using for_each_possible_cpu()) and instead fold any nr_active delta
      after we migrate all tasks away to make sure we don't have any skewed
      nr_active accounting.
      
      [ paulmck: Move call to calc_load_migration to CPU_DEAD to avoid
      miscounting noted by Rakib. ]
      Reported-by: NRakib Mullick <rakib.mullick@gmail.com>
      Reported-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      5d180232