1. 07 1月, 2015 14 次提交
    • P
      rcu: Remove redundant callback-list initialization · ab954c16
      Paul E. McKenney 提交于
      The RCU callback lists are initialized in both rcu_boot_init_percpu_data()
      and rcu_init_percpu_data().  The former is intended for initializing
      immutable data, so this commit removes the initialization from
      rcu_boot_init_percpu_data() and leaves it in rcu_init_percpu_data().
      This change prepares for permitting callbacks to be queued very early
      in boot.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      ab954c16
    • P
      rcu: Don't scan root rcu_node structure for stalled tasks · 6cd534ef
      Paul E. McKenney 提交于
      Now that blocked tasks are no longer migrated to the root rcu_node
      structure, there is no need to scan the root rcu_node structure for
      blocked tasks stalling the current grace period.  This commit therefore
      removes this scan.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      6cd534ef
    • L
      rcu: Revert "Allow post-unlock reference for rt_mutex" to avoid priority-inversion · abaf3f9d
      Lai Jiangshan 提交于
      The patch dfeb9765 ("Allow post-unlock reference for rt_mutex")
      ensured rcu-boost safe even the rt_mutex has post-unlock reference.
      
      But rt_mutex allowing post-unlock reference is definitely a bug and it was
      fixed by the commit 27e35715 ("rtmutex: Plug slow unlock race").
      This fix made the previous patch (dfeb9765) useless.
      
      And even worse, the priority-inversion introduced by the the previous
      patch still exists.
      
      rcu_read_unlock_special() {
      	rt_mutex_unlock(&rnp->boost_mtx);
      	/* Priority-Inversion:
      	 * the current task had been deboosted and preempted as a low
      	 * priority task immediately, it could wait long before reschedule in,
      	 * and the rcu-booster also waits on this low priority task and sleeps.
      	 * This priority-inversion makes rcu-booster can't work
      	 * as expected.
      	 */
      	complete(&rnp->boost_completion);
      }
      
      Just revert the patch to avoid it.
      
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      abaf3f9d
    • P
      rcu: Note quiescent state when CPU goes offline · 3ba4d0e0
      Paul E. McKenney 提交于
      The rcu_cleanup_dead_cpu() function (called after a CPU has gone
      completely offline) has not reported a quiescent state because there
      was probably at least one synchronize_rcu() between the time the CPU
      went offline and the CPU_DEAD notifier, and this would have detected
      the CPU's offline state via quiescent-state forcing.  However, the plan
      is for CPUs to take themselves offline, at which point it makes sense
      for them to report their own quiescent state.  This commit makes this
      change in preparation for the new CPU-hotplug setup.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3ba4d0e0
    • P
      rcu: Don't bother affinitying rcub kthreads away from offline CPUs · 5d0b0249
      Paul E. McKenney 提交于
      When rcu_boost_kthread_setaffinity() sees that all CPUs for a given
      rcu_node structure are now offline, it affinities the corresponding
      RCU-boost ("rcub") kthread away from those CPUs.  This is pointless
      because the kthread cannot run on those offline CPUs in any case.
      This commit therefore removes this unneeded code.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      5d0b0249
    • P
      rcu: Don't initiate RCU priority boosting on root rcu_node · 1be0085b
      Paul E. McKenney 提交于
      Because there is no longer any preempted tasks on the root rcu_node, and
      because there is no longer ever an rcub kthread for the root rcu_node,
      this commit drops the code in force_qs_rnp() that attempts to awaken
      the non-existent root rcub kthread.  This is strictly a performance
      enhancement, removing a root rcu_node ->lock acquisition and release
      along with some tests in rcu_initiate_boost(), ending with the test that
      notes that there is no rcub kthread.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      1be0085b
    • P
      rcu: Don't spawn rcub kthreads on root rcu_node structure · 3e9f5c70
      Paul E. McKenney 提交于
      Now that offlining CPUs no longer moves leaf rcu_node structures'
      ->blkd_tasks lists to the root, there is no way for the root rcu_node
      structure's ->blkd_task list to be nonempty, unless the root node is also
      the sole leaf node.  This commit therefore refrains from creating an rcub
      kthread for the root rcu_node structure unless it is also the sole leaf.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      3e9f5c70
    • P
      rcu: Make use of rcu_preempt_has_tasks() · 96e92021
      Paul E. McKenney 提交于
      Given that there is now arcu_preempt_has_tasks() function that checks
      to see if the ->blkd_tasks list is non-empty, this commit makes use of it.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      96e92021
    • P
      rcu: Shorten irq-disable region in rcu_cleanup_dead_cpu() · a8f4cbad
      Paul E. McKenney 提交于
      Now that we are not migrating callbacks, there is no need to hold the
      ->orphan_lock across the the ->qsmaskinit bit-clearing process.
      This commit therefore releases ->orphan_lock immediately after adopting
      the orphaned RCU callbacks.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a8f4cbad
    • P
      rcu: Don't migrate blocked tasks even if all corresponding CPUs offline · d19fb8d1
      Paul E. McKenney 提交于
      When the last CPU associated with a given leaf rcu_node structure
      goes offline, something must be done about the tasks queued on that
      rcu_node structure.  Each of these tasks has been preempted on one of
      the leaf rcu_node structure's CPUs while in an RCU read-side critical
      section that it have not yet exited.  Handling these tasks is the job of
      rcu_preempt_offline_tasks(), which migrates them from the leaf rcu_node
      structure to the root rcu_node structure.
      
      Unfortunately, this migration has to be done one task at a time because
      each tasks allegiance must be shifted from the original leaf rcu_node to
      the root, so that future attempts to deal with these tasks will acquire
      the root rcu_node structure's ->lock rather than that of the leaf.
      Worse yet, this migration must be done with interrupts disabled, which
      is not so good for realtime response, especially given that there is
      no bound on the number of tasks on a given rcu_node structure's list.
      (OK, OK, there is a bound, it is just that it is unreasonably large,
      especially on 64-bit systems.)  This was not considered a problem back
      when rcu_preempt_offline_tasks() was first written because realtime
      systems were assumed not to do CPU-hotplug operations while real-time
      applications were running.  This assumption has proved of dubious validity
      given that people are starting to run multiple realtime applications
      on a single SMP system and that it is common practice to offline then
      online a CPU before starting its real-time application in order to clear
      extraneous processing off of that CPU.  So we now need CPU hotplug
      operations to avoid undue latencies.
      
      This commit therefore avoids migrating these tasks, instead letting
      them be dequeued one by one from the original leaf rcu_node structure
      by rcu_read_unlock_special().  This means that the clearing of bits
      from the upper-level rcu_node structures must be deferred until the
      last such task has been dequeued, because otherwise subsequent grace
      periods won't wait on them.  This commit has the beneficial side effect
      of simplifying the CPU-hotplug code for TREE_PREEMPT_RCU, especially in
      CONFIG_RCU_BOOST builds.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      d19fb8d1
    • P
      rcu: Make rcu_read_unlock_special() propagate ->qsmaskinit bit clearing · b6a932d1
      Paul E. McKenney 提交于
      This commit causes rcu_read_unlock_special() to propagate ->qsmaskinit
      bit clearing up the rcu_node tree once a given rcu_node structure's
      blkd_tasks list becomes empty.  This is the final commit in preparation
      for the rework of RCU priority boosting:  It enables preempted tasks to
      remain queued on their rcu_node structure even after all of that rcu_node
      structure's CPUs have gone offline.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b6a932d1
    • P
      rcu: Abstract rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu() · 8af3a5e7
      Paul E. McKenney 提交于
      This commit abstracts rcu_cleanup_dead_rnp() from rcu_cleanup_dead_cpu()
      in preparation for the rework of RCU priority boosting.  This new function
      will be invoked from rcu_read_unlock_special() in the reworked scheme,
      which is why rcu_cleanup_dead_rnp() assumes that the leaf rcu_node
      structure's ->qsmaskinit field has already been updated.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      8af3a5e7
    • P
      rcu: Rename "empty" to "empty_norm" in preparation for boost rework · 74e871ac
      Paul E. McKenney 提交于
      This commit undertakes a simple variable renaming to make way for
      some rework of RCU priority boosting.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      74e871ac
    • P
      rcu: Protect rcu_boost() lockless accesses with ACCESS_ONCE() · b08ea27d
      Paul E. McKenney 提交于
      This commit prevents random compiler optimizations by applying
      ACCESS_ONCE() to lockless accesses.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b08ea27d
  2. 31 12月, 2014 1 次提交
    • P
      rcu: Make rcu_nmi_enter() handle nesting · 734d1680
      Paul E. McKenney 提交于
      The x86 architecture has multiple types of NMI-like interrupts: real
      NMIs, machine checks, and, for some values of NMI-like, debugging
      and breakpoint interrupts.  These interrupts can nest inside each
      other.  Andy Lutomirski is adding RCU support to these interrupts,
      so rcu_nmi_enter() and rcu_nmi_exit() must now correctly handle nesting.
      
      This commit therefore introduces nesting, using a clever NMI-coordination
      algorithm suggested by Andy.  The trick is to atomically increment
      ->dynticks (if needed) before manipulating ->dynticks_nmi_nesting on entry
      (and, accordingly, after on exit).  In addition, ->dynticks_nmi_nesting
      is incremented by one if ->dynticks was incremented and by two otherwise.
      This means that when rcu_nmi_exit() sees ->dynticks_nmi_nesting equal
      to one, it knows that ->dynticks must be atomically incremented.
      
      This NMI-coordination algorithms has been validated by the following
      Promela model:
      
      ------------------------------------------------------------------------
      
      /*
       * Promela model for Andy Lutomirski's suggested change to rcu_nmi_enter()
       * that allows nesting.
       *
       * This program is free software; you can redistribute it and/or modify
       * it under the terms of the GNU General Public License as published by
       * the Free Software Foundation; either version 2 of the License, or
       * (at your option) any later version.
       *
       * This program is distributed in the hope that it will be useful,
       * but WITHOUT ANY WARRANTY; without even the implied warranty of
       * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
       * GNU General Public License for more details.
       *
       * You should have received a copy of the GNU General Public License
       * along with this program; if not, you can access it online at
       * http://www.gnu.org/licenses/gpl-2.0.html.
       *
       * Copyright IBM Corporation, 2014
       *
       * Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
       */
      
      byte dynticks_nmi_nesting = 0;
      byte dynticks = 0;
      
      /*
       * Promela verision of rcu_nmi_enter().
       */
      inline rcu_nmi_enter()
      {
      	byte incby;
      	byte tmp;
      
      	incby = BUSY_INCBY;
      	assert(dynticks_nmi_nesting >= 0);
      	if
      	:: (dynticks & 1) == 0 ->
      		atomic {
      			dynticks = dynticks + 1;
      		}
      		assert((dynticks & 1) == 1);
      		incby = 1;
      	:: else ->
      		skip;
      	fi;
      	tmp = dynticks_nmi_nesting;
      	tmp = tmp + incby;
      	dynticks_nmi_nesting = tmp;
      	assert(dynticks_nmi_nesting >= 1);
      }
      
      /*
       * Promela verision of rcu_nmi_exit().
       */
      inline rcu_nmi_exit()
      {
      	byte tmp;
      
      	assert(dynticks_nmi_nesting > 0);
      	assert((dynticks & 1) != 0);
      	if
      	:: dynticks_nmi_nesting != 1 ->
      		tmp = dynticks_nmi_nesting;
      		tmp = tmp - BUSY_INCBY;
      		dynticks_nmi_nesting = tmp;
      	:: else ->
      		dynticks_nmi_nesting = 0;
      		atomic {
      			dynticks = dynticks + 1;
      		}
      		assert((dynticks & 1) == 0);
      	fi;
      }
      
      /*
       * Base-level NMI runs non-atomically.  Crudely emulates process-level
       * dynticks-idle entry/exit.
       */
      proctype base_NMI()
      {
      	byte busy;
      
      	busy = 0;
      	do
      	::	/* Emulate base-level dynticks and not. */
      		if
      		:: 1 ->	atomic {
      				dynticks = dynticks + 1;
      			}
      			busy = 1;
      		:: 1 ->	skip;
      		fi;
      
      		/* Verify that we only sometimes have base-level dynticks. */
      		if
      		:: busy == 0 -> skip;
      		:: busy == 1 -> skip;
      		fi;
      
      		/* Model RCU's NMI entry and exit actions. */
      		rcu_nmi_enter();
      		assert((dynticks & 1) == 1);
      		rcu_nmi_exit();
      
      		/* Emulated re-entering base-level dynticks and not. */
      		if
      		:: !busy -> skip;
      		:: busy ->
      			atomic {
      				dynticks = dynticks + 1;
      			}
      			busy = 0;
      		fi;
      
      		/* We had better now be in dyntick-idle mode. */
      		assert((dynticks & 1) == 0);
      	od;
      }
      
      /*
       * Nested NMI runs atomically to emulate interrupting base_level().
       */
      proctype nested_NMI()
      {
      	do
      	::	/*
      		 * Use an atomic section to model a nested NMI.  This is
      		 * guaranteed to interleave into base_NMI() between a pair
      		 * of base_NMI() statements, just as a nested NMI would.
      		 */
      		atomic {
      			/* Verify that we only sometimes are in dynticks. */
      			if
      			:: (dynticks & 1) == 0 -> skip;
      			:: (dynticks & 1) == 1 -> skip;
      			fi;
      
      			/* Model RCU's NMI entry and exit actions. */
      			rcu_nmi_enter();
      			assert((dynticks & 1) == 1);
      			rcu_nmi_exit();
      		}
      	od;
      }
      
      init {
      	run base_NMI();
      	run nested_NMI();
      }
      
      ------------------------------------------------------------------------
      
      The following script can be used to run this model if placed in
      rcu_nmi.spin:
      
      ------------------------------------------------------------------------
      
      if ! spin -a rcu_nmi.spin
      then
      	echo Spin errors!!!
      	exit 1
      fi
      if ! cc -DSAFETY -o pan pan.c
      then
      	echo Compilation errors!!!
      	exit 1
      fi
      ./pan -m100000
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      734d1680
  3. 20 12月, 2014 1 次提交
  4. 19 12月, 2014 1 次提交
    • T
      tick/powerclamp: Remove tick_nohz_idle abuse · a5fd9733
      Thomas Gleixner 提交于
      commit 4dbd2771 "tick: export nohz tick idle symbols for module
      use" was merged via the thermal tree without an explicit ack from the
      relevant maintainers.
      
      The exports are abused by the intel powerclamp driver which implements
      a fake idle state from a sched FIFO task. This causes all kinds of
      wreckage in the NOHZ core code which rightfully assumes that
      tick_nohz_idle_enter/exit() are only called from the idle task itself.
      
      Recent changes in the NOHZ core lead to a failure of the powerclamp
      driver and now people try to hack completely broken and backwards
      workarounds into the NOHZ core code. This is completely unacceptable
      and just papers over the real problem. There are way more subtle
      issues lurking around the corner.
      
      The real solution is to fix the powerclamp driver by rewriting it with
      a sane concept, but that's beyond the scope of this.
      
      So the only solution for now is to remove the calls into the core NOHZ
      code from the powerclamp trainwreck along with the exports. 
      
      Fixes: d6d71ee4 "PM: Introduce Intel PowerClamp Driver"
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Frederic Weisbecker <frederic@kernel.org>
      Cc: Pan Jacob jun <jacob.jun.pan@intel.com>
      Cc: LKP <lkp@01.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Zhang Rui <rui.zhang@intel.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1412181110110.17382@nanosSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      a5fd9733
  5. 18 12月, 2014 1 次提交
    • K
      param: do not set store func without write perm · b0a65b0c
      Kees Cook 提交于
      When a module_param is defined without DAC write permissions, it can
      still be changed at runtime and updated. Drivers using a 0444 permission
      may be surprised that these values can still be changed.
      
      For drivers that want to allow updates, any S_IW* flag will set the
      "store" function as before. Drivers without S_IW* flags will have the
      "store" function unset, unforcing a read-only value. Drivers that wish
      neither "store" nor "get" can continue to use "0" for perms to stay out
      of sysfs entirely.
      
      Old behavior:
        # cd /sys/module/snd/parameters
        # ls -l
        total 0
        -r--r--r-- 1 root root 4096 Dec 11 13:55 cards_limit
        -r--r--r-- 1 root root 4096 Dec 11 13:55 major
        -r--r--r-- 1 root root 4096 Dec 11 13:55 slots
        # cat major
        116
        # echo -1 > major
        -bash: major: Permission denied
        # chmod u+w major
        # echo -1 > major
        # cat major
        -1
      
      New behavior:
        ...
        # chmod u+w major
        # echo -1 > major
        -bash: echo: write error: Input/output error
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      b0a65b0c
  6. 15 12月, 2014 2 次提交
  7. 14 12月, 2014 8 次提交
  8. 13 12月, 2014 2 次提交
    • T
      genirq: Prevent proc race against freeing of irq descriptors · c291ee62
      Thomas Gleixner 提交于
      Since the rework of the sparse interrupt code to actually free the
      unused interrupt descriptors there exists a race between the /proc
      interfaces to the irq subsystem and the code which frees the interrupt
      descriptor.
      
      CPU0				CPU1
      				show_interrupts()
      				  desc = irq_to_desc(X);
      free_desc(desc)
        remove_from_radix_tree();
        kfree(desc);
      				  raw_spinlock_irq(&desc->lock);
      
      /proc/interrupts is the only interface which can actively corrupt
      kernel memory via the lock access. /proc/stat can only read from freed
      memory. Extremly hard to trigger, but possible.
      
      The interfaces in /proc/irq/N/ are not affected by this because the
      removal of the proc file is serialized in procfs against concurrent
      readers/writers. The removal happens before the descriptor is freed.
      
      For architectures which have CONFIG_SPARSE_IRQ=n this is a non issue
      as the descriptor is never freed. It's merely cleared out with the irq
      descriptor lock held. So any concurrent proc access will either see
      the old correct value or the cleared out ones.
      
      Protect the lookup and access to the irq descriptor in
      show_interrupts() with the sparse_irq_lock.
      
      Provide kstat_irqs_usr() which is protecting the lookup and access
      with sparse_irq_lock and switch /proc/stat to use it.
      
      Document the existing kstat_irqs interfaces so it's clear that the
      caller needs to take care about protection. The users of these
      interfaces are either not affected due to SPARSE_IRQ=n or already
      protected against removal.
      
      Fixes: 1f5a5b87 "genirq: Implement a sane sparse_irq allocator"
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      c291ee62
    • R
      tracing / PM: Replace CONFIG_PM_RUNTIME with CONFIG_PM · 798bc6d8
      Rafael J. Wysocki 提交于
      After commit b2b49ccb (PM: Kconfig: Set PM_RUNTIME if PM_SLEEP is
      selected) PM_RUNTIME is always set if PM is set, so files that are
      build conditionally if CONFIG_PM_RUNTIME is set may now be build
      if CONFIG_PM is set.
      
      Replace CONFIG_PM_RUNTIME with CONFIG_PM in kernel/trace/Makefile
      for this reason.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Acked-by: Steven Rostedt <rostedt@goodmis.org.
      798bc6d8
  9. 12 12月, 2014 3 次提交
    • E
      userns; Correct the comment in map_write · 36476bea
      Eric W. Biederman 提交于
      It is important that all maps are less than PAGE_SIZE
      or else setting the last byte of the buffer to '0'
      could write off the end of the allocated storage.
      
      Correct the misleading comment.
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      36476bea
    • E
      userns: Allow setting gid_maps without privilege when setgroups is disabled · 66d2f338
      Eric W. Biederman 提交于
      Now that setgroups can be disabled and not reenabled, setting gid_map
      without privielge can now be enabled when setgroups is disabled.
      
      This restores most of the functionality that was lost when unprivileged
      setting of gid_map was removed.  Applications that use this functionality
      will need to check to see if they use setgroups or init_groups, and if they
      don't they can be fixed by simply disabling setgroups before writing to
      gid_map.
      
      Cc: stable@vger.kernel.org
      Reviewed-by: NAndy Lutomirski <luto@amacapital.net>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      66d2f338
    • E
      userns: Add a knob to disable setgroups on a per user namespace basis · 9cc46516
      Eric W. Biederman 提交于
      - Expose the knob to user space through a proc file /proc/<pid>/setgroups
      
        A value of "deny" means the setgroups system call is disabled in the
        current processes user namespace and can not be enabled in the
        future in this user namespace.
      
        A value of "allow" means the segtoups system call is enabled.
      
      - Descendant user namespaces inherit the value of setgroups from
        their parents.
      
      - A proc file is used (instead of a sysctl) as sysctls currently do
        not allow checking the permissions at open time.
      
      - Writing to the proc file is restricted to before the gid_map
        for the user namespace is set.
      
        This ensures that disabling setgroups at a user namespace
        level will never remove the ability to call setgroups
        from a process that already has that ability.
      
        A process may opt in to the setgroups disable for itself by
        creating, entering and configuring a user namespace or by calling
        setns on an existing user namespace with setgroups disabled.
        Processes without privileges already can not call setgroups so this
        is a noop.  Prodcess with privilege become processes without
        privilege when entering a user namespace and as with any other path
        to dropping privilege they would not have the ability to call
        setgroups.  So this remains within the bounds of what is possible
        without a knob to disable setgroups permanently in a user namespace.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      9cc46516
  10. 11 12月, 2014 7 次提交
    • S
      printk: Do not disable preemption for accessing printk_func · 1fb8915b
      Steven Rostedt (Red Hat) 提交于
      As printk_func will either be the default function, or a per_cpu function
      for the current CPU, there's no reason to disable preemption to access
      it from printk. That's because if the printk_func is not the default
      then the caller had better disabled preemption as they were the one to
      change it.
      
      Link: http://lkml.kernel.org/r/CA+55aFz5-_LKW4JHEBoWinN9_ouNcGRWAF2FUA35u46FRN-Kxw@mail.gmail.comSuggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      1fb8915b
    • J
      perf: Fix events installation during moving group · 9fc81d87
      Jiri Olsa 提交于
      We allow PMU driver to change the cpu on which the event
      should be installed to. This happened in patch:
      
        e2d37cd2 ("perf: Allow the PMU driver to choose the CPU on which to install events")
      
      This patch also forces all the group members to follow
      the currently opened events cpu if the group happened
      to be moved.
      
      This and the change of event->cpu in perf_install_in_context()
      function introduced in:
      
        0cda4c02 ("perf: Introduce perf_pmu_migrate_context()")
      
      forces group members to change their event->cpu,
      if the currently-opened-event's PMU changed the cpu
      and there is a group move.
      
      Above behaviour causes problem for breakpoint events,
      which uses event->cpu to touch cpu specific data for
      breakpoints accounting. By changing event->cpu, some
      breakpoints slots were wrongly accounted for given
      cpu.
      
      Vinces's perf fuzzer hit this issue and caused following
      WARN on my setup:
      
         WARNING: CPU: 0 PID: 20214 at arch/x86/kernel/hw_breakpoint.c:119 arch_install_hw_breakpoint+0x142/0x150()
         Can't find any breakpoint slot
         [...]
      
      This patch changes the group moving code to keep the event's
      original cpu.
      Reported-by: NVince Weaver <vince@deater.net>
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Vince Weaver <vince@deater.net>
      Cc: Yan, Zheng <zheng.z.yan@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1418243031-20367-3-git-send-email-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9fc81d87
    • O
      exit: pidns: fix/update the comments in zap_pid_ns_processes() · a53b8315
      Oleg Nesterov 提交于
      The comments in zap_pid_ns_processes() are not clear, we need to explain
      how this code actually works.
      
      1. "Ignore SIGCHLD" looks like optimization but it is not, we also
         need this for correctness.
      
      2. The comment above sys_wait4() could tell more.
      
         EXIT_ZOMBIE child is only possible if it has exited before we
         ignored SIGCHLD. Or if it is traced from the parent namespace,
         but in this case it will be reaped by debugger after detach,
         sys_wait4() acts as a synchronization point.
      
      3. The comment about TASK_DEAD (EXIT_DEAD in fact) children is
         outdated. Contrary to what it says we do not need to make sure
         they all go away after 0a01f2cc "pidns: Make the pidns proc
         mount/umount logic obvious".
      
         At the same time, we do need to wait for nr_hashed==init_pids,
         but the reasons are quite different and not obvious: setns().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a53b8315
    • O
      exit: pidns: alloc_pid() leaks pid_namespace if child_reaper is exiting · 24c037eb
      Oleg Nesterov 提交于
      alloc_pid() does get_pid_ns() beforehand but forgets to put_pid_ns() if it
      fails because disable_pid_allocation() was called by the exiting
      child_reaper.
      
      We could simply move get_pid_ns() down to successful return, but this fix
      tries to be as trivial as possible.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Serge Hallyn <serge.hallyn@ubuntu.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      24c037eb
    • O
      exit: exit_notify: re-use "dead" list to autoreap current · 6c66e7db
      Oleg Nesterov 提交于
      After the previous change we can add just the exiting EXIT_DEAD task to
      the "dead" list and remove another release_task(tsk).
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6c66e7db
    • O
      exit: reparent: call forget_original_parent() under tasklist_lock · 482a3767
      Oleg Nesterov 提交于
      Shift "release dead children" loop from forget_original_parent() to its
      caller, exit_notify().  It is safe to reap them even if our parent reaps
      us right after we drop tasklist_lock, those children no longer have any
      connection to the exiting task.
      
      And this allows us to avoid write_lock_irq(tasklist_lock) right after it
      was released by forget_original_parent(), we can simply call it with
      tasklist_lock held.
      
      While at it, move the comment about forget_original_parent() up to
      this function.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      482a3767
    • O
      exit: reparent: avoid find_new_reaper() if no children · ad9e206a
      Oleg Nesterov 提交于
      Now that pid_ns logic was isolated we can change forget_original_parent()
      to return right after find_child_reaper() when father->children is empty,
      there is nothing to reparent in this case.
      
      In particular this avoids find_alive_thread() and this can help if the
      whole process exits and it has a lot of PF_EXITING threads at the start of
      the thread list, this can easily lead to O(nr_threads ** 2) iterations.
      
      Trivial test case (tested under KVM, 2 CPUs):
      
          static void *tfunc(void *arg)
          {
              pause();
              return NULL;
          }
      
          static int child(unsigned int nt)
          {
              pthread_t pt;
      
              while (nt--)
                  assert(pthread_create(&pt, NULL, tfunc, NULL) == 0);
      
              pthread_kill(pt, SIGTRAP);
              pause();
              return 0;
          }
      
          int main(int argc, const char *argv[])
          {
              int stat;
              unsigned int nf = atoi(argv[1]);
              unsigned int nt = atoi(argv[2]);
      
              while (nf--) {
                  if (!fork())
                      return child(nt);
      
                  wait(&stat);
                  assert(stat == SIGTRAP);
              }
      
              return 0;
          }
      
      $ time ./test 16 16536 shows:
      
                    real        user         sys
          -    5m37.628s    0m4.437s    8m5.560s
          +    0m50.032s    0m7.130s    1m4.927s
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Aaron Tomlin <atomlin@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Sterling Alexander <stalexan@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad9e206a