1. 09 7月, 2014 9 次提交
    • T
      cgroup: clean up sane_behavior handling · 7b9a6ba5
      Tejun Heo 提交于
      After the previous patch to remove sane_behavior support from
      non-default hierarchies, CGRP_ROOT_SANE_BEHAVIOR is used only to
      indicate the default hierarchy while parsing mount options.  This
      patch makes the following cleanups around it.
      
      * Don't show it in the mount option.  Eventually the default hierarchy
        will be assigned a different filesystem type.
      
      * As sane_behavior is no longer effective on non-default hierarchies
        and the default hierarchy doesn't accept any mount options,
        parse_cgroupfs_options() can consider sane_behavior mount option as
        indicating the default hierarchy and fail if any other options are
        specified with it.  While at it, remove one of the double blank
        lines in the function.
      
      * cgroup_mount() can now simply test CGRP_ROOT_SANE_BEHAVIOR to tell
        whether to mount the default hierarchy or not.
      
      * As CGROUP_ROOT_SANE_BEHAVIOR's only role now is indicating whether
        to select the default hierarchy or not during mount, it doesn't need
        to be set in the default hierarchy itself.  cgroup_init_early()
        updated accordingly.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      7b9a6ba5
    • T
      cgroup: remove sane_behavior support on non-default hierarchies · aa6ec29b
      Tejun Heo 提交于
      sane_behavior has been used as a development vehicle for the default
      unified hierarchy.  Now that the default hierarchy is in place, the
      flag became redundant and confusing as its usage is allowed on all
      hierarchies.  There are gonna be either the default hierarchy or
      legacy ones.  Let's make that clear by removing sane_behavior support
      on non-default hierarchies.
      
      This patch replaces cgroup_sane_behavior() with cgroup_on_dfl().  The
      comment on top of CGRP_ROOT_SANE_BEHAVIOR is moved to on top of
      cgroup_on_dfl() with sane_behavior specific part dropped.
      
      On the default and legacy hierarchies w/o sane_behavior, this
      shouldn't cause any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      aa6ec29b
    • T
      cgroup: make interface file "cgroup.sane_behavior" legacy-only · c1d5d42e
      Tejun Heo 提交于
      "cgroup.sane_behavior" is added to help distinguishing whether
      sane_behavior is in effect or not.  We now have the default hierarchy
      where the flag is always in effect and are planning to remove
      supporting sane behavior on the legacy hierarchies making this file on
      the default hierarchy rather pointless.  Let's make it legacy only and
      thus always zero.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      c1d5d42e
    • T
      cgroup: remove CGRP_ROOT_OPTION_MASK · 7450e90b
      Tejun Heo 提交于
      cgroup_root->flags only contains CGRP_ROOT_* flags and there's no
      reason to mask the flags.  Remove CGRP_ROOT_OPTION_MASK.
      
      This doesn't cause any behavior differences.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      7450e90b
    • T
      cgroup: implement cgroup_subsys->depends_on · af0ba678
      Tejun Heo 提交于
      Currently, the blkio subsystem attributes all of writeback IOs to the
      root.  One of the issues is that there's no way to tell who originated
      a writeback IO from block layer.  Those IOs are usually issued
      asynchronously from a task which didn't have anything to do with
      actually generating the dirty pages.  The memory subsystem, when
      enabled, already keeps track of the ownership of each dirty page and
      it's desirable for blkio to piggyback instead of adding its own
      per-page tag.
      
      blkio piggybacking on memory is an implementation detail which
      preferably should be handled automatically without requiring explicit
      userland action.  To achieve that, this patch implements
      cgroup_subsys->depends_on which contains the mask of subsystems which
      should be enabled together when the subsystem is enabled.
      
      The previous patches already implemented the support for enabled but
      invisible subsystems and cgroup_subsys->depends_on can be easily
      implemented by updating cgroup_refresh_child_subsys_mask() so that it
      calculates cgroup->child_subsys_mask considering
      cgroup_subsys->depends_on of the explicitly enabled subsystems.
      
      Documentation/cgroups/unified-hierarchy.txt is updated to explain that
      subsystems may not become immediately available after being unused
      from userland and that dependency could be a factor in it.  As
      subsystems may already keep residual references, this doesn't
      significantly change how subsystem rebinding can be used.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      af0ba678
    • T
      cgroup: implement cgroup_subsys->css_reset() · b4536f0c
      Tejun Heo 提交于
      cgroup is implementing support for subsystem dependency which would
      require a way to enable a subsystem even when it's not directly
      configured through "cgroup.subtree_control".
      
      The previous patches added support for explicitly and implicitly
      enabled subsystems and showing/hiding their interface files.  An
      explicitly enabled subsystem may become implicitly enabled if it's
      turned off through "cgroup.subtree_control" but there are subsystems
      depending on it.  In such cases, the subsystem, as it's turned off
      when seen from userland, shouldn't enforce any resource control.
      Also, the subsystem may be explicitly turned on later again and its
      interface files should be as close to the intial state as possible.
      
      This patch adds cgroup_subsys->css_reset() which is invoked when a css
      is hidden.  The callback should disable resource control and reset the
      state to the vanilla state.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      b4536f0c
    • T
      cgroup: make interface files visible iff enabled on cgroup->subtree_control · f63070d3
      Tejun Heo 提交于
      cgroup is implementing support for subsystem dependency which would
      require a way to enable a subsystem even when it's not directly
      configured through "cgroup.subtree_control".
      
      The preceding patch distinguished cgroup->subtree_control and
      ->child_subsys_mask where the former is the subsystems explicitly
      configured by the userland and the latter is all enabled subsystems
      currently is equal to the former but will include subsystems
      implicitly enabled through dependency.
      
      Subsystems which are enabled due to dependency shouldn't be visible to
      userland.  This patch updates cgroup_subtree_control_write() and
      create_css() such that interface files are not created for implicitly
      enabled subsytems.
      
      * @visible paramter is added to create_css().  Interface files are
        created only when true.
      
      * If an already implicitly enabled subsystem is turned on through
        "cgroup.subtree_control", the existing css should be used.  css
        draining is skipped.
      
      * cgroup_subtree_control_write() computes the new target
        cgroup->child_subsys_mask and create/kill or show/hide csses
        accordingly.
      
      As the two subsystem masks are still kept identical, this patch
      doesn't introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      f63070d3
    • T
      cgroup: introduce cgroup->subtree_control · 667c2491
      Tejun Heo 提交于
      cgroup is implementing support for subsystem dependency which would
      require a way to enable a subsystem even when it's not directly
      configured through "cgroup.subtree_control".
      
      Previously, cgroup->child_subsys_mask directly reflected
      "cgroup.subtree_control" and the enabled subsystems in the child
      cgroups.  This patch adds cgroup->subtree_control which
      "cgroup.subtree_control" operates on.  cgroup->child_subsys_mask is
      now calculated from cgroup->subtree_control by
      cgroup_refresh_child_subsys_mask(), which sets it identical to
      cgroup->subtree_control for now.
      
      This will allow using cgroup->child_subsys_mask for all the enabled
      subsystems including the implicit ones and ->subtree_control for
      tracking the explicitly requested ones.  This patch keeps the two
      masks identical and doesn't introduce any behavior changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      667c2491
    • T
      cgroup: reorganize cgroup_subtree_control_write() · c29adf24
      Tejun Heo 提交于
      Make the following two reorganizations to
      cgroup_subtree_control_write().  These are to prepare for future
      changes and shouldn't cause any functional difference.
      
      * Move availability above css offlining wait.
      
      * Move cgrp->child_subsys_mask update above new css creation.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      c29adf24
  2. 17 6月, 2014 2 次提交
  3. 16 6月, 2014 1 次提交
    • T
      rtmutex: Plug slow unlock race · 27e35715
      Thomas Gleixner 提交于
      When the rtmutex fast path is enabled the slow unlock function can
      create the following situation:
      
      spin_lock(foo->m->wait_lock);
      foo->m->owner = NULL;
      	    			rt_mutex_lock(foo->m); <-- fast path
      				free = atomic_dec_and_test(foo->refcnt);
      				rt_mutex_unlock(foo->m); <-- fast path
      				if (free)
      				   kfree(foo);
      
      spin_unlock(foo->m->wait_lock); <--- Use after free.
      
      Plug the race by changing the slow unlock to the following scheme:
      
           while (!rt_mutex_has_waiters(m)) {
           	    /* Clear the waiters bit in m->owner */
      	    clear_rt_mutex_waiters(m);
            	    owner = rt_mutex_owner(m);
            	    spin_unlock(m->wait_lock);
            	    if (cmpxchg(m->owner, owner, 0) == owner)
            	       return;
            	    spin_lock(m->wait_lock);
           }
      
      So in case of a new waiter incoming while the owner tries the slow
      path unlock we have two situations:
      
       unlock(wait_lock);
      					lock(wait_lock);
       cmpxchg(p, owner, 0) == owner
       	    	   			mark_rt_mutex_waiters(lock);
      	 				acquire(lock);
      
      Or:
      
       unlock(wait_lock);
      					lock(wait_lock);
      	 				mark_rt_mutex_waiters(lock);
       cmpxchg(p, owner, 0) != owner
      					enqueue_waiter();
      					unlock(wait_lock);
       lock(wait_lock);
       wakeup_next waiter();
       unlock(wait_lock);
      					lock(wait_lock);
      					acquire(lock);
      
      If the fast path is disabled, then the simple
      
         m->owner = NULL;
         unlock(m->wait_lock);
      
      is sufficient as all access to m->owner is serialized via
      m->wait_lock;
      
      Also document and clarify the wakeup_next_waiter function as suggested
      by Oleg Nesterov.
      Reported-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20140611183852.937945560@linutronix.de
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      27e35715
  4. 14 6月, 2014 1 次提交
  5. 11 6月, 2014 4 次提交
  6. 10 6月, 2014 3 次提交
  7. 09 6月, 2014 3 次提交
  8. 07 6月, 2014 17 次提交
    • T
      rtmutex: Detect changes in the pi lock chain · 82084984
      Thomas Gleixner 提交于
      When we walk the lock chain, we drop all locks after each step. So the
      lock chain can change under us before we reacquire the locks. That's
      harmless in principle as we just follow the wrong lock path. But it
      can lead to a false positive in the dead lock detection logic:
      
      T0 holds L0
      T0 blocks on L1 held by T1
      T1 blocks on L2 held by T2
      T2 blocks on L3 held by T3
      T4 blocks on L4 held by T4
      
      Now we walk the chain
      
      lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> 
           lock T2 ->  adjust T2 ->  drop locks
      
      T2 times out and blocks on L0
      
      Now we continue:
      
      lock T2 -> lock L0 -> deadlock detected, but it's not a deadlock at all.
      
      Brad tried to work around that in the deadlock detection logic itself,
      but the more I looked at it the less I liked it, because it's crystal
      ball magic after the fact.
      
      We actually can detect a chain change very simple:
      
      lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->
      
           next_lock = T2->pi_blocked_on->lock;
      
      drop locks
      
      T2 times out and blocks on L0
      
      Now we continue:
      
      lock T2 -> 
      
           if (next_lock != T2->pi_blocked_on->lock)
           	   return;
      
      So if we detect that T2 is now blocked on a different lock we stop the
      chain walk. That's also correct in the following scenario:
      
      lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->
      
           next_lock = T2->pi_blocked_on->lock;
      
      drop locks
      
      T3 times out and drops L3
      T2 acquires L3 and blocks on L4 now
      
      Now we continue:
      
      lock T2 -> 
      
           if (next_lock != T2->pi_blocked_on->lock)
           	   return;
      
      We don't have to follow up the chain at that point, because T2
      propagated our priority up to T4 already.
      
      [ Folded a cleanup patch from peterz ]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reported-by: NBrad Mouring <bmouring@ni.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20140605152801.930031935@linutronix.de
      Cc: stable@vger.kernel.org
      82084984
    • T
      rtmutex: Handle deadlock detection smarter · 3d5c9340
      Thomas Gleixner 提交于
      Even in the case when deadlock detection is not requested by the
      caller, we can detect deadlocks. Right now the code stops the lock
      chain walk and keeps the waiter enqueued, even on itself. Silly not to
      yell when such a scenario is detected and to keep the waiter enqueued.
      
      Return -EDEADLK unconditionally and handle it at the call sites.
      
      The futex calls return -EDEADLK. The non futex ones dequeue the
      waiter, throw a warning and put the task into a schedule loop.
      
      Tagged for stable as it makes the code more robust.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Brad Mouring <bmouring@ni.com>
      Link: http://lkml.kernel.org/r/20140605152801.836501969@linutronix.de
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      3d5c9340
    • S
      tracing: Fix memory leak on instance deletion · a9fcaaac
      Steven Rostedt (Red Hat) 提交于
      When an instance is created, it also gets a snapshot ring buffer
      allocated (with minimum of pages). But when it is deleted the snapshot
      buffer is not. There was a helper function added to match the allocation
      of these ring buffers to a way to free them, but it wasn't used by
      the deletion of an instance. Using that helper function solves this
      memory leak.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a9fcaaac
    • J
      sysctl: convert use of typedef ctl_table to struct ctl_table · 6f8fd1d7
      Joe Perches 提交于
      This typedef is unnecessary and should just be removed.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f8fd1d7
    • F
      kernel/seccomp.c: kernel-doc warning fix · 119ce5c8
      Fabian Frederick 提交于
      + fix small typo
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      119ce5c8
    • P
      ipc, kernel: clear whitespace · 46c0a8ca
      Paul McQuade 提交于
      trailing whitespace
      Signed-off-by: NPaul McQuade <paulmcquad@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46c0a8ca
    • P
      ipc, kernel: use Linux headers · 7153e402
      Paul McQuade 提交于
      Use #include <linux/uaccess.h> instead of <asm/uaccess.h>
      Use #include <linux/types.h> instead of <asm/types.h>
      Signed-off-by: NPaul McQuade <paulmcquad@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7153e402
    • F
      kernel/profile.c: use static const char instead of static char · f3da64d1
      Fabian Frederick 提交于
      schedstr, sleepstr and kvmstr are only used in strcmp & strlen
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3da64d1
    • F
      kernel/profile.c: convert printk to pr_foo() · aba871f1
      Fabian Frederick 提交于
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aba871f1
    • F
      kernel/user_namespace.c: kernel-doc/checkpatch fixes · 68a9a435
      Fabian Frederick 提交于
      -uid->gid
      -split some function declarations
      -if/then/else warning
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68a9a435
    • K
      sysctl: allow for strict write position handling · f4aacea2
      Kees Cook 提交于
      When writing to a sysctl string, each write, regardless of VFS position,
      begins writing the string from the start.  This means the contents of
      the last write to the sysctl controls the string contents instead of the
      first:
      
        open("/proc/sys/kernel/modprobe", O_WRONLY)   = 1
        write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
        write(1, "/bin/true", 9)                = 9
        close(1)                                = 0
      
        $ cat /proc/sys/kernel/modprobe
        /bin/true
      
      Expected behaviour would be to have the sysctl be "AAAA..." capped at
      maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
      contents of the second write.  Similarly, multiple short writes would
      not append to the sysctl.
      
      The old behavior is unlike regular POSIX files enough that doing audits
      of software that interact with sysctls can end up in unexpected or
      dangerous situations.  For example, "as long as the input starts with a
      trusted path" turns out to be an insufficient filter, as what must also
      happen is for the input to be entirely contained in a single write
      syscall -- not a common consideration, especially for high level tools.
      
      This provides kernel.sysctl_writes_strict as a way to make this behavior
      act in a less surprising manner for strings, and disallows non-zero file
      position when writing numeric sysctls (similar to what is already done
      when reading from non-zero file positions).  For now, the default (0) is
      to warn about non-zero file position use, but retain the legacy
      behavior.  Setting this to -1 disables the warning, and setting this to
      1 enables the file position respecting behavior.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: move misplaced hunk, per Randy]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4aacea2
    • K
      sysctl: refactor sysctl string writing logic · 2ca9bb45
      Kees Cook 提交于
      Consolidate buffer length checking with new-line/end-of-line checking.
      Additionally, instead of reading user memory twice, just do the
      assignment during the loop.
      
      This change doesn't affect the potential races here.  It was already
      possible to read a sysctl that was in the middle of a write.  In both
      cases, the string will always be NULL terminated.  The pre-existing race
      remains a problem to be solved.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ca9bb45
    • K
      sysctl: clean up char buffer arguments · f8808300
      Kees Cook 提交于
      When writing to a sysctl string, each write, regardless of VFS position,
      began writing the string from the start.  This meant the contents of the
      last write to the sysctl controlled the string contents instead of the
      first.
      
      This misbehavior was featured in an exploit against Chrome OS.  While
      it's not in itself a vulnerability, it's a weirdness that isn't on the
      mind of most auditors: "This filter looks correct, the first line
      written would not be meaningful to sysctl" doesn't apply here, since the
      size of the write and the contents of the final write are what matter
      when writing to sysctls.
      
      This adds the sysctl kernel.sysctl_writes_strict to control the write
      behavior.  The default (0) reports when VFS position is non-0 on a
      write, but retains legacy behavior, -1 disables the warning, and 1
      enables the position-respecting behavior.
      
      The long-term plan here is to wait for userspace to be fixed in response
      to the new warning and to then switch the default kernel behavior to the
      new position-respecting behavior.
      
      This patch (of 4):
      
      The char buffer arguments are needlessly cast in weird places.  Clean it
      up so things are easier to read.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8808300
    • F
      kernel/kexec.c: convert printk to pr_foo() · e1bebcf4
      Fabian Frederick 提交于
      + some pr_warning -> pr_warn and checkpatch warning fixes
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1bebcf4
    • M
      kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump after panic_notifers · f06e5153
      Masami Hiramatsu 提交于
      Add a "crash_kexec_post_notifiers" boot option to run kdump after
      running panic_notifiers and dump kmsg.  This can help rare situations
      where kdump fails because of unstable crashed kernel or hardware failure
      (memory corruption on critical data/code), or the 2nd kernel is already
      broken by the 1st kernel (it's a broken behavior, but who can guarantee
      that the "crashed" kernel works correctly?).
      
      Usage: add "crash_kexec_post_notifiers" to kernel boot option.
      
      Note that this actually increases risks of the failure of kdump.  This
      option should be set only if you worry about the rare case of kdump
      failure rather than increasing the chance of success.
      Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Acked-by: NMotohiro Kosaki <Motohiro.Kosaki@us.fujitsu.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
      Cc: Satoru MORIYA <satoru.moriya.br@hitachi.com>
      Cc: Tomoki Sekiyama <tomoki.sekiyama@hds.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f06e5153
    • S
      smp: print more useful debug info upon receiving IPI on an offline CPU · a219ccf4
      Srivatsa S. Bhat 提交于
      There is a longstanding problem related to CPU hotplug which causes IPIs
      to be delivered to offline CPUs, and the smp-call-function IPI handler
      code prints out a warning whenever this is detected.  Every once in a
      while this (usually harmless) warning gets reported on LKML, but so far
      it has not been completely fixed.  Usually the solution involves finding
      out the IPI sender and fixing it by adding appropriate synchronization
      with CPU hotplug.
      
      However, while going through one such internal bug reports, I found that
      there is a significant bug in the receiver side itself (more
      specifically, in stop-machine) that can lead to this problem even when
      the sender code is perfectly fine.  This patchset fixes that
      synchronization problem in the CPU hotplug stop-machine code.
      
      Patch 1 adds some additional debug code to the smp-call-function
      framework, to help debug such issues easily.
      
      Patch 2 modifies the stop-machine code to ensure that any IPIs that were
      sent while the target CPU was online, would be noticed and handled by
      that CPU without fail before it goes offline.  Thus, this avoids
      scenarios where IPIs are received on offline CPUs (as long as the sender
      uses proper hotplug synchronization).
      
      In fact, I debugged the problem by using Patch 1, and found that the
      payload of the IPI was always the block layer's trigger_softirq()
      function.  But I was not able to find anything wrong with the block
      layer code.  That's when I started looking at the stop-machine code and
      realized that there is a race-window which makes the IPI _receiver_ the
      culprit, not the sender.  Patch 2 fixes that race and hence this should
      put an end to most of the hard-to-debug IPI-to-offline-CPU issues.
      
      This patch (of 2):
      
      Today the smp-call-function code just prints a warning if we get an IPI
      on an offline CPU.  This info is sufficient to let us know that
      something went wrong, but often it is very hard to debug exactly who
      sent the IPI and why, from this info alone.
      
      In most cases, we get the warning about the IPI to an offline CPU,
      immediately after the CPU going offline comes out of the stop-machine
      phase and reenables interrupts.  Since all online CPUs participate in
      stop-machine, the information regarding the sender of the IPI is already
      lost by the time we exit the stop-machine loop.  So even if we dump the
      stack on each CPU at this point, we won't find anything useful since all
      of them will show the stack-trace of the stopper thread.  So we need a
      better way to figure out who sent the IPI and why.
      
      To achieve this, when we detect an IPI targeted to an offline CPU, loop
      through the call-single-data linked list and print out the payload
      (i.e., the name of the function which was supposed to be executed by the
      target CPU).  This would give us an insight as to who might have sent
      the IPI and help us debug this further.
      
      [akpm@linux-foundation.org: correctly suppress warning output on second and later occurrences]
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mike Galbraith <mgalbraith@suse.de>
      Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a219ccf4
    • O
      signals: change wait_for_helper() to use kernel_sigaction() · 76e0a6f4
      Oleg Nesterov 提交于
      Now that we have kernel_sigaction() we can change wait_for_helper() to
      use it and cleans up the code a bit.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76e0a6f4