1. 07 5月, 2012 3 次提交
  2. 14 4月, 2012 2 次提交
    • S
      uprobes/core: Decrement uprobe count before the pages are unmapped · cbc91f71
      Srikar Dronamraju 提交于
      Uprobes has a callback (uprobe_munmap()) in the unmap path to
      maintain the uprobes count.
      
      In the exit path this callback gets called in unlink_file_vma().
      However by the time unlink_file_vma() is called, the pages would
      have been unmapped (in unmap_vmas()) and the task->rss_stat counts
      accounted (in zap_pte_range()).
      
      If the exiting process has probepoints, uprobe_munmap() checks if
      the breakpoint instruction was around before decrementing the probe
      count.
      
      This results in a file backed page being reread by uprobe_munmap()
      and hence it does not find the breakpoint.
      
      This patch fixes this problem by moving the callback to
      unmap_single_vma(). Since unmap_single_vma() may not unmap the
      complete vma, add start and end parameters to uprobe_munmap().
      
      This bug became apparent courtesy of commit c3f0327f
      ("mm: add rss counters consistency check").
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
      Cc: Linux-mm <linux-mm@kvack.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20120411103527.23245.9835.sendpatchset@srdronam.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cbc91f71
    • S
      uprobes/core: Make background page replacement logic account for rss_stat counters · 7396fa81
      Srikar Dronamraju 提交于
      Background page replacement logic adds a new anonymous page
      instead of a file backed (while inserting a breakpoint) /
      anonymous page (while removing a breakpoint).
      
      Hence the uprobes logic should take care to update the
      task->ss_stat counters accordingly.
      
      This bug became apparent courtesy of commit c3f0327f
      ("mm: add rss counters consistency check").
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
      Cc: Linux-mm <linux-mm@kvack.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20120411103516.23245.2700.sendpatchset@srdronam.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7396fa81
  3. 06 4月, 2012 1 次提交
    • S
      simple_open: automatically convert to simple_open() · 234e3405
      Stephen Boyd 提交于
      Many users of debugfs copy the implementation of default_open() when
      they want to support a custom read/write function op.  This leads to a
      proliferation of the default_open() implementation across the entire
      tree.
      
      Now that the common implementation has been consolidated into libfs we
      can replace all the users of this function with simple_open().
      
      This replacement was done with the following semantic patch:
      
      <smpl>
      @ open @
      identifier open_f != simple_open;
      identifier i, f;
      @@
      -int open_f(struct inode *i, struct file *f)
      -{
      (
      -if (i->i_private)
      -f->private_data = i->i_private;
      |
      -f->private_data = i->i_private;
      )
      -return 0;
      -}
      
      @ has_open depends on open @
      identifier fops;
      identifier open.open_f;
      @@
      struct file_operations fops = {
      ...
      -.open = open_f,
      +.open = simple_open,
      ...
      };
      </smpl>
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Julia Lawall <Julia.Lawall@lip6.fr>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      234e3405
  4. 05 4月, 2012 1 次提交
  5. 02 4月, 2012 1 次提交
  6. 31 3月, 2012 4 次提交
    • S
      uprobes/core: Optimize probe hits with the help of a counter · 682968e0
      Srikar Dronamraju 提交于
      Maintain a per-mm counter: number of uprobes that are inserted
      on this process address space.
      
      This counter can be used at probe hit time to determine if we
      need a lookup in the uprobes rbtree. Everytime a probe gets
      inserted successfully, the probe count is incremented and
      everytime a probe gets removed, the probe count is decremented.
      
      The new uprobe_munmap hook ensures the count is correct on a
      unmap or remap of a region. We expect that once a
      uprobe_munmap() is called, the vma goes away.  So
      uprobe_unregister() finding a probe to unregister would either
      mean unmap event hasnt occurred yet or a mmap event on the same
      executable file occured after a unmap event.
      
      Additionally, uprobe_mmap hook now also gets called:
      
       a. on every executable vma that is COWed at fork.
       b. a vma of interest is newly mapped; breakpoint insertion also
          happens at the required address.
      
      On process creation, make sure the probes count in the child is
      set correctly.
      
      Special cases that are taken care include:
      
       a. mremap
       b. VM_DONTCOPY vmas on fork()
       c. insertion/removal races in the parent during fork().
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
      Cc: Linux-mm <linux-mm@kvack.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20120330182646.10018.85805.sendpatchset@srdronam.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      682968e0
    • S
      uprobes/core: Allocate XOL slots for uprobes use · d4b3b638
      Srikar Dronamraju 提交于
      Uprobes executes the original instruction at a probed location
      out of line. For this, we allocate a page (per mm) upon the
      first uprobe hit, in the process user address space, divide it
      into slots that are used to store the actual instructions to be
      singlestepped. These slots are known as xol (execution out of
      line) slots.
      
      Care is taken to ensure that the allocation is in an unmapped
      area as close to the top of the user address space as possible,
      with appropriate permission settings to keep selinux like
      frameworks happy.
      
      Upon a uprobe hit, a free slot is acquired, and is released
      after the singlestep completes.
      
      Lots of improvements courtesy suggestions/inputs from Peter and
      Oleg.
      
      [ Folded a fix for build issue on powerpc fixed and reported by
        Stephen Rothwell. ]
      Signed-off-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Jim Keniston <jkenisto@linux.vnet.ibm.com>
      Cc: Linux-mm <linux-mm@kvack.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Arnaldo Carvalho de Melo <acme@infradead.org>
      Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Cc: Anton Arapov <anton@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20120330182631.10018.48175.sendpatchset@srdronam.in.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      d4b3b638
    • S
      sched: Fix incorrect usage of for_each_cpu_mask() in select_fallback_rq() · e3831edd
      Srivatsa S. Bhat 提交于
      The function for_each_cpu_mask() expects a *pointer* to struct
      cpumask as its second argument, whereas select_fallback_rq()
      passes the value itself.
      
      And moreover, for_each_cpu_mask() has been marked as obselete
      in include/linux/cpumask.h. So move to the more appropriate
      for_each_cpu() variant.
      Reported-by: NSasha Levin <levinsasha928@gmail.com>
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Liu Chuansheng <chuansheng.liu@intel.com>
      Cc: vapier@gentoo.org
      Cc: rusty@rustcorp.com.au
      Link: http://lkml.kernel.org/r/4F75BED4.9050005@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e3831edd
    • J
      genirq: Adjust irq thread affinity on IRQ_SET_MASK_OK_NOCOPY return value · f5cb92ac
      Jiang Liu 提交于
      irq_move_masked_irq() checks the return code of
      chip->irq_set_affinity() only for 0, but IRQ_SET_MASK_OK_NOCOPY is
      also a valid return code, which is there to avoid a redundant copy of
      the cpumask. But in case of IRQ_SET_MASK_OK_NOCOPY we not only avoid
      the redundant copy, we also fail to adjust the thread affinity of an
      eventually threaded interrupt handler.
      
      Handle IRQ_SET_MASK_OK (==0) and IRQ_SET_MASK_OK_NOCOPY(==1) return
      values correctly by checking the valid return values seperately.
      Signed-off-by: NJiang Liu <jiang.liu@huawei.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Keping Chen <chenkeping@huawei.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/1333120296-13563-2-git-send-email-jiang.liu@huawei.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      f5cb92ac
  7. 30 3月, 2012 4 次提交
    • T
      cgroup: cgroup_attach_task() could return -errno after success · 8f121918
      Tejun Heo 提交于
      61d1d219 "cgroup: remove extra calls to find_existing_css_set" made
      cgroup_task_migrate() return void.  An unfortunate side effect was
      that cgroup_attach_task() was depending on that function's return
      value to clear its @retval on the success path.  On cgroup mounts
      without any subsystem with ->can_attach() callback,
      cgroup_attach_task() ended up returning @retval without initializing
      it on success.
      
      For some reason, gcc failed to warn about it and it didn't cause
      cgroup_attach_task() to return non-zero value in many cases, probably
      due to difference in register allocation.  When the problem
      materializes, systemd fails to populate /systemd cgroup mount and
      fails to boot.
      
      Fix it by initializing @retval to zero on declaration.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NJiri Kosina <jkosina@suse.cz>
      LKML-Reference: <alpine.LNX.2.00.1203282354440.25526@pobox.suse.cz>
      Reviewed-by: NMandeep Singh Baines <msb@chromium.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      8f121918
    • J
      kgdb,debug_core: pass the breakpoint struct instead of address and memory · 98b54aa1
      Jason Wessel 提交于
      There is extra state information that needs to be exposed in the
      kgdb_bpt structure for tracking how a breakpoint was installed.  The
      debug_core only uses the the probe_kernel_write() to install
      breakpoints, but this is not enough for all the archs.  Some arch such
      as x86 need to use text_poke() in order to install a breakpoint into a
      read only page.
      
      Passing the kgdb_bpt structure to kgdb_arch_set_breakpoint() and
      kgdb_arch_remove_breakpoint() allows other archs to set the type
      variable which indicates how the breakpoint was installed.
      
      Cc: stable@vger.kernel.org # >= 2.6.36
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      98b54aa1
    • J
      kdb: Fix smatch warning on dbg_io_ops->is_console · 78724b8e
      Jason Wessel 提交于
      The Smatch tool warned that the change from commit b8adde8d
      (kdb: Avoid using dbg_io_ops until it is initialized) should
      add another null check later in the kdb_printf().
      
      It is worth noting that the second use of dbg_io_ops->is_console
      is protected by the KDB_PAGER state variable which would only
      get set when kdb is fully active and initialized.  If we
      ever encounter changes or defects in the KDB_PAGER state
      we do not want to crash the kernel in a kdb_printf/printk.
      
      CC: Tim Bird <tim.bird@am.sony.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      78724b8e
    • G
      irqdomain: Remove powerpc dependency from debugfs file · 092b2fb0
      Grant Likely 提交于
      The debugfs code is really generic for all platforms.  This patch removes the
      powerpc-specific directory reference and makes it available to all
      architectures.
      Signed-off-by: NGrant Likely <grant.likely@secretlab.ca>
      092b2fb0
  8. 29 3月, 2012 24 次提交
    • S
      padata: Fix cpu hotplug · 96120905
      Steffen Klassert 提交于
      We don't remove the cpu that went offline from our cpumasks
      on cpu hotplug. This got lost somewhere along the line, so
      restore it. This fixes a hang of the padata instance on cpu
      hotplug.
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      96120905
    • S
      padata: Use the online cpumask as the default · 13614e0f
      Steffen Klassert 提交于
      We use the active cpumask to determine the superset of cpus
      to use for parallelization. However, the active cpumask is
      for internal usage of the scheduler and therefore not the
      appropriate cpumask for these purposes. So use the online
      cpumask instead.
      Reported-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      13614e0f
    • S
      padata: Add a reference to the api documentation · 107f8bda
      Steffen Klassert 提交于
      Add a reference to the padata api documentation at Documentation/padata.txt
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      107f8bda
    • K
      futex: Mark get_robust_list as deprecated · ec0c4274
      Kees Cook 提交于
      Notify get_robust_list users that the syscall is going away.
      Suggested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
      Cc: kernel-hardening@lists.openwall.com
      Cc: spender@grsecurity.net
      Link: http://lkml.kernel.org/r/20120323190855.GA27213@www.outflux.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      ec0c4274
    • K
      futex: Do not leak robust list to unprivileged process · bdbb776f
      Kees Cook 提交于
      It was possible to extract the robust list head address from a setuid
      process if it had used set_robust_list(), allowing an ASLR info leak. This
      changes the permission checks to be the same as those used for similar
      info that comes out of /proc.
      
      Running a setuid program that uses robust futexes would have had:
        cred->euid != pcred->euid
        cred->euid == pcred->uid
      so the old permissions check would allow it. I'm not aware of any setuid
      programs that use robust futexes, so this is just a preventative measure.
      
      (This patch is based on changes from grsecurity.)
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Serge E. Hallyn <serge.hallyn@canonical.com>
      Cc: kernel-hardening@lists.openwall.com
      Cc: spender@grsecurity.net
      Link: http://lkml.kernel.org/r/20120319231253.GA20893@www.outflux.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bdbb776f
    • P
      genirq: Respect NUMA node affinity in setup_irq_irq affinity() · 241fc640
      Prarit Bhargava 提交于
      We respect node affinity of devices already in the irq descriptor
      allocation, but we ignore it for the initial interrupt affinity
      setup, so the interrupt might be routed to a different node.
      
      Restrict the default affinity mask to the node on which the irq
      descriptor is allocated.
      
      [ tglx: Massaged changelog ]
      Signed-off-by: NPrarit Bhargava <prarit@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Link: http://lkml.kernel.org/r/1332788538-17425-1-git-send-email-prarit@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      241fc640
    • A
      genirq: Get rid of unneeded force parameter in irq_finalize_oneshot() · f3f79e38
      Alexander Gordeev 提交于
      The only place irq_finalize_oneshot() is called with force parameter set
      is the threaded handler error exit path. But IRQTF_RUNTHREAD is dropped
      at this point and irq_wake_thread() is not going to set it again,
      since PF_EXITING is set for this thread already. So irq_finalize_oneshot()
      will drop the threads bit in threads_oneshot anyway and hence the force
      parameter is superfluous.
      Signed-off-by: NAlexander Gordeev <agordeev@redhat.com>
      Link: http://lkml.kernel.org/r/20120321162234.GP24806@dhcp-26-207.brq.redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      f3f79e38
    • A
      genirq: Minor readablity improvement in irq_wake_thread() · 69592db2
      Alexander Gordeev 提交于
      exit_irq_thread() clears IRQTF_RUNTHREAD flag and drops the thread's bit in
      desc->threads_oneshot then. The bit must not be set again in between and it
      does not, since irq_wake_thread() sees PF_EXITING flag first and returns.
      
      Due to above the order or checking PF_EXITING and IRQTF_RUNTHREAD flags in
      irq_wake_thread() is important. This change just makes it more visible in the
      source code.
      Signed-off-by: NAlexander Gordeev <agordeev@redhat.com>
      Link: http://lkml.kernel.org/r/20120321162212.GO24806@dhcp-26-207.brq.redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      69592db2
    • S
      sched: Fix __schedule_bug() output when called from an interrupt · 6135fc1e
      Stephen Boyd 提交于
      If schedule is called from an interrupt handler __schedule_bug()
      will call show_regs() with the registers saved during the
      interrupt handling done in do_IRQ(). This means we'll see the
      registers and the backtrace for the process that was interrupted
      and not the full backtrace explaining who called schedule().
      
      This is due to 838225b4 ("sched: use show_regs() to improve
      __schedule_bug() output", 2007-10-24) which improperly assumed
      that get_irq_regs() would return the registers for the current
      stack because it is being called from within an interrupt
      handler. Simply remove the show_reg() code so that we dump a
      backtrace for the interrupt handler that called schedule().
      
      [ I ran across this when I was presented with a scheduling while
        atomic log with a stacktrace pointing at spin_unlock_irqrestore().
        It made no sense and I had to guess what interrupt handler could
        be called and poke around for someone calling schedule() in an
        interrupt handler. A simple test of putting an msleep() in
        an interrupt handler works better with this patch because you
        can actually see the msleep() call in the backtrace. ]
      Also-reported-by: NChris Metcalf <cmetcalf@tilera.com>
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      Cc: Satyam Sharma <satyam@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1332979847-27102-1-git-send-email-sboyd@codeaurora.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6135fc1e
    • R
      documentation: remove references to cpu_*_map. · 5f054e31
      Rusty Russell 提交于
      This has been obsolescent for a while, fix documentation and
      misc comments.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      5f054e31
    • D
      pidns: add reboot_pid_ns() to handle the reboot syscall · cf3f8921
      Daniel Lezcano 提交于
      In the case of a child pid namespace, rebooting the system does not really
      makes sense.  When the pid namespace is used in conjunction with the other
      namespaces in order to create a linux container, the reboot syscall leads
      to some problems.
      
      A container can reboot the host.  That can be fixed by dropping the
      sys_reboot capability but we are unable to correctly to poweroff/
      halt/reboot a container and the container stays stuck at the shutdown time
      with the container's init process waiting indefinitively.
      
      After several attempts, no solution from userspace was found to reliabily
      handle the shutdown from a container.
      
      This patch propose to make the init process of the child pid namespace to
      exit with a signal status set to : SIGINT if the child pid namespace
      called "halt/poweroff" and SIGHUP if the child pid namespace called
      "reboot".  When the reboot syscall is called and we are not in the initial
      pid namespace, we kill the pid namespace for "HALT", "POWEROFF",
      "RESTART", and "RESTART2".  Otherwise we return EINVAL.
      
      Returning EINVAL is also an easy way to check if this feature is supported
      by the kernel when invoking another 'reboot' option like CAD.
      
      By this way the parent process of the child pid namespace knows if it
      rebooted or not and can take the right decision.
      
      Test case:
      ==========
      
      #include <alloca.h>
      #include <stdio.h>
      #include <sched.h>
      #include <unistd.h>
      #include <signal.h>
      #include <sys/reboot.h>
      #include <sys/types.h>
      #include <sys/wait.h>
      
      #include <linux/reboot.h>
      
      static int do_reboot(void *arg)
      {
              int *cmd = arg;
      
              if (reboot(*cmd))
                      printf("failed to reboot(%d): %m\n", *cmd);
      }
      
      int test_reboot(int cmd, int sig)
      {
              long stack_size = 4096;
              void *stack = alloca(stack_size) + stack_size;
              int status;
              pid_t ret;
      
              ret = clone(do_reboot, stack, CLONE_NEWPID | SIGCHLD, &cmd);
              if (ret < 0) {
                      printf("failed to clone: %m\n");
                      return -1;
              }
      
              if (wait(&status) < 0) {
                      printf("unexpected wait error: %m\n");
                      return -1;
              }
      
              if (!WIFSIGNALED(status)) {
                      printf("child process exited but was not signaled\n");
                      return -1;
              }
      
              if (WTERMSIG(status) != sig) {
                      printf("signal termination is not the one expected\n");
                      return -1;
              }
      
              return 0;
      }
      
      int main(int argc, char *argv[])
      {
              int status;
      
              status = test_reboot(LINUX_REBOOT_CMD_RESTART, SIGHUP);
              if (status < 0)
                      return 1;
              printf("reboot(LINUX_REBOOT_CMD_RESTART) succeed\n");
      
              status = test_reboot(LINUX_REBOOT_CMD_RESTART2, SIGHUP);
              if (status < 0)
                      return 1;
              printf("reboot(LINUX_REBOOT_CMD_RESTART2) succeed\n");
      
              status = test_reboot(LINUX_REBOOT_CMD_HALT, SIGINT);
              if (status < 0)
                      return 1;
              printf("reboot(LINUX_REBOOT_CMD_HALT) succeed\n");
      
              status = test_reboot(LINUX_REBOOT_CMD_POWER_OFF, SIGINT);
              if (status < 0)
                      return 1;
              printf("reboot(LINUX_REBOOT_CMD_POWERR_OFF) succeed\n");
      
              status = test_reboot(LINUX_REBOOT_CMD_CAD_ON, -1);
              if (status >= 0) {
                      printf("reboot(LINUX_REBOOT_CMD_CAD_ON) should have failed\n");
                      return 1;
              }
              printf("reboot(LINUX_REBOOT_CMD_CAD_ON) has failed as expected\n");
      
              return 0;
      }
      
      [akpm@linux-foundation.org: tweak and add comments]
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NDaniel Lezcano <daniel.lezcano@free.fr>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Tested-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf3f8921
    • A
      sysctl: use bitmap library functions · 5a04cca6
      Akinobu Mita 提交于
      Use bitmap_set() instead of using set_bit() for each bit.  This conversion
      is valid because the bitmap is private in the function call and atomic
      bitops were unnecessary.
      
      This also includes minor change.
      - Use bitmap_copy() for shorter typing
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a04cca6
    • Z
      kexec: add further check to crashkernel · eaa3be6a
      Zhenzhong Duan 提交于
      When using crashkernel=2M-256M, the kernel doesn't give any warning.  This
      is misleading sometimes.
      Signed-off-by: NZhenzhong Duan <zhenzhong.duan@oracle.com>
      Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eaa3be6a
    • W
      kexec: crash: don't save swapper_pg_dir for !CONFIG_MMU configurations · d034cfab
      Will Deacon 提交于
      nommu platforms don't have very interesting swapper_pg_dir pointers and
      usually just #define them to NULL, meaning that we can't include them in
      the vmcoreinfo on the kexec crash path.
      
      This patch only saves the swapper_pg_dir if we have an MMU.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Reviewed-by: NSimon Horman <horms@verge.net.au>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d034cfab
    • G
      smp: add func to IPI cpus based on parameter func · b3a7e98e
      Gilad Ben-Yossef 提交于
      Add the on_each_cpu_cond() function that wraps on_each_cpu_mask() and
      calculates the cpumask of cpus to IPI by calling a function supplied as a
      parameter in order to determine whether to IPI each specific cpu.
      
      The function works around allocation failure of cpumask variable in
      CONFIG_CPUMASK_OFFSTACK=y by itereating over cpus sending an IPI a time
      via smp_call_function_single().
      
      The function is useful since it allows to seperate the specific code that
      decided in each case whether to IPI a specific cpu for a specific request
      from the common boilerplate code of handling creating the mask, handling
      failures etc.
      
      [akpm@linux-foundation.org: s/gfpflags/gfp_flags/]
      [akpm@linux-foundation.org: avoid double-evaluation of `info' (per Michal), parenthesise evaluation of `cond_func']
      [akpm@linux-foundation.org: s/CPU/CPUs, use all 80 cols in comment]
      Signed-off-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Sasha Levin <levinsasha928@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.org>
      Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com>
      Cc: Milton Miller <miltonm@bga.com>
      Reviewed-by: N"Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3a7e98e
    • G
      smp: introduce a generic on_each_cpu_mask() function · 3fc498f1
      Gilad Ben-Yossef 提交于
      We have lots of infrastructure in place to partition multi-core systems
      such that we have a group of CPUs that are dedicated to specific task:
      cgroups, scheduler and interrupt affinity, and cpuisol= boot parameter.
      Still, kernel code will at times interrupt all CPUs in the system via IPIs
      for various needs.  These IPIs are useful and cannot be avoided
      altogether, but in certain cases it is possible to interrupt only specific
      CPUs that have useful work to do and not the entire system.
      
      This patch set, inspired by discussions with Peter Zijlstra and Frederic
      Weisbecker when testing the nohz task patch set, is a first stab at trying
      to explore doing this by locating the places where such global IPI calls
      are being made and turning the global IPI into an IPI for a specific group
      of CPUs.  The purpose of the patch set is to get feedback if this is the
      right way to go for dealing with this issue and indeed, if the issue is
      even worth dealing with at all.  Based on the feedback from this patch set
      I plan to offer further patches that address similar issue in other code
      paths.
      
      This patch creates an on_each_cpu_mask() and on_each_cpu_cond()
      infrastructure API (the former derived from existing arch specific
      versions in Tile and Arm) and uses them to turn several global IPI
      invocation to per CPU group invocations.
      
      Core kernel:
      
      on_each_cpu_mask() calls a function on processors specified by cpumask,
      which may or may not include the local processor.
      
      You must not call this function with disabled interrupts or from a
      hardware interrupt handler or from a bottom half handler.
      
      arch/arm:
      
      Note that the generic version is a little different then the Arm one:
      
      1. It has the mask as first parameter
      2. It calls the function on the calling CPU with interrupts disabled,
         but this should be OK since the function is called on the other CPUs
         with interrupts disabled anyway.
      
      arch/tile:
      
      The API is the same as the tile private one, but the generic version
      also calls the function on the with interrupts disabled in UP case
      
      This is OK since the function is called on the other CPUs
      with interrupts disabled.
      Signed-off-by: NGilad Ben-Yossef <gilad@benyossef.com>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NChris Metcalf <cmetcalf@tilera.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Sasha Levin <levinsasha928@gmail.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Avi Kivity <avi@redhat.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.org>
      Cc: Kosaki Motohiro <kosaki.motohiro@gmail.com>
      Cc: Milton Miller <miltonm@bga.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fc498f1
    • M
      PM / QoS: add pm_qos_update_request_timeout() API · c4772d19
      MyungJoo Ham 提交于
      The new API, pm_qos_update_request_timeout() is to provide a timeout
      with pm_qos_update_request.
      
      For example, pm_qos_update_request_timeout(req, 100, 1000), means that
      QoS request on req with value 100 will be active for 1000 microseconds.
      After 1000 microseconds, the QoS request thru req is reset. If there
      were another pm_qos_update_request(req, x) during the 1000 us, this
      new request with value x will override as this is another request on the
      same req handle. A new request on the same req handle will always
      override the previous request whether it is the conventional request or
      it is the new timeout request.
      Signed-off-by: NMyungJoo Ham <myungjoo.ham@samsung.com>
      Signed-off-by: NKyungmin Park <kyungmin.park@samsung.com>
      Acked-by: NMark Gross <markgross@thegnar.org>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      c4772d19
    • R
      PM / Sleep: Mitigate race between the freezer and request_firmware() · 247bc037
      Rafael J. Wysocki 提交于
      There is a race condition between the freezer and request_firmware()
      such that if request_firmware() is run on one CPU and
      freeze_processes() is run on another CPU and usermodehelper_disable()
      called by it succeeds to grab umhelper_sem for writing before
      usermodehelper_read_trylock() called from request_firmware()
      acquires it for reading, the request_firmware() will fail and
      trigger a WARN_ON() complaining that it was called at a wrong time.
      However, in fact, it wasn't called at a wrong time and
      freeze_processes() simply happened to be executed simultaneously.
      
      To avoid this race, at least in some cases, modify
      usermodehelper_read_trylock() so that it doesn't fail if the
      freezing of tasks has just started and hasn't been completed yet.
      Instead, during the freezing of tasks, it will try to freeze the
      task that has called it so that it can wait until user space is
      thawed without triggering the scary warning.
      
      For this purpose, change usermodehelper_disabled so that it can
      take three different values, UMH_ENABLED (0), UMH_FREEZING and
      UMH_DISABLED.  The first one means that usermode helpers are
      enabled, the last one means "hard disable" (i.e. the system is not
      ready for usermode helpers to be used) and the second one
      is reserved for the freezer.  Namely, when freeze_processes() is
      started, it sets usermodehelper_disabled to UMH_FREEZING which
      tells usermodehelper_read_trylock() that it shouldn't fail just
      yet and should call try_to_freeze() if woken up and cannot
      return immediately.  This way all freezable tasks that happen
      to call request_firmware() right before freeze_processes() is
      started and lose the race for umhelper_sem with it will be
      frozen and will sleep until thaw_processes() unsets
      usermodehelper_disabled.  [For the non-freezable callers of
      request_firmware() the race for umhelper_sem against
      freeze_processes() is unfortunately unavoidable.]
      Reported-by: NStephen Boyd <sboyd@codeaurora.org>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: stable@vger.kernel.org
      247bc037
    • R
      PM / Sleep: Move disabling of usermode helpers to the freezer · 1e73203c
      Rafael J. Wysocki 提交于
      The core suspend/hibernation code calls usermodehelper_disable() to
      avoid race conditions between the freezer and the starting of
      usermode helpers and each code path has to do that on its own.
      However, it is always called right before freeze_processes()
      and usermodehelper_enable() is always called right after
      thaw_processes().  For this reason, to avoid code duplication and
      to make the connection between usermodehelper_disable() and the
      freezer more visible, make freeze_processes() call it and remove the
      direct usermodehelper_disable() and usermodehelper_enable() calls
      from all suspend/hibernation code paths.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: stable@vger.kernel.org
      1e73203c
    • R
      PM / Hibernate: Disable usermode helpers right before freezing tasks · 7b5179ac
      Rafael J. Wysocki 提交于
      There is no reason to call usermodehelper_disable() before creating
      memory bitmaps in hibernate() and software_resume(), so call it right
      before freeze_processes(), in accordance with the other suspend and
      hibernation code.  Consequently, call usermodehelper_enable() right
      after the thawing of tasks rather than after freeing the memory
      bitmaps.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: stable@vger.kernel.org
      7b5179ac
    • R
      firmware_class: Do not warn that system is not ready from async loads · 9b78c1da
      Rafael J. Wysocki 提交于
      If firmware is requested asynchronously, by calling
      request_firmware_nowait(), there is no reason to fail the request
      (and warn the user) when the system is (presumably temporarily)
      unready to handle it (because user space is not available yet or
      frozen).  For this reason, introduce an alternative routine for
      read-locking umhelper_sem, usermodehelper_read_lock_wait(), that
      will wait for usermodehelper_disabled to be unset (possibly with
      a timeout) and make request_firmware_work_func() use it instead of
      usermodehelper_read_trylock().
      
      Accordingly, modify request_firmware() so that it uses
      usermodehelper_read_trylock() to acquire umhelper_sem and remove
      the code related to that lock from _request_firmware().
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: stable@vger.kernel.org
      9b78c1da
    • R
      firmware_class: Rework usermodehelper check · fe2e39d8
      Rafael J. Wysocki 提交于
      Instead of two functions, read_lock_usermodehelper() and
      usermodehelper_is_disabled(), used in combination, introduce
      usermodehelper_read_trylock() that will only return with umhelper_sem
      held if usermodehelper_disabled is unset (and will return -EAGAIN
      otherwise) and make _request_firmware() use it.
      
      Rename read_unlock_usermodehelper() to
      usermodehelper_read_unlock() to follow the naming convention of the
      new function.
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: stable@vger.kernel.org
      fe2e39d8
    • D
      Remove all #inclusions of asm/system.h · 9ffc93f2
      David Howells 提交于
      Remove all #inclusions of asm/system.h preparatory to splitting and killing
      it.  Performed with the following command:
      
      perl -p -i -e 's!^#\s*include\s*<asm/system[.]h>.*\n!!' `grep -Irl '^#\s*include\s*<asm/system[.]h>' *`
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      9ffc93f2
    • D
      Add #includes needed to permit the removal of asm/system.h · 96f951ed
      David Howells 提交于
      asm/system.h is a cause of circular dependency problems because it contains
      commonly used primitive stuff like barrier definitions and uncommonly used
      stuff like switch_to() that might require MMU definitions.
      
      asm/system.h has been disintegrated by this point on all arches into the
      following common segments:
      
       (1) asm/barrier.h
      
           Moved memory barrier definitions here.
      
       (2) asm/cmpxchg.h
      
           Moved xchg() and cmpxchg() here.  #included in asm/atomic.h.
      
       (3) asm/bug.h
      
           Moved die() and similar here.
      
       (4) asm/exec.h
      
           Moved arch_align_stack() here.
      
       (5) asm/elf.h
      
           Moved AT_VECTOR_SIZE_ARCH here.
      
       (6) asm/switch_to.h
      
           Moved switch_to() here.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      96f951ed