1. 24 8月, 2009 1 次提交
  2. 19 8月, 2009 2 次提交
    • K
      mm: revert "oom: move oom_adj value" · 0753ba01
      KOSAKI Motohiro 提交于
      The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
      the mm_struct.  It was a very good first step for sanitize OOM.
      
      However Paul Menage reported the commit makes regression to his job
      scheduler.  Current OOM logic can kill OOM_DISABLED process.
      
      Why? His program has the code of similar to the following.
      
      	...
      	set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
      	...
      	if (vfork() == 0) {
      		set_oom_adj(0); /* Invoked child can be killed */
      		execve("foo-bar-cmd");
      	}
      	....
      
      vfork() parent and child are shared the same mm_struct.  then above
      set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
      change oom_adj for vfork() parent.  Then, vfork() parent (job scheduler)
      lost OOM immune and it was killed.
      
      Actually, fork-setting-exec idiom is very frequently used in userland program.
      We must not break this assumption.
      
      Then, this patch revert commit 2ff05b2b and related commit.
      
      Reverted commit list
      ---------------------
      - commit 2ff05b2b (oom: move oom_adj value from task_struct to mm_struct)
      - commit 4d8b9135 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
      - commit 81236810 (oom: only oom kill exiting tasks with attached memory)
      - commit 933b787b (mm: copy over oom_adj value at fork time)
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Menage <menage@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0753ba01
    • H
      Security/SELinux: remove duplicated #include · b08dc3eb
      Huang Weiyi 提交于
      Remove duplicated #include('s) in
        kernel/sysctl.c
      Signed-off-by: NHuang Weiyi <weiyi.huang@gmail.com>
      Acked-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      b08dc3eb
  3. 18 8月, 2009 2 次提交
    • T
      genirq: Wake up irq thread after action has been installed · 69ab8494
      Thomas Gleixner 提交于
      The wake_up_process() of the new irq thread in __setup_irq() is too
      early as the irqaction is not yet fully initialized especially
      action->irq is not yet set. The interrupt thread might dereference the
      wrong irq descriptor.
      
      Move the wakeup after the action is installed and action->irq has been
      set.
      Reported-by: NMichael Buesch <mb@bu3sch.de>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMichael Buesch <mb@bu3sch.de>
      69ab8494
    • I
      perf_counter: Fix the PARISC build · f738eb1b
      Ingo Molnar 提交于
      PARISC does not build:
      
      /home/mingo/tip/kernel/perf_counter.c: In function 'perf_counter_index':
      /home/mingo/tip/kernel/perf_counter.c:2016: error: 'PERF_COUNTER_INDEX_OFFSET' undeclared (first use in this function)
      /home/mingo/tip/kernel/perf_counter.c:2016: error: (Each undeclared identifier is reported only once
      /home/mingo/tip/kernel/perf_counter.c:2016: error: for each function it appears in.)
      
      As PERF_COUNTER_INDEX_OFFSET is not defined.
      
      Now, we could define it in the architecture - but lets also provide
      a core default of 0 (which happens to be what all but one
      architecture uses at the moment).
      
      Architectures that need a different index offset should set this
      value in their asm/perf_counter.h files.
      
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Cc: Helge Deller <deller@gmx.de>
      Cc: linux-parisc@vger.kernel.org
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f738eb1b
  4. 17 8月, 2009 2 次提交
    • P
      perf_counter: Check task on counter read IPI · e1ac3614
      Paul Mackerras 提交于
      In general, code in perf_counter.c that is called through an
      IPI checks, for per-task counters, that the counter's task is
      still the current task.  This is to handle the race condition
      where the cpu switches from the task we want to another task in
      the interval between sending the IPI and the IPI arriving and
      being handled on the target CPU.
      
      For some reason, __perf_counter_read is missing this check, yet
      there is no reason why the race condition can't occur.  This
      adds a check that the current task is the one we want.  If it
      isn't, we just return.  In that case the counter->count value
      should be up to date, since it will have been updated when the
      counter was scheduled out, which must have happened since the
      IPI was sent.
      
      I don't have an example of an actual failure due to this race,
      but it seems obvious that it could occur and we need to guard
      against it.
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <19076.63614.277861.368125@drongo.ozlabs.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e1ac3614
    • E
      Security/SELinux: seperate lsm specific mmap_min_addr · 788084ab
      Eric Paris 提交于
      Currently SELinux enforcement of controls on the ability to map low memory
      is determined by the mmap_min_addr tunable.  This patch causes SELinux to
      ignore the tunable and instead use a seperate Kconfig option specific to how
      much space the LSM should protect.
      
      The tunable will now only control the need for CAP_SYS_RAWIO and SELinux
      permissions will always protect the amount of low memory designated by
      CONFIG_LSM_MMAP_MIN_ADDR.
      
      This allows users who need to disable the mmap_min_addr controls (usual reason
      being they run WINE as a non-root user) to do so and still have SELinux
      controls preventing confined domains (like a web server) from being able to
      map some area of low memory.
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      788084ab
  5. 14 8月, 2009 2 次提交
    • E
      security: introducing security_request_module · 9188499c
      Eric Paris 提交于
      Calling request_module() will trigger a userspace upcall which will load a
      new module into the kernel.  This can be a dangerous event if the process
      able to trigger request_module() is able to control either the modprobe
      binary or the module binary.  This patch adds a new security hook to
      request_module() which can be used by an LSM to control a processes ability
      to call request_module().
      Signed-off-by: NEric Paris <eparis@redhat.com>
      Acked-by: NSerge Hallyn <serue@us.ibm.com>
      Signed-off-by: NJames Morris <jmorris@namei.org>
      9188499c
    • L
      genirq: prevent wakeup of freed irq thread · 2d860ad7
      Linus Torvalds 提交于
      free_irq() can remove an irqaction while the corresponding interrupt
      is in progress, but free_irq() sets action->thread to NULL
      unconditionally, which might lead to a NULL pointer dereference in
      handle_IRQ_event() when the hard interrupt context tries to wake up
      the handler thread.
      
      Prevent this by moving the thread stop after synchronize_irq(). No
      need to set action->thread to NULL either as action is going to be
      freed anyway.
      
      This fixes a boot crash reported against preempt-rt which uses the
      mainline irq threads code to implement full irq threading.
      
      [ tglx: removed local irqthread variable ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      2d860ad7
  6. 13 8月, 2009 6 次提交
    • P
      perf_counter: Report the cloning task as parent on perf_counter_fork() · 94d5d1b2
      Peter Zijlstra 提交于
      A bug in (9f498cc5: perf_counter: Full task tracing) makes
      profiling multi-threaded apps it go belly up.
      
      [ output as: (PID:TID):(PPID:PTID) ]
      
       # ./perf report -D | grep FORK
      0x4b0 [0x18]: PERF_EVENT_FORK: (3237:3237):(3236:3236)
      0xa10 [0x18]: PERF_EVENT_FORK: (3237:3238):(3236:3236)
      0xa70 [0x18]: PERF_EVENT_FORK: (3237:3239):(3236:3236)
      0xad0 [0x18]: PERF_EVENT_FORK: (3237:3240):(3236:3236)
      0xb18 [0x18]: PERF_EVENT_FORK: (3237:3241):(3236:3236)
      
      Shows us that the test (27d028de perf report: Update for the new
      FORK/EXIT events) in builtin-report.c:
      
              /*
               * A thread clone will have the same PID for both
               * parent and child.
               */
              if (thread == parent)
                      return 0;
      
      Will clearly fail.
      
      The problem is that perf_counter_fork() reports the actual
      parent, instead of the cloning thread.
      
      Fixing that (with the below patch), yields:
      
       # ./perf report -D | grep FORK
      0x4c8 [0x18]: PERF_EVENT_FORK: (1590:1590):(1589:1589)
      0xbd8 [0x18]: PERF_EVENT_FORK: (1590:1591):(1590:1590)
      0xc80 [0x18]: PERF_EVENT_FORK: (1590:1592):(1590:1590)
      0x3338 [0x18]: PERF_EVENT_FORK: (1590:1593):(1590:1590)
      0x66b0 [0x18]: PERF_EVENT_FORK: (1590:1594):(1590:1590)
      
      Which both makes more sense and doesn't confuse perf report
      anymore.
      Reported-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: paulus@samba.org
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Arjan van de Ven <arjan@infradead.org>
      LKML-Reference: <1250172882.5241.62.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      94d5d1b2
    • P
      perf_counter: Fix an ipi-deadlock · 970892a9
      Peter Zijlstra 提交于
      perf_pending_counter() is called from IRQ context and will call
      perf_counter_disable(), however perf_counter_disable() uses
      smp_call_function_single() which doesn't fancy being used with
      IRQs disabled due to IPI deadlocks.
      
      Fix this by making it use the local __perf_counter_disable()
      call and teaching the counter_sched_out() code about pending
      disables as well.
      
      This should cover the case where a counter migrates before the
      pending queue gets processed.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Corey J Ashford <cjashfor@us.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: stephane eranian <eranian@googlemail.com>
      LKML-Reference: <20090813103655.244097721@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      970892a9
    • P
      perf: Rework/fix the whole read vs group stuff · 3dab77fb
      Peter Zijlstra 提交于
      Replace PERF_SAMPLE_GROUP with PERF_SAMPLE_READ and introduce
      PERF_FORMAT_GROUP to deal with group reads in a more generic
      way.
      
      This allows you to get group reads out of read() as well.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Corey J Ashford <cjashfor@us.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: stephane eranian <eranian@googlemail.com>
      LKML-Reference: <20090813103655.117411814@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3dab77fb
    • P
      perf_counter: Fix swcounter context invariance · bcfc2602
      Peter Zijlstra 提交于
      perf_swcounter_is_counting() uses a lock, which means we cannot
      use swcounters from NMI or when holding that particular lock,
      this is unintended.
      
      The below removes the lock, this opens up race window, but not
      worse than the swcounters already experience due to RCU
      traversal of the context in perf_swcounter_ctx_event().
      
      This also fixes the hard lockups while opening a lockdep
      tracepoint counter.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: stephane eranian <eranian@googlemail.com>
      Cc: Corey J Ashford <cjashfor@us.ibm.com>
      LKML-Reference: <1250149915.10001.66.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      bcfc2602
    • I
      perf_counter: Provide hw_perf_counter_setup_online() APIs · 28402971
      Ingo Molnar 提交于
      Provide weak aliases for hw_perf_counter_setup_online(). This is
      used by the BTS patches (for v2.6.32), but it interacts with
      fixes so propagate this upstream. (it has no effect as of yet)
      
      Also export perf_counter_output() to architecture code.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      28402971
    • A
      Remove double removal of blktrace directory · 39cbb602
      Alan D. Brunelle 提交于
      commit fd51d251
      Author: Stefan Raspl <raspl@linux.vnet.ibm.com>
      Date:   Tue May 19 09:59:08 2009 +0200
      
          blktrace: remove debugfs entries on bad path
      
      added in an explicit invocation of debugfs_remove for bt->dir, in
      blk_remove_buf_file_callback we are also getting the directory removed. On
      occasion I am seeing memory corruption that I have bisected down to
      this commit. [The testing involves a (long) series of I/O benchmarks
      with blktrace invoked around the actual runs.] I believe that this
      committed patch is correct, but the problem actually lies in the code
      in blk_remove_buf_file_callback.
      
      With this patch I am able to consistently get complete runs whereas
      previously I could not get a single run to complete.
      
      The first part of the patch simply moves the debugfs_remove below the
      relay_close: the relay_close call will remove files under bt->dir, and
      so we should not remove the directory until all the files we created
      have been removed. (Note: This is not sufficient to fix the problem -
      the file system code has ref counts on the directoy, so our invocation
      does not cause the directory to actually be removed. Nonetheless, we
      should not rely upon that feature.)
      Signed-off-by: NAlan D. Brunelle <alan.brunelle@hp.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      39cbb602
  7. 11 8月, 2009 1 次提交
    • D
      futex: Fix handling of bad requeue syscall pairing · 392741e0
      Darren Hart 提交于
      If futex_requeue(requeue_pi=1) finds a futex_q that was created by a call
      other the futex_wait_requeue_pi(), the q.rt_waiter may be null.  If so,
      this will result in an oops from the following call graph:
      
      futex_requeue()
        rt_mutex_start_proxy_lock()
          task_blocks_on_rt_mutex()
            waiter->task dereference
              OOPS
      
      We currently WARN_ON() if this is detected, clearly this is inadequate.
      If we detect a mispairing in futex_requeue(), bail out, seding -EINVAL to
      user-space.
      
      V2: Fix parenthesis warnings.
      Signed-off-by: NDarren Hart <dvhltc@us.ibm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      Cc: John Stultz <johnstul@linux.vnet.ibm.com>
      LKML-Reference: <4A7CA8C0.7010809@us.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      392741e0
  8. 10 8月, 2009 5 次提交
    • D
      futex: Fix compat_futex to be same as futex for REQUEUE_PI · 4dc88029
      Dinakar Guniguntala 提交于
      Need to add the REQUEUE_PI checks to the compat_sys_futex API
      as well to ensure 32 bit requeue's work fine on a 64 bit
      system. Patch is against latest tip
      Signed-off-by: NDinakar Guniguntala <dino@in.ibm.com>
      Cc: Darren Hart <dvhltc@us.ibm.com>
      LKML-Reference: <20090810130142.GA23619@in.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4dc88029
    • P
      locking, sched: Give waitqueue spinlocks their own lockdep classes · 2fc39111
      Peter Zijlstra 提交于
      Give waitqueue spinlocks their own lockdep classes when they
      are initialised from init_waitqueue_head().  This means that
      struct wait_queue::func functions can operate other waitqueues.
      
      This is used by CacheFiles to catch the page from a backing fs
      being unlocked and to wake up another thread to take a copy of
      it.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NTakashi Iwai <tiwai@suse.de>
      Cc: linux-cachefs@redhat.com
      Cc: torvalds@osdl.org
      Cc: akpm@linux-foundation.org
      LKML-Reference: <20090810113305.17284.81508.stgit@warthog.procyon.org.uk>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2fc39111
    • P
      perf_counter: Require CAP_SYS_ADMIN for raw tracepoint data · a4e95fc2
      Peter Zijlstra 提交于
      Raw tracepoint data contains various kernel internals and
      data from other users, so restrict this to CAP_SYS_ADMIN.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      LKML-Reference: <1249896452.17467.75.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a4e95fc2
    • P
      perf_counter: Correct PERF_SAMPLE_RAW output · a044560c
      Peter Zijlstra 提交于
      PERF_SAMPLE_* output switches should unconditionally output the
      correct format, as they are the only way to unambiguously parse
      the PERF_EVENT_SAMPLE data.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      LKML-Reference: <1249896447.17467.74.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a044560c
    • D
      futex: Update futex_q lock_ptr on requeue proxy lock · beda2c7e
      Darren Hart 提交于
      futex_requeue() can acquire the lock on behalf of a waiter
      early on or during the requeue loop if it is uncontended or in
      the event of a lock steal or owner died. On wakeup, the waiter
      (in futex_wait_requeue_pi()) cleans up the pi_state owner using
      the lock_ptr to protect against concurrent access to the
      pi_state. The pi_state is hung off futex_q's on the requeue
      target futex hash bucket so the lock_ptr needs to be updated
      accordingly.
      
      The problem manifested by triggering the WARN_ON in
      lookup_pi_state() about the pid != pi_state->owner->pid.  With
      this patch, the pi_state is properly guarded against concurrent
      access via the requeue target hb lock.
      
      The astute reviewer may notice that there is a window of time
      between when futex_requeue() unlocks the hb locks and when
      futex_wait_requeue_pi() will acquire hb2->lock.  During this
      time the pi_state and uval are not in sync with the underlying
      rtmutex owner (but the uval does indicate there are waiters, so
      no atomic changes will occur in userspace).  However, this is
      not a problem. Should a contending thread enter
      lookup_pi_state() and acquire hb2->lock before the ownership is
      fixed up, it will find the pi_state hung off a waiter's
      (possibly the pending owner's) futex_q and block on the
      rtmutex.  Once futex_wait_requeue_pi() fixes up the owner, it
      will also move the pi_state from the old owner's
      task->pi_state_list to its own.
      
      v3: Fix plist lock name for application to mainline (rather
          than -rt) Compile tested against tip/v2.6.31-rc5.
      Signed-off-by: NDarren Hart <dvhltc@us.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      Cc: John Stultz <johnstul@linux.vnet.ibm.com>
      LKML-Reference: <4A7F4EFF.6090903@us.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      beda2c7e
  9. 09 8月, 2009 7 次提交
    • P
      perf_counter: Fix a race on perf_counter_ctx · 3a80b4a3
      Peter Zijlstra 提交于
      While extending perfcounters with BTS hw-tracing, Markus
      Metzger managed to trigger this warning:
      
         [  995.557128] WARNING: at kernel/perf_counter.c:1191 __perf_counter_task_sched_out+0x48/0x6b()
      
      triggers because commit
      9f498cc5 (perf_counter: Full
      task tracing) removed clearing of tsk->perf_counter_ctxp out
      from under ctx->lock which introduced a race (against
      perf_lock_task_context).
      
      Move it back and deal with the exit notification by explicitly
      passing along the former task context.
      Reported-by: NMarkus T Metzger <markus.t.metzger@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      LKML-Reference: <1249667341.17467.5.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3a80b4a3
    • F
      perf_counter: Fix tracepoint sampling to be part of generic sampling · 3a43ce68
      Frederic Weisbecker 提交于
      Based on Peter's comments, make tracepoint sampling generic
      just like all the other sampling bits are. This is a rename
      with no code changes:
      
      - PERF_SAMPLE_TP_RECORD to PERF_SAMPLE_RAW
      - struct perf_tracepoint_record to perf_raw_record
      
      We want the system in place that transport tracepoints raw
      samples events into the perf ring buffer to be generalized and
      usable by any type of counter.
      
      Reported-by; Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      LKML-Reference: <1249698400-5441-4-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3a43ce68
    • F
      perf_counter: Work around gcc warning by initializing tracepoint record unconditionally · 10b8e306
      Frederic Weisbecker 提交于
      Despite that the tracepoint record is always present when the
      PERF_SAMPLE_TP_RECORD flag is set, gcc raises a warning,
      thinking it might not be initialized:
      
        kernel/perf_counter.c: In function ‘perf_counter_output’:
        kernel/perf_counter.c:2650: warning: ‘tp’ may be used uninitialized in this function
      
      Then, initialize it to NULL and always check if it's not NULL
      before dereference it.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      LKML-Reference: <1249698400-5441-2-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      10b8e306
    • P
      perf_counter: Fix software counters for fast moving event sources · 7b4b6658
      Peter Zijlstra 提交于
      Reimplement the software counters to deal with fast moving
      event sources (such as tracepoints). This means being able
      to generate multiple overflows from a single 'event' as well
      as support throttling.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7b4b6658
    • F
      perf_counter: Fix/complete ftrace event records sampling · f413cdb8
      Frederic Weisbecker 提交于
      This patch implements the kernel side support for ftrace event
      record sampling.
      
      A new counter sampling attribute is added:
      
         PERF_SAMPLE_TP_RECORD
      
      which requests ftrace events record sampling. In this case
      if a PERF_TYPE_TRACEPOINT counter is active and a tracepoint
      fires, we emit the tracepoint binary record to the
      perfcounter event buffer, as a sample.
      
      Result, after setting PERF_SAMPLE_TP_RECORD attribute from perf
      record:
      
       perf record -f -F 1 -a -e workqueue:workqueue_execution
       perf report -D
      
       0x21e18 [0x48]: event: 9
       .
       . ... raw event: size 72 bytes
       .  0000:  09 00 00 00 01 00 48 00 d0 c7 00 81 ff ff ff ff  ......H........
       .  0010:  0a 00 00 00 0a 00 00 00 21 00 00 00 00 00 00 00  ........!......
       .  0020:  2b 00 01 02 0a 00 00 00 0a 00 00 00 65 76 65 6e  +...........eve
       .  0030:  74 73 2f 31 00 00 00 00 00 00 00 00 0a 00 00 00  ts/1...........
       .  0040:  e0 b1 31 81 ff ff ff ff                          .......
      .
      0x21e18 [0x48]: PERF_EVENT_SAMPLE (IP, 1): 10: 0xffffffff8100c7d0 period: 33
      
      The raw ftrace binary record starts at offset 0020.
      
      Translation:
      
       struct trace_entry {
      	type		= 0x2b = 43;
      	flags		= 1;
      	preempt_count	= 2;
      	pid		= 0xa = 10;
      	tgid		= 0xa = 10;
       }
      
       thread_comm = "events/1"
       thread_pid  = 0xa = 10;
       func	    = 0xffffffff8131b1e0 = flush_to_ldisc()
      
      What will come next?
      
       - Userspace support ('perf trace'), 'flight data recorder' mode
         for perf trace, etc.
      
       - The unconditional copy from the profiling callback brings
         some costs however if someone wants no such sampling to
         occur, and needs to be fixed in the future. For that we need
         to have an instant access to the perf counter attribute.
         This is a matter of a flag to add in the struct ftrace_event.
      
       - Take care of the events recursivity! Don't ever try to record
         a lock event for example, it seems some locking is used in
         the profiling fast path and lead to a tracing recursivity.
         That will be fixed using raw spinlock or recursivity
         protection.
      
       - [...]
      
       - Profit! :-)
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Tom Zanussi <tzanussi@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Gabriel Munteanu <eduard.munteanu@linux360.ro>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f413cdb8
    • P
      perf_counter, ftrace: Fix perf_counter integration · 3a659305
      Peter Zijlstra 提交于
      Adds possible second part to the assign argument of TP_EVENT().
      
        TP_perf_assign(
      	__perf_count(foo);
      	__perf_addr(bar);
        )
      
      Which, when specified make the swcounter increment with @foo instead
      of the usual 1, and report @bar for PERF_SAMPLE_ADDR (data address
      associated with the event) when this triggers a counter overflow.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3a659305
    • S
      posix_cpu_timers_exit_group(): Do not use thread_group_cputimer() · 17d42c1c
      Stanislaw Gruszka 提交于
      When the process exits we don't have to run new cputimer nor
      use running one (as it not accounts when tsk->exit_state != 0)
      to get process CPU times.  As there is only one thread we can
      just use CPU times fields from task and signal structs.
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      17d42c1c
  10. 08 8月, 2009 6 次提交
    • T
      tracing/filters: Always free pred on filter_add_subsystem_pred() failure · 26528e77
      Tom Zanussi 提交于
      If filter_add_subsystem_pred() fails due to ENOSPC or ENOMEM,
      the pred doesn't get freed, while as a side effect it does for
      other errors. Make it so the caller always frees the pred for
      any error.
      Signed-off-by: NTom Zanussi <tzanussi@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <1249746593.6453.32.camel@tropicana>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      26528e77
    • T
      tracing/filters: Don't use pred on alloc failure · 96b2de31
      Tom Zanussi 提交于
      Dan Carpenter sent me a fix to prevent pred from being used if
      it couldn't be allocated.  I noticed the same problem also
      existed for the create_pred() case and added a fix for that.
      Reported-by: NDan Carpenter <error27@gmail.com>
      Signed-off-by: NTom Zanussi <tzanussi@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <1249746549.6453.29.camel@tropicana>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      96b2de31
    • Y
      x86/irq: Fix move_irq_desc() for nodes without ram · ad7d6c7a
      Yinghai Lu 提交于
      Don't move it if target node is -1.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4A785B5D.4070702@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ad7d6c7a
    • E
      execve: must clear current->clear_child_tid · 9c8a8228
      Eric Dumazet 提交于
      While looking at Jens Rosenboom bug report
      (http://lkml.org/lkml/2009/7/27/35) about strange sys_futex call done from
      a dying "ps" program, we found following problem.
      
      clone() syscall has special support for TID of created threads.  This
      support includes two features.
      
      One (CLONE_CHILD_SETTID) is to set an integer into user memory with the
      TID value.
      
      One (CLONE_CHILD_CLEARTID) is to clear this same integer once the created
      thread dies.
      
      The integer location is a user provided pointer, provided at clone()
      time.
      
      kernel keeps this pointer value into current->clear_child_tid.
      
      At execve() time, we should make sure kernel doesnt keep this user
      provided pointer, as full user memory is replaced by a new one.
      
      As glibc fork() actually uses clone() syscall with CLONE_CHILD_SETTID and
      CLONE_CHILD_CLEARTID set, chances are high that we might corrupt user
      memory in forked processes.
      
      Following sequence could happen:
      
      1) bash (or any program) starts a new process, by a fork() call that
         glibc maps to a clone( ...  CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID
         ...) syscall
      
      2) When new process starts, its current->clear_child_tid is set to a
         location that has a meaning only in bash (or initial program) context
         (&THREAD_SELF->tid)
      
      3) This new process does the execve() syscall to start a new program.
         current->clear_child_tid is left unchanged (a non NULL value)
      
      4) If this new program creates some threads, and initial thread exits,
         kernel will attempt to clear the integer pointed by
         current->clear_child_tid from mm_release() :
      
              if (tsk->clear_child_tid
                  && !(tsk->flags & PF_SIGNALED)
                  && atomic_read(&mm->mm_users) > 1) {
                      u32 __user * tidptr = tsk->clear_child_tid;
                      tsk->clear_child_tid = NULL;
      
                      /*
                       * We don't check the error code - if userspace has
                       * not set up a proper pointer then tough luck.
                       */
      << here >>      put_user(0, tidptr);
                      sys_futex(tidptr, FUTEX_WAKE, 1, NULL, NULL, 0);
              }
      
      5) OR : if new program is not multi-threaded, but spied by /proc/pid
         users (ps command for example), mm_users > 1, and the exiting program
         could corrupt 4 bytes in a persistent memory area (shm or memory mapped
         file)
      
      If current->clear_child_tid points to a writeable portion of memory of the
      new program, kernel happily and silently corrupts 4 bytes of memory, with
      unexpected effects.
      
      Fix is straightforward and should not break any sane program.
      Reported-by: NJens Rosenboom <jens@mcbone.net>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sonny Rao <sonnyrao@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c8a8228
    • X
      generic-ipi: fix hotplug_cfd() · 69dd647f
      Xiao Guangrong 提交于
      Use CONFIG_HOTPLUG_CPU, not CONFIG_CPU_HOTPLUG
      
      When hot-unpluging a cpu, it will leak memory allocated at cpu hotplug,
      but only if CPUMASK_OFFSTACK=y, which is default to n.
      
      The bug was introduced by 8969a5ed
      ("generic-ipi: remove kmalloc()").
      Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69dd647f
    • E
      ring-buffer: Fix memleak in ring_buffer_free() · bd3f0221
      Eric Dumazet 提交于
      I noticed oprofile memleaked in linux-2.6 current tree,
      and tracked this ring-buffer leak.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      LKML-Reference: <4A7C06B9.2090302@gmail.com>
      Cc: stable@kernel.org
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      bd3f0221
  11. 07 8月, 2009 2 次提交
    • L
      lockdep: Fix file mode of lock_stat · 9795447f
      Li Zefan 提交于
      /proc/lock_stat is writable.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      LKML-Reference: <4A7BE7B6.10904@cn.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9795447f
    • P
      perf_counter: Fix double list iteration in per task precise stats · 1054598c
      Peter Zijlstra 提交于
      Brice Goglin reported this crash with per task precise stats:
      
      > I finally managed to test the threaded perfcounter statistics (thanks a
      > lot for implementing it). I am running 2.6.31-rc5 (with the AMD
      > magny-cours patches but I don't think they matter here). I am trying to
      > measure local/remote memory accesses per thread during the well-known
      > stream benchmark. It's compiled with OpenMP using 16 threads on a
      > quad-socket quad-core barcelona machine.
      >
      > Command line is:
      >  /mnt/scratch/bgoglin/cpunode/linux-2.6.31/tools/perf/perf record -f -s
      > -e r1000001e0 -e r1000002e0 -e r1000004e0 -e r1000008e0 ./stream
      >
      > It seems to work fine with a single -e <counter> on the command line
      > while it crashes when there are at least 2 of them.
      > It seems to work fine without -s as well.
      
      A silly copy-paste resulted in a messed up iteration which would
      cause the OOPS.
      Reported-by: NBrice Goglin <Brice.Goglin@inria.fr>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
      LKML-Reference: <1249574786.32113.550.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1054598c
  12. 06 8月, 2009 4 次提交
    • R
      ring-buffer: Fix advance of reader in rb_buffer_peek() · 469535a5
      Robert Richter 提交于
      When calling rb_buffer_peek() from ring_buffer_consume() and a
      padding event is returned, the function rb_advance_reader() is
      called twice. This may lead to missing samples or under high
      workloads to the warning below. This patch fixes this. If a padding
      event is returned by rb_buffer_peek() it will be consumed by the
      calling function now.
      
      Also, I simplified some code in ring_buffer_consume().
      
      ------------[ cut here ]------------
      WARNING: at /dev/shm/.source/linux/kernel/trace/ring_buffer.c:2289 rb_advance_reader+0x2e/0xc5()
      Hardware name: Anaheim
      Modules linked in:
      Pid: 29, comm: events/2 Tainted: G        W  2.6.31-rc3-oprofile-x86_64-standard-00059-g5050dc2 #1
      Call Trace:
      [<ffffffff8106776f>] ? rb_advance_reader+0x2e/0xc5
      [<ffffffff81039ffe>] warn_slowpath_common+0x77/0x8f
      [<ffffffff8103a025>] warn_slowpath_null+0xf/0x11
      [<ffffffff8106776f>] rb_advance_reader+0x2e/0xc5
      [<ffffffff81068bda>] ring_buffer_consume+0xa0/0xd2
      [<ffffffff81326933>] op_cpu_buffer_read_entry+0x21/0x9e
      [<ffffffff810be3af>] ? __find_get_block+0x4b/0x165
      [<ffffffff8132749b>] sync_buffer+0xa5/0x401
      [<ffffffff810be3af>] ? __find_get_block+0x4b/0x165
      [<ffffffff81326c1b>] ? wq_sync_buffer+0x0/0x78
      [<ffffffff81326c76>] wq_sync_buffer+0x5b/0x78
      [<ffffffff8104aa30>] worker_thread+0x113/0x1ac
      [<ffffffff8104dd95>] ? autoremove_wake_function+0x0/0x38
      [<ffffffff8104a91d>] ? worker_thread+0x0/0x1ac
      [<ffffffff8104dc9a>] kthread+0x88/0x92
      [<ffffffff8100bdba>] child_rip+0xa/0x20
      [<ffffffff8104dc12>] ? kthread+0x0/0x92
      [<ffffffff8100bdb0>] ? child_rip+0x0/0x20
      ---[ end trace f561c0a58fcc89bd ]---
      
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NRobert Richter <robert.richter@amd.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      469535a5
    • P
      ftrace: Fix perf-tracepoint OOPS · af6af30c
      Peter Zijlstra 提交于
      Not all tracepoints are created equal, in specific the ftrace
      tracepoints are created with TRACE_EVENT_FORMAT() which does
      not generate the needed bits to tie them into perf counters.
      
      For those events, don't create the 'id' file and fail
      ->profile_enable when their ID is specified through other
      means.
      Reported-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      LKML-Reference: <1249497664.5890.4.camel@laptop>
      [ v2: fix build error in the !CONFIG_EVENT_PROFILE case ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      af6af30c
    • D
      rtmutex: Avoid deadlock in rt_mutex_start_proxy_lock() · 1bbf2083
      Darren Hart 提交于
      In the event of a lock steal or owner died,
      rt_mutex_start_proxy_lock() will give the rt_mutex to the
      waiting task, but it fails to release the wait_lock. This leads
      to subsequent deadlocks when other tasks try to acquire the
      rt_mutex.
      
      I also removed a few extra blank lines that really spaced this
      routine out. I must have been high on the \n when I wrote this
      originally...
      Signed-off-by: NDarren Hart <dvhltc@us.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      Cc: John Stultz <johnstul@linux.vnet.ibm.com>
      LKML-Reference: <4A79D7F1.4000405@us.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1bbf2083
    • S
      ring-buffer: do not disable ring buffer on oops_in_progress · 464e85eb
      Steven Rostedt 提交于
      The commit:
      
        commit e0fdace1
        Author: David Miller <davem@davemloft.net>
        Date:   Fri Aug 1 01:11:22 2008 -0700
      
          debug_locks: set oops_in_progress if we will log messages.
      
          Otherwise lock debugging messages on runqueue locks can deadlock the
          system due to the wakeups performed by printk().
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      
      Will permanently set oops_in_progress on any lockdep failure.
      When this triggers it will cause any read from the ring buffer to
      permanently disable the ring buffer (not to mention no locking of
      printk).
      
      This patch removes the check. It keeps the print in NMI which makes
      sense. This is probably OK, since the ring buffer should not cause
      something to set oops_in_progress anyway.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      464e85eb