1. 08 8月, 2009 2 次提交
    • E
      execve: must clear current->clear_child_tid · 9c8a8228
      Eric Dumazet 提交于
      While looking at Jens Rosenboom bug report
      (http://lkml.org/lkml/2009/7/27/35) about strange sys_futex call done from
      a dying "ps" program, we found following problem.
      
      clone() syscall has special support for TID of created threads.  This
      support includes two features.
      
      One (CLONE_CHILD_SETTID) is to set an integer into user memory with the
      TID value.
      
      One (CLONE_CHILD_CLEARTID) is to clear this same integer once the created
      thread dies.
      
      The integer location is a user provided pointer, provided at clone()
      time.
      
      kernel keeps this pointer value into current->clear_child_tid.
      
      At execve() time, we should make sure kernel doesnt keep this user
      provided pointer, as full user memory is replaced by a new one.
      
      As glibc fork() actually uses clone() syscall with CLONE_CHILD_SETTID and
      CLONE_CHILD_CLEARTID set, chances are high that we might corrupt user
      memory in forked processes.
      
      Following sequence could happen:
      
      1) bash (or any program) starts a new process, by a fork() call that
         glibc maps to a clone( ...  CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID
         ...) syscall
      
      2) When new process starts, its current->clear_child_tid is set to a
         location that has a meaning only in bash (or initial program) context
         (&THREAD_SELF->tid)
      
      3) This new process does the execve() syscall to start a new program.
         current->clear_child_tid is left unchanged (a non NULL value)
      
      4) If this new program creates some threads, and initial thread exits,
         kernel will attempt to clear the integer pointed by
         current->clear_child_tid from mm_release() :
      
              if (tsk->clear_child_tid
                  && !(tsk->flags & PF_SIGNALED)
                  && atomic_read(&mm->mm_users) > 1) {
                      u32 __user * tidptr = tsk->clear_child_tid;
                      tsk->clear_child_tid = NULL;
      
                      /*
                       * We don't check the error code - if userspace has
                       * not set up a proper pointer then tough luck.
                       */
      << here >>      put_user(0, tidptr);
                      sys_futex(tidptr, FUTEX_WAKE, 1, NULL, NULL, 0);
              }
      
      5) OR : if new program is not multi-threaded, but spied by /proc/pid
         users (ps command for example), mm_users > 1, and the exiting program
         could corrupt 4 bytes in a persistent memory area (shm or memory mapped
         file)
      
      If current->clear_child_tid points to a writeable portion of memory of the
      new program, kernel happily and silently corrupts 4 bytes of memory, with
      unexpected effects.
      
      Fix is straightforward and should not break any sane program.
      Reported-by: NJens Rosenboom <jens@mcbone.net>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sonny Rao <sonnyrao@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c8a8228
    • X
      generic-ipi: fix hotplug_cfd() · 69dd647f
      Xiao Guangrong 提交于
      Use CONFIG_HOTPLUG_CPU, not CONFIG_CPU_HOTPLUG
      
      When hot-unpluging a cpu, it will leak memory allocated at cpu hotplug,
      but only if CPUMASK_OFFSTACK=y, which is default to n.
      
      The bug was introduced by 8969a5ed
      ("generic-ipi: remove kmalloc()").
      Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69dd647f
  2. 04 8月, 2009 1 次提交
  3. 02 8月, 2009 6 次提交
    • G
      sched: Fix race in cpupri introduced by cpumask_var changes · 07903af1
      Gregory Haskins 提交于
      Background:
      
      Several race conditions in the scheduler have cropped up
      recently, which Steven and I have tracked down using ftrace.
      The most recent one turns out to be a race in how the scheduler
      determines a suitable migration target for RT tasks, introduced
      recently with commit:
      
          commit 68e74568
          Date:   Tue Nov 25 02:35:13 2008 +1030
      
              sched: convert struct cpupri_vec cpumask_var_t.
      
      The original design of cpupri allowed lockless readers to
      quickly determine a best-estimate target.  Races between the
      pri_active bitmap and the vec->mask were handled in the
      original code because we would detect and return "0" when this
      occured.  The design was predicated on the *effective*
      atomicity (*) of caching the result of cpus_and() between the
      cpus_allowed and the vec->mask.
      
      Commit 68e74568 changed the behavior such that vec->mask is
      accessed multiple times.  This introduces a subtle race, the
      result of which means we can have a result that returns "1",
      but with an empty bitmap.
      
      *) yes, we know cpus_and() is not a locked operator across the
         entire composite array, but it is implicitly atomic on a
         per-word basis which is all the design required to work.
      
      Implementation:
      
      Rather than forgoing the lockless design, or reverting to a
      stack-based cpumask_t, we simply check for when the race has
      been encountered and continue processing in the event that the
      race is hit.  This renders the removal race as if the priority
      bit had been atomically cleared as well, and allows the
      algorithm to execute correctly.
      Signed-off-by: NGregory Haskins <ghaskins@novell.com>
      CC: Rusty Russell <rusty@rustcorp.com.au>
      CC: Steven Rostedt <srostedt@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090730145728.25226.92769.stgit@dev.haskins.net>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      07903af1
    • P
      sched: Fix latencytop and sleep profiling vs group scheduling · e414314c
      Peter Zijlstra 提交于
      The latencytop and sleep accounting code assumes that any
      scheduler entity represents a task, this is not so.
      
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e414314c
    • P
      perf_counter: Full task tracing · 9f498cc5
      Peter Zijlstra 提交于
      In order to be able to distinguish between no samples due to
      inactivity and no samples due to task ended, Arjan asked for
      PERF_EVENT_EXIT events. This is useful to the boot delay
      instrumentation (bootchart) app.
      
      This patch changes the PERF_EVENT_FORK to be emitted on every
      clone, and adds PERF_EVENT_EXIT to be emitted on task exit,
      after the task's counters have been closed.
      
      This task tracing is controlled through: attr.comm || attr.mmap
      and through the new attr.task field.
      Suggested-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Anton Blanchard <anton@samba.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      [ cleaned up perf_counter.h a bit ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9f498cc5
    • P
      perf_counter: Collapse inherit on read() · e53c0994
      Peter Zijlstra 提交于
      Currently the counter value returned by read() is the value of
      the parent counter, to which child counters are only fed back
      on child exit.
      
      Thus read() can return rather erratic (and meaningless) numbers
      depending on the state of the child processes.
      
      Change this by always iterating the full child hierarchy on
      read() and sum all counters.
      Suggested-by: NCorey Ashford <cjashfor@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e53c0994
    • L
      do_sigaltstack: small cleanups · 0dd8486b
      Linus Torvalds 提交于
      The previous commit ("do_sigaltstack: avoid copying 'stack_t' as a
      structure to user space") fixed a real bug.  This one just cleans up the
      copy from user space to that gcc can generate better code for it (and so
      that it looks the same as the later copy back to user space).
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0dd8486b
    • L
      do_sigaltstack: avoid copying 'stack_t' as a structure to user space · 0083fc2c
      Linus Torvalds 提交于
      Ulrich Drepper correctly points out that there is generally padding in
      the structure on 64-bit hosts, and that copying the structure from
      kernel to user space can leak information from the kernel stack in those
      padding bytes.
      
      Avoid the whole issue by just copying the three members one by one
      instead, which also means that the function also can avoid the need for
      a stack frame.  This also happens to match how we copy the new structure
      from user space, so it all even makes sense.
      
      [ The obvious solution of adding a memset() generates horrid code, gcc
        does really stupid things. ]
      Reported-by: NUlrich Drepper <drepper@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0083fc2c
  4. 31 7月, 2009 1 次提交
  5. 30 7月, 2009 5 次提交
  6. 29 7月, 2009 2 次提交
    • L
      tracing: Fix missing function_graph events when we splice_read from trace_pipe · 74e7ff8c
      Lai Jiangshan 提交于
      About a half events are missing when we splice_read
      from trace_pipe. They are unexpectedly consumed because we ignore
      the TRACE_TYPE_NO_CONSUME return value used by the function graph
      tracer when it needs to consume the events by itself to walk on
      the ring buffer.
      
      The same problem appears with ftrace_dump()
      
      Example of an output before this patch:
      
      1)               |      ktime_get_real() {
      1)   2.846 us    |          read_hpet();
      1)   4.558 us    |        }
      1)   6.195 us    |      }
      
      After this patch:
      
      0)               |      ktime_get_real() {
      0)               |        getnstimeofday() {
      0)   1.960 us    |          read_hpet();
      0)   3.597 us    |        }
      0)   5.196 us    |      }
      
      The fix also applies on 2.6.30
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: stable@kernel.org
      LKML-Reference: <4A6EEC52.90704@cn.fujitsu.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      74e7ff8c
    • L
      tracing: Fix invalid function_graph entry · 38ceb592
      Lai Jiangshan 提交于
      When print_graph_entry() computes a function call entry event, it needs
      to also check the next entry to guess if it matches the return event of
      the current function entry.
      In order to look at this next event, it needs to consume the current
      entry before going ahead in the ring buffer.
      
      However, if the current event that gets consumed is the last one in the
      ring buffer head page, the ring_buffer may reuse the page for writers.
      The consumed entry will then become invalid because of possible
      racy overwriting.
      
      Me must then handle this entry by making a copy of it.
      
      The fix also applies on 2.6.30
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: stable@kernel.org
      LKML-Reference: <4A6EEAEC.3050508@cn.fujitsu.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      38ceb592
  7. 28 7月, 2009 2 次提交
  8. 25 7月, 2009 1 次提交
  9. 23 7月, 2009 11 次提交
  10. 22 7月, 2009 1 次提交
    • P
      softirq: introduce tasklet_hrtimer infrastructure · 9ba5f005
      Peter Zijlstra 提交于
      commit ca109491 (hrtimer: removing all ur callback modes) moved all
      hrtimer callbacks into hard interrupt context when high resolution
      timers are active. That breaks code which relied on the assumption
      that the callback happens in softirq context.
      
      Provide a generic infrastructure which combines tasklets and hrtimers
      together to provide an in-softirq hrtimer experience.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: torvalds@linux-foundation.org
      Cc: kaber@trash.net
      Cc: David Miller <davem@davemloft.net>
      LKML-Reference: <1248265724.27058.1366.camel@twins>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      9ba5f005
  11. 21 7月, 2009 1 次提交
    • T
      genirq: Delegate irq affinity setting to the irq thread · 591d2fb0
      Thomas Gleixner 提交于
      irq_set_thread_affinity() calls set_cpus_allowed_ptr() which might
      sleep, but irq_set_thread_affinity() is called with desc->lock held
      and can be called from hard interrupt context as well. The code has
      another bug as it does not hold a ref on the task struct as required
      by set_cpus_allowed_ptr().
      
      Just set the IRQTF_AFFINITY bit in action->thread_flags. The next time
      the thread runs it migrates itself. Solves all of the above problems
      nicely.
      
      Add kerneldoc to irq_set_thread_affinity() while at it.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <new-submission>
      591d2fb0
  12. 19 7月, 2009 2 次提交
  13. 18 7月, 2009 5 次提交
    • T
      sched: fix nr_uninterruptible accounting of frozen tasks really · 6301cb95
      Thomas Gleixner 提交于
      commit e3c8ca83 (sched: do not count frozen tasks toward load) broke
      the nr_uninterruptible accounting on freeze/thaw. On freeze the task
      is excluded from accounting with a check for (task->flags &
      PF_FROZEN), but that flag is cleared before the task is thawed. So
      while we prevent that the task with state TASK_UNINTERRUPTIBLE
      is accounted to nr_uninterruptible on freeze we decrement
      nr_uninterruptible on thaw.
      
      Use a separate flag which is handled by the freezing task itself. Set
      it before calling the scheduler with TASK_UNINTERRUPTIBLE state and
      clear it after we return from frozen state.
      
      Cc: <stable@kernel.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      6301cb95
    • T
      sched: fix load average accounting vs. cpu hotplug · a468d389
      Thomas Gleixner 提交于
      The new load average code clears rq->calc_load_active on
      CPU_ONLINE. That's wrong as the new onlined CPU might have got a
      scheduler tick already and accounted the delta to the stale value of
      the time we offlined the CPU.
      
      Clear the value when we cleanup the dead CPU instead. 
      
      Also move the update of the calc_load_update time for the newly online
      CPU to CPU_UP_PREPARE to avoid that the CPU plays catch up with the
      stale update time value.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      a468d389
    • M
      profile: Suppress warning about large allocations when profile=1 is specified · e5d490b2
      Mel Gorman 提交于
      When profile= is used, a large buffer is allocated early at
      boot. This can be larger than what the page allocator can
      provide so it prints a warning. However, the caller is able to
      handle the situation so this patch suppresses the warning.
      Signed-off-by: NMel Gorman <mel@csn.ul.ie>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Linux Memory Management List <linux-mm@kvack.org>
      Cc: Heinz Diehl <htd@fancy-poultry.org>
      Cc: David Miller <davem@davemloft.net>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      LKML-Reference: <1247656992-19846-3-git-send-email-mel@csn.ul.ie>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e5d490b2
    • A
      perf_counter: Log vfork as a fork event · ed900c05
      Anton Blanchard 提交于
      Right now we don't output vfork events. Even though we should
      always see an exec after a vfork, we may get perfcounter
      samples between the vfork and exec. These samples can lead to
      some confusion when parsing perfcounter data.
      
      To keep things consistent we should always log a fork event. It
      will result in a little more log data, but is less confusing to
      trace parsing tools.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090716104817.589309391@samba.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ed900c05
    • A
      perf_counter: Make sure we dont leak kernel memory to userspace · 413ee3b4
      Anton Blanchard 提交于
      There are a few places we are leaking tiny amounts of kernel
      memory to userspace. This happens when writing out strings
      because we always align the end to 64 bits.
      
      To avoid this we should always use an appropriately sized
      temporary buffer and ensure it is zeroed.
      
      Since d_path assembles the string from the end of the buffer
      backwards, we need to add 64 bits after the buffer to allow for
      alignment.
      
      We also need to copy arch_vma_name to the temporary buffer,
      because if we use it directly we may end up copying to
      userspace a number of bytes after the end of the string
      constant.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090716104817.273972048@samba.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      413ee3b4