1. 18 7月, 2009 3 次提交
  2. 11 7月, 2009 2 次提交
  3. 01 7月, 2009 1 次提交
  4. 19 6月, 2009 2 次提交
  5. 17 6月, 2009 2 次提交
    • D
      oom: move oom_adj value from task_struct to mm_struct · 2ff05b2b
      David Rientjes 提交于
      The per-task oom_adj value is a characteristic of its mm more than the
      task itself since it's not possible to oom kill any thread that shares the
      mm.  If a task were to be killed while attached to an mm that could not be
      freed because another thread were set to OOM_DISABLE, it would have
      needlessly been terminated since there is no potential for future memory
      freeing.
      
      This patch moves oomkilladj (now more appropriately named oom_adj) from
      struct task_struct to struct mm_struct.  This requires task_lock() on a
      task to check its oom_adj value to protect against exec, but it's already
      necessary to take the lock when dereferencing the mm to find the total VM
      size for the badness heuristic.
      
      This fixes a livelock if the oom killer chooses a task and another thread
      sharing the same memory has an oom_adj value of OOM_DISABLE.  This occurs
      because oom_kill_task() repeatedly returns 1 and refuses to kill the
      chosen task while select_bad_process() will repeatedly choose the same
      task during the next retry.
      
      Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
      oom_kill_task() to check for threads sharing the same memory will be
      removed in the next patch in this series where it will no longer be
      necessary.
      
      Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
      these threads are immune from oom killing already.  They simply report an
      oom_adj value of OOM_DISABLE.
      
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ff05b2b
    • M
      cpuset,mm: update tasks' mems_allowed in time · 58568d2a
      Miao Xie 提交于
      Fix allocating page cache/slab object on the unallowed node when memory
      spread is set by updating tasks' mems_allowed after its cpuset's mems is
      changed.
      
      In order to update tasks' mems_allowed in time, we must modify the code of
      memory policy.  Because the memory policy is applied in the process's
      context originally.  After applying this patch, one task directly
      manipulates anothers mems_allowed, and we use alloc_lock in the
      task_struct to protect mems_allowed and memory policy of the task.
      
      But in the fast path, we didn't use lock to protect them, because adding a
      lock may lead to performance regression.  But if we don't add a lock,the
      task might see no nodes when changing cpuset's mems_allowed to some
      non-overlapping set.  In order to avoid it, we set all new allowed nodes,
      then clear newly disallowed ones.
      
      [lee.schermerhorn@hp.com:
        The rework of mpol_new() to extract the adjusting of the node mask to
        apply cpuset and mpol flags "context" breaks set_mempolicy() and mbind()
        with MPOL_PREFERRED and a NULL nodemask--i.e., explicit local
        allocation.  Fix this by adding the check for MPOL_PREFERRED and empty
        node mask to mpol_new_mpolicy().
      
        Remove the now unneeded 'nodes = NULL' from mpol_new().
      
        Note that mpol_new_mempolicy() is always called with a non-NULL
        'nodes' parameter now that it has been removed from mpol_new().
        Therefore, we don't need to test nodes for NULL before testing it for
        'empty'.  However, just to be extra paranoid, add a VM_BUG_ON() to
        verify this assumption.]
      [lee.schermerhorn@hp.com:
      
        I don't think the function name 'mpol_new_mempolicy' is descriptive
        enough to differentiate it from mpol_new().
      
        This function applies cpuset set context, usually constraining nodes
        to those allowed by the cpuset.  However, when the 'RELATIVE_NODES flag
        is set, it also translates the nodes.  So I settled on
        'mpol_set_nodemask()', because the comment block for mpol_new() mentions
        that we need to call this function to "set nodes".
      
        Some additional minor line length, whitespace and typo cleanup.]
      Signed-off-by: NMiao Xie <miaox@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Paul Menage <menage@google.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NLee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58568d2a
  6. 16 6月, 2009 1 次提交
    • K
      sched: delayed cleanup of user_struct · 3959214f
      Kay Sievers 提交于
      During bootup performance tracing we see repeated occurrences of
      /sys/kernel/uid/* events for the same uid, leading to a,
      in this case, rather pointless userspace processing for the
      same uid over and over.
      
      This is usually caused by tools which change their uid to "nobody",
      to run without privileges to read data supplied by untrusted users.
      
      This change delays the execution of the (already existing) scheduled
      work, to cleanup the uid after one second, so the allocated and announced
      uid can possibly be re-used by another process.
      
      This is the current behavior, where almost every invocation of a
      binary, which changes the uid, creates two events:
        $ read START < /sys/kernel/uevent_seqnum; \
        for i in `seq 100`; do su --shell=/bin/true bin; done; \
        read END < /sys/kernel/uevent_seqnum; \
        echo $(($END - $START))
        178
      
      With the delayed cleanup, we get only two events, and userspace finishes
      a bit faster too:
        $ read START < /sys/kernel/uevent_seqnum; \
        for i in `seq 100`; do su --shell=/bin/true bin; done; \
        read END < /sys/kernel/uevent_seqnum; \
        echo $(($END - $START))
        1
      Acked-by: NDhaval Giani <dhaval@linux.vnet.ibm.com>
      Signed-off-by: NKay Sievers <kay.sievers@vrfy.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@suse.de>
      3959214f
  7. 15 6月, 2009 1 次提交
    • L
      sched: Introduce SCHED_RESET_ON_FORK scheduling policy flag · ca94c442
      Lennart Poettering 提交于
      This patch introduces a new flag SCHED_RESET_ON_FORK which can be passed
      to the kernel via sched_setscheduler(), ORed in the policy parameter. If
      set this will make sure that when the process forks a) the scheduling
      priority is reset to DEFAULT_PRIO if it was higher and b) the scheduling
      policy is reset to SCHED_NORMAL if it was either SCHED_FIFO or SCHED_RR.
      
      Why have this?
      
      Currently, if a process is real-time scheduled this will 'leak' to all
      its child processes. For security reasons it is often (always?) a good
      idea to make sure that if a process acquires RT scheduling this is
      confined to this process and only this process. More specifically this
      makes the per-process resource limit RLIMIT_RTTIME useful for security
      purposes, because it makes it impossible to use a fork bomb to
      circumvent the per-process RLIMIT_RTTIME accounting.
      
      This feature is also useful for tools like 'renice' which can then
      change the nice level of a process without having this spill to all its
      child processes.
      
      Why expose this via sched_setscheduler() and not other syscalls such as
      prctl() or sched_setparam()?
      
      prctl() does not take a pid parameter. Due to that it would be
      impossible to modify this flag for other processes than the current one.
      
      The struct passed to sched_setparam() can unfortunately not be extended
      without breaking compatibility, since sched_setparam() lacks a size
      parameter.
      
      How to use this from userspace? In your RT program simply replace this:
      
        sched_setscheduler(pid, SCHED_FIFO, &param);
      
      by this:
      
        sched_setscheduler(pid, SCHED_FIFO|SCHED_RESET_ON_FORK, &param);
      Signed-off-by: NLennart Poettering <lennart@poettering.net>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20090615152714.GA29092@tango.0pointer.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ca94c442
  8. 24 5月, 2009 1 次提交
  9. 22 5月, 2009 1 次提交
    • P
      perf_counter: Dynamically allocate tasks' perf_counter_context struct · a63eaf34
      Paul Mackerras 提交于
      This replaces the struct perf_counter_context in the task_struct with
      a pointer to a dynamically allocated perf_counter_context struct.  The
      main reason for doing is this is to allow us to transfer a
      perf_counter_context from one task to another when we do lazy PMU
      switching in a later patch.
      
      This has a few side-benefits: the task_struct becomes a little smaller,
      we save some memory because only tasks that have perf_counters attached
      get a perf_counter_context allocated for them, and we can remove the
      inclusion of <linux/perf_counter.h> in sched.h, meaning that we don't
      end up recompiling nearly everything whenever perf_counter.h changes.
      
      The perf_counter_context structures are reference-counted and freed
      when the last reference is dropped.  A context can have references
      from its task and the counters on its task.  Counters can outlive the
      task so it is possible that a context will be freed well after its
      task has exited.
      
      Contexts are allocated on fork if the parent had a context, or
      otherwise the first time that a per-task counter is created on a task.
      In the latter case, we set the context pointer in the task struct
      locklessly using an atomic compare-and-exchange operation in case we
      raced with some other task in creating a context for the subject task.
      
      This also removes the task pointer from the perf_counter struct.  The
      task pointer was not used anywhere and would make it harder to move a
      context from one task to another.  Anything that needed to know which
      task a counter was attached to was already using counter->ctx->task.
      
      The __perf_counter_init_context function moves up in perf_counter.c
      so that it can be called from find_get_context, and now initializes
      the refcount, but is otherwise unchanged.
      
      We were potentially calling list_del_counter twice: once from
      __perf_counter_exit_task when the task exits and once from
      __perf_counter_remove_from_context when the counter's fd gets closed.
      This adds a check in list_del_counter so it doesn't do anything if
      the counter has already been removed from the lists.
      
      Since perf_counter_task_sched_in doesn't do anything if the task doesn't
      have a context, and leaves cpuctx->task_ctx = NULL, this adds code to
      __perf_install_in_context to set cpuctx->task_ctx if necessary, i.e. in
      the case where the current task adds the first counter to itself and
      thus creates a context for itself.
      
      This also adds similar code to __perf_counter_enable to handle a
      similar situation which can arise when the counters have been disabled
      using prctl; that also leaves cpuctx->task_ctx = NULL.
      
      [ Impact: refactor counter context management to prepare for new feature ]
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      LKML-Reference: <18966.10075.781053.231153@cargo.ozlabs.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a63eaf34
  10. 19 5月, 2009 2 次提交
  11. 15 5月, 2009 3 次提交
    • T
      sched, timers: cleanup avenrun users · 2d02494f
      Thomas Gleixner 提交于
      avenrun is an rough estimate so we don't have to worry about
      consistency of the three avenrun values. Remove the xtime lock
      dependency and provide a function to scale the values. Cleanup the
      users.
      
      [ Impact: cleanup ]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      2d02494f
    • T
      sched, timers: move calc_load() to scheduler · dce48a84
      Thomas Gleixner 提交于
      Dimitri Sivanich noticed that xtime_lock is held write locked across
      calc_load() which iterates over all online CPUs. That can cause long
      latencies for xtime_lock readers on large SMP systems. 
      
      The load average calculation is an rough estimate anyway so there is
      no real need to protect the readers vs. the update. It's not a problem
      when the avenrun array is updated while a reader copies the values.
      
      Instead of iterating over all online CPUs let the scheduler_tick code
      update the number of active tasks shortly before the avenrun update
      happens. The avenrun update itself is handled by the CPU which calls
      do_timer().
      
      [ Impact: reduce xtime_lock write locked section ]
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      dce48a84
    • P
      perf_counter: per user mlock gift · 789f90fc
      Peter Zijlstra 提交于
      Instead of a per-process mlock gift for perf-counters, use a
      per-user gift so that there is less of a DoS potential.
      
      [ Impact: allow less worst-case unprivileged memory consumption ]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      LKML-Reference: <20090515132018.496182835@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      789f90fc
  12. 13 5月, 2009 2 次提交
    • A
      timers: Logic to move non pinned timers · eea08f32
      Arun R Bharadwaj 提交于
      * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]:
      
      This patch migrates all non pinned timers and hrtimers to the current
      idle load balancer, from all the idle CPUs. Timers firing on busy CPUs
      are not migrated.
      
      While migrating hrtimers, care should be taken to check if migrating
      a hrtimer would result in a latency or not. So we compare the expiry of the
      hrtimer with the next timer interrupt on the target cpu and migrate the
      hrtimer only if it expires *after* the next interrupt on the target cpu.
      So, added a clockevents_get_next_event() helper function to return the
      next_event on the target cpu's clock_event_device.
      
      [ tglx: cleanups and simplifications ]
      Signed-off-by: NArun R Bharadwaj <arun@linux.vnet.ibm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      eea08f32
    • A
      timers: /proc/sys sysctl hook to enable timer migration · cd1bb94b
      Arun R Bharadwaj 提交于
      * Arun R Bharadwaj <arun@linux.vnet.ibm.com> [2009-04-16 12:11:36]:
      
      This patch creates the /proc/sys sysctl interface at
      /proc/sys/kernel/timer_migration
      
      Timer migration is enabled by default.
      
      To disable timer migration, when CONFIG_SCHED_DEBUG = y,
      
      echo 0 > /proc/sys/kernel/timer_migration
      Signed-off-by: NArun R Bharadwaj <arun@linux.vnet.ibm.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      cd1bb94b
  13. 11 5月, 2009 1 次提交
  14. 30 4月, 2009 1 次提交
  15. 18 4月, 2009 1 次提交
    • S
      tracing: add same level recursion detection · 261842b7
      Steven Rostedt 提交于
      The tracing infrastructure allows for recursion. That is, an interrupt
      may interrupt the act of tracing an event, and that interrupt may very well
      perform its own trace. This is a recursive trace, and is fine to do.
      
      The problem arises when there is a bug, and the utility doing the trace
      calls something that recurses back into the tracer. This recursion is not
      caused by an external event like an interrupt, but by code that is not
      expected to recurse. The result could be a lockup.
      
      This patch adds a bitmask to the task structure that keeps track
      of the trace recursion. To find the interrupt depth, the following
      algorithm is used:
      
        level = hardirq_count() + softirq_count() + in_nmi;
      
      Here, level will be the depth of interrutps and softirqs, and even handles
      the nmi. Then the corresponding bit is set in the recursion bitmask.
      If the bit was already set, we know we had a recursion at the same level
      and we warn about it and fail the writing to the buffer.
      
      After the data has been committed to the buffer, we clear the bit.
      No atomics are needed. The only races are with interrupts and they reset
      the bitmask before returning anywy.
      
      [ Impact: detect same irq level trace recursion ]
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      261842b7
  16. 15 4月, 2009 1 次提交
  17. 09 4月, 2009 1 次提交
    • N
      sched: do not count frozen tasks toward load · e3c8ca83
      Nathan Lynch 提交于
      Freezing tasks via the cgroup freezer causes the load average to climb
      because the freezer's current implementation puts frozen tasks in
      uninterruptible sleep (D state).
      
      Some applications which perform job-scheduling functions consult the
      load average when making decisions.  If a cgroup is frozen, the load
      average does not provide a useful measure of the system's utilization
      to such applications.  This is especially inconvenient if the job
      scheduler employs the cgroup freezer as a mechanism for preempting low
      priority jobs.  Contrast this with using SIGSTOP for the same purpose:
      the stopped tasks do not count toward system load.
      
      Change task_contributes_to_load() to return false if the task is
      frozen.  This results in /proc/loadavg behavior that better meets
      users' expectations.
      Signed-off-by: NNathan Lynch <ntl@pobox.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NNigel Cunningham <nigel@tuxonice.net>
      Tested-by: NNigel Cunningham <nigel@tuxonice.net>
      Cc: <stable@kernel.org>
      Cc: containers@lists.linux-foundation.org
      Cc: linux-pm@lists.linux-foundation.org
      Cc: Matt Helsley <matthltc@us.ibm.com>
      LKML-Reference: <20090408194512.47a99b95@manatee.lan>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e3c8ca83
  18. 07 4月, 2009 3 次提交
  19. 06 4月, 2009 1 次提交
  20. 03 4月, 2009 5 次提交
  21. 01 4月, 2009 2 次提交
  22. 24 3月, 2009 2 次提交
    • S
      function-graph: ignore times across schedule · 8aef2d28
      Steven Rostedt 提交于
      Impact: more accurate timings
      
      The current method of function graph tracing does not take into
      account the time spent when a task is not running. This shows functions
      that call schedule have increased costs:
      
       3) + 18.664 us   |      }
       ------------------------------------------
       3)    <idle>-0    =>  kblockd-123
       ------------------------------------------
      
       3)               |      finish_task_switch() {
       3)   1.441 us    |        _spin_unlock_irq();
       3)   3.966 us    |      }
       3) ! 2959.433 us |    }
       3) ! 2961.465 us |  }
      
      This patch uses the tracepoint in the scheduling context switch to
      account for time that has elapsed while a task is scheduled out.
      Now we see:
      
       ------------------------------------------
       3)    <idle>-0    =>  edac-po-1067
       ------------------------------------------
      
       3)               |      finish_task_switch() {
       3)   0.685 us    |        _spin_unlock_irq();
       3)   2.331 us    |      }
       3) + 41.439 us   |    }
       3) + 42.663 us   |  }
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      8aef2d28
    • T
      genirq: add threaded interrupt handler support · 3aa551c9
      Thomas Gleixner 提交于
      Add support for threaded interrupt handlers:
      
      A device driver can request that its main interrupt handler runs in a
      thread. To achive this the device driver requests the interrupt with
      request_threaded_irq() and provides additionally to the handler a
      thread function. The handler function is called in hard interrupt
      context and needs to check whether the interrupt originated from the
      device. If the interrupt originated from the device then the handler
      can either return IRQ_HANDLED or IRQ_WAKE_THREAD. IRQ_HANDLED is
      returned when no further action is required. IRQ_WAKE_THREAD causes
      the genirq code to invoke the threaded (main) handler. When
      IRQ_WAKE_THREAD is returned handler must have disabled the interrupt
      on the device level. This is mandatory for shared interrupt handlers,
      but we need to do it as well for obscure x86 hardware where disabling
      an interrupt on the IO_APIC level redirects the interrupt to the
      legacy PIC interrupt lines.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@elte.hu>
      3aa551c9
  23. 12 3月, 2009 1 次提交