1. 08 2月, 2013 1 次提交
  2. 05 2月, 2013 2 次提交
  3. 04 2月, 2013 1 次提交
  4. 31 1月, 2013 1 次提交
  5. 28 1月, 2013 6 次提交
    • F
      cputime: Safely read cputime of full dynticks CPUs · 6a61671b
      Frederic Weisbecker 提交于
      While remotely reading the cputime of a task running in a
      full dynticks CPU, the values stored in utime/stime fields
      of struct task_struct may be stale. Its values may be those
      of the last kernel <-> user transition time snapshot and
      we need to add the tickless time spent since this snapshot.
      
      To fix this, flush the cputime of the dynticks CPUs on
      kernel <-> user transition and record the time / context
      where we did this. Then on top of this snapshot and the current
      time, perform the fixup on the reader side from task_times()
      accessors.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      [fixed kvm module related build errors]
      Signed-off-by: NSedat Dilek <sedat.dilek@gmail.com>
      6a61671b
    • F
      kvm: Prepare to add generic guest entry/exit callbacks · c11f11fc
      Frederic Weisbecker 提交于
      Do some ground preparatory work before adding guest_enter()
      and guest_exit() context tracking callbacks. Those will
      be later used to read the guest cputime safely when we
      run in full dynticks mode.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Gleb Natapov <gleb@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      c11f11fc
    • F
      cputime: Use accessors to read task cputime stats · 6fac4829
      Frederic Weisbecker 提交于
      This is in preparation for the full dynticks feature. While
      remotely reading the cputime of a task running in a full
      dynticks CPU, we'll need to do some extra-computation. This
      way we can account the time it spent tickless in userspace
      since its last cputime snapshot.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      6fac4829
    • F
      cputime: Allow dynamic switch between tick/virtual based cputime accounting · 3f4724ea
      Frederic Weisbecker 提交于
      Allow to dynamically switch between tick and virtual based
      cputime accounting. This way we can provide a kind of "on-demand"
      virtual based cputime accounting. In this mode, the kernel relies
      on the context tracking subsystem to dynamically probe on kernel
      boundaries.
      
      This is in preparation for being able to stop the timer tick in
      more places than just the idle state. Doing so will depend on
      CONFIG_VIRT_CPU_ACCOUNTING_GEN which makes it possible to account
      the cputime without the tick by hooking on kernel/user boundaries.
      
      Depending whether the tick is stopped or not, we can switch between
      tick and vtime based accounting anytime in order to minimize the
      overhead associated to user hooks.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      3f4724ea
    • F
      cputime: Generic on-demand virtual cputime accounting · abf917cd
      Frederic Weisbecker 提交于
      If we want to stop the tick further idle, we need to be
      able to account the cputime without using the tick.
      
      Virtual based cputime accounting solves that problem by
      hooking into kernel/user boundaries.
      
      However implementing CONFIG_VIRT_CPU_ACCOUNTING require
      low level hooks and involves more overhead. But we already
      have a generic context tracking subsystem that is required
      for RCU needs by archs which plan to shut down the tick
      outside idle.
      
      This patch implements a generic virtual based cputime
      accounting that relies on these generic kernel/user hooks.
      
      There are some upsides of doing this:
      
      - This requires no arch code to implement CONFIG_VIRT_CPU_ACCOUNTING
      if context tracking is already built (already necessary for RCU in full
      tickless mode).
      
      - We can rely on the generic context tracking subsystem to dynamically
      (de)activate the hooks, so that we can switch anytime between virtual
      and tick based accounting. This way we don't have the overhead
      of the virtual accounting when the tick is running periodically.
      
      And one downside:
      
      - There is probably more overhead than a native virtual based cputime
      accounting. But this relies on hooks that are already set anyway.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      abf917cd
    • F
      cputime: Move default nsecs_to_cputime() to jiffies based cputime file · ae8dda5c
      Frederic Weisbecker 提交于
      If the architecture doesn't provide an implementation of
      nsecs_to_cputime(), the cputime accounting core uses a
      default one that converts the nanoseconds to jiffies. However
      this only makes sense if we use the jiffies based cputime.
      
      For now it doesn't matter much because this API is only
      called on code that uses jiffies based cputime accounting.
      
      But the code may evolve and this API may be used more
      broadly in the future. Keeping this default implementation
      around is very error prone as it may introduce a bug and
      hide it on architectures that don't override this API.
      
      Fix this by moving this definition to the jiffies based
      cputime headers as it is the only place where it belongs to.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      ae8dda5c
  6. 27 1月, 2013 1 次提交
    • F
      cputime: Avoid multiplication overflow on utime scaling · 62188451
      Frederic Weisbecker 提交于
      We scale stime, utime values based on rtime (sum_exec_runtime
      converted to jiffies). During scaling we multiple rtime * utime,
      which seems to be fine, since both values are converted to u64,
      but it's not.
      
      Let assume HZ is 1000 - 1ms tick. Process consist of 64 threads,
      run for 1 day, threads utilize 100% cpu on user space. Machine
      has 64 cpus.
      
      Process rtime = utime will be 64 * 24 * 60 * 60 * 1000 jiffies,
      which is 0x149970000. Multiplication rtime * utime result is
      0x1a855771100000000, which can not be covered in 64 bits.
      
      Result of overflow is stall of utime values visible in user
      space (prev_utime in kernel), even if application still consume
      lot of CPU time.
      
      A solution to solve this is to perform the multiplication on
      stime instead of utime. It's easy to grow the utime value fast
      with a CPU bound thread in userspace for example. Now we assume
      that doing so with stime is much harder. In most cases a task
      shouldn't ever spend much time in kernel space as it tends to
      sleep waiting for jobs completion when they take long to
      achieve. IO is the typical example of that.
      
      Hence scaling the cputime by performing the multiplication on
      stime instead of utime should considerably reduce the chances of
      an overflow on most workloads.
      
      This is largely inspired by a patch from Stanislaw Gruszka:
      http://lkml.kernel.org/r/20130107113144.GA7544@redhat.comInspired-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Reported-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Acked-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1359217182-25184-1-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      62188451
  7. 25 1月, 2013 3 次提交
    • Y
      sched/rt: Avoid updating RT entry timeout twice within one tick period · 57d2aa00
      Ying Xue 提交于
      The issue below was found in 2.6.34-rt rather than mainline rt
      kernel, but the issue still exists upstream as well.
      
      So please let me describe how it was noticed on 2.6.34-rt:
      
      On this version, each softirq has its own thread, it means there
      is at least one RT FIFO task per cpu. The priority of these
      tasks is set to 49 by default. If user launches an RT FIFO task
      with priority lower than 49 of softirq RT tasks, it's possible
      there are two RT FIFO tasks enqueued one cpu runqueue at one
      moment. By current strategy of balancing RT tasks, when it comes
      to RT tasks, we really need to put them off to a CPU that they
      can run on as soon as possible. Even if it means a bit of cache
      line flushing, we want RT tasks to be run with the least latency.
      
      When the user RT FIFO task which just launched before is
      running, the sched timer tick of the current cpu happens. In this
      tick period, the timeout value of the user RT task will be
      updated once. Subsequently, we try to wake up one softirq RT
      task on its local cpu. As the priority of current user RT task
      is lower than the softirq RT task, the current task will be
      preempted by the higher priority softirq RT task. Before
      preemption, we check to see if current can readily move to a
      different cpu. If so, we will reschedule to allow the RT push logic
      to try to move current somewhere else. Whenever the woken
      softirq RT task runs, it first tries to migrate the user FIFO RT
      task over to a cpu that is running a task of lesser priority. If
      migration is done, it will send a reschedule request to the found
      cpu by IPI interrupt. Once the target cpu responds the IPI
      interrupt, it will pick the migrated user RT task to preempt its
      current task. When the user RT task is running on the new cpu,
      the sched timer tick of the cpu fires. So it will tick the user
      RT task again. This also means the RT task timeout value will be
      updated again. As the migration may be done in one tick period,
      it means the user RT task timeout value will be updated twice
      within one tick.
      
      If we set a limit on the amount of cpu time for the user RT task
      by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted
      upon reaching the soft limit.
      
      But exactly when the SIGXCPU signal should be sent depends on the
      RT task timeout value. In fact the timeout mechanism of sending
      the SIGXCPU signal assumes the RT task timeout is increased once
      every tick.
      
      However, currently the timeout value may be added twice per
      tick. So it results in the SIGXCPU signal being sent earlier
      than expected.
      
      To solve this issue, we prevent the timeout value from increasing
      twice within one tick time by remembering the jiffies value of
      last updating the timeout. As long as the RT task's jiffies is
      different with the global jiffies value, we allow its timeout to
      be updated.
      Signed-off-by: NYing Xue <ying.xue@windriver.com>
      Signed-off-by: NFan Du <fan.du@windriver.com>
      Reviewed-by: NYong Zhang <yong.zhang0@gmail.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1342508623-2887-1-git-send-email-ying.xue@windriver.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      57d2aa00
    • V
      sched/fair: Set se->vruntime directly in place_entity() · 16c8f1c7
      Viresh Kumar 提交于
      We are first storing the new vruntime in a variable and then
      storing it in se->vruntime. Simply update se->vruntime directly.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Cc: linaro-dev@lists.linaro.org
      Cc: patches@linaro.org
      Cc: peterz@infradead.org
      Link: http://lkml.kernel.org/r/ae59db1945518d6f6250920d46eb1f1a9cc0024e.1352361704.git.viresh.kumar@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      16c8f1c7
    • K
      sched/rt: Add reschedule check to switched_from_rt() · 1158ddb5
      Kirill Tkhai 提交于
      Reschedule rq->curr if the first RT task has just been
      pulled to the rq.
      Signed-off-by: NKirill V Tkhai <tkhai@yandex.ru>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tkhai Kirill <tkhai@yandex.ru>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/118761353614535@web28f.yandex.ruSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1158ddb5
  8. 24 1月, 2013 1 次提交
    • Z
      sched: Fix the broken sched_rr_get_interval() · a59f4e07
      Zhu Yanhai 提交于
      The caller of sched_sliced() should pass se.cfs_rq and se as the
      arguments, however in sched_rr_get_interval() we gave it
      rq.cfs_rq and se, which made the following computation obviously
      wrong.
      
      The change was introduced by commit:
      
        77034937 sched: fix crash in sys_sched_rr_get_interval()
      
      ... 5 years ago, while it had been the correct 'cfs_rq_of' before
      the commit. The change seems to be irrelevant to the commit
      msg, which was to return a 0 timeslice for tasks that are on an
      idle runqueue. So I believe that was just a plain typo.
      Signed-off-by: NZhu Yanhai <gaoyang.zyh@taobao.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul Turner <pjt@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/1357621012-15039-1-git-send-email-gaoyang.zyh@taobao.com
      [ Since this is an ABI and an old bug, we'll test this via a
        slow upstream route, to hopefully discover any app breakage. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a59f4e07
  9. 23 1月, 2013 1 次提交
  10. 20 12月, 2012 1 次提交
  11. 18 12月, 2012 1 次提交
  12. 14 12月, 2012 1 次提交
  13. 11 12月, 2012 11 次提交
    • M
      mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node · 5bca2303
      Mel Gorman 提交于
      Due to the fact that migrations are driven by the CPU a task is running
      on there is no point tracking NUMA faults until one task runs on a new
      node. This patch tracks the first node used by an address space. Until
      it changes, PTE scanning is disabled and no NUMA hinting faults are
      trapped. This should help workloads that are short-lived, do not care
      about NUMA placement or have bound themselves to a single node.
      
      This takes advantage of the logic in "mm: sched: numa: Implement slow
      start for working set sampling" to delay when the checks are made. This
      will take advantage of processes that set their CPU and node bindings
      early in their lifetime. It will also potentially allow any initial load
      balancing to take place.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      5bca2303
    • M
      mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG · 3105b86a
      Mel Gorman 提交于
      The "mm: sched: numa: Control enabling and disabling of NUMA balancing"
      depends on scheduling debug being enabled but it's perfectly legimate to
      disable automatic NUMA balancing even without this option. This should
      take care of it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      3105b86a
    • M
      mm: sched: numa: Control enabling and disabling of NUMA balancing · 1a687c2e
      Mel Gorman 提交于
      This patch adds Kconfig options and kernel parameters to allow the
      enabling and disabling of automatic NUMA balancing. The existance
      of such a switch was and is very important when debugging problems
      related to transparent hugepages and we should have the same for
      automatic NUMA placement.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      1a687c2e
    • M
      mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate · b8593bfd
      Mel Gorman 提交于
      The PTE scanning rate and fault rates are two of the biggest sources of
      system CPU overhead with automatic NUMA placement.  Ideally a proper policy
      would detect if a workload was properly placed, schedule and adjust the
      PTE scanning rate accordingly. We do not track the necessary information
      to do that but we at least know if we migrated or not.
      
      This patch scans slower if a page was not migrated as the result of a
      NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
      now higher than the previous default. Once every minute it will reset
      the scanner in case of phase changes.
      
      This is hilariously crude and the numbers are arbitrary. Workloads will
      converge quite slowly in comparison to what a proper policy should be able
      to do. On the plus side, we will chew up less CPU for workloads that have
      no need for automatic balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      b8593bfd
    • M
      sched: numa: Slowly increase the scanning period as NUMA faults are handled · fb003b80
      Mel Gorman 提交于
      Currently the rate of scanning for an address space is controlled
      by the individual tasks. The next scan is simply determined by
      2*p->numa_scan_period.
      
      The 2*p->numa_scan_period is arbitrary and never changes. At this point
      there is still no proper policy that decides if a task or process is
      properly placed. It just scans and assumes the next NUMA fault will
      place it properly. As it is assumed that pages will get properly placed
      over time, increase the scan window each time a fault is incurred. This
      is a big assumption as noted in the comments.
      
      It should be noted that changing to p->numa_scan_period will increase
      system CPU usage because now the scanning rate has effectively doubled.
      If that is a problem then the min_rate should be made 200ms instead of
      restoring the 2* logic.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      fb003b80
    • M
      mm: numa: Rate limit setting of pte_numa if node is saturated · e14808b4
      Mel Gorman 提交于
      If there are a large number of NUMA hinting faults and all of them
      are resulting in migrations it may indicate that memory is just
      bouncing uselessly around. NUMA balancing cost is likely exceeding
      any benefit from locality. Rate limit the PTE updates if the node
      is migration rate-limited. As noted in the comments, this distorts
      the NUMA faulting statistics.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      e14808b4
    • P
      mm: sched: numa: Implement slow start for working set sampling · 4b96a29b
      Peter Zijlstra 提交于
      Add a 1 second delay before starting to scan the working set of
      a task and starting to balance it amongst nodes.
      
      [ note that before the constant per task WSS sampling rate patch
        the initial scan would happen much later still, in effect that
        patch caused this regression. ]
      
      The theory is that short-run tasks benefit very little from NUMA
      placement: they come and go, and they better stick to the node
      they were started on. As tasks mature and rebalance to other CPUs
      and nodes, so does their NUMA placement have to change and so
      does it start to matter more and more.
      
      In practice this change fixes an observable kbuild regression:
      
         # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]
      
         !NUMA:
         45.291088843 seconds time elapsed                                          ( +-  0.40% )
         45.154231752 seconds time elapsed                                          ( +-  0.36% )
      
         +NUMA, no slow start:
         46.172308123 seconds time elapsed                                          ( +-  0.30% )
         46.343168745 seconds time elapsed                                          ( +-  0.25% )
      
         +NUMA, 1 sec slow start:
         45.224189155 seconds time elapsed                                          ( +-  0.25% )
         45.160866532 seconds time elapsed                                          ( +-  0.17% )
      
      and it also fixes an observable perf bench (hackbench) regression:
      
         # perf stat --null --repeat 10 perf bench sched messaging
      
         -NUMA:
      
         -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
         +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )
      
         +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )
      
      The implementation is simple and straightforward, most of the patch
      deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable
      knob.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote the changelog, ran measurements, tuned the default. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      4b96a29b
    • M
      sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges · 9f40604c
      Mel Gorman 提交于
      By accounting against the present PTEs, scanning speed reflects the
      actual present (mapped) memory.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      9f40604c
    • P
      mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate · 6e5fb223
      Peter Zijlstra 提交于
      Previously, to probe the working set of a task, we'd use
      a very simple and crude method: mark all of its address
      space PROT_NONE.
      
      That method has various (obvious) disadvantages:
      
       - it samples the working set at dissimilar rates,
         giving some tasks a sampling quality advantage
         over others.
      
       - creates performance problems for tasks with very
         large working sets
      
       - over-samples processes with large address spaces but
         which only very rarely execute
      
      Improve that method by keeping a rotating offset into the
      address space that marks the current position of the scan,
      and advance it by a constant rate (in a CPU cycles execution
      proportional manner). If the offset reaches the last mapped
      address of the mm then it then it starts over at the first
      address.
      
      The per-task nature of the working set sampling functionality in this tree
      allows such constant rate, per task, execution-weight proportional sampling
      of the working set, with an adaptive sampling interval/frequency that
      goes from once per 100ms up to just once per 8 seconds.  The current
      sampling volume is 256 MB per interval.
      
      As tasks mature and converge their working set, so does the
      sampling rate slow down to just a trickle, 256 MB per 8
      seconds of CPU time executed.
      
      This, beyond being adaptive, also rate-limits rarely
      executing systems and does not over-sample on overloaded
      systems.
      
      [ In AutoNUMA speak, this patch deals with the effective sampling
        rate of the 'hinting page fault'. AutoNUMA's scanning is
        currently rate-limited, but it is also fundamentally
        single-threaded, executing in the knuma_scand kernel thread,
        so the limit in AutoNUMA is global and does not scale up with
        the number of CPUs, nor does it scan tasks in an execution
        proportional manner.
      
        So the idea of rate-limiting the scanning was first implemented
        in the AutoNUMA tree via a global rate limit. This patch goes
        beyond that by implementing an execution rate proportional
        working set sampling rate that is not implemented via a single
        global scanning daemon. ]
      
      [ Dan Carpenter pointed out a possible NULL pointer dereference in the
        first version of this patch. ]
      Based-on-idea-by: NAndrea Arcangeli <aarcange@redhat.com>
      Bug-Found-By: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote changelog and fixed bug. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      6e5fb223
    • P
      mm: numa: Add fault driven placement and migration · cbee9f88
      Peter Zijlstra 提交于
      NOTE: This patch is based on "sched, numa, mm: Add fault driven
      	placement and migration policy" but as it throws away all the policy
      	to just leave a basic foundation I had to drop the signed-offs-by.
      
      This patch creates a bare-bones method for setting PTEs pte_numa in the
      context of the scheduler that when faulted later will be faulted onto the
      node the CPU is running on.  In itself this does nothing useful but any
      placement policy will fundamentally depend on receiving hints on placement
      from fault context and doing something intelligent about it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      cbee9f88
    • I
      Revert "sched/autogroup: Fix crash on reboot when autogroup is disabled" · c1ad41f1
      Ingo Molnar 提交于
      This reverts commit 5258f386,
      because the underlying autogroups bug got fixed upstream in
      a better way, via:
      
        fd8ef117 Revert "sched, autogroup: Stop going ahead if autogroup is disabled"
      
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Yong Zhang <yong.zhang0@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c1ad41f1
  14. 04 12月, 2012 1 次提交
    • M
      Revert "sched, autogroup: Stop going ahead if autogroup is disabled" · fd8ef117
      Mike Galbraith 提交于
      This reverts commit 800d4d30.
      
      Between commits 8323f26c ("sched: Fix race in task_group()") and
      800d4d30 ("sched, autogroup: Stop going ahead if autogroup is
      disabled"), autogroup is a wreck.
      
      With both applied, all you have to do to crash a box is disable
      autogroup during boot up, then reboot..  boom, NULL pointer dereference
      due to commit 800d4d30 not allowing autogroup to move things, and
      commit 8323f26c making that the only way to switch runqueues:
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<ffffffff81063ac0>] effective_load.isra.43+0x50/0x90
        Pid: 7047, comm: systemd-user-se Not tainted 3.6.8-smp #7 MEDIONPC MS-7502/MS-7502
        RIP: effective_load.isra.43+0x50/0x90
        Process systemd-user-se (pid: 7047, threadinfo ffff880221dde000, task ffff88022618b3a0)
        Call Trace:
          select_task_rq_fair+0x255/0x780
          try_to_wake_up+0x156/0x2c0
          wake_up_state+0xb/0x10
          signal_wake_up+0x28/0x40
          complete_signal+0x1d6/0x250
          __send_signal+0x170/0x310
          send_signal+0x40/0x80
          do_send_sig_info+0x47/0x90
          group_send_sig_info+0x4a/0x70
          kill_pid_info+0x3a/0x60
          sys_kill+0x97/0x1a0
          ? vfs_read+0x120/0x160
          ? sys_read+0x45/0x90
          system_call_fastpath+0x16/0x1b
        Code: 49 0f af 41 50 31 d2 49 f7 f0 48 83 f8 01 48 0f 46 c6 48 2b 07 48 8b bf 40 01 00 00 48 85 ff 74 3a 45 31 c0 48 8b 8f 50 01 00 00 <48> 8b 11 4c 8b 89 80 00 00 00 49 89 d2 48 01 d0 45 8b 59 58 4c
        RIP  [<ffffffff81063ac0>] effective_load.isra.43+0x50/0x90
         RSP <ffff880221ddfbd8>
        CR2: 0000000000000000
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Yong Zhang <yong.zhang0@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@vger.kernel.org # 2.6.39+
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd8ef117
  15. 01 12月, 2012 1 次提交
    • F
      context_tracking: New context tracking susbsystem · 91d1aa43
      Frederic Weisbecker 提交于
      Create a new subsystem that probes on kernel boundaries
      to keep track of the transitions between level contexts
      with two basic initial contexts: user or kernel.
      
      This is an abstraction of some RCU code that use such tracking
      to implement its userspace extended quiescent state.
      
      We need to pull this up from RCU into this new level of indirection
      because this tracking is also going to be used to implement an "on
      demand" generic virtual cputime accounting. A necessary step to
      shutdown the tick while still accounting the cputime.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      [ paulmck: fix whitespace error and email address. ]
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      91d1aa43
  16. 29 11月, 2012 4 次提交
    • F
      cputime: Comment cputime's adjusting code · fa092057
      Frederic Weisbecker 提交于
      The reason for the scaling and monotonicity correction performed
      by cputime_adjust() may not be immediately clear to the reviewer.
      
      Add some comments to explain what happens there.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      fa092057
    • F
      cputime: Consolidate cputime adjustment code · d37f761d
      Frederic Weisbecker 提交于
      task_cputime_adjusted() and thread_group_cputime_adjusted()
      essentially share the same code. They just don't use the same
      source:
      
      * The first function uses the cputime in the task struct and the
      previous adjusted snapshot that ensures monotonicity.
      
      * The second adds the cputime of all tasks in the group and the
      previous adjusted snapshot of the whole group from the signal
      structure.
      
      Just consolidate the common code that does the adjustment. These
      functions just need to fetch the values from the appropriate
      source.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      d37f761d
    • F
      cputime: Rename thread_group_times to thread_group_cputime_adjusted · e80d0a1a
      Frederic Weisbecker 提交于
      We have thread_group_cputime() and thread_group_times(). The naming
      doesn't provide enough information about the difference between
      these two APIs.
      
      To lower the confusion, rename thread_group_times() to
      thread_group_cputime_adjusted(). This name better suggests that
      it's a version of thread_group_cputime() that does some stabilization
      on the raw cputime values. ie here: scale on top of CFS runtime
      stats and bound lower value for monotonicity.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      e80d0a1a
    • F
      cputime: Move thread_group_cputime() to sched code · a634f933
      Frederic Weisbecker 提交于
      thread_group_cputime() is a general cputime API that is not only
      used by posix cpu timer. Let's move this helper to sched code.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      a634f933
  17. 28 11月, 2012 1 次提交
  18. 20 11月, 2012 2 次提交