1. 09 10月, 2013 6 次提交
    • M
      sched/numa: Set the scan rate proportional to the memory usage of the task being scanned · 598f0ec0
      Mel Gorman 提交于
      The NUMA PTE scan rate is controlled with a combination of the
      numa_balancing_scan_period_min, numa_balancing_scan_period_max and
      numa_balancing_scan_size. This scan rate is independent of the size
      of the task and as an aside it is further complicated by the fact that
      numa_balancing_scan_size controls how many pages are marked pte_numa and
      not how much virtual memory is scanned.
      
      In combination, it is almost impossible to meaningfully tune the min and
      max scan periods and reasoning about performance is complex when the time
      to complete a full scan is is partially a function of the tasks memory
      size. This patch alters the semantic of the min and max tunables to be
      about tuning the length time it takes to complete a scan of a tasks occupied
      virtual address space. Conceptually this is a lot easier to understand. There
      is a "sanity" check to ensure the scan rate is never extremely fast based on
      the amount of virtual memory that should be scanned in a second. The default
      of 2.5G seems arbitrary but it is to have the maximum scan rate after the
      patch roughly match the maximum scan rate before the patch was applied.
      
      On a similar note, numa_scan_period is in milliseconds and not
      jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
      to numa_scan_period means that the rate scanning slows depends on HZ which
      is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      598f0ec0
    • M
      sched/numa: Initialise numa_next_scan properly · 7e8d16b6
      Mel Gorman 提交于
      Scan delay logic and resets are currently initialised to start scanning
      immediately instead of delaying properly. Initialise them properly at
      fork time and catch when a new mm has been allocated.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-17-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7e8d16b6
    • M
      Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node" · b726b7df
      Mel Gorman 提交于
      PTE scanning and NUMA hinting fault handling is expensive so commit
      5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
      on a new node") deferred the PTE scan until a task had been scheduled on
      another node. The problem is that in the purely shared memory case that
      this may never happen and no NUMA hinting fault information will be
      captured. We are not ruling out the possibility that something better
      can be done here but for now, this patch needs to be reverted and depend
      entirely on the scan_delay to avoid punishing short-lived processes.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b726b7df
    • P
      sched/numa: Continue PTE scanning even if migrate rate limited · 9e645ab6
      Peter Zijlstra 提交于
      Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
      limited sees like a bad idea. Even if this node can't migrate anymore other
      nodes might and we want up-to-date information to do balance decisions.
      We already rate limit the actual migrations, this should leave enough
      bandwidth to allow the non-migrating scanning. I think its important we
      keep up-to-date information if we're going to do placement based on it.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1381141781-10992-15-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9e645ab6
    • P
      sched/numa: Mitigate chance that same task always updates PTEs · 19a78d11
      Peter Zijlstra 提交于
      With a trace_printk("working\n"); right after the cmpxchg in
      task_numa_work() we can see that of a 4 thread process, its always the
      same task winning the race and doing the protection change.
      
      This is a problem since the task doing the protection change has a
      penalty for taking faults -- it is busy when marking the PTEs. If its
      always the same task the ->numa_faults[] get severely skewed.
      
      Avoid this by delaying the task doing the protection change such that
      it is unlikely to win the privilege again.
      
      Before:
      
      root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
            thread 0/0-3232  [022] ....   212.787402: task_numa_work: working
            thread 0/0-3232  [022] ....   212.888473: task_numa_work: working
            thread 0/0-3232  [022] ....   212.989538: task_numa_work: working
            thread 0/0-3232  [022] ....   213.090602: task_numa_work: working
            thread 0/0-3232  [022] ....   213.191667: task_numa_work: working
            thread 0/0-3232  [022] ....   213.292734: task_numa_work: working
            thread 0/0-3232  [022] ....   213.393804: task_numa_work: working
            thread 0/0-3232  [022] ....   213.494869: task_numa_work: working
            thread 0/0-3232  [022] ....   213.596937: task_numa_work: working
            thread 0/0-3232  [022] ....   213.699000: task_numa_work: working
            thread 0/0-3232  [022] ....   213.801067: task_numa_work: working
            thread 0/0-3232  [022] ....   213.903155: task_numa_work: working
            thread 0/0-3232  [022] ....   214.005201: task_numa_work: working
            thread 0/0-3232  [022] ....   214.107266: task_numa_work: working
            thread 0/0-3232  [022] ....   214.209342: task_numa_work: working
      
      After:
      
      root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
            thread 0/0-3253  [005] ....   136.865051: task_numa_work: working
            thread 0/2-3255  [026] ....   136.965134: task_numa_work: working
            thread 0/3-3256  [024] ....   137.065217: task_numa_work: working
            thread 0/3-3256  [024] ....   137.165302: task_numa_work: working
            thread 0/3-3256  [024] ....   137.265382: task_numa_work: working
            thread 0/0-3253  [004] ....   137.366465: task_numa_work: working
            thread 0/2-3255  [026] ....   137.466549: task_numa_work: working
            thread 0/0-3253  [004] ....   137.566629: task_numa_work: working
            thread 0/0-3253  [004] ....   137.666711: task_numa_work: working
            thread 0/1-3254  [028] ....   137.766799: task_numa_work: working
            thread 0/0-3253  [004] ....   137.866876: task_numa_work: working
            thread 0/2-3255  [026] ....   137.966960: task_numa_work: working
            thread 0/1-3254  [028] ....   138.067041: task_numa_work: working
            thread 0/2-3255  [026] ....   138.167123: task_numa_work: working
            thread 0/3-3256  [024] ....   138.267207: task_numa_work: working
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1381141781-10992-14-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      19a78d11
    • P
      sched/numa: Fix comments · c69307d5
      Peter Zijlstra 提交于
      Fix a 80 column violation and a PTE vs PMD reference.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1381141781-10992-4-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c69307d5
  2. 06 10月, 2013 1 次提交
  3. 25 9月, 2013 7 次提交
  4. 20 9月, 2013 7 次提交
  5. 16 9月, 2013 1 次提交
  6. 13 9月, 2013 7 次提交
  7. 10 9月, 2013 1 次提交
  8. 04 9月, 2013 1 次提交
    • S
      sched/cputime: Do not scale when utime == 0 · 5a8e01f8
      Stanislaw Gruszka 提交于
      scale_stime() silently assumes that stime < rtime, otherwise
      when stime == rtime and both values are big enough (operations
      on them do not fit in 32 bits), the resulting scaling stime can
      be bigger than rtime. In consequence utime = rtime - stime
      results in negative value.
      
      User space visible symptoms of the bug are overflowed TIME
      values on ps/top, for example:
      
       $ ps aux | grep rcu
       root         8  0.0  0.0      0     0 ?        S    12:42   0:00 [rcuc/0]
       root         9  0.0  0.0      0     0 ?        S    12:42   0:00 [rcub/0]
       root        10 62422329  0.0  0     0 ?        R    12:42 21114581:37 [rcu_preempt]
       root        11  0.1  0.0      0     0 ?        S    12:42   0:02 [rcuop/0]
       root        12 62422329  0.0  0     0 ?        S    12:42 21114581:35 [rcuop/1]
       root        10 62422329  0.0  0     0 ?        R    12:42 21114581:37 [rcu_preempt]
      
      or overflowed utime values read directly from /proc/$PID/stat
      
      Reference:
      
        https://lkml.org/lkml/2013/8/20/259Reported-and-tested-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Cc: stable@vger.kernel.org
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Link: http://lkml.kernel.org/r/20130904131602.GC2564@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5a8e01f8
  9. 02 9月, 2013 9 次提交