1. 09 10月, 2013 6 次提交
    • M
      sched/numa: Set the scan rate proportional to the memory usage of the task being scanned · 598f0ec0
      Mel Gorman 提交于
      The NUMA PTE scan rate is controlled with a combination of the
      numa_balancing_scan_period_min, numa_balancing_scan_period_max and
      numa_balancing_scan_size. This scan rate is independent of the size
      of the task and as an aside it is further complicated by the fact that
      numa_balancing_scan_size controls how many pages are marked pte_numa and
      not how much virtual memory is scanned.
      
      In combination, it is almost impossible to meaningfully tune the min and
      max scan periods and reasoning about performance is complex when the time
      to complete a full scan is is partially a function of the tasks memory
      size. This patch alters the semantic of the min and max tunables to be
      about tuning the length time it takes to complete a scan of a tasks occupied
      virtual address space. Conceptually this is a lot easier to understand. There
      is a "sanity" check to ensure the scan rate is never extremely fast based on
      the amount of virtual memory that should be scanned in a second. The default
      of 2.5G seems arbitrary but it is to have the maximum scan rate after the
      patch roughly match the maximum scan rate before the patch was applied.
      
      On a similar note, numa_scan_period is in milliseconds and not
      jiffies. Properly placed pages slow the scanning rate but adding 10 jiffies
      to numa_scan_period means that the rate scanning slows depends on HZ which
      is confusing. Get rid of the jiffies_to_msec conversion and treat it as ms.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-18-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      598f0ec0
    • M
      sched/numa: Initialise numa_next_scan properly · 7e8d16b6
      Mel Gorman 提交于
      Scan delay logic and resets are currently initialised to start scanning
      immediately instead of delaying properly. Initialise them properly at
      fork time and catch when a new mm has been allocated.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-17-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7e8d16b6
    • M
      Revert "mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node" · b726b7df
      Mel Gorman 提交于
      PTE scanning and NUMA hinting fault handling is expensive so commit
      5bca2303 ("mm: sched: numa: Delay PTE scanning until a task is scheduled
      on a new node") deferred the PTE scan until a task had been scheduled on
      another node. The problem is that in the purely shared memory case that
      this may never happen and no NUMA hinting fault information will be
      captured. We are not ruling out the possibility that something better
      can be done here but for now, this patch needs to be reverted and depend
      entirely on the scan_delay to avoid punishing short-lived processes.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1381141781-10992-16-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b726b7df
    • P
      sched/numa: Continue PTE scanning even if migrate rate limited · 9e645ab6
      Peter Zijlstra 提交于
      Avoiding marking PTEs pte_numa because a particular NUMA node is migrate rate
      limited sees like a bad idea. Even if this node can't migrate anymore other
      nodes might and we want up-to-date information to do balance decisions.
      We already rate limit the actual migrations, this should leave enough
      bandwidth to allow the non-migrating scanning. I think its important we
      keep up-to-date information if we're going to do placement based on it.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1381141781-10992-15-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9e645ab6
    • P
      sched/numa: Mitigate chance that same task always updates PTEs · 19a78d11
      Peter Zijlstra 提交于
      With a trace_printk("working\n"); right after the cmpxchg in
      task_numa_work() we can see that of a 4 thread process, its always the
      same task winning the race and doing the protection change.
      
      This is a problem since the task doing the protection change has a
      penalty for taking faults -- it is busy when marking the PTEs. If its
      always the same task the ->numa_faults[] get severely skewed.
      
      Avoid this by delaying the task doing the protection change such that
      it is unlikely to win the privilege again.
      
      Before:
      
      root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
            thread 0/0-3232  [022] ....   212.787402: task_numa_work: working
            thread 0/0-3232  [022] ....   212.888473: task_numa_work: working
            thread 0/0-3232  [022] ....   212.989538: task_numa_work: working
            thread 0/0-3232  [022] ....   213.090602: task_numa_work: working
            thread 0/0-3232  [022] ....   213.191667: task_numa_work: working
            thread 0/0-3232  [022] ....   213.292734: task_numa_work: working
            thread 0/0-3232  [022] ....   213.393804: task_numa_work: working
            thread 0/0-3232  [022] ....   213.494869: task_numa_work: working
            thread 0/0-3232  [022] ....   213.596937: task_numa_work: working
            thread 0/0-3232  [022] ....   213.699000: task_numa_work: working
            thread 0/0-3232  [022] ....   213.801067: task_numa_work: working
            thread 0/0-3232  [022] ....   213.903155: task_numa_work: working
            thread 0/0-3232  [022] ....   214.005201: task_numa_work: working
            thread 0/0-3232  [022] ....   214.107266: task_numa_work: working
            thread 0/0-3232  [022] ....   214.209342: task_numa_work: working
      
      After:
      
      root@interlagos:~# grep "thread 0/.*working" /debug/tracing/trace | tail -15
            thread 0/0-3253  [005] ....   136.865051: task_numa_work: working
            thread 0/2-3255  [026] ....   136.965134: task_numa_work: working
            thread 0/3-3256  [024] ....   137.065217: task_numa_work: working
            thread 0/3-3256  [024] ....   137.165302: task_numa_work: working
            thread 0/3-3256  [024] ....   137.265382: task_numa_work: working
            thread 0/0-3253  [004] ....   137.366465: task_numa_work: working
            thread 0/2-3255  [026] ....   137.466549: task_numa_work: working
            thread 0/0-3253  [004] ....   137.566629: task_numa_work: working
            thread 0/0-3253  [004] ....   137.666711: task_numa_work: working
            thread 0/1-3254  [028] ....   137.766799: task_numa_work: working
            thread 0/0-3253  [004] ....   137.866876: task_numa_work: working
            thread 0/2-3255  [026] ....   137.966960: task_numa_work: working
            thread 0/1-3254  [028] ....   138.067041: task_numa_work: working
            thread 0/2-3255  [026] ....   138.167123: task_numa_work: working
            thread 0/3-3256  [024] ....   138.267207: task_numa_work: working
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1381141781-10992-14-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      19a78d11
    • P
      sched/numa: Fix comments · c69307d5
      Peter Zijlstra 提交于
      Fix a 80 column violation and a PTE vs PMD reference.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Link: http://lkml.kernel.org/r/1381141781-10992-4-git-send-email-mgorman@suse.deSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c69307d5
  2. 20 9月, 2013 6 次提交
  3. 13 9月, 2013 6 次提交
  4. 10 9月, 2013 1 次提交
  5. 02 9月, 2013 8 次提交
  6. 01 8月, 2013 1 次提交
  7. 31 7月, 2013 1 次提交
  8. 23 7月, 2013 3 次提交
    • P
      sched: Micro-optimize the smart wake-affine logic · 7d9ffa89
      Peter Zijlstra 提交于
      Smart wake-affine is using node-size as the factor currently, but the overhead
      of the mask operation is high.
      
      Thus, this patch introduce the 'sd_llc_size' percpu variable, which will record
      the highest cache-share domain size, and make it to be the new factor, in order
      to reduce the overhead and make it more reasonable.
      Tested-by: NDavidlohr Bueso <davidlohr.bueso@hp.com>
      Tested-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Link: http://lkml.kernel.org/r/51D5008E.6030102@linux.vnet.ibm.com
      [ Tidied up the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      7d9ffa89
    • M
      sched: Implement smarter wake-affine logic · 62470419
      Michael Wang 提交于
      The wake-affine scheduler feature is currently always trying to pull
      the wakee close to the waker. In theory this should be beneficial if
      the waker's CPU caches hot data for the wakee, and it's also beneficial
      in the extreme ping-pong high context switch rate case.
      
      Testing shows it can benefit hackbench up to 15%.
      
      However, the feature is somewhat blind, from which some workloads
      such as pgbench suffer. It's also time-consuming algorithmically.
      
      Testing shows it can damage pgbench up to 50% - far more than the
      benefit it brings in the best case.
      
      So wake-affine should be smarter and it should realize when to
      stop its thankless effort at trying to find a suitable CPU to wake on.
      
      This patch introduces 'wakee_flips', which will be increased each
      time the task flips (switches) its wakee target.
      
      So a high 'wakee_flips' value means the task has more than one
      wakee, and the bigger the number, the higher the wakeup frequency.
      
      Now when making the decision on whether to pull or not, pay attention to
      the wakee with a high 'wakee_flips', pulling such a task may benefit
      the wakee. Also imply that the waker will face cruel competition later,
      it could be very cruel or very fast depends on the story behind
      'wakee_flips', waker therefore suffers.
      
      Furthermore, if waker also has a high 'wakee_flips', that implies that
      multiple tasks rely on it, then waker's higher latency will damage all
      of them, so pulling wakee seems to be a bad deal.
      
      Thus, when 'waker->wakee_flips / wakee->wakee_flips' becomes
      higher and higher, the cost of pulling seems to be worse and worse.
      
      The patch therefore helps the wake-affine feature to stop its pulling
      work when:
      
      	wakee->wakee_flips > factor &&
      	waker->wakee_flips > (factor * wakee->wakee_flips)
      
      The 'factor' here is the number of CPUs in the current CPU's NUMA node,
      so a bigger node will lead to more pulling since the trial becomes more
      severe.
      
      After applying the patch, pgbench shows up to 40% improvements and no regressions.
      
      Tested with 12 cpu x86 server and tip 3.10.0-rc7.
      
      The percentages in the final column highlight the areas with the biggest wins,
      all other areas improved as well:
      
      	pgbench		    base	smart
      
      	| db_size | clients |  tps  |	|  tps  |
      	+---------+---------+-------+   +-------+
      	| 22 MB   |       1 | 10598 |   | 10796 |
      	| 22 MB   |       2 | 21257 |   | 21336 |
      	| 22 MB   |       4 | 41386 |   | 41622 |
      	| 22 MB   |       8 | 51253 |   | 57932 |
      	| 22 MB   |      12 | 48570 |   | 54000 |
      	| 22 MB   |      16 | 46748 |   | 55982 | +19.75%
      	| 22 MB   |      24 | 44346 |   | 55847 | +25.93%
      	| 22 MB   |      32 | 43460 |   | 54614 | +25.66%
      	| 7484 MB |       1 |  8951 |   |  9193 |
      	| 7484 MB |       2 | 19233 |   | 19240 |
      	| 7484 MB |       4 | 37239 |   | 37302 |
      	| 7484 MB |       8 | 46087 |   | 50018 |
      	| 7484 MB |      12 | 42054 |   | 48763 |
      	| 7484 MB |      16 | 40765 |   | 51633 | +26.66%
      	| 7484 MB |      24 | 37651 |   | 52377 | +39.11%
      	| 7484 MB |      32 | 37056 |   | 51108 | +37.92%
      	| 15 GB   |       1 |  8845 |   |  9104 |
      	| 15 GB   |       2 | 19094 |   | 19162 |
      	| 15 GB   |       4 | 36979 |   | 36983 |
      	| 15 GB   |       8 | 46087 |   | 49977 |
      	| 15 GB   |      12 | 41901 |   | 48591 |
      	| 15 GB   |      16 | 40147 |   | 50651 | +26.16%
      	| 15 GB   |      24 | 37250 |   | 52365 | +40.58%
      	| 15 GB   |      32 | 36470 |   | 50015 | +37.14%
      Signed-off-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/51D50057.9000809@linux.vnet.ibm.com
      [ Improved the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      62470419
    • V
      sched: Move h_load calculation to task_h_load() · 68520796
      Vladimir Davydov 提交于
      The bad thing about update_h_load(), which computes hierarchical load
      factor for task groups, is that it is called for each task group in the
      system before every load balancer run, and since rebalance can be
      triggered very often, this function can eat really a lot of cpu time if
      there are many cpu cgroups in the system.
      
      Although the situation was improved significantly by commit a35b6466
      ('sched, cgroup: Reduce rq->lock hold times for large cgroup
      hierarchies'), the problem still can arise under some kinds of loads,
      e.g. when cpus are switching from idle to busy and back very frequently.
      
      For instance, when I start 1000 of processes that wake up every
      millisecond on my 8 cpus host, 'top' and 'perf top' show:
      
      Cpu(s): 17.8%us, 24.3%sy,  0.0%ni, 57.9%id,  0.0%wa,  0.0%hi,  0.0%si
      Events: 243K cycles
        7.57%  [kernel]               [k] __schedule
        7.08%  [kernel]               [k] timerqueue_add
        6.13%  libc-2.12.so           [.] usleep
      
      Then if I create 10000 *idle* cpu cgroups (no processes in them), cpu
      usage increases significantly although the 'wakers' are still executing
      in the root cpu cgroup:
      
      Cpu(s): 19.1%us, 48.7%sy,  0.0%ni, 31.6%id,  0.0%wa,  0.0%hi,  0.7%si
      Events: 230K cycles
       24.56%  [kernel]            [k] tg_load_down
        5.76%  [kernel]            [k] __schedule
      
      This happens because this particular kind of load triggers 'new idle'
      rebalance very frequently, which requires calling update_h_load(),
      which, in turn, calls tg_load_down() for every *idle* cpu cgroup even
      though it is absolutely useless, because idle cpu cgroups have no tasks
      to pull.
      
      This patch tries to improve the situation by making h_load calculation
      proceed only when h_load is really necessary. To achieve this, it
      substitutes update_h_load() with update_cfs_rq_h_load(), which computes
      h_load only for a given cfs_rq and all its ascendants, and makes the
      load balancer call this function whenever it considers if a task should
      be pulled, i.e. it moves h_load calculations directly to task_h_load().
      For h_load of the same cfs_rq not to be updated multiple times (in case
      several tasks in the same cgroup are considered during the same balance
      run), the patch keeps the time of the last h_load update for each cfs_rq
      and breaks calculation when it finds h_load to be uptodate.
      
      The benefit of it is that h_load is computed only for those cfs_rq's,
      which really need it, in particular all idle task groups are skipped.
      Although this, in fact, moves h_load calculation under rq lock, it
      should not affect latency much, because the amount of work done under rq
      lock while trying to pull tasks is limited by sched_nr_migrate.
      
      After the patch applied with the setup described above (1000 wakers in
      the root cgroup and 10000 idle cgroups), I get:
      
      Cpu(s): 16.9%us, 24.8%sy,  0.0%ni, 58.4%id,  0.0%wa,  0.0%hi,  0.0%si
      Events: 242K cycles
        7.57%  [kernel]                  [k] __schedule
        6.70%  [kernel]                  [k] timerqueue_add
        5.93%  libc-2.12.so              [.] usleep
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1373896159-1278-1-git-send-email-vdavydov@parallels.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      68520796
  9. 22 7月, 2013 1 次提交
  10. 18 7月, 2013 1 次提交
  11. 15 7月, 2013 1 次提交
    • P
      kernel: delete __cpuinit usage from all core kernel files · 0db0628d
      Paul Gortmaker 提交于
      The __cpuinit type of throwaway sections might have made sense
      some time ago when RAM was more constrained, but now the savings
      do not offset the cost and complications.  For example, the fix in
      commit 5e427ec2 ("x86: Fix bit corruption at CPU resume time")
      is a good example of the nasty type of bugs that can be created
      with improper use of the various __init prefixes.
      
      After a discussion on LKML[1] it was decided that cpuinit should go
      the way of devinit and be phased out.  Once all the users are gone,
      we can then finally remove the macros themselves from linux/init.h.
      
      This removes all the uses of the __cpuinit macros from C files in
      the core kernel directories (kernel, init, lib, mm, and include)
      that don't really have a specific maintainer.
      
      [1] https://lkml.org/lkml/2013/5/20/589Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      0db0628d
  12. 27 6月, 2013 5 次提交