1. 31 3月, 2016 4 次提交
  2. 21 3月, 2016 2 次提交
  3. 09 3月, 2016 1 次提交
  4. 29 2月, 2016 3 次提交
  5. 09 2月, 2016 2 次提交
    • R
      sched/numa: Spread memory according to CPU and memory use · 4142c3eb
      Rik van Riel 提交于
      The pseudo-interleaving in NUMA placement has a fundamental problem:
      using hard usage thresholds to spread memory equally between nodes
      can prevent workloads from converging, or keep memory "trapped" on
      nodes where the workload is barely running any more.
      
      In order for workloads to properly converge, the memory migration
      should not be stopped when nodes reach parity, but instead be
      distributed according to how heavily memory is used from each node.
      This way memory migration and task migration reinforce each other,
      instead of one putting the brakes on the other.
      
      Remove the hard thresholds from the pseudo-interleaving code, and
      instead use a more gradual policy on memory placement. This also
      seems to improve convergence of workloads that do not run flat out,
      but sleep in between bursts of activity.
      
      We still want to slow down NUMA scanning and migration once a workload
      has settled on a few actively used nodes, so keep the 3/4 hysteresis
      in place. Keep track of whether a workload is actively running on
      multiple nodes, so task_numa_migrate does a full scan of the system
      for better task placement.
      
      In the case of running 3 SPECjbb2005 instances on a 4 node system,
      this code seems to result in fairer distribution of memory between
      nodes, with more memory bandwidth for each instance.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: mgorman@suse.de
      Link: http://lkml.kernel.org/r/20160125170739.2fc9a641@annuminas.surriel.com
      [ Minor readability tweaks. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      4142c3eb
    • M
      sched/debug: Make schedstats a runtime tunable that is disabled by default · cb251765
      Mel Gorman 提交于
      schedstats is very useful during debugging and performance tuning but it
      incurs overhead to calculate the stats. As such, even though it can be
      disabled at build time, it is often enabled as the information is useful.
      
      This patch adds a kernel command-line and sysctl tunable to enable or
      disable schedstats on demand (when it's built in). It is disabled
      by default as someone who knows they need it can also learn to enable
      it when necessary.
      
      The benefits are dependent on how scheduler-intensive the workload is.
      If it is then the patch reduces the number of cycles spent calculating
      the stats with a small benefit from reducing the cache footprint of the
      scheduler.
      
      These measurements were taken from a 48-core 2-socket
      machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a
      single socket machine 8-core machine with Intel i7-3770 processors.
      
      netperf-tcp
                                 4.5.0-rc1             4.5.0-rc1
                                   vanilla          nostats-v3r1
      Hmean    64         560.45 (  0.00%)      575.98 (  2.77%)
      Hmean    128        766.66 (  0.00%)      795.79 (  3.80%)
      Hmean    256        950.51 (  0.00%)      981.50 (  3.26%)
      Hmean    1024      1433.25 (  0.00%)     1466.51 (  2.32%)
      Hmean    2048      2810.54 (  0.00%)     2879.75 (  2.46%)
      Hmean    3312      4618.18 (  0.00%)     4682.09 (  1.38%)
      Hmean    4096      5306.42 (  0.00%)     5346.39 (  0.75%)
      Hmean    8192     10581.44 (  0.00%)    10698.15 (  1.10%)
      Hmean    16384    18857.70 (  0.00%)    18937.61 (  0.42%)
      
      Small gains here, UDP_STREAM showed nothing intresting and neither did
      the TCP_RR tests. The gains on the 8-core machine were very similar.
      
      tbench4
                                       4.5.0-rc1             4.5.0-rc1
                                         vanilla          nostats-v3r1
      Hmean    mb/sec-1         500.85 (  0.00%)      522.43 (  4.31%)
      Hmean    mb/sec-2         984.66 (  0.00%)     1018.19 (  3.41%)
      Hmean    mb/sec-4        1827.91 (  0.00%)     1847.78 (  1.09%)
      Hmean    mb/sec-8        3561.36 (  0.00%)     3611.28 (  1.40%)
      Hmean    mb/sec-16       5824.52 (  0.00%)     5929.03 (  1.79%)
      Hmean    mb/sec-32      10943.10 (  0.00%)    10802.83 ( -1.28%)
      Hmean    mb/sec-64      15950.81 (  0.00%)    16211.31 (  1.63%)
      Hmean    mb/sec-128     15302.17 (  0.00%)    15445.11 (  0.93%)
      Hmean    mb/sec-256     14866.18 (  0.00%)    15088.73 (  1.50%)
      Hmean    mb/sec-512     15223.31 (  0.00%)    15373.69 (  0.99%)
      Hmean    mb/sec-1024    14574.25 (  0.00%)    14598.02 (  0.16%)
      Hmean    mb/sec-2048    13569.02 (  0.00%)    13733.86 (  1.21%)
      Hmean    mb/sec-3072    12865.98 (  0.00%)    13209.23 (  2.67%)
      
      Small gains of 2-4% at low thread counts and otherwise flat.  The
      gains on the 8-core machine were slightly different
      
      tbench4 on 8-core i7-3770 single socket machine
      Hmean    mb/sec-1        442.59 (  0.00%)      448.73 (  1.39%)
      Hmean    mb/sec-2        796.68 (  0.00%)      794.39 ( -0.29%)
      Hmean    mb/sec-4       1322.52 (  0.00%)     1343.66 (  1.60%)
      Hmean    mb/sec-8       2611.65 (  0.00%)     2694.86 (  3.19%)
      Hmean    mb/sec-16      2537.07 (  0.00%)     2609.34 (  2.85%)
      Hmean    mb/sec-32      2506.02 (  0.00%)     2578.18 (  2.88%)
      Hmean    mb/sec-64      2511.06 (  0.00%)     2569.16 (  2.31%)
      Hmean    mb/sec-128     2313.38 (  0.00%)     2395.50 (  3.55%)
      Hmean    mb/sec-256     2110.04 (  0.00%)     2177.45 (  3.19%)
      Hmean    mb/sec-512     2072.51 (  0.00%)     2053.97 ( -0.89%)
      
      In constract, this shows a relatively steady 2-3% gain at higher thread
      counts. Due to the nature of the patch and the type of workload, it's
      not a surprise that the result will depend on the CPU used.
      
      hackbench-pipes
                               4.5.0-rc1             4.5.0-rc1
                                 vanilla          nostats-v3r1
      Amean    1        0.0637 (  0.00%)      0.0660 ( -3.59%)
      Amean    4        0.1229 (  0.00%)      0.1181 (  3.84%)
      Amean    7        0.1921 (  0.00%)      0.1911 (  0.52%)
      Amean    12       0.3117 (  0.00%)      0.2923 (  6.23%)
      Amean    21       0.4050 (  0.00%)      0.3899 (  3.74%)
      Amean    30       0.4586 (  0.00%)      0.4433 (  3.33%)
      Amean    48       0.5910 (  0.00%)      0.5694 (  3.65%)
      Amean    79       0.8663 (  0.00%)      0.8626 (  0.43%)
      Amean    110      1.1543 (  0.00%)      1.1517 (  0.22%)
      Amean    141      1.4457 (  0.00%)      1.4290 (  1.16%)
      Amean    172      1.7090 (  0.00%)      1.6924 (  0.97%)
      Amean    192      1.9126 (  0.00%)      1.9089 (  0.19%)
      
      Some small gains and losses and while the variance data is not included,
      it's close to the noise. The UMA machine did not show anything particularly
      different
      
      pipetest
                                   4.5.0-rc1             4.5.0-rc1
                                     vanilla          nostats-v2r2
      Min         Time        4.13 (  0.00%)        3.99 (  3.39%)
      1st-qrtle   Time        4.38 (  0.00%)        4.27 (  2.51%)
      2nd-qrtle   Time        4.46 (  0.00%)        4.39 (  1.57%)
      3rd-qrtle   Time        4.56 (  0.00%)        4.51 (  1.10%)
      Max-90%     Time        4.67 (  0.00%)        4.60 (  1.50%)
      Max-93%     Time        4.71 (  0.00%)        4.65 (  1.27%)
      Max-95%     Time        4.74 (  0.00%)        4.71 (  0.63%)
      Max-99%     Time        4.88 (  0.00%)        4.79 (  1.84%)
      Max         Time        4.93 (  0.00%)        4.83 (  2.03%)
      Mean        Time        4.48 (  0.00%)        4.39 (  1.91%)
      Best99%Mean Time        4.47 (  0.00%)        4.39 (  1.91%)
      Best95%Mean Time        4.46 (  0.00%)        4.38 (  1.93%)
      Best90%Mean Time        4.45 (  0.00%)        4.36 (  1.98%)
      Best50%Mean Time        4.36 (  0.00%)        4.25 (  2.49%)
      Best10%Mean Time        4.23 (  0.00%)        4.10 (  3.13%)
      Best5%Mean  Time        4.19 (  0.00%)        4.06 (  3.20%)
      Best1%Mean  Time        4.13 (  0.00%)        4.00 (  3.39%)
      
      Small improvement and similar gains were seen on the UMA machine.
      
      The gain is small but it stands to reason that doing less work in the
      scheduler is a good thing. The downside is that the lack of schedstats and
      tracepoints may be surprising to experts doing performance analysis until
      they find the existence of the schedstats= parameter or schedstats sysctl.
      It will be automatically activated for latencytop and sleep profiling to
      alleviate the problem. For tracepoints, there is a simple warning as it's
      not safe to activate schedstats in the context when it's known the tracepoint
      may be wanted but is unavailable.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <mgalbraith@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cb251765
  6. 22 1月, 2016 1 次提交
    • G
      sched/numa: Fix use-after-free bug in the task_numa_compare · 1dff76b9
      Gavin Guo 提交于
      The following message can be observed on the Ubuntu v3.13.0-65 with KASan
      backported:
      
        ==================================================================
        BUG: KASan: use after free in task_numa_find_cpu+0x64c/0x890 at addr ffff880dd393ecd8
        Read of size 8 by task qemu-system-x86/3998900
        =============================================================================
        BUG kmalloc-128 (Tainted: G    B        ): kasan: bad access detected
        -----------------------------------------------------------------------------
      
        INFO: Allocated in task_numa_fault+0xc1b/0xed0 age=41980 cpu=18 pid=3998890
      	__slab_alloc+0x4f8/0x560
      	__kmalloc+0x1eb/0x280
      	task_numa_fault+0xc1b/0xed0
      	do_numa_page+0x192/0x200
      	handle_mm_fault+0x808/0x1160
      	__do_page_fault+0x218/0x750
      	do_page_fault+0x1a/0x70
      	page_fault+0x28/0x30
      	SyS_poll+0x66/0x1a0
      	system_call_fastpath+0x1a/0x1f
        INFO: Freed in task_numa_free+0x1d2/0x200 age=62 cpu=18 pid=0
      	__slab_free+0x2ab/0x3f0
      	kfree+0x161/0x170
      	task_numa_free+0x1d2/0x200
      	finish_task_switch+0x1d2/0x210
      	__schedule+0x5d4/0xc60
      	schedule_preempt_disabled+0x40/0xc0
      	cpu_startup_entry+0x2da/0x340
      	start_secondary+0x28f/0x360
        Call Trace:
         [<ffffffff81a6ce35>] dump_stack+0x45/0x56
         [<ffffffff81244aed>] print_trailer+0xfd/0x170
         [<ffffffff8124ac36>] object_err+0x36/0x40
         [<ffffffff8124cbf9>] kasan_report_error+0x1e9/0x3a0
         [<ffffffff8124d260>] kasan_report+0x40/0x50
         [<ffffffff810dda7c>] ? task_numa_find_cpu+0x64c/0x890
         [<ffffffff8124bee9>] __asan_load8+0x69/0xa0
         [<ffffffff814f5c38>] ? find_next_bit+0xd8/0x120
         [<ffffffff810dda7c>] task_numa_find_cpu+0x64c/0x890
         [<ffffffff810de16c>] task_numa_migrate+0x4ac/0x7b0
         [<ffffffff810de523>] numa_migrate_preferred+0xb3/0xc0
         [<ffffffff810e0b88>] task_numa_fault+0xb88/0xed0
         [<ffffffff8120ef02>] do_numa_page+0x192/0x200
         [<ffffffff81211038>] handle_mm_fault+0x808/0x1160
         [<ffffffff810d7dbd>] ? sched_clock_cpu+0x10d/0x160
         [<ffffffff81068c52>] ? native_load_tls+0x82/0xa0
         [<ffffffff81a7bd68>] __do_page_fault+0x218/0x750
         [<ffffffff810c2186>] ? hrtimer_try_to_cancel+0x76/0x160
         [<ffffffff81a6f5e7>] ? schedule_hrtimeout_range_clock.part.24+0xf7/0x1c0
         [<ffffffff81a7c2ba>] do_page_fault+0x1a/0x70
         [<ffffffff81a772e8>] page_fault+0x28/0x30
         [<ffffffff8128cbd4>] ? do_sys_poll+0x1c4/0x6d0
         [<ffffffff810e64f6>] ? enqueue_task_fair+0x4b6/0xaa0
         [<ffffffff810233c9>] ? sched_clock+0x9/0x10
         [<ffffffff810cf70a>] ? resched_task+0x7a/0xc0
         [<ffffffff810d0663>] ? check_preempt_curr+0xb3/0x130
         [<ffffffff8128b5c0>] ? poll_select_copy_remaining+0x170/0x170
         [<ffffffff810d3bc0>] ? wake_up_state+0x10/0x20
         [<ffffffff8112a28f>] ? drop_futex_key_refs.isra.14+0x1f/0x90
         [<ffffffff8112d40e>] ? futex_requeue+0x3de/0xba0
         [<ffffffff8112e49e>] ? do_futex+0xbe/0x8f0
         [<ffffffff81022c89>] ? read_tsc+0x9/0x20
         [<ffffffff8111bd9d>] ? ktime_get_ts+0x12d/0x170
         [<ffffffff8108f699>] ? timespec_add_safe+0x59/0xe0
         [<ffffffff8128d1f6>] SyS_poll+0x66/0x1a0
         [<ffffffff81a830dd>] system_call_fastpath+0x1a/0x1f
      
      As commit 1effd9f1 ("sched/numa: Fix unsafe get_task_struct() in
      task_numa_assign()") points out, the rcu_read_lock() cannot protect the
      task_struct from being freed in the finish_task_switch(). And the bug
      happens in the process of calculation of imp which requires the access of
      p->numa_faults being freed in the following path:
      
      do_exit()
              current->flags |= PF_EXITING;
          release_task()
              ~~delayed_put_task_struct()~~
          schedule()
          ...
          ...
      rq->curr = next;
          context_switch()
              finish_task_switch()
                  put_task_struct()
                      __put_task_struct()
      		    task_numa_free()
      
      The fix here to get_task_struct() early before end of dst_rq->lock to
      protect the calculation process and also put_task_struct() in the
      corresponding point if finally the dst_rq->curr somehow cannot be
      assigned.
      
      Additional credit to Liang Chen who helped fix the error logic and add the
      put_task_struct() to the place it missed.
      Signed-off-by: NGavin Guo <gavin.guo@canonical.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jay.vosburgh@canonical.com
      Cc: liang.chen@canonical.com
      Link: http://lkml.kernel.org/r/1453264618-17645-1-git-send-email-gavin.guo@canonical.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1dff76b9
  7. 06 1月, 2016 2 次提交
  8. 04 12月, 2015 3 次提交
  9. 23 11月, 2015 8 次提交
  10. 09 11月, 2015 1 次提交
  11. 20 10月, 2015 2 次提交
  12. 06 10月, 2015 2 次提交
  13. 18 9月, 2015 3 次提交
  14. 13 9月, 2015 6 次提交