1. 07 6月, 2015 5 次提交
    • R
      sched/numa: Only consider less busy nodes as numa balancing destinations · 6f9aad0b
      Rik van Riel 提交于
      Changeset a43455a1 ("sched/numa: Ensure task_numa_migrate() checks
      the preferred node") fixes an issue where workloads would never
      converge on a fully loaded (or overloaded) system.
      
      However, it introduces a regression on less than fully loaded systems,
      where workloads converge on a few NUMA nodes, instead of properly
      staying spread out across the whole system. This leads to a reduction
      in available memory bandwidth, and usable CPU cache, with predictable
      performance problems.
      
      The root cause appears to be an interaction between the load balancer
      and NUMA balancing, where the short term load represented by the load
      balancer differs from the long term load the NUMA balancing code would
      like to base its decisions on.
      
      Simply reverting a43455a1 would re-introduce the non-convergence
      of workloads on fully loaded systems, so that is not a good option. As
      an aside, the check done before a43455a1 only applied to a task's
      preferred node, not to other candidate nodes in the system, so the
      converge-on-too-few-nodes problem still happens, just to a lesser
      degree.
      
      Instead, try to compensate for the impedance mismatch between the load
      balancer and NUMA balancing by only ever considering a lesser loaded
      node as a destination for NUMA balancing, regardless of whether the
      task is trying to move to the preferred node, or to another node.
      
      This patch also addresses the issue that a system with a single
      runnable thread would never migrate that thread to near its memory,
      introduced by 095bebf6 ("sched/numa: Do not move past the balance
      point if unbalanced").
      
      A test where the main thread creates a large memory area, and spawns a
      worker thread to iterate over the memory (placed on another node by
      select_task_rq_fair), after which the main thread goes to sleep and
      waits for the worker thread to loop over all the memory now sees the
      worker thread migrated to where the memory is, instead of having all
      the memory migrated over like before.
      
      Jirka has run a number of performance tests on several systems: single
      instance SpecJBB 2005 performance is 7-15% higher on a 4 node system,
      with higher gains on systems with more cores per socket.
      Multi-instance SpecJBB 2005 (one per node), linpack, and stream see
      little or no changes with the revert of 095bebf6 and this patch.
      Reported-by: NArtem Bityutski <dedekind1@gmail.com>
      Reported-by: NJirka Hladky <jhladky@redhat.com>
      Tested-by: NJirka Hladky <jhladky@redhat.com>
      Tested-by: NArtem Bityutskiy <dedekind1@gmail.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20150528095249.3083ade0@annuminas.surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      6f9aad0b
    • R
      Revert 095bebf6 ("sched/numa: Do not move past the balance point if unbalanced") · e4991b24
      Rik van Riel 提交于
      Commit 095bebf6 ("sched/numa: Do not move past the balance point
      if unbalanced") broke convergence of workloads with just one runnable
      thread, by making it impossible for the one runnable thread on the
      system to move from one NUMA node to another.
      
      Instead, the thread would remain where it was, and pull all the memory
      across to its location, which is much slower than just migrating the
      thread to where the memory is.
      
      The next patch has a better fix for the issue that 095bebf6 tried
      to address.
      Reported-by: NJirka Hladky <jhladky@redhat.com>
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: dedekind1@gmail.com
      Cc: mgorman@suse.de
      Link: http://lkml.kernel.org/r/1432753468-7785-2-git-send-email-riel@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e4991b24
    • B
      sched/fair: Prevent throttling in early pick_next_task_fair() · 54d27365
      Ben Segall 提交于
      The optimized task selection logic optimistically selects a new task
      to run without first doing a full put_prev_task(). This is so that we
      can avoid a put/set on the common ancestors of the old and new task.
      
      Similarly, we should only call check_cfs_rq_runtime() to throttle
      eligible groups if they're part of the common ancestry, otherwise it
      is possible to end up with no eligible task in the simple task
      selection.
      
      Imagine:
      		/root
      	/prev		/next
      	/A		/B
      
      If our optimistic selection ends up throttling /next, we goto simple
      and our put_prev_task() ends up throttling /prev, after which we're
      going to bug out in set_next_entity() because there aren't any tasks
      left.
      
      Avoid this scenario by only throttling common ancestors.
      Reported-by: NMohammed Naser <mnaser@vexxhost.com>
      Reported-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NBen Segall <bsegall@google.com>
      [ munged Changelog ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: pjt@google.com
      Fixes: 678d5718 ("sched/fair: Optimize cgroup pick_next_task_fair()")
      Link: http://lkml.kernel.org/r/xm26wq1oswoq.fsf@sword-of-the-dawn.mtv.corp.google.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      54d27365
    • F
      preempt: Use preempt_schedule_context() as the official tracing preemption point · 4eaca0a8
      Frederic Weisbecker 提交于
      preempt_schedule_context() is a tracing safe preemption point but it's
      only used when CONFIG_CONTEXT_TRACKING=y. Other configs have tracing
      recursion issues since commit:
      
        b30f0e3f ("sched/preempt: Optimize preemption operations on __schedule() callers")
      
      introduced function based preemp_count_*() ops.
      
      Lets make it available on all configs and give it a more appropriate
      name for its new position.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1433432349-1021-3-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      4eaca0a8
    • F
      sched: Make preempt_schedule_context() function-tracing safe · be690035
      Frederic Weisbecker 提交于
      Since function tracing disables preemption, it needs a safe preemption
      point to use when preemption is re-enabled without worrying about tracing
      recursion. Ie: to avoid tracing recursion, that preemption point can't
      be traced (use of notrace qualifier) and it can't call any traceable
      function before that preemption point disables preemption itself, which
      disarms the recursion.
      
      preempt_schedule() was fine until commit:
      
        b30f0e3f ("sched/preempt: Optimize preemption operations on __schedule() callers")
      
      because PREEMPT_ACTIVE (which has the property to disable preemption
      and this disarm tracing preemption recursion) was set before calling
      any further function.
      
      But that commit introduced the use of preempt_count_add/sub() functions
      to set PREEMPT_ACTIVE and because these functions are called before
      preemption gets a chance to be disabled, we have a tracing recursion.
      
      preempt_schedule_context() is one of the possible preemption functions
      used by tracing. Its special purpose is to avoid tracing recursion
      against context tracking. Lets enhance this function to become more
      generally tracing safe by disabling preemption with raw accessors, such
      that no function is called before preemption gets disabled and disarm
      the tracing recursion.
      
      This function is going to become the specific tracing-safe preemption
      point in further commit.
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1433432349-1021-2-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      be690035
  2. 19 5月, 2015 3 次提交
    • R
      sched/numa: Reduce conflict between fbq_classify_rq() and migration · c1ceac62
      Rik van Riel 提交于
      It is possible for fbq_classify_rq() to indicate that a CPU has tasks that
      should be moved to another NUMA node, but for migrate_improves_locality
      and migrate_degrades_locality to not identify those tasks.
      
      This patch always gives preference to preferred node evaluations, and
      only checks the number of faults when evaluating moves between two
      non-preferred nodes on a larger NUMA system.
      
      On a two node system, the number of faults is never evaluated. Either
      a task is about to be pulled off its preferred node, or migrated onto
      it.
      Signed-off-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: mgorman@suse.de
      Link: http://lkml.kernel.org/r/20150514225936.35b91717@annuminas.surriel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      c1ceac62
    • F
      sched/preempt: Optimize preemption operations on __schedule() callers · b30f0e3f
      Frederic Weisbecker 提交于
      __schedule() disables preemption and some of its callers
      (the preempt_schedule*() family) also set PREEMPT_ACTIVE.
      
      So we have two preempt_count() modifications that could be performed
      at once.
      
      Lets remove the preemption disablement from __schedule() and pull
      this responsibility to its callers in order to optimize preempt_count()
      operations in a single place.
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431441711-29753-5-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b30f0e3f
    • S
      sched: always use blk_schedule_flush_plug in io_schedule_out · 10d784ea
      Shaohua Li 提交于
      block plug callback could sleep, so we introduce a parameter
      'from_schedule' and corresponding drivers can use it to destinguish a
      schedule plug flush or a plug finish. Unfortunately io_schedule_out
      still uses blk_flush_plug(). This causes below output (Note, I added a
      might_sleep() in raid1_unplug to make it trigger faster, but the whole
      thing doesn't matter if I add might_sleep). In raid1/10, this can cause
      deadlock.
      
      This patch makes io_schedule_out always uses blk_schedule_flush_plug.
      This should only impact drivers (as far as I know, raid 1/10) which are
      sensitive to the 'from_schedule' parameter.
      
      [  370.817949] ------------[ cut here ]------------
      [  370.817960] WARNING: CPU: 7 PID: 145 at ../kernel/sched/core.c:7306 __might_sleep+0x7f/0x90()
      [  370.817969] do not call blocking ops when !TASK_RUNNING; state=2 set at [<ffffffff81092fcf>] prepare_to_wait+0x2f/0x90
      [  370.817971] Modules linked in: raid1
      [  370.817976] CPU: 7 PID: 145 Comm: kworker/u16:9 Tainted: G        W       4.0.0+ #361
      [  370.817977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
      [  370.817983] Workqueue: writeback bdi_writeback_workfn (flush-9:1)
      [  370.817985]  ffffffff81cd83be ffff8800ba8cb298 ffffffff819dd7af 0000000000000001
      [  370.817988]  ffff8800ba8cb2e8 ffff8800ba8cb2d8 ffffffff81051afc ffff8800ba8cb2c8
      [  370.817990]  ffffffffa00061a8 000000000000041e 0000000000000000 ffff8800ba8cba28
      [  370.817993] Call Trace:
      [  370.817999]  [<ffffffff819dd7af>] dump_stack+0x4f/0x7b
      [  370.818002]  [<ffffffff81051afc>] warn_slowpath_common+0x8c/0xd0
      [  370.818004]  [<ffffffff81051b86>] warn_slowpath_fmt+0x46/0x50
      [  370.818006]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
      [  370.818008]  [<ffffffff81092fcf>] ? prepare_to_wait+0x2f/0x90
      [  370.818010]  [<ffffffff810776ef>] __might_sleep+0x7f/0x90
      [  370.818014]  [<ffffffffa0000c03>] raid1_unplug+0xd3/0x170 [raid1]
      [  370.818024]  [<ffffffff81421d9a>] blk_flush_plug_list+0x8a/0x1e0
      [  370.818028]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
      [  370.818031]  [<ffffffff819e21b0>] io_schedule_timeout+0x130/0x140
      [  370.818033]  [<ffffffff819e3586>] bit_wait_io+0x36/0x50
      [  370.818034]  [<ffffffff819e31b5>] __wait_on_bit+0x65/0x90
      [  370.818041]  [<ffffffff8125b67c>] ? ext4_read_block_bitmap_nowait+0xbc/0x630
      [  370.818043]  [<ffffffff819e3550>] ? bit_wait+0x50/0x50
      [  370.818045]  [<ffffffff819e3302>] out_of_line_wait_on_bit+0x72/0x80
      [  370.818047]  [<ffffffff810935e0>] ? autoremove_wake_function+0x40/0x40
      [  370.818050]  [<ffffffff811de744>] __wait_on_buffer+0x44/0x50
      [  370.818053]  [<ffffffff8125ae80>] ext4_wait_block_bitmap+0xe0/0xf0
      [  370.818058]  [<ffffffff812975d6>] ext4_mb_init_cache+0x206/0x790
      [  370.818062]  [<ffffffff8114bc6c>] ? lru_cache_add+0x1c/0x50
      [  370.818064]  [<ffffffff81297c7e>] ext4_mb_init_group+0x11e/0x200
      [  370.818066]  [<ffffffff81298231>] ext4_mb_load_buddy+0x341/0x360
      [  370.818068]  [<ffffffff8129a1a3>] ext4_mb_find_by_goal+0x93/0x2f0
      [  370.818070]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
      [  370.818072]  [<ffffffff8129ab67>] ext4_mb_regular_allocator+0x67/0x460
      [  370.818074]  [<ffffffff81295b54>] ? ext4_mb_normalize_request+0x1e4/0x5b0
      [  370.818076]  [<ffffffff8129ca4b>] ext4_mb_new_blocks+0x4cb/0x620
      [  370.818079]  [<ffffffff81290956>] ext4_ext_map_blocks+0x4c6/0x14d0
      [  370.818081]  [<ffffffff812a4d4e>] ? ext4_es_lookup_extent+0x4e/0x290
      [  370.818085]  [<ffffffff8126399d>] ext4_map_blocks+0x14d/0x4f0
      [  370.818088]  [<ffffffff81266fbd>] ext4_writepages+0x76d/0xe50
      [  370.818094]  [<ffffffff81149691>] do_writepages+0x21/0x50
      [  370.818097]  [<ffffffff811d5c00>] __writeback_single_inode+0x60/0x490
      [  370.818099]  [<ffffffff811d630a>] writeback_sb_inodes+0x2da/0x590
      [  370.818103]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
      [  370.818105]  [<ffffffff811abf4b>] ? trylock_super+0x1b/0x50
      [  370.818107]  [<ffffffff811d665f>] __writeback_inodes_wb+0x9f/0xd0
      [  370.818109]  [<ffffffff811d69db>] wb_writeback+0x34b/0x3c0
      [  370.818111]  [<ffffffff811d70df>] bdi_writeback_workfn+0x23f/0x550
      [  370.818116]  [<ffffffff8106bbd8>] process_one_work+0x1c8/0x570
      [  370.818117]  [<ffffffff8106bb5b>] ? process_one_work+0x14b/0x570
      [  370.818119]  [<ffffffff8106c09b>] worker_thread+0x11b/0x470
      [  370.818121]  [<ffffffff8106bf80>] ? process_one_work+0x570/0x570
      [  370.818124]  [<ffffffff81071868>] kthread+0xf8/0x110
      [  370.818126]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
      [  370.818129]  [<ffffffff819e9322>] ret_from_fork+0x42/0x70
      [  370.818131]  [<ffffffff81071770>] ? kthread_create_on_node+0x210/0x210
      [  370.818132] ---[ end trace 7b4deb71e68b6605 ]---
      
      V2: don't change ->in_iowait
      
      Cc: NeilBrown <neilb@suse.de>
      Signed-off-by: NShaohua Li <shli@fb.com>
      Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      10d784ea
  3. 18 5月, 2015 1 次提交
    • P
      sched,perf: Fix periodic timers · 4cfafd30
      Peter Zijlstra 提交于
      In the below two commits (see Fixes) we have periodic timers that can
      stop themselves when they're no longer required, but need to be
      (re)-started when their idle condition changes.
      
      Further complications is that we want the timer handler to always do
      the forward such that it will always correctly deal with the overruns,
      and we do not want to race such that the handler has already decided
      to stop, but the (external) restart sees the timer still active and we
      end up with a 'lost' timer.
      
      The problem with the current code is that the re-start can come before
      the callback does the forward, at which point the forward from the
      callback will WARN about forwarding an enqueued timer.
      
      Now, conceptually its easy to detect if you're before or after the fwd
      by comparing the expiration time against the current time. Of course,
      that's expensive (and racy) because we don't have the current time.
      
      Alternatively one could cache this state inside the timer, but then
      everybody pays the overhead of maintaining this extra state, and that
      is undesired.
      
      The only other option that I could see is the external timer_active
      variable, which I tried to kill before. I would love a nicer interface
      for this seemingly simple 'problem' but alas.
      
      Fixes: 272325c4 ("perf: Fix mux_interval hrtimer wreckage")
      Fixes: 77a4d1a1 ("sched: Cleanup bandwidth timers")
      Cc: pjt@google.com
      Cc: tglx@linutronix.de
      Cc: klamm@yandex-team.ru
      Cc: mingo@kernel.org
      Cc: bsegall@google.com
      Cc: hpa@zytor.com
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20150514102311.GX21418@twins.programming.kicks-ass.net
      4cfafd30
  4. 17 5月, 2015 1 次提交
    • N
      sched: Fix function declaration return type mismatch · 58ac93e4
      Nicholas Mc Guire 提交于
      static code checking was unhappy with:
      
        ./kernel/sched/fair.c:162 WARNING: return of wrong type
                      int != unsigned int
      
      get_update_sysctl_factor() is declared to return int but is
      currently  returning an unsigned int. The first few preprocessed
      lines are:
      
       static int get_update_sysctl_factor(void)
       {
       unsigned int cpus = ({ int __min1 = (cpumask_weight(cpu_online_mask));
       int __min2 = (8); __min1 < __min2 ? __min1: __min2; });
       unsigned int factor;
      
      The type used by min_t() should be 'unsigned int' and the return type
      of get_update_sysctl_factor() should also be 'unsigned int' as its
      call-site update_sysctl() is expecting 'unsigned int' and the values
      utilizing:
      
        'factor'
        'sysctl_sched_min_granularity'
        'sched_nr_latency'
        'sysctl_sched_wakeup_granularity'
      
      ... are also all 'unsigned int', plus cpumask_weight() is also
      returning 'unsigned int'.
      
      So the natural type to use around here is 'unsigned int'.
      
      ( Patch was compile tested with x86_64_defconfig +
        CONFIG_SCHED_DEBUG=y and the changed sections in
        kernel/sched/fair.i were reviewed. )
      Signed-off-by: NNicholas Mc Guire <hofrat@osadl.org>
      [ Improved the changelog a bit. ]
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431716742-11077-1-git-send-email-hofrat@osadl.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      58ac93e4
  5. 08 5月, 2015 10 次提交
  6. 29 4月, 2015 1 次提交
  7. 27 4月, 2015 1 次提交
  8. 23 4月, 2015 1 次提交
  9. 22 4月, 2015 3 次提交
  10. 16 4月, 2015 1 次提交
  11. 08 4月, 2015 1 次提交
  12. 03 4月, 2015 2 次提交
  13. 02 4月, 2015 4 次提交
  14. 27 3月, 2015 6 次提交