1. 01 2月, 2017 7 次提交
  2. 20 1月, 2017 1 次提交
    • P
      sched/clock: Fix hotplug crash · acb04058
      Peter Zijlstra 提交于
      Mike reported that he could trigger the WARN_ON_ONCE() in
      set_sched_clock_stable() using hotplug.
      
      This exposed a fundamental problem with the interface, we should never
      mark the TSC stable if we ever find it to be unstable. Therefore
      set_sched_clock_stable() is a broken interface.
      
      The reason it existed is that not having it is a pain, it means all
      relevant architecture code needs to call clear_sched_clock_stable()
      where appropriate.
      
      Of the three architectures that select HAVE_UNSTABLE_SCHED_CLOCK ia64
      and parisc are trivial in that they never called
      set_sched_clock_stable(), so add an unconditional call to
      clear_sched_clock_stable() to them.
      
      For x86 the story is a lot more involved, and what this patch tries to
      do is ensure we preserve the status quo. So even is Cyrix or Transmeta
      have usable TSC they never called set_sched_clock_stable() so they now
      get an explicit mark unstable.
      Reported-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: 9881b024 ("sched/clock: Delay switching sched_clock to stable")
      Link: http://lkml.kernel.org/r/20170119133633.GB6536@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      acb04058
  3. 14 1月, 2017 2 次提交
    • T
      sched/core: Separate out io_schedule_prepare() and io_schedule_finish() · 10ab5643
      Tejun Heo 提交于
      Now that IO schedule accounting is done inside __schedule(),
      io_schedule() can be split into three steps - prep, schedule, and
      finish - where the schedule part doesn't need any special annotation.
      This allows marking a sleep as iowait by simply wrapping an existing
      blocking function with io_schedule_prepare() and io_schedule_finish().
      
      Because task_struct->in_iowait is single bit, the caller of
      io_schedule_prepare() needs to record and the pass its state to
      io_schedule_finish() to be safe regarding nesting.  While this isn't
      the prettiest, these functions are mostly gonna be used by core
      functions and we don't want to use more space for ->in_iowait.
      
      While at it, as it's simple to do now, reimplement io_schedule()
      without unnecessarily going through io_schedule_timeout().
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: adilger.kernel@dilger.ca
      Cc: jack@suse.com
      Cc: kernel-team@fb.com
      Cc: mingbo@fb.com
      Cc: tytso@mit.edu
      Link: http://lkml.kernel.org/r/1477673892-28940-3-git-send-email-tj@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      10ab5643
    • P
      sched/clock: Delay switching sched_clock to stable · 9881b024
      Peter Zijlstra 提交于
      Currently we switch to the stable sched_clock if we guess the TSC is
      usable, and then switch back to the unstable path if it turns out TSC
      isn't stable during SMP bringup after all.
      
      Delay switching to the stable path until after SMP bringup is
      complete. This way we'll avoid switching during the time we detect the
      worst of the TSC offences.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9881b024
  4. 11 1月, 2017 1 次提交
  5. 13 12月, 2016 1 次提交
  6. 29 11月, 2016 1 次提交
    • P
      sched/idle: Add support for tasks that inject idle · c1de45ca
      Peter Zijlstra 提交于
      Idle injection drivers such as Intel powerclamp and ACPI PAD drivers use
      realtime tasks to take control of CPU then inject idle. There are two
      issues with this approach:
      
       1. Low efficiency: injected idle task is treated as busy so sched ticks
          do not stop during injected idle period, the result of these
          unwanted wakeups can be ~20% loss in power savings.
      
       2. Idle accounting: injected idle time is presented to user as busy.
      
      This patch addresses the issues by introducing a new PF_IDLE flag which
      allows any given task to be treated as idle task while the flag is set.
      Therefore, idle injection tasks can run through the normal flow of NOHZ
      idle enter/exit to get the correct accounting as well as tick stop when
      possible.
      
      The implication is that idle task is then no longer limited to PID == 0.
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NJacob Pan <jacob.jun.pan@linux.intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      c1de45ca
  7. 24 11月, 2016 1 次提交
  8. 23 11月, 2016 1 次提交
    • E
      ptrace: Capture the ptracer's creds not PT_PTRACE_CAP · 64b875f7
      Eric W. Biederman 提交于
      When the flag PT_PTRACE_CAP was added the PTRACE_TRACEME path was
      overlooked.  This can result in incorrect behavior when an application
      like strace traces an exec of a setuid executable.
      
      Further PT_PTRACE_CAP does not have enough information for making good
      security decisions as it does not report which user namespace the
      capability is in.  This has already allowed one mistake through
      insufficient granulariy.
      
      I found this issue when I was testing another corner case of exec and
      discovered that I could not get strace to set PT_PTRACE_CAP even when
      running strace as root with a full set of caps.
      
      This change fixes the above issue with strace allowing stracing as
      root a setuid executable without disabling setuid.  More fundamentaly
      this change allows what is allowable at all times, by using the correct
      information in it's decision.
      
      Cc: stable@vger.kernel.org
      Fixes: 4214e42f96d4 ("v2.4.9.11 -> v2.4.9.12")
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      64b875f7
  9. 22 11月, 2016 2 次提交
  10. 21 11月, 2016 1 次提交
  11. 17 11月, 2016 1 次提交
  12. 15 11月, 2016 2 次提交
  13. 31 10月, 2016 1 次提交
    • F
      x86/intel_rdt: Add tasks files · e02737d5
      Fenghua Yu 提交于
      The root directory all subdirectories are automatically populated with a
      read/write (mode 0644) file named "tasks". When read it will show all the
      task IDs assigned to the resource group. Tasks can be added (one at a time)
      to a group by writing the task ID to the file.  E.g.
      
      Membership in a resource group is indicated by a new field in the
      task_struct "int closid" which holds the CLOSID for each task. The default
      resource group uses CLOSID=0 which means that all existing tasks when the
      resctrl file system is mounted belong to the default group.
      
      If a group is removed, tasks which are members of that group are moved to
      the default group.
      Signed-off-by: NFenghua Yu <fenghua.yu@intel.com>
      Cc: "Ravi V Shankar" <ravi.v.shankar@intel.com>
      Cc: "Tony Luck" <tony.luck@intel.com>
      Cc: "Shaohua Li" <shli@fb.com>
      Cc: "Sai Prakhya" <sai.praneeth.prakhya@intel.com>
      Cc: "Peter Zijlstra" <peterz@infradead.org>
      Cc: "Stephane Eranian" <eranian@google.com>
      Cc: "Dave Hansen" <dave.hansen@intel.com>
      Cc: "David Carrillo-Cisneros" <davidcc@google.com>
      Cc: "Nilay Vaish" <nilayvaish@gmail.com>
      Cc: "Vikas Shivappa" <vikas.shivappa@linux.intel.com>
      Cc: "Ingo Molnar" <mingo@elte.hu>
      Cc: "Borislav Petkov" <bp@suse.de>
      Cc: "H. Peter Anvin" <h.peter.anvin@intel.com>
      Link: http://lkml.kernel.org/r/1477692289-37412-8-git-send-email-fenghua.yu@intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      e02737d5
  14. 25 10月, 2016 1 次提交
  15. 08 10月, 2016 6 次提交
    • A
      thp: reduce usage of huge zero page's atomic counter · 6fcb52a5
      Aaron Lu 提交于
      The global zero page is used to satisfy an anonymous read fault.  If
      THP(Transparent HugePage) is enabled then the global huge zero page is
      used.  The global huge zero page uses an atomic counter for reference
      counting and is allocated/freed dynamically according to its counter
      value.
      
      CPU time spent on that counter will greatly increase if there are a lot
      of processes doing anonymous read faults.  This patch proposes a way to
      reduce the access to the global counter so that the CPU load can be
      reduced accordingly.
      
      To do this, a new flag of the mm_struct is introduced:
      MMF_USED_HUGE_ZERO_PAGE.  With this flag, the process only need to touch
      the global counter in two cases:
      
       1 The first time it uses the global huge zero page;
       2 The time when mm_user of its mm_struct reaches zero.
      
      Note that right now, the huge zero page is eligible to be freed as soon
      as its last use goes away.  With this patch, the page will not be
      eligible to be freed until the exit of the last process from which it
      was ever used.
      
      And with the use of mm_user, the kthread is not eligible to use huge
      zero page either.  Since no kthread is using huge zero page today, there
      is no difference after applying this patch.  But if that is not desired,
      I can change it to when mm_count reaches zero.
      
      Case used for test on Haswell EP:
      
        usemem -n 72 --readonly -j 0x200000 100G
      
      Which spawns 72 processes and each will mmap 100G anonymous space and
      then do read only access to that space sequentially with a step of 2MB.
      
        CPU cycles from perf report for base commit:
            54.03%  usemem   [kernel.kallsyms]   [k] get_huge_zero_page
        CPU cycles from perf report for this commit:
             0.11%  usemem   [kernel.kallsyms]   [k] mm_get_huge_zero_page
      
      Performance(throughput) of the workload for base commit: 1784430792
      Performance(throughput) of the workload for this commit: 4726928591
      164% increase.
      
      Runtime of the workload for base commit: 707592 us
      Runtime of the workload for this commit: 303970 us
      50% drop.
      
      Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Ebru Akagunduz <ebru.akagunduz@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6fcb52a5
    • M
      mm: make sure that kthreads will not refault oom reaped memory · 3f70dc38
      Michal Hocko 提交于
      There are only few use_mm() users in the kernel right now.  Most of them
      write to the target memory but vhost driver relies on
      copy_from_user/get_user from a kernel thread context.  This makes it
      impossible to reap the memory of an oom victim which shares the mm with
      the vhost kernel thread because it could see a zero page unexpectedly
      and theoretically make an incorrect decision visible outside of the
      killed task context.
      
      To quote Michael S. Tsirkin:
      : Getting an error from __get_user and friends is handled gracefully.
      : Getting zero instead of a real value will cause userspace
      : memory corruption.
      
      The vhost kernel thread is bound to an open fd of the vhost device which
      is not tight to the mm owner life cycle in general.  The device fd can
      be inherited or passed over to another process which means that we
      really have to be careful about unexpected memory corruption because
      unlike for normal oom victims the result will be visible outside of the
      oom victim context.
      
      Make sure that no kthread context (users of use_mm) can ever see
      corrupted data because of the oom reaper and hook into the page fault
      path by checking MMF_UNSTABLE mm flag.  __oom_reap_task_mm will set the
      flag before it starts unmapping the address space while the flag is
      checked after the page fault has been handled.  If the flag is set then
      SIGBUS is triggered so any g-u-p user will get a error code.
      
      Regular tasks do not need this protection because all which share the mm
      are killed when the mm is reaped and so the corruption will not outlive
      them.
      
      This patch shouldn't have any visible effect at this moment because the
      OOM killer doesn't invoke oom reaper for tasks with mm shared with
      kthreads yet.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: N"Michael S. Tsirkin" <mst@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f70dc38
    • M
      mm, oom: get rid of signal_struct::oom_victims · 862e3073
      Michal Hocko 提交于
      After "oom: keep mm of the killed task available" we can safely detect
      an oom victim by checking task->signal->oom_mm so we do not need the
      signal_struct counter anymore so let's get rid of it.
      
      This alone wouldn't be sufficient for nommu archs because
      exit_oom_victim doesn't hide the process from the oom killer anymore.
      We can, however, mark the mm with a MMF flag in __mmput.  We can reuse
      MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      862e3073
    • M
      kernel, oom: fix potential pgd_lock deadlock from __mmdrop · 7283094e
      Michal Hocko 提交于
      Lockdep complains that __mmdrop is not safe from the softirq context:
      
        =================================
        [ INFO: inconsistent lock state ]
        4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G        W
        ---------------------------------
        inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
        swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
         (pgd_lock){+.?...}, at: pgd_free+0x19/0x6b
        {SOFTIRQ-ON-W} state was registered at:
           __lock_acquire+0xa06/0x196e
           lock_acquire+0x139/0x1e1
           _raw_spin_lock+0x32/0x41
           __change_page_attr_set_clr+0x2a5/0xacd
           change_page_attr_set_clr+0x16f/0x32c
           set_memory_nx+0x37/0x3a
           free_init_pages+0x9e/0xc7
           alternative_instructions+0xa2/0xb3
           check_bugs+0xe/0x2d
           start_kernel+0x3ce/0x3ea
           x86_64_start_reservations+0x2a/0x2c
           x86_64_start_kernel+0x17a/0x18d
        irq event stamp: 105916
        hardirqs last  enabled at (105916): free_hot_cold_page+0x37e/0x390
        hardirqs last disabled at (105915): free_hot_cold_page+0x2c1/0x390
        softirqs last  enabled at (105878): _local_bh_enable+0x42/0x44
        softirqs last disabled at (105879): irq_exit+0x6f/0xd1
      
        other info that might help us debug this:
         Possible unsafe locking scenario:
      
               CPU0
               ----
          lock(pgd_lock);
          <Interrupt>
            lock(pgd_lock);
      
         *** DEADLOCK ***
      
        1 lock held by swapper/1/0:
         #0:  (rcu_callback){......}, at: rcu_process_callbacks+0x390/0x800
      
        stack backtrace:
        CPU: 1 PID: 0 Comm: swapper/1 Tainted: G        W       4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
        Call Trace:
         <IRQ>
          print_usage_bug.part.25+0x259/0x268
          mark_lock+0x381/0x567
          __lock_acquire+0x993/0x196e
          lock_acquire+0x139/0x1e1
          _raw_spin_lock+0x32/0x41
          pgd_free+0x19/0x6b
          __mmdrop+0x25/0xb9
          __put_task_struct+0x103/0x11e
          delayed_put_task_struct+0x157/0x15e
          rcu_process_callbacks+0x660/0x800
          __do_softirq+0x1ec/0x4d5
          irq_exit+0x6f/0xd1
          smp_apic_timer_interrupt+0x42/0x4d
          apic_timer_interrupt+0x8e/0xa0
         <EOI>
          arch_cpu_idle+0xf/0x11
          default_idle_call+0x32/0x34
          cpu_startup_entry+0x20c/0x399
          start_secondary+0xfe/0x101
      
      More over commit a79e53d8 ("x86/mm: Fix pgd_lock deadlock") was
      explicit about pgd_lock not to be called from the irq context.  This
      means that __mmdrop called from free_signal_struct has to be postponed
      to a user context.  We already have a similar mechanism for mmput_async
      so we can use it here as well.  This is safe because mm_count is pinned
      by mm_users.
      
      This fixes bug introduced by "oom: keep mm of the killed task available"
      
      Link: http://lkml.kernel.org/r/1472119394-11342-5-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7283094e
    • M
      oom: keep mm of the killed task available · 26db62f1
      Michal Hocko 提交于
      oom_reap_task has to call exit_oom_victim in order to make sure that the
      oom vicim will not block the oom killer for ever.  This is, however,
      opening new problems (e.g oom_killer_disable exclusion - see commit
      74070542 ("oom, suspend: fix oom_reaper vs.  oom_killer_disable
      race")).  exit_oom_victim should be only called from the victim's
      context ideally.
      
      One way to achieve this would be to rely on per mm_struct flags.  We
      already have MMF_OOM_REAPED to hide a task from the oom killer since
      "mm, oom: hide mm which is shared with kthread or global init". The
      problem is that the exit path:
      
        do_exit
          exit_mm
            tsk->mm = NULL;
            mmput
              __mmput
            exit_oom_victim
      
      doesn't guarantee that exit_oom_victim will get called in a bounded
      amount of time.  At least exit_aio depends on IO which might get blocked
      due to lack of memory and who knows what else is lurking there.
      
      This patch takes a different approach.  We remember tsk->mm into the
      signal_struct and bind it to the signal struct life time for all oom
      victims.  __oom_reap_task_mm as well as oom_scan_process_thread do not
      have to rely on find_lock_task_mm anymore and they will have a reliable
      reference to the mm struct.  As a result all the oom specific
      communication inside the OOM killer can be done via tsk->signal->oom_mm.
      
      Increasing the signal_struct for something as unlikely as the oom killer
      is far from ideal but this approach will make the code much more
      reasonable and long term we even might want to move task->mm into the
      signal_struct anyway.  In the next step we might want to make the oom
      killer exclusion and access to memory reserves completely independent
      which would be also nice.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26db62f1
    • T
      mm,oom_reaper: do not attempt to reap a task twice · 8496afab
      Tetsuo Handa 提交于
      "mm, oom_reaper: do not attempt to reap a task twice" tried to give the
      OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag.
      But the usefulness of the flag is rather limited and actually never
      shown in practice.  If the flag is set, it means that the holder of
      mm->mmap_sem cannot call up_write() due to presumably being blocked at
      unkillable wait waiting for other thread's memory allocation.  But since
      one of threads sharing that mm will queue that mm immediately via
      task_will_free_mem() shortcut (otherwise, oom_badness() will select the
      same mm again due to oom_score_adj value unchanged), retrying
      MMF_OOM_NOT_REAPABLE mm is unlikely helpful.
      
      Let's always set MMF_OOM_REAPED.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-3-git-send-email-mhocko@kernel.orgSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8496afab
  16. 30 9月, 2016 4 次提交
    • P
      sched/core, ia64: Rename set_curr_task() · a458ae2e
      Peter Zijlstra 提交于
      Rename the ia64 only set_curr_task() function to free up the name.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      a458ae2e
    • P
      sched/core: Rewrite and improve select_idle_siblings() · 10e2f1ac
      Peter Zijlstra 提交于
      select_idle_siblings() is a known pain point for a number of
      workloads; it either does too much or not enough and sometimes just
      does plain wrong.
      
      This rewrite attempts to address a number of issues (but sadly not
      all).
      
      The current code does an unconditional sched_domain iteration; with
      the intent of finding an idle core (on SMT hardware). The problems
      which this patch tries to address are:
      
       - its pointless to look for idle cores if the machine is real busy;
         at which point you're just wasting cycles.
      
       - it's behaviour is inconsistent between SMT and !SMT hardware in
         that !SMT hardware ends up doing a scan for any idle CPU in the LLC
         domain, while SMT hardware does a scan for idle cores and if that
         fails, falls back to a scan for idle threads on the 'target' core.
      
      The new code replaces the sched_domain scan with 3 explicit scans:
      
       1) search for an idle core in the LLC
       2) search for an idle CPU in the LLC
       3) search for an idle thread in the 'target' core
      
      where 1 and 3 are conditional on SMT support and 1 and 2 have runtime
      heuristics to skip the step.
      
      Step 1) is conditional on sd_llc_shared->has_idle_cores; when a cpu
      goes idle and sd_llc_shared->has_idle_cores is false, we scan all SMT
      siblings of the CPU going idle. Similarly, we clear
      sd_llc_shared->has_idle_cores when we fail to find an idle core.
      
      Step 2) tracks the average cost of the scan and compares this to the
      average idle time guestimate for the CPU doing the wakeup. There is a
      significant fudge factor involved to deal with the variability of the
      averages. Esp. hackbench was sensitive to this.
      
      Step 3) is unconditional; we assume (also per step 1) that scanning
      all SMT siblings in a core is 'cheap'.
      
      With this; SMT systems gain step 2, which cures a few benchmarks --
      notably one from Facebook.
      
      One 'feature' of the sched_domain iteration, which we preserve in the
      new code, is that it would start scanning from the 'target' CPU,
      instead of scanning the cpumask in cpu id order. This avoids multiple
      CPUs in the LLC scanning for idle to gang up and find the same CPU
      quite as much. The down side is that tasks can end up hopping across
      the LLC for no apparent reason.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      10e2f1ac
    • P
      sched/core: Replace sd_busy/nr_busy_cpus with sched_domain_shared · 0e369d75
      Peter Zijlstra 提交于
      Move the nr_busy_cpus thing from its hacky sd->parent->groups->sgc
      location into the much more natural sched_domain_shared location.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      0e369d75
    • P
      sched/core: Introduce 'struct sched_domain_shared' · 24fc7edb
      Peter Zijlstra 提交于
      Since struct sched_domain is strictly per cpu; introduce a structure
      that is shared between all 'identical' sched_domains.
      
      Limit to SD_SHARE_PKG_RESOURCES domains for now, as we'll only use it
      for shared cache state; if another use comes up later we can easily
      relax this.
      
      While the sched_group's are normally shared between CPUs, these are
      not natural to use when we need some shared state on a domain level --
      since that would require the domain to have a parent, which is not a
      given.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      24fc7edb
  17. 22 9月, 2016 2 次提交
  18. 16 9月, 2016 2 次提交
  19. 15 9月, 2016 1 次提交
  20. 14 9月, 2016 1 次提交
    • R
      cpufreq / sched: SCHED_CPUFREQ_IOWAIT flag to indicate iowait condition · 8c34ab19
      Rafael J. Wysocki 提交于
      Testing indicates that it is possible to improve performace
      significantly without increasing energy consumption too much by
      teaching cpufreq governors to bump up the CPU performance level if
      the in_iowait flag is set for the task in enqueue_task_fair().
      
      For this purpose, define a new cpufreq_update_util() flag
      SCHED_CPUFREQ_IOWAIT and modify enqueue_task_fair() to pass that
      flag to cpufreq_update_util() in the in_iowait case.  That generally
      requires cpufreq_update_util() to be called directly from there,
      because update_load_avg() may not be invoked in that case.
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Looks-good-to: Steve Muckle <smuckle@linaro.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      8c34ab19
  21. 24 8月, 2016 1 次提交