1. 23 9月, 2015 1 次提交
  2. 13 9月, 2015 2 次提交
    • D
      sched/fair: Make utilization tracking CPU scale-invariant · e3279a2e
      Dietmar Eggemann 提交于
      Besides the existing frequency scale-invariance correction factor, apply
      CPU scale-invariance correction factor to utilization tracking to
      compensate for any differences in compute capacity. This could be due to
      micro-architectural differences (i.e. instructions per seconds) between
      cpus in HMP systems (e.g. big.LITTLE), and/or differences in the current
      maximum frequency supported by individual cpus in SMP systems. In the
      existing implementation utilization isn't comparable between cpus as it
      is relative to the capacity of each individual CPU.
      
      Each segment of the sched_avg.util_sum geometric series is now scaled
      by the CPU performance factor too so the sched_avg.util_avg of each
      sched entity will be invariant from the particular CPU of the HMP/SMP
      system on which the sched entity is scheduled.
      
      With this patch, the utilization of a CPU stays relative to the max CPU
      performance of the fastest CPU in the system.
      
      In contrast to utilization (sched_avg.util_sum), load
      (sched_avg.load_sum) should not be scaled by compute capacity. The
      utilization metric is based on running time which only makes sense when
      cpus are _not_ fully utilized (utilization cannot go beyond 100% even if
      more tasks are added), where load is runnable time which isn't limited
      by the capacity of the CPU and therefore is a better metric for
      overloaded scenarios. If we run two nice-0 busy loops on two cpus with
      different compute capacity their load should be similar since their
      compute demands are the same. We have to assume that the compute demand
      of any task running on a fully utilized CPU (no spare cycles = 100%
      utilization) is high and the same no matter of the compute capacity of
      its current CPU, hence we shouldn't scale load by CPU capacity.
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/55CE7409.1000700@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e3279a2e
    • D
      sched/fair: Make load tracking frequency scale-invariant · e0f5f3af
      Dietmar Eggemann 提交于
      Apply frequency scaling correction factor to per-entity load tracking to
      make it frequency invariant. Currently, load appears bigger when the CPU
      is running slower which affects load-balancing decisions.
      
      Each segment of the sched_avg.load_sum geometric series is now scaled by
      the current frequency so that the sched_avg.load_avg of each sched entity
      will be invariant from frequency scaling.
      
      Moreover, cfs_rq.runnable_load_sum is scaled by the current frequency as
      well.
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NVincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
      Cc: Juri Lelli <Juri.Lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: daniel.lezcano@linaro.org
      Cc: mturquette@baylibre.com
      Cc: pang.xunlei@zte.com.cn
      Cc: rjw@rjwysocki.net
      Cc: sgurrappadi@nvidia.com
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1439569394-11974-2-git-send-email-morten.rasmussen@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e0f5f3af
  3. 05 9月, 2015 2 次提交
    • M
      mm: defer flush of writable TLB entries · d950c947
      Mel Gorman 提交于
      If a PTE is unmapped and it's dirty then it was writable recently.  Due to
      deferred TLB flushing, it's best to assume a writable TLB cache entry
      exists.  With that assumption, the TLB must be flushed before any IO can
      start or the page is freed to avoid lost writes or data corruption.  This
      patch defers flushing of potentially writable TLBs as long as possible.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d950c947
    • M
      mm: send one IPI per CPU to TLB flush all entries after unmapping pages · 72b252ae
      Mel Gorman 提交于
      An IPI is sent to flush remote TLBs when a page is unmapped that was
      potentially accesssed by other CPUs.  There are many circumstances where
      this happens but the obvious one is kswapd reclaiming pages belonging to a
      running process as kswapd and the task are likely running on separate
      CPUs.
      
      On small machines, this is not a significant problem but as machine gets
      larger with more cores and more memory, the cost of these IPIs can be
      high.  This patch uses a simple structure that tracks CPUs that
      potentially have TLB entries for pages being unmapped.  When the unmapping
      is complete, the full TLB is flushed on the assumption that a refill cost
      is lower than flushing individual entries.
      
      Architectures wishing to do this must give the following guarantee.
      
              If a clean page is unmapped and not immediately flushed, the
              architecture must guarantee that a write to that linear address
              from a CPU with a cached TLB entry will trap a page fault.
      
      This is essentially what the kernel already depends on but the window is
      much larger with this patch applied and is worth highlighting.  The
      architecture should consider whether the cost of the full TLB flush is
      higher than sending an IPI to flush each individual entry.  An additional
      architecture helper called flush_tlb_local is required.  It's a trivial
      wrapper with some accounting in the x86 case.
      
      The impact of this patch depends on the workload as measuring any benefit
      requires both mapped pages co-located on the LRU and memory pressure.  The
      case with the biggest impact is multiple processes reading mapped pages
      taken from the vm-scalability test suite.  The test case uses NR_CPU
      readers of mapped files that consume 10*RAM.
      
      Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed      159.62 (  0.00%)   120.68 ( 24.40%)
      Ops lru-file-mmap-read-time_range    30.59 (  0.00%)     2.80 ( 90.85%)
      Ops lru-file-mmap-read-time_stddv     6.70 (  0.00%)     0.64 ( 90.38%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User          581.00       611.43
      System       5804.93      4111.76
      Elapsed       161.03       122.12
      
      This is showing that the readers completed 24.40% faster with 29% less
      system CPU time.  From vmstats, it is known that the vanilla kernel was
      interrupted roughly 900K times per second during the steady phase of the
      test and the patched kernel was interrupts 180K times per second.
      
      The impact is lower on a single socket machine.
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed       25.33 (  0.00%)    20.38 ( 19.54%)
      Ops lru-file-mmap-read-time_range     0.91 (  0.00%)     1.44 (-58.24%)
      Ops lru-file-mmap-read-time_stddv     0.28 (  0.00%)     0.47 (-65.34%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User           58.09        57.64
      System        111.82        76.56
      Elapsed        27.29        22.55
      
      It's still a noticeable improvement with vmstat showing interrupts went
      from roughly 500K per second to 45K per second.
      
      The patch will have no impact on workloads with no memory pressure or have
      relatively few mapped pages.  It will have an unpredictable impact on the
      workload running on the CPU being flushed as it'll depend on how many TLB
      entries need to be refilled and how long that takes.  Worst case, the TLB
      will be completely cleared of active entries when the target PFNs were not
      resident at all.
      
      [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72b252ae
  4. 12 8月, 2015 1 次提交
  5. 03 8月, 2015 4 次提交
    • Y
      sched/fair: Rewrite runnable load and utilization average tracking · 9d89c257
      Yuyang Du 提交于
      The idea of runnable load average (let runnable time contribute to weight)
      was proposed by Paul Turner and Ben Segall, and it is still followed by
      this rewrite. This rewrite aims to solve the following issues:
      
      1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
         updated at the granularity of an entity at a time, which results in the
         cfs_rq's load average is stale or partially updated: at any time, only
         one entity is up to date, all other entities are effectively lagging
         behind. This is undesirable.
      
         To illustrate, if we have n runnable entities in the cfs_rq, as time
         elapses, they certainly become outdated:
      
           t0: cfs_rq { e1_old, e2_old, ..., en_old }
      
         and when we update:
      
           t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }
      
           t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }
      
           ...
      
         We solve this by combining all runnable entities' load averages together
         in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based
         on the fact that if we regard the update as a function, then:
      
         w * update(e) = update(w * e) and
      
         update(e1) + update(e2) = update(e1 + e2), then
      
         w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)
      
         therefore, by this rewrite, we have an entirely updated cfs_rq at the
         time we update it:
      
           t1: update cfs_rq { e1_new, e2_new, ..., en_new }
      
           t2: update cfs_rq { e1_new, e2_new, ..., en_new }
      
           ...
      
      2. cfs_rq's load average is different between top rq->cfs_rq and other
         task_group's per CPU cfs_rqs in whether or not blocked_load_average
         contributes to the load.
      
         The basic idea behind runnable load average (the same for utilization)
         is that the blocked state is taken into account as opposed to only
         accounting for the currently runnable state. Therefore, the average
         should include both the runnable/running and blocked load averages.
         This rewrite does that.
      
         In addition, we also combine runnable/running and blocked averages
         of all entities into the cfs_rq's average, and update it together at
         once. This is based on the fact that:
      
           update(runnable) + update(blocked) = update(runnable + blocked)
      
         This significantly reduces the code as we don't need to separately
         maintain/update runnable/running load and blocked load.
      
      3. How task_group entities' share is calculated is complex and imprecise.
      
         We reduce the complexity in this rewrite to allow a very simple rule:
         the task_group's load_avg is aggregated from its per CPU cfs_rqs's
         load_avgs. Then group entity's weight is simply proportional to its
         own cfs_rq's load_avg / task_group's load_avg. To illustrate,
      
         if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then,
      
         task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then
      
         cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share
      
      To sum up, this rewrite in principle is equivalent to the current one, but
      fixes the issues described above. Turns out, it significantly reduces the
      code complexity and hence increases clarity and efficiency. In addition,
      the new averages are more smooth/continuous (no spurious spikes and valleys)
      and updated more consistently and quickly to reflect the load dynamics.
      
      As a result, we have less load tracking overhead, better performance,
      and especially better power efficiency due to more balanced load.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9d89c257
    • K
      sched/preempt: Fix cond_resched_lock() and cond_resched_softirq() · fe32d3cd
      Konstantin Khlebnikov 提交于
      These functions check should_resched() before unlocking spinlock/bh-enable:
      preempt_count always non-zero => should_resched() always returns false.
      cond_resched_lock() worked iff spin_needbreak is set.
      
      This patch adds argument "preempt_offset" to should_resched().
      
      preempt_count offset constants for that:
      
        PREEMPT_DISABLE_OFFSET  - offset after preempt_disable()
        PREEMPT_LOCK_OFFSET     - offset after spin_lock()
        SOFTIRQ_DISABLE_OFFSET  - offset after local_bh_distable()
        SOFTIRQ_LOCK_OFFSET     - offset after spin_lock_bh()
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Graf <agraf@suse.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: bdb43806 ("sched: Extract the basic add/sub preempt_count modifiers")
      Link: http://lkml.kernel.org/r/20150715095204.12246.98268.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fe32d3cd
    • M
      sched/fair: Beef up wake_wide() · 63b0e9ed
      Mike Galbraith 提交于
      Josef Bacik reported that Facebook sees better performance with their
      1:N load (1 dispatch/node, N workers/node) when carrying an old patch
      to try very hard to wake to an idle CPU.  While looking at wake_wide(),
      I noticed that it doesn't pay attention to the wakeup of a many partner
      waker, returning 1 only when waking one of its many partners.
      
      Correct that, letting explicit domain flags override the heuristic.
      
      While at it, adjust task_struct bits, we don't need a 64-bit counter.
      Tested-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      [ Tidy things up. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-team<Kernel-team@fb.com>
      Cc: morten.rasmussen@arm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1436888390.7983.49.camel@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      63b0e9ed
    • P
      sched/cputime: Guarantee stime + utime == rtime · 9d7fb042
      Peter Zijlstra 提交于
      While the current code guarantees monotonicity for stime and utime
      independently of one another, it does not guarantee that the sum of
      both is equal to the total time we started out with.
      
      This confuses things (and peoples) who look at this sum, like top, and
      will report >100% usage followed by a matching period of 0%.
      
      Rework the code to provide both individual monotonicity and a coherent
      sum.
      Suggested-by: NFredrik Markstrom <fredrik.markstrom@gmail.com>
      Reported-by: NFredrik Markstrom <fredrik.markstrom@gmail.com>
      Tested-by: NFredrik Markstrom <fredrik.markstrom@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jason.low2@hp.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9d7fb042
  6. 18 7月, 2015 2 次提交
  7. 04 7月, 2015 2 次提交
  8. 26 6月, 2015 1 次提交
    • J
      clone: support passing tls argument via C rather than pt_regs magic · 3033f14a
      Josh Triplett 提交于
      clone has some of the quirkiest syscall handling in the kernel, with a
      pile of special cases, historical curiosities, and architecture-specific
      calling conventions.  In particular, clone with CLONE_SETTLS accepts a
      parameter "tls" that the C entry point completely ignores and some
      assembly entry points overwrite; instead, the low-level arch-specific
      code pulls the tls parameter out of the arch-specific register captured
      as part of pt_regs on entry to the kernel.  That's a massive hack, and
      it makes the arch-specific code only work when called via the specific
      existing syscall entry points; because of this hack, any new clone-like
      system call would have to accept an identical tls argument in exactly
      the same arch-specific position, rather than providing a unified system
      call entry point across architectures.
      
      The first patch allows architectures to handle the tls argument via
      normal C parameter passing, if they opt in by selecting
      HAVE_COPY_THREAD_TLS.  The second patch makes 32-bit and 64-bit x86 opt
      into this.
      
      These two patches came out of the clone4 series, which isn't ready for
      this merge window, but these first two cleanup patches were entirely
      uncontroversial and have acks.  I'd like to go ahead and submit these
      two so that other architectures can begin building on top of this and
      opting into HAVE_COPY_THREAD_TLS.  However, I'm also happy to wait and
      send these through the next merge window (along with v3 of clone4) if
      anyone would prefer that.
      
      This patch (of 2):
      
      clone with CLONE_SETTLS accepts an argument to set the thread-local
      storage area for the new thread.  sys_clone declares an int argument
      tls_val in the appropriate point in the argument list (based on the
      various CLONE_BACKWARDS variants), but doesn't actually use or pass along
      that argument.  Instead, sys_clone calls do_fork, which calls
      copy_process, which calls the arch-specific copy_thread, and copy_thread
      pulls the corresponding syscall argument out of the pt_regs captured at
      kernel entry (knowing what argument of clone that architecture passes tls
      in).
      
      Apart from being awful and inscrutable, that also only works because only
      one code path into copy_thread can pass the CLONE_SETTLS flag, and that
      code path comes from sys_clone with its architecture-specific
      argument-passing order.  This prevents introducing a new version of the
      clone system call without propagating the same architecture-specific
      position of the tls argument.
      
      However, there's no reason to pull the argument out of pt_regs when
      sys_clone could just pass it down via C function call arguments.
      
      Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt into,
      and a new copy_thread_tls that accepts the tls parameter as an additional
      unsigned long (syscall-argument-sized) argument.  Change sys_clone's tls
      argument to an unsigned long (which does not change the ABI), and pass
      that down to copy_thread_tls.
      
      Architectures that don't opt into copy_thread_tls will continue to ignore
      the C argument to sys_clone in favor of the pt_regs captured at kernel
      entry, and thus will be unable to introduce new versions of the clone
      syscall.
      
      Patch co-authored by Josh Triplett and Thiago Macieira.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thiago Macieira <thiago.macieira@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3033f14a
  9. 19 6月, 2015 1 次提交
    • T
      timer: Reduce timer migration overhead if disabled · bc7a34b8
      Thomas Gleixner 提交于
      Eric reported that the timer_migration sysctl is not really nice
      performance wise as it needs to check at every timer insertion whether
      the feature is enabled or not. Further the check does not live in the
      timer code, so we have an extra function call which checks an extra
      cache line to figure out that it is disabled.
      
      We can do better and store that information in the per cpu (hr)timer
      bases. I pondered to use a static key, but that's a nightmare to
      update from the nohz code and the timer base cache line is hot anyway
      when we select a timer base.
      
      The old logic enabled the timer migration unconditionally if
      CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
      line.
      
      With this modification, we start off with migration disabled. The user
      visible sysctl is still set to enabled. If the kernel switches to NOHZ
      migration is enabled, if the user did not disable it via the sysctl
      prior to the switch. If nohz=off is on the kernel command line,
      migration stays disabled no matter what.
      
      Before:
        47.76%  hog       [.] main
        14.84%  [kernel]  [k] _raw_spin_lock_irqsave
         9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.71%  [kernel]  [k] mod_timer
         6.24%  [kernel]  [k] lock_timer_base.isra.38
         3.76%  [kernel]  [k] detach_if_pending
         3.71%  [kernel]  [k] del_timer
         2.50%  [kernel]  [k] internal_add_timer
         1.51%  [kernel]  [k] get_nohz_timer_target
         1.28%  [kernel]  [k] __internal_add_timer
         0.78%  [kernel]  [k] timerfn
         0.48%  [kernel]  [k] wake_up_nohz_cpu
      
      After:
        48.10%  hog       [.] main
        15.25%  [kernel]  [k] _raw_spin_lock_irqsave
         9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.50%  [kernel]  [k] mod_timer
         6.44%  [kernel]  [k] lock_timer_base.isra.38
         3.87%  [kernel]  [k] detach_if_pending
         3.80%  [kernel]  [k] del_timer
         2.67%  [kernel]  [k] internal_add_timer
         1.33%  [kernel]  [k] __internal_add_timer
         0.73%  [kernel]  [k] timerfn
         0.54%  [kernel]  [k] wake_up_nohz_cpu
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bc7a34b8
  10. 05 6月, 2015 1 次提交
    • O
      signals: don't abuse __flush_signals() in selinux_bprm_committed_creds() · 9e7c8f8c
      Oleg Nesterov 提交于
      selinux_bprm_committed_creds()->__flush_signals() is not right, we
      shouldn't clear TIF_SIGPENDING unconditionally. There can be other
      reasons for signal_pending(): freezing(), JOBCTL_PENDING_MASK, and
      potentially more.
      
      Also change this code to check fatal_signal_pending() rather than
      SIGNAL_GROUP_EXIT, it looks a bit better.
      
      Now we can kill __flush_signals() before it finds another buggy user.
      
      Note: this code looks racy, we can flush a signal which was sent after
      the task SID has been updated.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      9e7c8f8c
  11. 27 5月, 2015 2 次提交
    • T
      sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem · d59cfc09
      Tejun Heo 提交于
      The cgroup side of threadgroup locking uses signal_struct->group_rwsem
      to synchronize against threadgroup changes.  This per-process rwsem
      adds small overhead to thread creation, exit and exec paths, forces
      cgroup code paths to do lock-verify-unlock-retry dance in a couple
      places and makes it impossible to atomically perform operations across
      multiple processes.
      
      This patch replaces signal_struct->group_rwsem with a global
      percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
      side and contained in cgroups proper.  This patch converts one-to-one.
      
      This does make writer side heavier and lower the granularity; however,
      cgroup process migration is a fairly cold path, we do want to optimize
      thread operations over it and cgroup migration operations don't take
      enough time for the lower granularity to matter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      d59cfc09
    • T
      sched, cgroup: reorganize threadgroup locking · 7d7efec3
      Tejun Heo 提交于
      threadgroup_change_begin/end() are used to mark the beginning and end
      of threadgroup modifying operations to allow code paths which require
      a threadgroup to stay stable across blocking operations to synchronize
      against those sections using threadgroup_lock/unlock().
      
      It's currently implemented as a general mechanism in sched.h using
      per-signal_struct rwsem; however, this never grew non-cgroup use cases
      and becomes noop if !CONFIG_CGROUPS.  It turns out that cgroups is
      gonna be better served with a different sycnrhonization scheme and is
      a bit silly to keep cgroups specific details as a general mechanism.
      
      What's general here is identifying the places where threadgroups are
      modified.  This patch restructures threadgroup locking so that
      threadgroup_change_begin/end() become a place where subsystems which
      need to sycnhronize against threadgroup changes can hook into.
      
      cgroup_threadgroup_change_begin/end() which operate on the
      per-signal_struct rwsem are created and threadgroup_lock/unlock() are
      moved to cgroup.c and made static.
      
      This is pure reorganization which doesn't cause any functional
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      7d7efec3
  12. 19 5月, 2015 4 次提交
    • P
      sched/wait: Introduce TASK_NOLOAD and TASK_IDLE · 80ed87c8
      Peter Zijlstra 提交于
      Currently people use TASK_INTERRUPTIBLE to idle kthreads and wait for
      'work' because TASK_UNINTERRUPTIBLE contributes to the loadavg. Having
      all idle kthreads contribute to the loadavg is somewhat silly.
      
      Now mostly this works OK, because kthreads have all their signals
      masked. However there's a few sites where this is causing problems and
      TASK_UNINTERRUPTIBLE should be used, except for that loadavg issue.
      
      This patch adds TASK_NOLOAD which, when combined with
      TASK_UNINTERRUPTIBLE avoids the loadavg accounting.
      
      As most of imagined usage sites are loops where a thread wants to
      idle, waiting for work, a helper TASK_IDLE is introduced.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      80ed87c8
    • D
      sched/preempt, mm/fault: Count pagefault_disable() levels in pagefault_disabled · 8bcbde54
      David Hildenbrand 提交于
      Until now, pagefault_disable()/pagefault_enabled() used the preempt
      count to track whether in an environment with pagefaults disabled (can
      be queried via in_atomic()).
      
      This patch introduces a separate counter in task_struct to count the
      level of pagefault_disable() calls. We'll keep manipulating the preempt
      count to retain compatibility to existing pagefault handlers.
      
      It is now possible to verify whether in a pagefault_disable() envionment
      by calling pagefault_disabled(). In contrast to in_atomic() it will not
      be influenced by preempt_enable()/preempt_disable().
      
      This patch is based on a patch from Ingo Molnar.
      Reviewed-and-tested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David.Laight@ACULAB.COM
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: airlied@linux.ie
      Cc: akpm@linux-foundation.org
      Cc: benh@kernel.crashing.org
      Cc: bigeasy@linutronix.de
      Cc: borntraeger@de.ibm.com
      Cc: daniel.vetter@intel.com
      Cc: heiko.carstens@de.ibm.com
      Cc: herbert@gondor.apana.org.au
      Cc: hocko@suse.cz
      Cc: hughd@google.com
      Cc: mst@redhat.com
      Cc: paulus@samba.org
      Cc: ralf@linux-mips.org
      Cc: schwidefsky@de.ibm.com
      Cc: yang.shi@windriver.com
      Link: http://lkml.kernel.org/r/1431359540-32227-2-git-send-email-dahi@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8bcbde54
    • F
      sched/preempt: Merge preempt_mask.h into preempt.h · 92cf2118
      Frederic Weisbecker 提交于
      preempt_mask.h defines all the preempt_count semantics and related
      symbols: preempt, softirq, hardirq, nmi, preempt active, need resched,
      etc...
      
      preempt.h defines the accessors and mutators of preempt_count.
      
      But there is a messy dependency game around those two header files:
      
      	* preempt_mask.h includes preempt.h in order to access preempt_count()
      
      	* preempt_mask.h defines all preempt_count semantic and symbols
      	  except PREEMPT_NEED_RESCHED that is needed by asm/preempt.h
      	  Thus we need to define it from preempt.h, right before including
      	  asm/preempt.h, instead of defining it to preempt_mask.h with the
      	  other preempt_count symbols. Therefore the preempt_count semantics
      	  happen to be spread out.
      
      	* We plan to introduce preempt_active_[enter,exit]() to consolidate
      	  preempt_schedule*() code. But we'll need to access both preempt_count
      	  mutators (preempt_count_add()) and preempt_count symbols
      	  (PREEMPT_ACTIVE, PREEMPT_OFFSET). The usual place to define preempt
      	  operations is in preempt.h but then we'll need symbols in
      	  preempt_mask.h which already includes preempt.h. So we end up with
      	  a ressource circle dependency.
      
      Lets merge preempt_mask.h into preempt.h to solve these dependency issues.
      This way we gather semantic symbols and operation definition of
      preempt_count in a single file.
      
      This is a dumb copy-paste merge. Further merge re-arrangments are
      performed in a subsequent patch to ease review.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431441711-29753-2-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      92cf2118
    • P
      locking/arch: Rename set_mb() to smp_store_mb() · b92b8b35
      Peter Zijlstra 提交于
      Since set_mb() is really about an smp_mb() -- not a IO/DMA barrier
      like mb() rename it to match the recent smp_load_acquire() and
      smp_store_release().
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b92b8b35
  13. 15 5月, 2015 2 次提交
  14. 11 5月, 2015 1 次提交
    • N
      VFS: replace {, total_}link_count in task_struct with pointer to nameidata · 756daf26
      NeilBrown 提交于
      task_struct currently contains two ad-hoc members for use by the VFS:
      link_count and total_link_count.  These are only interesting to fs/namei.c,
      so exposing them explicitly is poor layering.  Incidentally, link_count
      isn't used anymore, so it can just die.
      
      This patches replaces those with a single pointer to 'struct nameidata'.
      This structure represents the current filename lookup of which
      there can only be one per process, and is a natural place to
      store total_link_count.
      
      This will allow the current "nameidata" argument to all
      follow_link operations to be removed as current->nameidata
      can be used instead in the _very_ few instances that care about
      it at all.
      
      As there are occasional circumstances where pathname lookup can
      recurse, such as through kern_path_locked, we always save and old
      current->nameidata (if there is one) when setting a new value, and
      make sure any active link_counts are preserved.
      
      follow_mount and follow_automount now get a 'struct nameidata *'
      rather than 'int flags' so that they can directly access
      total_link_count, rather than going through 'current'.
      Suggested-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      756daf26
  15. 10 5月, 2015 1 次提交
  16. 08 5月, 2015 9 次提交
    • P
      perf: Fix software migrate events · ff303e66
      Peter Zijlstra 提交于
      Stephane asked about PERF_COUNT_SW_CPU_MIGRATIONS and I realized it
      was borken:
      
       > The problem is that the task isn't actually scheduled while its being
       > migrated (obviously), and if its not scheduled, the counters aren't
       > scheduled either, so there's no observing of the fact.
       >
       > A further problem with migrations is that many migrations happen from
       > softirq context, which is nested inside the 'random' task context of
       > whoemever happens to run at that time, similarly for the wakeup
       > migrations triggered from (soft)irq context. All those end up being
       > accounted in the task that's currently running, eg. your 'ls'.
      
      The below cures this by marking a task as migrated and accounting it
      on the subsequent sched_in().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ff303e66
    • P
      sched: Implement lockless wake-queues · 76751049
      Peter Zijlstra 提交于
      This is useful for locking primitives that can effect multiple
      wakeups per operation and want to avoid lock internal lock contention
      by delaying the wakeups until we've released the lock internal locks.
      
      Alternatively it can be used to avoid issuing multiple wakeups, and
      thus save a few cycles, in packet processing. Queue all target tasks
      and wakeup once you've processed all packets. That way you avoid
      waking the target task multiple times if there were multiple packets
      for the same task.
      
      Properties of a wake_q are:
      - Lockless, as queue head must reside on the stack.
      - Being a queue, maintains wakeup order passed by the callers. This can
        be important for otherwise, in scenarios where highly contended locks
        could affect any reliance on lock fairness.
      - A queued task cannot be added again until it is woken up.
      
      This patch adds the needed infrastructure into the scheduler code
      and uses the new wake_list to delay the futex wakeups until
      after we've released the hash bucket locks.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      [tweaks, adjustments, comments, etc.]
      Signed-off-by: NDavidlohr Bueso <dbueso@suse.de>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Chris Mason <clm@fb.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: George Spelvin <linux@horizon.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Manfred Spraul <manfred@colorfullife.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1430494072-30283-2-git-send-email-dave@stgolabs.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      76751049
    • J
      sched, timer: Use the atomic task_cputime in thread_group_cputimer · 71107445
      Jason Low 提交于
      Recent optimizations were made to thread_group_cputimer to improve its
      scalability by keeping track of cputime stats without a lock. However,
      the values were open coded to the structure, causing them to be at
      a different abstraction level from the regular task_cputime structure.
      Furthermore, any subsequent similar optimizations would not be able to
      share the new code, since they are specific to thread_group_cputimer.
      
      This patch adds the new task_cputime_atomic data structure (introduced in
      the previous patch in the series) to thread_group_cputimer for keeping
      track of the cputime atomically, which also helps generalize the code.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Link: http://lkml.kernel.org/r/1430251224-5764-6-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      71107445
    • J
      sched, timer: Provide an atomic 'struct task_cputime' data structure · 971e8a98
      Jason Low 提交于
      This patch adds an atomic variant of the 'struct task_cputime' data structure,
      which can be used to store and update task_cputime statistics without
      needing to do locking.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Link: http://lkml.kernel.org/r/1430251224-5764-5-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      971e8a98
    • J
      sched, timer: Replace spinlocks with atomics in thread_group_cputimer(), to improve scalability · 1018016c
      Jason Low 提交于
      While running a database workload, we found a scalability issue with itimers.
      
      Much of the problem was caused by the thread_group_cputimer spinlock.
      Each time we account for group system/user time, we need to obtain a
      thread_group_cputimer's spinlock to update the timers. On larger systems
      (such as a 16 socket machine), this caused more than 30% of total time
      spent trying to obtain this kernel lock to update these group timer stats.
      
      This patch converts the timers to 64-bit atomic variables and use
      atomic add to update them without a lock. With this patch, the percent
      of total time spent updating thread group cputimer timers was reduced
      from 30% down to less than 1%.
      
      Note: On 32-bit systems using the generic 64-bit atomics, this causes
      sample_group_cputimer() to take locks 3 times instead of just 1 time.
      However, we tested this patch on a 32-bit system ARM system using the
      generic atomics and did not find the overhead to be much of an issue.
      An explanation for why this isn't an issue is that 32-bit systems usually
      have small numbers of CPUs, and cacheline contention from extra spinlocks
      called periodically is not really apparent on smaller systems.
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Waiman Long <Waiman.Long@hp.com>
      Link: http://lkml.kernel.org/r/1430251224-5764-4-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      1018016c
    • J
      sched, timer: Convert usages of ACCESS_ONCE() in the scheduler to READ_ONCE()/WRITE_ONCE() · 316c1608
      Jason Low 提交于
      ACCESS_ONCE doesn't work reliably on non-scalar types. This patch removes
      the rest of the existing usages of ACCESS_ONCE() in the scheduler, and use
      the new READ_ONCE() and WRITE_ONCE() APIs as appropriate.
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NWaiman Long <Waiman.Long@hp.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Aswin Chandramouleeswaran <aswin@hp.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Galbraith <umgwanakikbuti@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Preeti U Murthy <preeti@linux.vnet.ibm.com>
      Cc: Scott J Norton <scott.norton@hp.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Link: http://lkml.kernel.org/r/1430251224-5764-2-git-send-email-jason.low2@hp.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      316c1608
    • P
      signals, ptrace, sched: Fix a misaligned load inside ptrace_attach() · e7cc4173
      Palmer Dabbelt 提交于
      The misaligned load exception arises when running ptrace_attach() on
      the RISC-V (which hasn't been upstreamed yet).  The problem is that
      wait_on_bit() takes a void* but then proceeds to call test_bit(),
      which takes a long*.  This allows an int-aligned pointer to be passed
      to test_bit(), which promptly fails.  This will manifest on any other
      asm-generic port where unaligned loads trap, where sizeof(long) >
      sizeof(int), and where task_struct.jobctl ends up not being
      long-aligned.
      
      This patch changes task_struct.jobctl to be a long, which ensures it
      has the correct alignment.
      Signed-off-by: NPalmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bobby.prani@gmail.com
      Cc: oleg@redhat.com
      Cc: paulmck@linux.vnet.ibm.com
      Cc: richard@nod.at
      Cc: vdavydov@parallels.com
      Link: http://lkml.kernel.org/r/1430453997-32459-2-git-send-email-palmer@dabbelt.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e7cc4173
    • P
      signals, sched: Change all uses of JOBCTL_* from 'int' to 'long' · b76808e6
      Palmer Dabbelt 提交于
      c56fb6564dcd ("Fix a misaligned load inside ptrace_attach()") makes
      jobctl an "unsigned long".  It makes sense to have the masks applied
      to it match that type.  This is currently just a cosmetic change, but
      it will prevent the mask from being unexpectedly truncated if we ever
      end up with masks with more bits.
      
      One instance of "signr" is an int, but I left this alone because the
      mask ensures that it will never overflow.
      Signed-off-by: NPalmer Dabbelt <palmer@dabbelt.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NChris Metcalf <cmetcalf@ezchip.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: bobby.prani@gmail.com
      Cc: oleg@redhat.com
      Cc: paulmck@linux.vnet.ibm.com
      Cc: richard@nod.at
      Cc: vdavydov@parallels.com
      Link: http://lkml.kernel.org/r/1430453997-32459-4-git-send-email-palmer@dabbelt.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      b76808e6
    • P
      sched: Move the loadavg code to a more obvious location · 3289bdb4
      Peter Zijlstra 提交于
      I could not find the loadavg code.. turns out it was hidden in a file
      called proc.c. It further got mingled up with the cruft per rq load
      indexes (which we really want to get rid of).
      
      Move the per rq load indexes into the fair.c load-balance code (that's
      the only thing that uses them) and rename proc.c to loadavg.c so we
      can find it again.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      [ Did minor cleanups to the code. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      3289bdb4
  17. 07 5月, 2015 1 次提交
  18. 27 4月, 2015 1 次提交
  19. 13 4月, 2015 1 次提交
  20. 27 3月, 2015 1 次提交
    • V
      sched: Add sched_avg::utilization_avg_contrib · 36ee28e4
      Vincent Guittot 提交于
      Add new statistics which reflect the average time a task is running on the CPU
      and the sum of these running time of the tasks on a runqueue. The latter is
      named utilization_load_avg.
      
      This patch is based on the usage metric that was proposed in the 1st
      versions of the per-entity load tracking patchset by Paul Turner
      <pjt@google.com> but that has be removed afterwards. This version differs from
      the original one in the sense that it's not linked to task_group.
      
      The rq's utilization_load_avg will be used to check if a rq is overloaded or
      not instead of trying to compute how many tasks a group of CPUs can handle.
      
      Rename runnable_avg_period into avg_period as it is now used with both
      runnable_avg_sum and running_avg_sum.
      
      Add some descriptions of the variables to explain their differences.
      Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Cc: Paul Turner <pjt@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Ben Segall <bsegall@google.com>
      Cc: Morten.Rasmussen@arm.com
      Cc: Paul Turner <pjt@google.com>
      Cc: dietmar.eggemann@arm.com
      Cc: efault@gmx.de
      Cc: kamalesh@linux.vnet.ibm.com
      Cc: linaro-kernel@lists.linaro.org
      Cc: nicolas.pitre@linaro.org
      Cc: preeti@linux.vnet.ibm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1425052454-25797-2-git-send-email-vincent.guittot@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      36ee28e4