1. 07 11月, 2015 1 次提交
  2. 06 11月, 2015 3 次提交
  3. 15 10月, 2015 2 次提交
    • J
      posix_cpu_timer: Reduce unnecessary sighand lock contention · c8d75aa4
      Jason Low 提交于
      It was found while running a database workload on large systems that
      significant time was spent trying to acquire the sighand lock.
      
      The issue was that whenever an itimer expired, many threads ended up
      simultaneously trying to send the signal. Most of the time, nothing
      happened after acquiring the sighand lock because another thread
      had just already sent the signal and updated the "next expire" time.
      The fastpath_timer_check() didn't help much since the "next expire"
      time was updated after the threads exit fastpath_timer_check().
      
      This patch addresses this by having the thread_group_cputimer structure
      maintain a boolean to signify when a thread in the group is already
      checking for process wide timers, and adds extra logic in the fastpath
      to check the boolean.
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NGeorge Spelvin <linux@horizon.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: hideaki.kimura@hpe.com
      Cc: terry.rudd@hpe.com
      Cc: scott.norton@hpe.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1444849677-29330-5-git-send-email-jason.low2@hp.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      c8d75aa4
    • J
      posix_cpu_timer: Convert cputimer->running to bool · d5c373eb
      Jason Low 提交于
      In the next patch in this series, a new field 'checking_timer' will
      be added to 'struct thread_group_cputimer'. Both this and the
      existing 'running' integer field are just used as boolean values. To
      save space in the structure, we can make both of these fields booleans.
      
      This is a preparatory patch to convert the existing running integer
      field to a boolean.
      Suggested-by: NGeorge Spelvin <linux@horizon.com>
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Reviewed: George Spelvin <linux@horizon.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: hideaki.kimura@hpe.com
      Cc: terry.rudd@hpe.com
      Cc: scott.norton@hpe.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1444849677-29330-4-git-send-email-jason.low2@hp.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      d5c373eb
  4. 13 10月, 2015 1 次提交
    • A
      bpf: charge user for creation of BPF maps and programs · aaac3ba9
      Alexei Starovoitov 提交于
      since eBPF programs and maps use kernel memory consider it 'locked' memory
      from user accounting point of view and charge it against RLIMIT_MEMLOCK limit.
      This limit is typically set to 64Kbytes by distros, so almost all
      bpf+tracing programs would need to increase it, since they use maps,
      but kernel charges maximum map size upfront.
      For example the hash map of 1024 elements will be charged as 64Kbyte.
      It's inconvenient for current users and changes current behavior for root,
      but probably worth doing to be consistent root vs non-root.
      
      Similar accounting logic is done by mmap of perf_event.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aaac3ba9
  5. 06 10月, 2015 2 次提交
    • P
      sched/core: Create preempt_count invariant · 609ca066
      Peter Zijlstra 提交于
      Assuming units of PREEMPT_DISABLE_OFFSET for preempt_count() numbers.
      
      Now that TASK_DEAD no longer results in preempt_count() == 3 during
      scheduling, we will always call context_switch() with preempt_count()
      == 2.
      
      However, we don't always end up with preempt_count() == 2 in
      finish_task_switch() because new tasks get created with
      preempt_count() == 1.
      
      Create FORK_PREEMPT_COUNT and set it to 2 and use that in the right
      places. Note that we cannot use INIT_PREEMPT_COUNT as that serves
      another purpose (boot).
      
      After this, preempt_count() is invariant across the context switch,
      with exception of PREEMPT_ACTIVE.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      609ca066
    • P
      sched/core: Simplify INIT_PREEMPT_COUNT · 87dcbc06
      Peter Zijlstra 提交于
      As per the following commit:
      
        d86ee480 ("sched: optimize cond_resched()")
      
      we need PREEMPT_ACTIVE to avoid cond_resched() from working before
      the scheduler is set up.
      
      However, keeping preemption disabled should do the same thing
      already, making the PREEMPT_ACTIVE part entirely redundant.
      
      The only complication is !PREEMPT_COUNT kernels, where
      PREEMPT_DISABLED ends up being 0. Instead we use an unconditional
      PREEMPT_OFFSET to set preempt_count() even on !PREEMPT_COUNT
      kernels.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      87dcbc06
  6. 23 9月, 2015 1 次提交
  7. 21 9月, 2015 1 次提交
    • P
      rcu: Use single-stage IPI algorithm for RCU expedited grace period · 8203d6d0
      Paul E. McKenney 提交于
      The current preemptible-RCU expedited grace-period algorithm invokes
      synchronize_sched_expedited() to enqueue all tasks currently running
      in a preemptible-RCU read-side critical section, then waits for all the
      ->blkd_tasks lists to drain.  This works, but results in both an IPI and
      a double context switch even on CPUs that do not happen to be running
      in a preemptible RCU read-side critical section.
      
      This commit implements a new algorithm that causes less OS jitter.
      This new algorithm IPIs all online CPUs that are not idle (from an
      RCU perspective), but refrains from self-IPIs.  If a CPU receiving
      this IPI is not in a preemptible RCU read-side critical section (or
      is just now exiting one), it pushes quiescence up the rcu_node tree,
      otherwise, it sets a flag that will be handled by the upcoming outermost
      rcu_read_unlock(), which will then push quiescence up the tree.
      
      The expedited grace period must of course wait on any pre-existing blocked
      readers, and newly blocked readers must be queued carefully based on
      the state of both the normal and the expedited grace periods.  This
      new queueing approach also avoids the need to update boost state,
      courtesy of the fact that blocked tasks are no longer ever migrated to
      the root rcu_node structure.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      8203d6d0
  8. 17 9月, 2015 1 次提交
    • T
      sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem · 1ed13287
      Tejun Heo 提交于
      Note: This commit was originally committed as d59cfc09 but got
            reverted by 0c986253 due to the performance regression from
            the percpu_rwsem write down/up operations added to cgroup task
            migration path.  percpu_rwsem changes which alleviate the
            performance issue are pending for v4.4-rc1 merge window.
            Re-apply.
      
      The cgroup side of threadgroup locking uses signal_struct->group_rwsem
      to synchronize against threadgroup changes.  This per-process rwsem
      adds small overhead to thread creation, exit and exec paths, forces
      cgroup code paths to do lock-verify-unlock-retry dance in a couple
      places and makes it impossible to atomically perform operations across
      multiple processes.
      
      This patch replaces signal_struct->group_rwsem with a global
      percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
      side and contained in cgroups proper.  This patch converts one-to-one.
      
      This does make writer side heavier and lower the granularity; however,
      cgroup process migration is a fairly cold path, we do want to optimize
      thread operations over it and cgroup migration operations don't take
      enough time for the lower granularity to matter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      1ed13287
  9. 16 9月, 2015 1 次提交
    • T
      Revert "sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem" · 0c986253
      Tejun Heo 提交于
      This reverts commit d59cfc09.
      
      d59cfc09 ("sched, cgroup: replace signal_struct->group_rwsem with
      a global percpu_rwsem") and b5ba75b5 ("cgroup: simplify
      threadgroup locking") changed how cgroup synchronizes against task
      fork and exits so that it uses global percpu_rwsem instead of
      per-process rwsem; unfortunately, the write [un]lock paths of
      percpu_rwsem always involve synchronize_rcu_expedited() which turned
      out to be too expensive.
      
      Improvements for percpu_rwsem are scheduled to be merged in the coming
      v4.4-rc1 merge window which alleviates this issue.  For now, revert
      the two commits to restore per-process rwsem.  They will be re-applied
      for the v4.4-rc1 merge window.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.comReported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org # v4.2+
      0c986253
  10. 13 9月, 2015 2 次提交
    • D
      sched/fair: Make utilization tracking CPU scale-invariant · e3279a2e
      Dietmar Eggemann 提交于
      Besides the existing frequency scale-invariance correction factor, apply
      CPU scale-invariance correction factor to utilization tracking to
      compensate for any differences in compute capacity. This could be due to
      micro-architectural differences (i.e. instructions per seconds) between
      cpus in HMP systems (e.g. big.LITTLE), and/or differences in the current
      maximum frequency supported by individual cpus in SMP systems. In the
      existing implementation utilization isn't comparable between cpus as it
      is relative to the capacity of each individual CPU.
      
      Each segment of the sched_avg.util_sum geometric series is now scaled
      by the CPU performance factor too so the sched_avg.util_avg of each
      sched entity will be invariant from the particular CPU of the HMP/SMP
      system on which the sched entity is scheduled.
      
      With this patch, the utilization of a CPU stays relative to the max CPU
      performance of the fastest CPU in the system.
      
      In contrast to utilization (sched_avg.util_sum), load
      (sched_avg.load_sum) should not be scaled by compute capacity. The
      utilization metric is based on running time which only makes sense when
      cpus are _not_ fully utilized (utilization cannot go beyond 100% even if
      more tasks are added), where load is runnable time which isn't limited
      by the capacity of the CPU and therefore is a better metric for
      overloaded scenarios. If we run two nice-0 busy loops on two cpus with
      different compute capacity their load should be similar since their
      compute demands are the same. We have to assume that the compute demand
      of any task running on a fully utilized CPU (no spare cycles = 100%
      utilization) is high and the same no matter of the compute capacity of
      its current CPU, hence we shouldn't scale load by CPU capacity.
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/55CE7409.1000700@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e3279a2e
    • D
      sched/fair: Make load tracking frequency scale-invariant · e0f5f3af
      Dietmar Eggemann 提交于
      Apply frequency scaling correction factor to per-entity load tracking to
      make it frequency invariant. Currently, load appears bigger when the CPU
      is running slower which affects load-balancing decisions.
      
      Each segment of the sched_avg.load_sum geometric series is now scaled by
      the current frequency so that the sched_avg.load_avg of each sched entity
      will be invariant from frequency scaling.
      
      Moreover, cfs_rq.runnable_load_sum is scaled by the current frequency as
      well.
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NVincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
      Cc: Juri Lelli <Juri.Lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: daniel.lezcano@linaro.org
      Cc: mturquette@baylibre.com
      Cc: pang.xunlei@zte.com.cn
      Cc: rjw@rjwysocki.net
      Cc: sgurrappadi@nvidia.com
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1439569394-11974-2-git-send-email-morten.rasmussen@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e0f5f3af
  11. 05 9月, 2015 2 次提交
    • M
      mm: defer flush of writable TLB entries · d950c947
      Mel Gorman 提交于
      If a PTE is unmapped and it's dirty then it was writable recently.  Due to
      deferred TLB flushing, it's best to assume a writable TLB cache entry
      exists.  With that assumption, the TLB must be flushed before any IO can
      start or the page is freed to avoid lost writes or data corruption.  This
      patch defers flushing of potentially writable TLBs as long as possible.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d950c947
    • M
      mm: send one IPI per CPU to TLB flush all entries after unmapping pages · 72b252ae
      Mel Gorman 提交于
      An IPI is sent to flush remote TLBs when a page is unmapped that was
      potentially accesssed by other CPUs.  There are many circumstances where
      this happens but the obvious one is kswapd reclaiming pages belonging to a
      running process as kswapd and the task are likely running on separate
      CPUs.
      
      On small machines, this is not a significant problem but as machine gets
      larger with more cores and more memory, the cost of these IPIs can be
      high.  This patch uses a simple structure that tracks CPUs that
      potentially have TLB entries for pages being unmapped.  When the unmapping
      is complete, the full TLB is flushed on the assumption that a refill cost
      is lower than flushing individual entries.
      
      Architectures wishing to do this must give the following guarantee.
      
              If a clean page is unmapped and not immediately flushed, the
              architecture must guarantee that a write to that linear address
              from a CPU with a cached TLB entry will trap a page fault.
      
      This is essentially what the kernel already depends on but the window is
      much larger with this patch applied and is worth highlighting.  The
      architecture should consider whether the cost of the full TLB flush is
      higher than sending an IPI to flush each individual entry.  An additional
      architecture helper called flush_tlb_local is required.  It's a trivial
      wrapper with some accounting in the x86 case.
      
      The impact of this patch depends on the workload as measuring any benefit
      requires both mapped pages co-located on the LRU and memory pressure.  The
      case with the biggest impact is multiple processes reading mapped pages
      taken from the vm-scalability test suite.  The test case uses NR_CPU
      readers of mapped files that consume 10*RAM.
      
      Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed      159.62 (  0.00%)   120.68 ( 24.40%)
      Ops lru-file-mmap-read-time_range    30.59 (  0.00%)     2.80 ( 90.85%)
      Ops lru-file-mmap-read-time_stddv     6.70 (  0.00%)     0.64 ( 90.38%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User          581.00       611.43
      System       5804.93      4111.76
      Elapsed       161.03       122.12
      
      This is showing that the readers completed 24.40% faster with 29% less
      system CPU time.  From vmstats, it is known that the vanilla kernel was
      interrupted roughly 900K times per second during the steady phase of the
      test and the patched kernel was interrupts 180K times per second.
      
      The impact is lower on a single socket machine.
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed       25.33 (  0.00%)    20.38 ( 19.54%)
      Ops lru-file-mmap-read-time_range     0.91 (  0.00%)     1.44 (-58.24%)
      Ops lru-file-mmap-read-time_stddv     0.28 (  0.00%)     0.47 (-65.34%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User           58.09        57.64
      System        111.82        76.56
      Elapsed        27.29        22.55
      
      It's still a noticeable improvement with vmstat showing interrupts went
      from roughly 500K per second to 45K per second.
      
      The patch will have no impact on workloads with no memory pressure or have
      relatively few mapped pages.  It will have an unpredictable impact on the
      workload running on the CPU being flushed as it'll depend on how many TLB
      entries need to be refilled and how long that takes.  Worst case, the TLB
      will be completely cleared of active entries when the target PFNs were not
      resident at all.
      
      [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72b252ae
  12. 12 8月, 2015 1 次提交
  13. 03 8月, 2015 4 次提交
    • Y
      sched/fair: Rewrite runnable load and utilization average tracking · 9d89c257
      Yuyang Du 提交于
      The idea of runnable load average (let runnable time contribute to weight)
      was proposed by Paul Turner and Ben Segall, and it is still followed by
      this rewrite. This rewrite aims to solve the following issues:
      
      1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
         updated at the granularity of an entity at a time, which results in the
         cfs_rq's load average is stale or partially updated: at any time, only
         one entity is up to date, all other entities are effectively lagging
         behind. This is undesirable.
      
         To illustrate, if we have n runnable entities in the cfs_rq, as time
         elapses, they certainly become outdated:
      
           t0: cfs_rq { e1_old, e2_old, ..., en_old }
      
         and when we update:
      
           t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }
      
           t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }
      
           ...
      
         We solve this by combining all runnable entities' load averages together
         in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based
         on the fact that if we regard the update as a function, then:
      
         w * update(e) = update(w * e) and
      
         update(e1) + update(e2) = update(e1 + e2), then
      
         w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)
      
         therefore, by this rewrite, we have an entirely updated cfs_rq at the
         time we update it:
      
           t1: update cfs_rq { e1_new, e2_new, ..., en_new }
      
           t2: update cfs_rq { e1_new, e2_new, ..., en_new }
      
           ...
      
      2. cfs_rq's load average is different between top rq->cfs_rq and other
         task_group's per CPU cfs_rqs in whether or not blocked_load_average
         contributes to the load.
      
         The basic idea behind runnable load average (the same for utilization)
         is that the blocked state is taken into account as opposed to only
         accounting for the currently runnable state. Therefore, the average
         should include both the runnable/running and blocked load averages.
         This rewrite does that.
      
         In addition, we also combine runnable/running and blocked averages
         of all entities into the cfs_rq's average, and update it together at
         once. This is based on the fact that:
      
           update(runnable) + update(blocked) = update(runnable + blocked)
      
         This significantly reduces the code as we don't need to separately
         maintain/update runnable/running load and blocked load.
      
      3. How task_group entities' share is calculated is complex and imprecise.
      
         We reduce the complexity in this rewrite to allow a very simple rule:
         the task_group's load_avg is aggregated from its per CPU cfs_rqs's
         load_avgs. Then group entity's weight is simply proportional to its
         own cfs_rq's load_avg / task_group's load_avg. To illustrate,
      
         if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then,
      
         task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then
      
         cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share
      
      To sum up, this rewrite in principle is equivalent to the current one, but
      fixes the issues described above. Turns out, it significantly reduces the
      code complexity and hence increases clarity and efficiency. In addition,
      the new averages are more smooth/continuous (no spurious spikes and valleys)
      and updated more consistently and quickly to reflect the load dynamics.
      
      As a result, we have less load tracking overhead, better performance,
      and especially better power efficiency due to more balanced load.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9d89c257
    • K
      sched/preempt: Fix cond_resched_lock() and cond_resched_softirq() · fe32d3cd
      Konstantin Khlebnikov 提交于
      These functions check should_resched() before unlocking spinlock/bh-enable:
      preempt_count always non-zero => should_resched() always returns false.
      cond_resched_lock() worked iff spin_needbreak is set.
      
      This patch adds argument "preempt_offset" to should_resched().
      
      preempt_count offset constants for that:
      
        PREEMPT_DISABLE_OFFSET  - offset after preempt_disable()
        PREEMPT_LOCK_OFFSET     - offset after spin_lock()
        SOFTIRQ_DISABLE_OFFSET  - offset after local_bh_distable()
        SOFTIRQ_LOCK_OFFSET     - offset after spin_lock_bh()
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Graf <agraf@suse.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: bdb43806 ("sched: Extract the basic add/sub preempt_count modifiers")
      Link: http://lkml.kernel.org/r/20150715095204.12246.98268.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fe32d3cd
    • M
      sched/fair: Beef up wake_wide() · 63b0e9ed
      Mike Galbraith 提交于
      Josef Bacik reported that Facebook sees better performance with their
      1:N load (1 dispatch/node, N workers/node) when carrying an old patch
      to try very hard to wake to an idle CPU.  While looking at wake_wide(),
      I noticed that it doesn't pay attention to the wakeup of a many partner
      waker, returning 1 only when waking one of its many partners.
      
      Correct that, letting explicit domain flags override the heuristic.
      
      While at it, adjust task_struct bits, we don't need a 64-bit counter.
      Tested-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      [ Tidy things up. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-team<Kernel-team@fb.com>
      Cc: morten.rasmussen@arm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1436888390.7983.49.camel@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      63b0e9ed
    • P
      sched/cputime: Guarantee stime + utime == rtime · 9d7fb042
      Peter Zijlstra 提交于
      While the current code guarantees monotonicity for stime and utime
      independently of one another, it does not guarantee that the sum of
      both is equal to the total time we started out with.
      
      This confuses things (and peoples) who look at this sum, like top, and
      will report >100% usage followed by a matching period of 0%.
      
      Rework the code to provide both individual monotonicity and a coherent
      sum.
      Suggested-by: NFredrik Markstrom <fredrik.markstrom@gmail.com>
      Reported-by: NFredrik Markstrom <fredrik.markstrom@gmail.com>
      Tested-by: NFredrik Markstrom <fredrik.markstrom@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jason.low2@hp.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9d7fb042
  14. 18 7月, 2015 2 次提交
  15. 04 7月, 2015 2 次提交
  16. 26 6月, 2015 1 次提交
    • J
      clone: support passing tls argument via C rather than pt_regs magic · 3033f14a
      Josh Triplett 提交于
      clone has some of the quirkiest syscall handling in the kernel, with a
      pile of special cases, historical curiosities, and architecture-specific
      calling conventions.  In particular, clone with CLONE_SETTLS accepts a
      parameter "tls" that the C entry point completely ignores and some
      assembly entry points overwrite; instead, the low-level arch-specific
      code pulls the tls parameter out of the arch-specific register captured
      as part of pt_regs on entry to the kernel.  That's a massive hack, and
      it makes the arch-specific code only work when called via the specific
      existing syscall entry points; because of this hack, any new clone-like
      system call would have to accept an identical tls argument in exactly
      the same arch-specific position, rather than providing a unified system
      call entry point across architectures.
      
      The first patch allows architectures to handle the tls argument via
      normal C parameter passing, if they opt in by selecting
      HAVE_COPY_THREAD_TLS.  The second patch makes 32-bit and 64-bit x86 opt
      into this.
      
      These two patches came out of the clone4 series, which isn't ready for
      this merge window, but these first two cleanup patches were entirely
      uncontroversial and have acks.  I'd like to go ahead and submit these
      two so that other architectures can begin building on top of this and
      opting into HAVE_COPY_THREAD_TLS.  However, I'm also happy to wait and
      send these through the next merge window (along with v3 of clone4) if
      anyone would prefer that.
      
      This patch (of 2):
      
      clone with CLONE_SETTLS accepts an argument to set the thread-local
      storage area for the new thread.  sys_clone declares an int argument
      tls_val in the appropriate point in the argument list (based on the
      various CLONE_BACKWARDS variants), but doesn't actually use or pass along
      that argument.  Instead, sys_clone calls do_fork, which calls
      copy_process, which calls the arch-specific copy_thread, and copy_thread
      pulls the corresponding syscall argument out of the pt_regs captured at
      kernel entry (knowing what argument of clone that architecture passes tls
      in).
      
      Apart from being awful and inscrutable, that also only works because only
      one code path into copy_thread can pass the CLONE_SETTLS flag, and that
      code path comes from sys_clone with its architecture-specific
      argument-passing order.  This prevents introducing a new version of the
      clone system call without propagating the same architecture-specific
      position of the tls argument.
      
      However, there's no reason to pull the argument out of pt_regs when
      sys_clone could just pass it down via C function call arguments.
      
      Introduce a new CONFIG_HAVE_COPY_THREAD_TLS for architectures to opt into,
      and a new copy_thread_tls that accepts the tls parameter as an additional
      unsigned long (syscall-argument-sized) argument.  Change sys_clone's tls
      argument to an unsigned long (which does not change the ABI), and pass
      that down to copy_thread_tls.
      
      Architectures that don't opt into copy_thread_tls will continue to ignore
      the C argument to sys_clone in favor of the pt_regs captured at kernel
      entry, and thus will be unable to introduce new versions of the clone
      syscall.
      
      Patch co-authored by Josh Triplett and Thiago Macieira.
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Acked-by: NAndy Lutomirski <luto@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thiago Macieira <thiago.macieira@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3033f14a
  17. 19 6月, 2015 1 次提交
    • T
      timer: Reduce timer migration overhead if disabled · bc7a34b8
      Thomas Gleixner 提交于
      Eric reported that the timer_migration sysctl is not really nice
      performance wise as it needs to check at every timer insertion whether
      the feature is enabled or not. Further the check does not live in the
      timer code, so we have an extra function call which checks an extra
      cache line to figure out that it is disabled.
      
      We can do better and store that information in the per cpu (hr)timer
      bases. I pondered to use a static key, but that's a nightmare to
      update from the nohz code and the timer base cache line is hot anyway
      when we select a timer base.
      
      The old logic enabled the timer migration unconditionally if
      CONFIG_NO_HZ was set even if nohz was disabled on the kernel command
      line.
      
      With this modification, we start off with migration disabled. The user
      visible sysctl is still set to enabled. If the kernel switches to NOHZ
      migration is enabled, if the user did not disable it via the sysctl
      prior to the switch. If nohz=off is on the kernel command line,
      migration stays disabled no matter what.
      
      Before:
        47.76%  hog       [.] main
        14.84%  [kernel]  [k] _raw_spin_lock_irqsave
         9.55%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.71%  [kernel]  [k] mod_timer
         6.24%  [kernel]  [k] lock_timer_base.isra.38
         3.76%  [kernel]  [k] detach_if_pending
         3.71%  [kernel]  [k] del_timer
         2.50%  [kernel]  [k] internal_add_timer
         1.51%  [kernel]  [k] get_nohz_timer_target
         1.28%  [kernel]  [k] __internal_add_timer
         0.78%  [kernel]  [k] timerfn
         0.48%  [kernel]  [k] wake_up_nohz_cpu
      
      After:
        48.10%  hog       [.] main
        15.25%  [kernel]  [k] _raw_spin_lock_irqsave
         9.76%  [kernel]  [k] _raw_spin_unlock_irqrestore
         6.50%  [kernel]  [k] mod_timer
         6.44%  [kernel]  [k] lock_timer_base.isra.38
         3.87%  [kernel]  [k] detach_if_pending
         3.80%  [kernel]  [k] del_timer
         2.67%  [kernel]  [k] internal_add_timer
         1.33%  [kernel]  [k] __internal_add_timer
         0.73%  [kernel]  [k] timerfn
         0.54%  [kernel]  [k] wake_up_nohz_cpu
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Viresh Kumar <viresh.kumar@linaro.org>
      Cc: John Stultz <john.stultz@linaro.org>
      Cc: Joonwoo Park <joonwoop@codeaurora.org>
      Cc: Wenbo Wang <wenbo.wang@memblaze.com>
      Link: http://lkml.kernel.org/r/20150526224512.127050787@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      bc7a34b8
  18. 05 6月, 2015 1 次提交
    • O
      signals: don't abuse __flush_signals() in selinux_bprm_committed_creds() · 9e7c8f8c
      Oleg Nesterov 提交于
      selinux_bprm_committed_creds()->__flush_signals() is not right, we
      shouldn't clear TIF_SIGPENDING unconditionally. There can be other
      reasons for signal_pending(): freezing(), JOBCTL_PENDING_MASK, and
      potentially more.
      
      Also change this code to check fatal_signal_pending() rather than
      SIGNAL_GROUP_EXIT, it looks a bit better.
      
      Now we can kill __flush_signals() before it finds another buggy user.
      
      Note: this code looks racy, we can flush a signal which was sent after
      the task SID has been updated.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      9e7c8f8c
  19. 27 5月, 2015 2 次提交
    • T
      sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem · d59cfc09
      Tejun Heo 提交于
      The cgroup side of threadgroup locking uses signal_struct->group_rwsem
      to synchronize against threadgroup changes.  This per-process rwsem
      adds small overhead to thread creation, exit and exec paths, forces
      cgroup code paths to do lock-verify-unlock-retry dance in a couple
      places and makes it impossible to atomically perform operations across
      multiple processes.
      
      This patch replaces signal_struct->group_rwsem with a global
      percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
      side and contained in cgroups proper.  This patch converts one-to-one.
      
      This does make writer side heavier and lower the granularity; however,
      cgroup process migration is a fairly cold path, we do want to optimize
      thread operations over it and cgroup migration operations don't take
      enough time for the lower granularity to matter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      d59cfc09
    • T
      sched, cgroup: reorganize threadgroup locking · 7d7efec3
      Tejun Heo 提交于
      threadgroup_change_begin/end() are used to mark the beginning and end
      of threadgroup modifying operations to allow code paths which require
      a threadgroup to stay stable across blocking operations to synchronize
      against those sections using threadgroup_lock/unlock().
      
      It's currently implemented as a general mechanism in sched.h using
      per-signal_struct rwsem; however, this never grew non-cgroup use cases
      and becomes noop if !CONFIG_CGROUPS.  It turns out that cgroups is
      gonna be better served with a different sycnrhonization scheme and is
      a bit silly to keep cgroups specific details as a general mechanism.
      
      What's general here is identifying the places where threadgroups are
      modified.  This patch restructures threadgroup locking so that
      threadgroup_change_begin/end() become a place where subsystems which
      need to sycnhronize against threadgroup changes can hook into.
      
      cgroup_threadgroup_change_begin/end() which operate on the
      per-signal_struct rwsem are created and threadgroup_lock/unlock() are
      moved to cgroup.c and made static.
      
      This is pure reorganization which doesn't cause any functional
      changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      7d7efec3
  20. 19 5月, 2015 4 次提交
    • P
      sched/wait: Introduce TASK_NOLOAD and TASK_IDLE · 80ed87c8
      Peter Zijlstra 提交于
      Currently people use TASK_INTERRUPTIBLE to idle kthreads and wait for
      'work' because TASK_UNINTERRUPTIBLE contributes to the loadavg. Having
      all idle kthreads contribute to the loadavg is somewhat silly.
      
      Now mostly this works OK, because kthreads have all their signals
      masked. However there's a few sites where this is causing problems and
      TASK_UNINTERRUPTIBLE should be used, except for that loadavg issue.
      
      This patch adds TASK_NOLOAD which, when combined with
      TASK_UNINTERRUPTIBLE avoids the loadavg accounting.
      
      As most of imagined usage sites are loops where a thread wants to
      idle, waiting for work, a helper TASK_IDLE is introduced.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Julian Anastasov <ja@ssi.bg>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: NeilBrown <neilb@suse.de>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      80ed87c8
    • D
      sched/preempt, mm/fault: Count pagefault_disable() levels in pagefault_disabled · 8bcbde54
      David Hildenbrand 提交于
      Until now, pagefault_disable()/pagefault_enabled() used the preempt
      count to track whether in an environment with pagefaults disabled (can
      be queried via in_atomic()).
      
      This patch introduces a separate counter in task_struct to count the
      level of pagefault_disable() calls. We'll keep manipulating the preempt
      count to retain compatibility to existing pagefault handlers.
      
      It is now possible to verify whether in a pagefault_disable() envionment
      by calling pagefault_disabled(). In contrast to in_atomic() it will not
      be influenced by preempt_enable()/preempt_disable().
      
      This patch is based on a patch from Ingo Molnar.
      Reviewed-and-tested-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NDavid Hildenbrand <dahi@linux.vnet.ibm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: David.Laight@ACULAB.COM
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: airlied@linux.ie
      Cc: akpm@linux-foundation.org
      Cc: benh@kernel.crashing.org
      Cc: bigeasy@linutronix.de
      Cc: borntraeger@de.ibm.com
      Cc: daniel.vetter@intel.com
      Cc: heiko.carstens@de.ibm.com
      Cc: herbert@gondor.apana.org.au
      Cc: hocko@suse.cz
      Cc: hughd@google.com
      Cc: mst@redhat.com
      Cc: paulus@samba.org
      Cc: ralf@linux-mips.org
      Cc: schwidefsky@de.ibm.com
      Cc: yang.shi@windriver.com
      Link: http://lkml.kernel.org/r/1431359540-32227-2-git-send-email-dahi@linux.vnet.ibm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      8bcbde54
    • F
      sched/preempt: Merge preempt_mask.h into preempt.h · 92cf2118
      Frederic Weisbecker 提交于
      preempt_mask.h defines all the preempt_count semantics and related
      symbols: preempt, softirq, hardirq, nmi, preempt active, need resched,
      etc...
      
      preempt.h defines the accessors and mutators of preempt_count.
      
      But there is a messy dependency game around those two header files:
      
      	* preempt_mask.h includes preempt.h in order to access preempt_count()
      
      	* preempt_mask.h defines all preempt_count semantic and symbols
      	  except PREEMPT_NEED_RESCHED that is needed by asm/preempt.h
      	  Thus we need to define it from preempt.h, right before including
      	  asm/preempt.h, instead of defining it to preempt_mask.h with the
      	  other preempt_count symbols. Therefore the preempt_count semantics
      	  happen to be spread out.
      
      	* We plan to introduce preempt_active_[enter,exit]() to consolidate
      	  preempt_schedule*() code. But we'll need to access both preempt_count
      	  mutators (preempt_count_add()) and preempt_count symbols
      	  (PREEMPT_ACTIVE, PREEMPT_OFFSET). The usual place to define preempt
      	  operations is in preempt.h but then we'll need symbols in
      	  preempt_mask.h which already includes preempt.h. So we end up with
      	  a ressource circle dependency.
      
      Lets merge preempt_mask.h into preempt.h to solve these dependency issues.
      This way we gather semantic symbols and operation definition of
      preempt_count in a single file.
      
      This is a dumb copy-paste merge. Further merge re-arrangments are
      performed in a subsequent patch to ease review.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1431441711-29753-2-git-send-email-fweisbec@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      92cf2118
    • P
      locking/arch: Rename set_mb() to smp_store_mb() · b92b8b35
      Peter Zijlstra 提交于
      Since set_mb() is really about an smp_mb() -- not a IO/DMA barrier
      like mb() rename it to match the recent smp_load_acquire() and
      smp_store_release().
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      b92b8b35
  21. 15 5月, 2015 2 次提交
  22. 11 5月, 2015 1 次提交
    • N
      VFS: replace {, total_}link_count in task_struct with pointer to nameidata · 756daf26
      NeilBrown 提交于
      task_struct currently contains two ad-hoc members for use by the VFS:
      link_count and total_link_count.  These are only interesting to fs/namei.c,
      so exposing them explicitly is poor layering.  Incidentally, link_count
      isn't used anymore, so it can just die.
      
      This patches replaces those with a single pointer to 'struct nameidata'.
      This structure represents the current filename lookup of which
      there can only be one per process, and is a natural place to
      store total_link_count.
      
      This will allow the current "nameidata" argument to all
      follow_link operations to be removed as current->nameidata
      can be used instead in the _very_ few instances that care about
      it at all.
      
      As there are occasional circumstances where pathname lookup can
      recurse, such as through kern_path_locked, we always save and old
      current->nameidata (if there is one) when setting a new value, and
      make sure any active link_counts are preserved.
      
      follow_mount and follow_automount now get a 'struct nameidata *'
      rather than 'int flags' so that they can directly access
      total_link_count, rather than going through 'current'.
      Suggested-by: NAl Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      756daf26
  23. 10 5月, 2015 1 次提交
  24. 08 5月, 2015 1 次提交
    • P
      perf: Fix software migrate events · ff303e66
      Peter Zijlstra 提交于
      Stephane asked about PERF_COUNT_SW_CPU_MIGRATIONS and I realized it
      was borken:
      
       > The problem is that the task isn't actually scheduled while its being
       > migrated (obviously), and if its not scheduled, the counters aren't
       > scheduled either, so there's no observing of the fact.
       >
       > A further problem with migrations is that many migrations happen from
       > softirq context, which is nested inside the 'random' task context of
       > whoemever happens to run at that time, similarly for the wakeup
       > migrations triggered from (soft)irq context. All those end up being
       > accounted in the task that's currently running, eg. your 'ls'.
      
      The below cures this by marking a task as migrated and accounting it
      on the subsequent sched_in().
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ff303e66