1. 09 2月, 2016 1 次提交
    • M
      sched/debug: Make schedstats a runtime tunable that is disabled by default · cb251765
      Mel Gorman 提交于
      schedstats is very useful during debugging and performance tuning but it
      incurs overhead to calculate the stats. As such, even though it can be
      disabled at build time, it is often enabled as the information is useful.
      
      This patch adds a kernel command-line and sysctl tunable to enable or
      disable schedstats on demand (when it's built in). It is disabled
      by default as someone who knows they need it can also learn to enable
      it when necessary.
      
      The benefits are dependent on how scheduler-intensive the workload is.
      If it is then the patch reduces the number of cycles spent calculating
      the stats with a small benefit from reducing the cache footprint of the
      scheduler.
      
      These measurements were taken from a 48-core 2-socket
      machine with Xeon(R) E5-2670 v3 cpus although they were also tested on a
      single socket machine 8-core machine with Intel i7-3770 processors.
      
      netperf-tcp
                                 4.5.0-rc1             4.5.0-rc1
                                   vanilla          nostats-v3r1
      Hmean    64         560.45 (  0.00%)      575.98 (  2.77%)
      Hmean    128        766.66 (  0.00%)      795.79 (  3.80%)
      Hmean    256        950.51 (  0.00%)      981.50 (  3.26%)
      Hmean    1024      1433.25 (  0.00%)     1466.51 (  2.32%)
      Hmean    2048      2810.54 (  0.00%)     2879.75 (  2.46%)
      Hmean    3312      4618.18 (  0.00%)     4682.09 (  1.38%)
      Hmean    4096      5306.42 (  0.00%)     5346.39 (  0.75%)
      Hmean    8192     10581.44 (  0.00%)    10698.15 (  1.10%)
      Hmean    16384    18857.70 (  0.00%)    18937.61 (  0.42%)
      
      Small gains here, UDP_STREAM showed nothing intresting and neither did
      the TCP_RR tests. The gains on the 8-core machine were very similar.
      
      tbench4
                                       4.5.0-rc1             4.5.0-rc1
                                         vanilla          nostats-v3r1
      Hmean    mb/sec-1         500.85 (  0.00%)      522.43 (  4.31%)
      Hmean    mb/sec-2         984.66 (  0.00%)     1018.19 (  3.41%)
      Hmean    mb/sec-4        1827.91 (  0.00%)     1847.78 (  1.09%)
      Hmean    mb/sec-8        3561.36 (  0.00%)     3611.28 (  1.40%)
      Hmean    mb/sec-16       5824.52 (  0.00%)     5929.03 (  1.79%)
      Hmean    mb/sec-32      10943.10 (  0.00%)    10802.83 ( -1.28%)
      Hmean    mb/sec-64      15950.81 (  0.00%)    16211.31 (  1.63%)
      Hmean    mb/sec-128     15302.17 (  0.00%)    15445.11 (  0.93%)
      Hmean    mb/sec-256     14866.18 (  0.00%)    15088.73 (  1.50%)
      Hmean    mb/sec-512     15223.31 (  0.00%)    15373.69 (  0.99%)
      Hmean    mb/sec-1024    14574.25 (  0.00%)    14598.02 (  0.16%)
      Hmean    mb/sec-2048    13569.02 (  0.00%)    13733.86 (  1.21%)
      Hmean    mb/sec-3072    12865.98 (  0.00%)    13209.23 (  2.67%)
      
      Small gains of 2-4% at low thread counts and otherwise flat.  The
      gains on the 8-core machine were slightly different
      
      tbench4 on 8-core i7-3770 single socket machine
      Hmean    mb/sec-1        442.59 (  0.00%)      448.73 (  1.39%)
      Hmean    mb/sec-2        796.68 (  0.00%)      794.39 ( -0.29%)
      Hmean    mb/sec-4       1322.52 (  0.00%)     1343.66 (  1.60%)
      Hmean    mb/sec-8       2611.65 (  0.00%)     2694.86 (  3.19%)
      Hmean    mb/sec-16      2537.07 (  0.00%)     2609.34 (  2.85%)
      Hmean    mb/sec-32      2506.02 (  0.00%)     2578.18 (  2.88%)
      Hmean    mb/sec-64      2511.06 (  0.00%)     2569.16 (  2.31%)
      Hmean    mb/sec-128     2313.38 (  0.00%)     2395.50 (  3.55%)
      Hmean    mb/sec-256     2110.04 (  0.00%)     2177.45 (  3.19%)
      Hmean    mb/sec-512     2072.51 (  0.00%)     2053.97 ( -0.89%)
      
      In constract, this shows a relatively steady 2-3% gain at higher thread
      counts. Due to the nature of the patch and the type of workload, it's
      not a surprise that the result will depend on the CPU used.
      
      hackbench-pipes
                               4.5.0-rc1             4.5.0-rc1
                                 vanilla          nostats-v3r1
      Amean    1        0.0637 (  0.00%)      0.0660 ( -3.59%)
      Amean    4        0.1229 (  0.00%)      0.1181 (  3.84%)
      Amean    7        0.1921 (  0.00%)      0.1911 (  0.52%)
      Amean    12       0.3117 (  0.00%)      0.2923 (  6.23%)
      Amean    21       0.4050 (  0.00%)      0.3899 (  3.74%)
      Amean    30       0.4586 (  0.00%)      0.4433 (  3.33%)
      Amean    48       0.5910 (  0.00%)      0.5694 (  3.65%)
      Amean    79       0.8663 (  0.00%)      0.8626 (  0.43%)
      Amean    110      1.1543 (  0.00%)      1.1517 (  0.22%)
      Amean    141      1.4457 (  0.00%)      1.4290 (  1.16%)
      Amean    172      1.7090 (  0.00%)      1.6924 (  0.97%)
      Amean    192      1.9126 (  0.00%)      1.9089 (  0.19%)
      
      Some small gains and losses and while the variance data is not included,
      it's close to the noise. The UMA machine did not show anything particularly
      different
      
      pipetest
                                   4.5.0-rc1             4.5.0-rc1
                                     vanilla          nostats-v2r2
      Min         Time        4.13 (  0.00%)        3.99 (  3.39%)
      1st-qrtle   Time        4.38 (  0.00%)        4.27 (  2.51%)
      2nd-qrtle   Time        4.46 (  0.00%)        4.39 (  1.57%)
      3rd-qrtle   Time        4.56 (  0.00%)        4.51 (  1.10%)
      Max-90%     Time        4.67 (  0.00%)        4.60 (  1.50%)
      Max-93%     Time        4.71 (  0.00%)        4.65 (  1.27%)
      Max-95%     Time        4.74 (  0.00%)        4.71 (  0.63%)
      Max-99%     Time        4.88 (  0.00%)        4.79 (  1.84%)
      Max         Time        4.93 (  0.00%)        4.83 (  2.03%)
      Mean        Time        4.48 (  0.00%)        4.39 (  1.91%)
      Best99%Mean Time        4.47 (  0.00%)        4.39 (  1.91%)
      Best95%Mean Time        4.46 (  0.00%)        4.38 (  1.93%)
      Best90%Mean Time        4.45 (  0.00%)        4.36 (  1.98%)
      Best50%Mean Time        4.36 (  0.00%)        4.25 (  2.49%)
      Best10%Mean Time        4.23 (  0.00%)        4.10 (  3.13%)
      Best5%Mean  Time        4.19 (  0.00%)        4.06 (  3.20%)
      Best1%Mean  Time        4.13 (  0.00%)        4.00 (  3.39%)
      
      Small improvement and similar gains were seen on the UMA machine.
      
      The gain is small but it stands to reason that doing less work in the
      scheduler is a good thing. The downside is that the lack of schedstats and
      tracepoints may be surprising to experts doing performance analysis until
      they find the existence of the schedstats= parameter or schedstats sysctl.
      It will be automatically activated for latencytop and sleep profiling to
      alleviate the problem. For tracepoints, there is a simple warning as it's
      not safe to activate schedstats in the context when it's known the tracepoint
      may be wanted but is unavailable.
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <mgalbraith@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1454663316-22048-1-git-send-email-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      cb251765
  2. 21 1月, 2016 2 次提交
  3. 20 1月, 2016 1 次提交
    • W
      pipe: limit the per-user amount of pages allocated in pipes · 759c0114
      Willy Tarreau 提交于
      On no-so-small systems, it is possible for a single process to cause an
      OOM condition by filling large pipes with data that are never read. A
      typical process filling 4000 pipes with 1 MB of data will use 4 GB of
      memory. On small systems it may be tricky to set the pipe max size to
      prevent this from happening.
      
      This patch makes it possible to enforce a per-user soft limit above
      which new pipes will be limited to a single page, effectively limiting
      them to 4 kB each, as well as a hard limit above which no new pipes may
      be created for this user. This has the effect of protecting the system
      against memory abuse without hurting other users, and still allowing
      pipes to work correctly though with less data at once.
      
      The limit are controlled by two new sysctls : pipe-user-pages-soft, and
      pipe-user-pages-hard. Both may be disabled by setting them to zero. The
      default soft limit allows the default number of FDs per process (1024)
      to create pipes of the default size (64kB), thus reaching a limit of 64MB
      before starting to create only smaller pipes. With 256 processes limited
      to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
      1084 MB of memory allocated for a user. The hard limit is disabled by
      default to avoid breaking existing applications that make intensive use
      of pipes (eg: for splicing).
      
      Reported-by: socketpair@gmail.com
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Mitigates: CVE-2013-4312 (Linux 2.0+)
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NWilly Tarreau <w@1wt.eu>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      759c0114
  4. 11 1月, 2016 1 次提交
  5. 06 1月, 2016 3 次提交
    • J
      sched/core: Move sched_entity::avg into separate cache line · 5a107804
      Jiri Olsa 提交于
      The sched_entity::avg collides with read-mostly sched_entity data.
      
      The perf c2c tool showed many read HITM accesses across
      many CPUs for sched_entity's cfs_rq and my_q, while having
      at the same time tons of stores for avg.
      
      After placing sched_entity::avg into separate cache line,
      the perf bench sched pipe showed around 20 seconds speedup.
      
      NOTE I cut out all perf events except for cycles and
      instructions from following output.
      
      Before:
        $ perf stat -r 5 perf bench sched pipe -l 10000000
        # Running 'sched/pipe' benchmark:
        # Executed 10000000 pipe operations between two processes
      
             Total time: 270.348 [sec]
      
              27.034805 usecs/op
                  36989 ops/sec
         ...
      
           245,537,074,035      cycles                    #    1.433 GHz
           187,264,548,519      instructions              #    0.77  insns per cycle
      
             272.653840535 seconds time elapsed           ( +-  1.31% )
      
      After:
        $ perf stat -r 5 perf bench sched pipe -l 10000000
        # Running 'sched/pipe' benchmark:
        # Executed 10000000 pipe operations between two processes
      
             Total time: 251.076 [sec]
      
              25.107678 usecs/op
                  39828 ops/sec
        ...
      
           244,573,513,928      cycles                    #    1.572 GHz
           187,409,641,157      instructions              #    0.76  insns per cycle
      
             251.679315188 seconds time elapsed           ( +-  0.31% )
      Signed-off-by: NJiri Olsa <jolsa@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Joe Mario <jmario@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/1449606239-28602-1-git-send-email-jolsa@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5a107804
    • P
      sched/core: Fix unserialized r-m-w scribbling stuff · be958bdc
      Peter Zijlstra 提交于
      Some of the sched bitfieds (notably sched_reset_on_fork) can be set
      on other than current, this can cause the r-m-w to race with other
      updates.
      
      Since all the sched bits are serialized by scheduler locks, pull them
      in a separate word.
      Reported-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: akpm@linux-foundation.org
      Cc: hannes@cmpxchg.org
      Cc: mhocko@kernel.org
      Cc: vdavydov@parallels.com
      Link: http://lkml.kernel.org/r/20151125150207.GM11639@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      be958bdc
    • S
      sched/core: Check tgid in is_global_init() · 570f5241
      Sergey Senozhatsky 提交于
      Our global init task can have sub-threads, so ->pid check is not reliable
      enough for is_global_init(), we need to check tgid instead. This has been
      spotted by Oleg and a fix was proposed by Richard a long time ago (see the
      link below).
      
      Oleg wrote:
      
        : Because is_global_init() is only true for the main thread of /sbin/init.
        :
        : Just look at oom_unkillable_task(). It tries to not kill init. But, say,
        : select_bad_process() can happily find a sub-thread of is_global_init()
        : and still kill it.
      
      I recently hit the problem in question; re-sending the patch (to the
      best of my knowledge it has never been submitted) with updated function
      comment. Credit goes to Oleg and Richard.
      Suggested-by: NRichard Guy Briggs <rgb@redhat.com>
      Reported-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NSerge Hallyn <serge.hallyn@canonical.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Eric W . Biederman <ebiederm@xmission.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Serge E . Hallyn <serge.hallyn@ubuntu.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://www.redhat.com/archives/linux-audit/2013-December/msg00086.htmlSigned-off-by: NIngo Molnar <mingo@kernel.org>
      570f5241
  6. 09 12月, 2015 1 次提交
    • T
      watchdog: introduce touch_softlockup_watchdog_sched() · 03e0d461
      Tejun Heo 提交于
      touch_softlockup_watchdog() is used to tell watchdog that scheduler
      stall is expected.  One group of usage is from paths where the task
      may not be able to yield for a long time such as performing slow PIO
      to finicky device and coming out of suspend.  The other is to account
      for scheduler and timer going idle.
      
      For scheduler softlockup detection, there's no reason to distinguish
      the two cases; however, workqueue lockup detector is planned and it
      can use the same signals from the former group while the latter would
      spuriously prevent detection.  This patch introduces a new function
      touch_softlockup_watchdog_sched() and convert the latter group to call
      it instead.  For now, it just calls touch_softlockup_watchdog() and
      there's no functional difference.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Ulrich Obergfell <uobergfe@redhat.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      03e0d461
  7. 04 12月, 2015 2 次提交
  8. 23 11月, 2015 1 次提交
  9. 10 11月, 2015 1 次提交
    • R
      coredump: add DAX filtering for ELF coredumps · 5037835c
      Ross Zwisler 提交于
      Add two new flags to the existing coredump mechanism for ELF files to
      allow us to explicitly filter DAX mappings.  This is desirable because
      DAX mappings, like hugetlb mappings, have the potential to be very
      large.
      
      Update the coredump_filter documentation in
      Documentation/filesystems/proc.txt so that it addresses the new DAX
      coredump flags.  Also update the documented default value of
      coredump_filter to be consistent with the core(5) man page.  The
      documentation being updated talks about bit 4, Dump ELF headers, which
      is enabled if CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS is turned on in the
      kernel config.  This kernel config option defaults to "y" if both ELF
      binaries and coredump are enabled.
      Signed-off-by: NRoss Zwisler <ross.zwisler@linux.intel.com>
      Acked-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      5037835c
  10. 07 11月, 2015 3 次提交
  11. 06 11月, 2015 3 次提交
  12. 15 10月, 2015 2 次提交
    • J
      posix_cpu_timer: Reduce unnecessary sighand lock contention · c8d75aa4
      Jason Low 提交于
      It was found while running a database workload on large systems that
      significant time was spent trying to acquire the sighand lock.
      
      The issue was that whenever an itimer expired, many threads ended up
      simultaneously trying to send the signal. Most of the time, nothing
      happened after acquiring the sighand lock because another thread
      had just already sent the signal and updated the "next expire" time.
      The fastpath_timer_check() didn't help much since the "next expire"
      time was updated after the threads exit fastpath_timer_check().
      
      This patch addresses this by having the thread_group_cputimer structure
      maintain a boolean to signify when a thread in the group is already
      checking for process wide timers, and adds extra logic in the fastpath
      to check the boolean.
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Reviewed-by: NGeorge Spelvin <linux@horizon.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: hideaki.kimura@hpe.com
      Cc: terry.rudd@hpe.com
      Cc: scott.norton@hpe.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1444849677-29330-5-git-send-email-jason.low2@hp.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      c8d75aa4
    • J
      posix_cpu_timer: Convert cputimer->running to bool · d5c373eb
      Jason Low 提交于
      In the next patch in this series, a new field 'checking_timer' will
      be added to 'struct thread_group_cputimer'. Both this and the
      existing 'running' integer field are just used as boolean values. To
      save space in the structure, we can make both of these fields booleans.
      
      This is a preparatory patch to convert the existing running integer
      field to a boolean.
      Suggested-by: NGeorge Spelvin <linux@horizon.com>
      Signed-off-by: NJason Low <jason.low2@hp.com>
      Reviewed: George Spelvin <linux@horizon.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: hideaki.kimura@hpe.com
      Cc: terry.rudd@hpe.com
      Cc: scott.norton@hpe.com
      Cc: Peter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1444849677-29330-4-git-send-email-jason.low2@hp.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      d5c373eb
  13. 13 10月, 2015 1 次提交
    • A
      bpf: charge user for creation of BPF maps and programs · aaac3ba9
      Alexei Starovoitov 提交于
      since eBPF programs and maps use kernel memory consider it 'locked' memory
      from user accounting point of view and charge it against RLIMIT_MEMLOCK limit.
      This limit is typically set to 64Kbytes by distros, so almost all
      bpf+tracing programs would need to increase it, since they use maps,
      but kernel charges maximum map size upfront.
      For example the hash map of 1024 elements will be charged as 64Kbyte.
      It's inconvenient for current users and changes current behavior for root,
      but probably worth doing to be consistent root vs non-root.
      
      Similar accounting logic is done by mmap of perf_event.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      aaac3ba9
  14. 06 10月, 2015 2 次提交
    • P
      sched/core: Create preempt_count invariant · 609ca066
      Peter Zijlstra 提交于
      Assuming units of PREEMPT_DISABLE_OFFSET for preempt_count() numbers.
      
      Now that TASK_DEAD no longer results in preempt_count() == 3 during
      scheduling, we will always call context_switch() with preempt_count()
      == 2.
      
      However, we don't always end up with preempt_count() == 2 in
      finish_task_switch() because new tasks get created with
      preempt_count() == 1.
      
      Create FORK_PREEMPT_COUNT and set it to 2 and use that in the right
      places. Note that we cannot use INIT_PREEMPT_COUNT as that serves
      another purpose (boot).
      
      After this, preempt_count() is invariant across the context switch,
      with exception of PREEMPT_ACTIVE.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      609ca066
    • P
      sched/core: Simplify INIT_PREEMPT_COUNT · 87dcbc06
      Peter Zijlstra 提交于
      As per the following commit:
      
        d86ee480 ("sched: optimize cond_resched()")
      
      we need PREEMPT_ACTIVE to avoid cond_resched() from working before
      the scheduler is set up.
      
      However, keeping preemption disabled should do the same thing
      already, making the PREEMPT_ACTIVE part entirely redundant.
      
      The only complication is !PREEMPT_COUNT kernels, where
      PREEMPT_DISABLED ends up being 0. Instead we use an unconditional
      PREEMPT_OFFSET to set preempt_count() even on !PREEMPT_COUNT
      kernels.
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      87dcbc06
  15. 23 9月, 2015 1 次提交
  16. 21 9月, 2015 1 次提交
    • P
      rcu: Use single-stage IPI algorithm for RCU expedited grace period · 8203d6d0
      Paul E. McKenney 提交于
      The current preemptible-RCU expedited grace-period algorithm invokes
      synchronize_sched_expedited() to enqueue all tasks currently running
      in a preemptible-RCU read-side critical section, then waits for all the
      ->blkd_tasks lists to drain.  This works, but results in both an IPI and
      a double context switch even on CPUs that do not happen to be running
      in a preemptible RCU read-side critical section.
      
      This commit implements a new algorithm that causes less OS jitter.
      This new algorithm IPIs all online CPUs that are not idle (from an
      RCU perspective), but refrains from self-IPIs.  If a CPU receiving
      this IPI is not in a preemptible RCU read-side critical section (or
      is just now exiting one), it pushes quiescence up the rcu_node tree,
      otherwise, it sets a flag that will be handled by the upcoming outermost
      rcu_read_unlock(), which will then push quiescence up the tree.
      
      The expedited grace period must of course wait on any pre-existing blocked
      readers, and newly blocked readers must be queued carefully based on
      the state of both the normal and the expedited grace periods.  This
      new queueing approach also avoids the need to update boost state,
      courtesy of the fact that blocked tasks are no longer ever migrated to
      the root rcu_node structure.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      8203d6d0
  17. 17 9月, 2015 1 次提交
    • T
      sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem · 1ed13287
      Tejun Heo 提交于
      Note: This commit was originally committed as d59cfc09 but got
            reverted by 0c986253 due to the performance regression from
            the percpu_rwsem write down/up operations added to cgroup task
            migration path.  percpu_rwsem changes which alleviate the
            performance issue are pending for v4.4-rc1 merge window.
            Re-apply.
      
      The cgroup side of threadgroup locking uses signal_struct->group_rwsem
      to synchronize against threadgroup changes.  This per-process rwsem
      adds small overhead to thread creation, exit and exec paths, forces
      cgroup code paths to do lock-verify-unlock-retry dance in a couple
      places and makes it impossible to atomically perform operations across
      multiple processes.
      
      This patch replaces signal_struct->group_rwsem with a global
      percpu_rwsem cgroup_threadgroup_rwsem which is cheaper on the reader
      side and contained in cgroups proper.  This patch converts one-to-one.
      
      This does make writer side heavier and lower the granularity; however,
      cgroup process migration is a fairly cold path, we do want to optimize
      thread operations over it and cgroup migration operations don't take
      enough time for the lower granularity to matter.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.com
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      1ed13287
  18. 16 9月, 2015 1 次提交
    • T
      Revert "sched, cgroup: replace signal_struct->group_rwsem with a global percpu_rwsem" · 0c986253
      Tejun Heo 提交于
      This reverts commit d59cfc09.
      
      d59cfc09 ("sched, cgroup: replace signal_struct->group_rwsem with
      a global percpu_rwsem") and b5ba75b5 ("cgroup: simplify
      threadgroup locking") changed how cgroup synchronizes against task
      fork and exits so that it uses global percpu_rwsem instead of
      per-process rwsem; unfortunately, the write [un]lock paths of
      percpu_rwsem always involve synchronize_rcu_expedited() which turned
      out to be too expensive.
      
      Improvements for percpu_rwsem are scheduled to be merged in the coming
      v4.4-rc1 merge window which alleviates this issue.  For now, revert
      the two commits to restore per-process rwsem.  They will be re-applied
      for the v4.4-rc1 merge window.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Link: http://lkml.kernel.org/g/55F8097A.7000206@de.ibm.comReported-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: stable@vger.kernel.org # v4.2+
      0c986253
  19. 13 9月, 2015 2 次提交
    • D
      sched/fair: Make utilization tracking CPU scale-invariant · e3279a2e
      Dietmar Eggemann 提交于
      Besides the existing frequency scale-invariance correction factor, apply
      CPU scale-invariance correction factor to utilization tracking to
      compensate for any differences in compute capacity. This could be due to
      micro-architectural differences (i.e. instructions per seconds) between
      cpus in HMP systems (e.g. big.LITTLE), and/or differences in the current
      maximum frequency supported by individual cpus in SMP systems. In the
      existing implementation utilization isn't comparable between cpus as it
      is relative to the capacity of each individual CPU.
      
      Each segment of the sched_avg.util_sum geometric series is now scaled
      by the CPU performance factor too so the sched_avg.util_avg of each
      sched entity will be invariant from the particular CPU of the HMP/SMP
      system on which the sched entity is scheduled.
      
      With this patch, the utilization of a CPU stays relative to the max CPU
      performance of the fastest CPU in the system.
      
      In contrast to utilization (sched_avg.util_sum), load
      (sched_avg.load_sum) should not be scaled by compute capacity. The
      utilization metric is based on running time which only makes sense when
      cpus are _not_ fully utilized (utilization cannot go beyond 100% even if
      more tasks are added), where load is runnable time which isn't limited
      by the capacity of the CPU and therefore is a better metric for
      overloaded scenarios. If we run two nice-0 busy loops on two cpus with
      different compute capacity their load should be similar since their
      compute demands are the same. We have to assume that the compute demand
      of any task running on a fully utilized CPU (no spare cycles = 100%
      utilization) is high and the same no matter of the compute capacity of
      its current CPU, hence we shouldn't scale load by CPU capacity.
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/55CE7409.1000700@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e3279a2e
    • D
      sched/fair: Make load tracking frequency scale-invariant · e0f5f3af
      Dietmar Eggemann 提交于
      Apply frequency scaling correction factor to per-entity load tracking to
      make it frequency invariant. Currently, load appears bigger when the CPU
      is running slower which affects load-balancing decisions.
      
      Each segment of the sched_avg.load_sum geometric series is now scaled by
      the current frequency so that the sched_avg.load_avg of each sched entity
      will be invariant from frequency scaling.
      
      Moreover, cfs_rq.runnable_load_sum is scaled by the current frequency as
      well.
      Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
      Signed-off-by: NMorten Rasmussen <morten.rasmussen@arm.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NVincent Guittot <vincent.guittot@linaro.org>
      Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
      Cc: Juri Lelli <Juri.Lelli@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: daniel.lezcano@linaro.org
      Cc: mturquette@baylibre.com
      Cc: pang.xunlei@zte.com.cn
      Cc: rjw@rjwysocki.net
      Cc: sgurrappadi@nvidia.com
      Cc: yuyang.du@intel.com
      Link: http://lkml.kernel.org/r/1439569394-11974-2-git-send-email-morten.rasmussen@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      e0f5f3af
  20. 05 9月, 2015 2 次提交
    • M
      mm: defer flush of writable TLB entries · d950c947
      Mel Gorman 提交于
      If a PTE is unmapped and it's dirty then it was writable recently.  Due to
      deferred TLB flushing, it's best to assume a writable TLB cache entry
      exists.  With that assumption, the TLB must be flushed before any IO can
      start or the page is freed to avoid lost writes or data corruption.  This
      patch defers flushing of potentially writable TLBs as long as possible.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d950c947
    • M
      mm: send one IPI per CPU to TLB flush all entries after unmapping pages · 72b252ae
      Mel Gorman 提交于
      An IPI is sent to flush remote TLBs when a page is unmapped that was
      potentially accesssed by other CPUs.  There are many circumstances where
      this happens but the obvious one is kswapd reclaiming pages belonging to a
      running process as kswapd and the task are likely running on separate
      CPUs.
      
      On small machines, this is not a significant problem but as machine gets
      larger with more cores and more memory, the cost of these IPIs can be
      high.  This patch uses a simple structure that tracks CPUs that
      potentially have TLB entries for pages being unmapped.  When the unmapping
      is complete, the full TLB is flushed on the assumption that a refill cost
      is lower than flushing individual entries.
      
      Architectures wishing to do this must give the following guarantee.
      
              If a clean page is unmapped and not immediately flushed, the
              architecture must guarantee that a write to that linear address
              from a CPU with a cached TLB entry will trap a page fault.
      
      This is essentially what the kernel already depends on but the window is
      much larger with this patch applied and is worth highlighting.  The
      architecture should consider whether the cost of the full TLB flush is
      higher than sending an IPI to flush each individual entry.  An additional
      architecture helper called flush_tlb_local is required.  It's a trivial
      wrapper with some accounting in the x86 case.
      
      The impact of this patch depends on the workload as measuring any benefit
      requires both mapped pages co-located on the LRU and memory pressure.  The
      case with the biggest impact is multiple processes reading mapped pages
      taken from the vm-scalability test suite.  The test case uses NR_CPU
      readers of mapped files that consume 10*RAM.
      
      Linear mapped reader on a 4-node machine with 64G RAM and 48 CPUs
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed      159.62 (  0.00%)   120.68 ( 24.40%)
      Ops lru-file-mmap-read-time_range    30.59 (  0.00%)     2.80 ( 90.85%)
      Ops lru-file-mmap-read-time_stddv     6.70 (  0.00%)     0.64 ( 90.38%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User          581.00       611.43
      System       5804.93      4111.76
      Elapsed       161.03       122.12
      
      This is showing that the readers completed 24.40% faster with 29% less
      system CPU time.  From vmstats, it is known that the vanilla kernel was
      interrupted roughly 900K times per second during the steady phase of the
      test and the patched kernel was interrupts 180K times per second.
      
      The impact is lower on a single socket machine.
      
                                                 4.2.0-rc1          4.2.0-rc1
                                                   vanilla       flushfull-v7
      Ops lru-file-mmap-read-elapsed       25.33 (  0.00%)    20.38 ( 19.54%)
      Ops lru-file-mmap-read-time_range     0.91 (  0.00%)     1.44 (-58.24%)
      Ops lru-file-mmap-read-time_stddv     0.28 (  0.00%)     0.47 (-65.34%)
      
                 4.2.0-rc1    4.2.0-rc1
                   vanilla flushfull-v7
      User           58.09        57.64
      System        111.82        76.56
      Elapsed        27.29        22.55
      
      It's still a noticeable improvement with vmstat showing interrupts went
      from roughly 500K per second to 45K per second.
      
      The patch will have no impact on workloads with no memory pressure or have
      relatively few mapped pages.  It will have an unpredictable impact on the
      workload running on the CPU being flushed as it'll depend on how many TLB
      entries need to be refilled and how long that takes.  Worst case, the TLB
      will be completely cleared of active entries when the target PFNs were not
      resident at all.
      
      [sasha.levin@oracle.com: trace tlb flush after disabling preemption in try_to_unmap_flush]
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72b252ae
  21. 12 8月, 2015 1 次提交
  22. 03 8月, 2015 4 次提交
    • Y
      sched/fair: Rewrite runnable load and utilization average tracking · 9d89c257
      Yuyang Du 提交于
      The idea of runnable load average (let runnable time contribute to weight)
      was proposed by Paul Turner and Ben Segall, and it is still followed by
      this rewrite. This rewrite aims to solve the following issues:
      
      1. cfs_rq's load average (namely runnable_load_avg and blocked_load_avg) is
         updated at the granularity of an entity at a time, which results in the
         cfs_rq's load average is stale or partially updated: at any time, only
         one entity is up to date, all other entities are effectively lagging
         behind. This is undesirable.
      
         To illustrate, if we have n runnable entities in the cfs_rq, as time
         elapses, they certainly become outdated:
      
           t0: cfs_rq { e1_old, e2_old, ..., en_old }
      
         and when we update:
      
           t1: update e1, then we have cfs_rq { e1_new, e2_old, ..., en_old }
      
           t2: update e2, then we have cfs_rq { e1_old, e2_new, ..., en_old }
      
           ...
      
         We solve this by combining all runnable entities' load averages together
         in cfs_rq's avg, and update the cfs_rq's avg as a whole. This is based
         on the fact that if we regard the update as a function, then:
      
         w * update(e) = update(w * e) and
      
         update(e1) + update(e2) = update(e1 + e2), then
      
         w1 * update(e1) + w2 * update(e2) = update(w1 * e1 + w2 * e2)
      
         therefore, by this rewrite, we have an entirely updated cfs_rq at the
         time we update it:
      
           t1: update cfs_rq { e1_new, e2_new, ..., en_new }
      
           t2: update cfs_rq { e1_new, e2_new, ..., en_new }
      
           ...
      
      2. cfs_rq's load average is different between top rq->cfs_rq and other
         task_group's per CPU cfs_rqs in whether or not blocked_load_average
         contributes to the load.
      
         The basic idea behind runnable load average (the same for utilization)
         is that the blocked state is taken into account as opposed to only
         accounting for the currently runnable state. Therefore, the average
         should include both the runnable/running and blocked load averages.
         This rewrite does that.
      
         In addition, we also combine runnable/running and blocked averages
         of all entities into the cfs_rq's average, and update it together at
         once. This is based on the fact that:
      
           update(runnable) + update(blocked) = update(runnable + blocked)
      
         This significantly reduces the code as we don't need to separately
         maintain/update runnable/running load and blocked load.
      
      3. How task_group entities' share is calculated is complex and imprecise.
      
         We reduce the complexity in this rewrite to allow a very simple rule:
         the task_group's load_avg is aggregated from its per CPU cfs_rqs's
         load_avgs. Then group entity's weight is simply proportional to its
         own cfs_rq's load_avg / task_group's load_avg. To illustrate,
      
         if a task_group has { cfs_rq1, cfs_rq2, ..., cfs_rqn }, then,
      
         task_group_avg = cfs_rq1_avg + cfs_rq2_avg + ... + cfs_rqn_avg, then
      
         cfs_rqx's entity's share = cfs_rqx_avg / task_group_avg * task_group's share
      
      To sum up, this rewrite in principle is equivalent to the current one, but
      fixes the issues described above. Turns out, it significantly reduces the
      code complexity and hence increases clarity and efficiency. In addition,
      the new averages are more smooth/continuous (no spurious spikes and valleys)
      and updated more consistently and quickly to reflect the load dynamics.
      
      As a result, we have less load tracking overhead, better performance,
      and especially better power efficiency due to more balanced load.
      Signed-off-by: NYuyang Du <yuyang.du@intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: arjan@linux.intel.com
      Cc: bsegall@google.com
      Cc: dietmar.eggemann@arm.com
      Cc: fengguang.wu@intel.com
      Cc: len.brown@intel.com
      Cc: morten.rasmussen@arm.com
      Cc: pjt@google.com
      Cc: rafael.j.wysocki@intel.com
      Cc: umgwanakikbuti@gmail.com
      Cc: vincent.guittot@linaro.org
      Link: http://lkml.kernel.org/r/1436918682-4971-3-git-send-email-yuyang.du@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9d89c257
    • K
      sched/preempt: Fix cond_resched_lock() and cond_resched_softirq() · fe32d3cd
      Konstantin Khlebnikov 提交于
      These functions check should_resched() before unlocking spinlock/bh-enable:
      preempt_count always non-zero => should_resched() always returns false.
      cond_resched_lock() worked iff spin_needbreak is set.
      
      This patch adds argument "preempt_offset" to should_resched().
      
      preempt_count offset constants for that:
      
        PREEMPT_DISABLE_OFFSET  - offset after preempt_disable()
        PREEMPT_LOCK_OFFSET     - offset after spin_lock()
        SOFTIRQ_DISABLE_OFFSET  - offset after local_bh_distable()
        SOFTIRQ_LOCK_OFFSET     - offset after spin_lock_bh()
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alexander Graf <agraf@suse.de>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: David Vrabel <david.vrabel@citrix.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Fixes: bdb43806 ("sched: Extract the basic add/sub preempt_count modifiers")
      Link: http://lkml.kernel.org/r/20150715095204.12246.98268.stgit@buzzSigned-off-by: NIngo Molnar <mingo@kernel.org>
      fe32d3cd
    • M
      sched/fair: Beef up wake_wide() · 63b0e9ed
      Mike Galbraith 提交于
      Josef Bacik reported that Facebook sees better performance with their
      1:N load (1 dispatch/node, N workers/node) when carrying an old patch
      to try very hard to wake to an idle CPU.  While looking at wake_wide(),
      I noticed that it doesn't pay attention to the wakeup of a many partner
      waker, returning 1 only when waking one of its many partners.
      
      Correct that, letting explicit domain flags override the heuristic.
      
      While at it, adjust task_struct bits, we don't need a 64-bit counter.
      Tested-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NMike Galbraith <umgwanakikbuti@gmail.com>
      [ Tidy things up. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: kernel-team<Kernel-team@fb.com>
      Cc: morten.rasmussen@arm.com
      Cc: riel@redhat.com
      Link: http://lkml.kernel.org/r/1436888390.7983.49.camel@gmail.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      63b0e9ed
    • P
      sched/cputime: Guarantee stime + utime == rtime · 9d7fb042
      Peter Zijlstra 提交于
      While the current code guarantees monotonicity for stime and utime
      independently of one another, it does not guarantee that the sum of
      both is equal to the total time we started out with.
      
      This confuses things (and peoples) who look at this sum, like top, and
      will report >100% usage followed by a matching period of 0%.
      
      Rework the code to provide both individual monotonicity and a coherent
      sum.
      Suggested-by: NFredrik Markstrom <fredrik.markstrom@gmail.com>
      Reported-by: NFredrik Markstrom <fredrik.markstrom@gmail.com>
      Tested-by: NFredrik Markstrom <fredrik.markstrom@gmail.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: jason.low2@hp.com
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      9d7fb042
  23. 18 7月, 2015 2 次提交
  24. 04 7月, 2015 1 次提交