1. 11 6月, 2019 7 次提交
    • S
      psi: introduce state_mask to represent stalled psi states · 8d1df33e
      Suren Baghdasaryan 提交于
      commit 33b2d6302abc4ccea1d9b3f095e2e27b02ca264e upstream.
      
      Patch series "psi: pressure stall monitors", v6.
      
      This is a respin of:
        https://lwn.net/ml/linux-kernel/20190308184311.144521-1-surenb%40google.com/
      
      Android is adopting psi to detect and remedy memory pressure that
      results in stuttering and decreased responsiveness on mobile devices.
      
      Psi gives us the stall information, but because we're dealing with
      latencies in the millisecond range, periodically reading the pressure
      files to detect stalls in a timely fashion is not feasible.  Psi also
      doesn't aggregate its averages at a high-enough frequency right now.
      
      This patch series extends the psi interface such that users can
      configure sensitive latency thresholds and use poll() and friends to be
      notified when these are breached.
      
      As high-frequency aggregation is costly, it implements an aggregation
      method that is optimized for fast, short-interval averaging, and makes
      the aggregation frequency adaptive, such that high-frequency updates
      only happen while monitored stall events are actively occurring.
      
      With these patches applied, Android can monitor for, and ward off,
      mounting memory shortages before they cause problems for the user.  For
      example, using memory stall monitors in userspace low memory killer
      daemon (lmkd) we can detect mounting pressure and kill less important
      processes before device becomes visibly sluggish.  In our memory stress
      testing psi memory monitors produce roughly 10x less false positives
      compared to vmpressure signals.  Having ability to specify multiple
      triggers for the same psi metric allows other parts of Android framework
      to monitor memory state of the device and act accordingly.
      
      The new interface is straight-forward.  The user opens one of the
      pressure files for writing and writes a trigger description into the
      file descriptor that defines the stall state - some or full, and the
      maximum stall time over a given window of time.  E.g.:
      
              /* Signal when stall time exceeds 100ms of a 1s window */
              char trigger[] = "full 100000 1000000"
              fd = open("/proc/pressure/memory")
              write(fd, trigger, sizeof(trigger))
              while (poll() >= 0) {
                      ...
              };
              close(fd);
      
      When the monitored stall state is entered, psi adapts its aggregation
      frequency according to what the configured time window requires in order
      to emit event signals in a timely fashion.  Once the stalling subsides,
      aggregation reverts back to normal.
      
      The trigger is associated with the open file descriptor.  To stop
      monitoring, the user only needs to close the file descriptor and the
      trigger is discarded.
      
      Patches 1-6 prepare the psi code for polling support.  Patch 7
      implements the adaptive polling logic, the pressure growth detection
      optimized for short intervals, and hooks up write() and poll() on the
      pressure files.
      
      The patches were developed in collaboration with Johannes Weiner.
      
      This patch (of 7):
      
      The psi monitoring patches will need to determine the same states as
      record_times().  To avoid calculating them twice, maintain a state mask
      that can be consulted cheaply.  Do this in a separate patch to keep the
      churn in the main feature patch at a minimum.
      
      This adds 4-byte state_mask member into psi_group_cpu struct which
      results in its first cacheline-aligned part becoming 52 bytes long.  Add
      explicit values to enumeration element counters that affect
      psi_group_cpu struct size.
      
      Link: http://lkml.kernel.org/r/20190124211518.244221-4-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      8d1df33e
    • J
      psi: avoid divide-by-zero crash inside virtual machines · dca08058
      Johannes Weiner 提交于
      commit 4e37504d1c49eec6434d0cc97278d2b51c9e8763 upstream.
      
      We've been seeing hard-to-trigger psi crashes when running inside VM
      instances:
      
          divide error: 0000 [#1] SMP PTI
          Modules linked in: [...]
          CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
          Workqueue: events psi_clock
          RIP: 0010:psi_update_stats+0x270/0x490
          RSP: 0018:ffffc90001117e10 EFLAGS: 00010246
          RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8
          RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000
          RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000
          R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
          R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502
          FS:  0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           psi_clock+0x12/0x50
           process_one_work+0x1e0/0x390
           worker_thread+0x2b/0x3c0
           ? rescuer_thread+0x330/0x330
           kthread+0x113/0x130
           ? kthread_create_worker_on_cpu+0x40/0x40
           ? SyS_exit_group+0x10/0x10
           ret_from_fork+0x35/0x40
          Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 <48> f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48
      
      The Code-line points to `period` being 0 inside update_stats(), and we
      divide by that when calculating that period's pressure percentage.
      
      The elapsed period should never be 0.  The reason this can happen is due
      to an off-by-one in the idle time / missing period calculation combined
      with a coarse sched_clock() in the virtual machine.
      
      The target time for aggregation is advanced into the future on a fixed
      grid to prevent clock drift.  So when an aggregation runs after some idle
      period, we can not just set it to "now + psi_period", but have to
      calculate the downtime and advance the target time relative to itself.
      
      However, if the aggregator was disabled exactly one psi_period (ns), we
      drop one idle period in the calculation due to a > when we should do >=.
      In that case, next_update will be advanced from 'now - psi_period' to
      'now' when it should be moved to 'now + psi_period'.  The run finishes
      with last_update == next_update == sched_clock().
      
      With hardware clocks, this exact nanosecond match isn't likely in the
      first place; but if it does happen, the clock will still have moved on and
      the period non-zero by the time the worker runs.  A pointlessly short
      period, but besides the extra work, no harm no foul.  However, a slow
      sched_clock() like we have on VMs might not have advanced either by the
      time the worker runs again.  And when we calculate the elapsed period, the
      result, our pressure divisor, will be 0.  Ouch.
      
      Fix this by correctly handling the situation when the elapsed time between
      aggregation runs is precisely two periods, and advance the expiration
      timestamp correctly to period into the future.
      
      Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: Łukasz Siudut <lsiudut@fb.com
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      dca08058
    • J
      psi: fix aggregation idle shut-off · 39057d08
      Johannes Weiner 提交于
      commit 1b69ac6b40ebd85eed73e4dbccde2a36961ab990 upstream.
      
      psi has provisions to shut off the periodic aggregation worker when
      there is a period of no task activity - and thus no data that needs
      aggregating.  However, while developing psi monitoring, Suren noticed
      that the aggregation clock currently won't stay shut off for good.
      
      Debugging this revealed a flaw in the idle design: an aggregation run
      will see no task activity and decide to go to sleep; shortly thereafter,
      the kworker thread that executed the aggregation will go idle and cause
      a scheduling change, during which the psi callback will kick the
      !pending worker again.  This will ping-pong forever, and is equivalent
      to having no shut-off logic at all (but with more code!)
      
      Fix this by exempting aggregation workers from psi's clock waking logic
      when the state change is them going to sleep.  To do this, tag workers
      with the last work function they executed, and if in psi we see a worker
      going to sleep after aggregating psi data, we will not reschedule the
      aggregation work item.
      
      What if the worker is also executing other items before or after?
      
      Any psi state times that were incurred by work items preceding the
      aggregation work will have been collected from the per-cpu buckets
      during the aggregation itself.  If there are work items following the
      aggregation work, the worker's last_func tag will be overwritten and the
      aggregator will be kept alive to process this genuine new activity.
      
      If the aggregation work is the last thing the worker does, and we decide
      to go idle, the brief period of non-idle time incurred between the
      aggregation run and the kworker's dequeue will be stranded in the
      per-cpu buckets until the clock is woken by later activity.  But that
      should not be a problem.  The buckets can hold 4s worth of time, and
      future activity will wake the clock with a 2s delay, giving us 2s worth
      of data we can leave behind when disabling aggregation.  If it takes a
      worker more than two seconds to go idle after it finishes its last work
      item, we likely have bigger problems in the system, and won't notice one
      sample that was averaged with a bogus per-CPU weight.
      
      Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
      Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NSuren Baghdasaryan <surenb@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      39057d08
    • J
      psi: make disabling/enabling easier for vendor kernels · 59d17ef0
      Johannes Weiner 提交于
      commit e0c274472d5d27f277af722e017525e0b33784cd upstream.
      
      Mel Gorman reports a hackbench regression with psi that would prohibit
      shipping the suse kernel with it default-enabled, but he'd still like
      users to be able to opt in at little to no cost to others.
      
      With the current combination of CONFIG_PSI and the psi_disabled bool set
      from the commandline, this is a challenge.  Do the following things to
      make it easier:
      
      1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros
         to enable CONFIG_PSI in their kernel but leave the feature disabled
         unless a user requests it at boot-time.
      
         To avoid double negatives, rename psi_disabled= to psi=.
      
      2. Make psi_disabled a static branch to eliminate any branch costs
         when the feature is disabled.
      
      In terms of numbers before and after this patch, Mel says:
      
      : The following is a comparision using CONFIG_PSI=n as a baseline against
      : your patch and a vanilla kernel
      :
      :                          4.20.0-rc4             4.20.0-rc4             4.20.0-rc4
      :                 kconfigdisable-v1r1                vanilla        psidisable-v1r1
      : Amean     1       1.3100 (   0.00%)      1.3923 (  -6.28%)      1.3427 (  -2.49%)
      : Amean     3       3.8860 (   0.00%)      4.1230 *  -6.10%*      3.8860 (  -0.00%)
      : Amean     5       6.8847 (   0.00%)      8.0390 * -16.77%*      6.7727 (   1.63%)
      : Amean     7       9.9310 (   0.00%)     10.8367 *  -9.12%*      9.9910 (  -0.60%)
      : Amean     12     16.6577 (   0.00%)     18.2363 *  -9.48%*     17.1083 (  -2.71%)
      : Amean     18     26.5133 (   0.00%)     27.8833 *  -5.17%*     25.7663 (   2.82%)
      : Amean     24     34.3003 (   0.00%)     34.6830 (  -1.12%)     32.0450 (   6.58%)
      : Amean     30     40.0063 (   0.00%)     40.5800 (  -1.43%)     41.5087 (  -3.76%)
      : Amean     32     40.1407 (   0.00%)     41.2273 (  -2.71%)     39.9417 (   0.50%)
      :
      : It's showing that the vanilla kernel takes a hit (as the bisection
      : indicated it would) and that disabling PSI by default is reasonably
      : close in terms of performance for this particular workload on this
      : particular machine so;
      
      Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: NMel Gorman <mgorman@techsingularity.net>
      Reported-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      59d17ef0
    • O
      kernel/sched/psi.c: simplify cgroup_move_task() · f1a75d14
      Olof Johansson 提交于
      commit 8fcb2312d1e3300e81aa871aad00d4c038cfc184 upstream.
      
      The existing code triggered an invalid warning about 'rq' possibly being
      used uninitialized.  Instead of doing the silly warning suppression by
      initializa it to NULL, refactor the code to bail out early instead.
      
      Warning was:
      
        kernel/sched/psi.c: In function `cgroup_move_task':
        kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized]
      
      Link: http://lkml.kernel.org/r/20181103183339.8669-1-olof@lixom.net
      Fixes: 2ce7135adc9ad ("psi: cgroup support")
      Signed-off-by: NOlof Johansson <olof@lixom.net>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      f1a75d14
    • J
      psi: cgroup support · cef7be65
      Johannes Weiner 提交于
      commit 2ce7135adc9ad081aa3c49744144376ac74fea60 upstream.
      
      On a system that executes multiple cgrouped jobs and independent
      workloads, we don't just care about the health of the overall system, but
      also that of individual jobs, so that we can ensure individual job health,
      fairness between jobs, or prioritize some jobs over others.
      
      This patch implements pressure stall tracking for cgroups.  In kernels
      with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
      and io.pressure files that track aggregate pressure stall times for only
      the tasks inside the cgroup.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      cef7be65
    • J
      psi: pressure stall information for CPU, memory, and IO · d2b34844
      Johannes Weiner 提交于
      commit eb414681d5a07d28d2ff90dc05f69ec6b232ebd2 upstream.
      
      When systems are overcommitted and resources become contended, it's hard
      to tell exactly the impact this has on workload productivity, or how close
      the system is to lockups and OOM kills.  In particular, when machines work
      multiple jobs concurrently, the impact of overcommit in terms of latency
      and throughput on the individual job can be enormous.
      
      In order to maximize hardware utilization without sacrificing individual
      job health or risk complete machine lockups, this patch implements a way
      to quantify resource pressure in the system.
      
      A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
      expose the percentage of time the system is stalled on CPU, memory, or IO,
      respectively.  Stall states are aggregate versions of the per-task delay
      accounting delays:
      
             cpu: some tasks are runnable but not executing on a CPU
             memory: tasks are reclaiming, or waiting for swapin or thrashing cache
             io: tasks are waiting for io completions
      
      These percentages of walltime can be thought of as pressure percentages,
      and they give a general sense of system health and productivity loss
      incurred by resource overcommit.  They can also indicate when the system
      is approaching lockup scenarios and OOMs.
      
      To do this, psi keeps track of the task states associated with each CPU
      and samples the time they spend in stall states.  Every 2 seconds, the
      samples are averaged across CPUs - weighted by the CPUs' non-idle time to
      eliminate artifacts from unused CPUs - and translated into percentages of
      walltime.  A running average of those percentages is maintained over 10s,
      1m, and 5m periods (similar to the loadaverage).
      
      [hannes@cmpxchg.org: doc fixlet, per Randy]
        Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
      [hannes@cmpxchg.org: code optimization]
        Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
      [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
        Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
      [hannes@cmpxchg.org: fix build]
        Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      d2b34844