1. 27 12月, 2019 40 次提交
    • J
      psi: avoid divide-by-zero crash inside virtual machines · 2ab1c71c
      Johannes Weiner 提交于
      commit 4e37504d1c49eec6434d0cc97278d2b51c9e8763 upstream.
      
      We've been seeing hard-to-trigger psi crashes when running inside VM
      instances:
      
          divide error: 0000 [#1] SMP PTI
          Modules linked in: [...]
          CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
          Workqueue: events psi_clock
          RIP: 0010:psi_update_stats+0x270/0x490
          RSP: 0018:ffffc90001117e10 EFLAGS: 00010246
          RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8
          RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000
          RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000
          R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
          R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502
          FS:  0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           psi_clock+0x12/0x50
           process_one_work+0x1e0/0x390
           worker_thread+0x2b/0x3c0
           ? rescuer_thread+0x330/0x330
           kthread+0x113/0x130
           ? kthread_create_worker_on_cpu+0x40/0x40
           ? SyS_exit_group+0x10/0x10
           ret_from_fork+0x35/0x40
          Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 <48> f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48
      
      The Code-line points to `period` being 0 inside update_stats(), and we
      divide by that when calculating that period's pressure percentage.
      
      The elapsed period should never be 0.  The reason this can happen is due
      to an off-by-one in the idle time / missing period calculation combined
      with a coarse sched_clock() in the virtual machine.
      
      The target time for aggregation is advanced into the future on a fixed
      grid to prevent clock drift.  So when an aggregation runs after some idle
      period, we can not just set it to "now + psi_period", but have to
      calculate the downtime and advance the target time relative to itself.
      
      However, if the aggregator was disabled exactly one psi_period (ns), we
      drop one idle period in the calculation due to a > when we should do >=.
      In that case, next_update will be advanced from 'now - psi_period' to
      'now' when it should be moved to 'now + psi_period'.  The run finishes
      with last_update == next_update == sched_clock().
      
      With hardware clocks, this exact nanosecond match isn't likely in the
      first place; but if it does happen, the clock will still have moved on and
      the period non-zero by the time the worker runs.  A pointlessly short
      period, but besides the extra work, no harm no foul.  However, a slow
      sched_clock() like we have on VMs might not have advanced either by the
      time the worker runs again.  And when we calculate the elapsed period, the
      result, our pressure divisor, will be 0.  Ouch.
      
      Fix this by correctly handling the situation when the elapsed time between
      aggregation runs is precisely two periods, and advance the expiration
      timestamp correctly to period into the future.
      
      Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: Łukasz Siudut <lsiudut@fb.com
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      2ab1c71c
    • J
      psi: clarify the Kconfig text for the default-disable option · c3a04b34
      Johannes Weiner 提交于
      commit 7b2489d37e1e355228f7c55724f77580e1dec22a upstream.
      
      The current help text caused some confusion in online forums about
      whether or not to default-enable or default-disable psi in vendor
      kernels.  This is because it doesn't communicate the reason for why we
      made this setting configurable in the first place: that the overhead is
      non-zero in an artificial scheduler stress test.
      
      Since this isn't representative of real workloads, and the effect was
      not measurable in scheduler-heavy real world applications such as the
      webservers and memcache installations at Facebook, it's fair to point
      out that this is a pretty cautious option to select.
      
      Link: http://lkml.kernel.org/r/20190129233617.16767-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      c3a04b34
    • J
      psi: fix aggregation idle shut-off · 70f1297d
      Johannes Weiner 提交于
      commit 1b69ac6b40ebd85eed73e4dbccde2a36961ab990 upstream.
      
      psi has provisions to shut off the periodic aggregation worker when
      there is a period of no task activity - and thus no data that needs
      aggregating.  However, while developing psi monitoring, Suren noticed
      that the aggregation clock currently won't stay shut off for good.
      
      Debugging this revealed a flaw in the idle design: an aggregation run
      will see no task activity and decide to go to sleep; shortly thereafter,
      the kworker thread that executed the aggregation will go idle and cause
      a scheduling change, during which the psi callback will kick the
      !pending worker again.  This will ping-pong forever, and is equivalent
      to having no shut-off logic at all (but with more code!)
      
      Fix this by exempting aggregation workers from psi's clock waking logic
      when the state change is them going to sleep.  To do this, tag workers
      with the last work function they executed, and if in psi we see a worker
      going to sleep after aggregating psi data, we will not reschedule the
      aggregation work item.
      
      What if the worker is also executing other items before or after?
      
      Any psi state times that were incurred by work items preceding the
      aggregation work will have been collected from the per-cpu buckets
      during the aggregation itself.  If there are work items following the
      aggregation work, the worker's last_func tag will be overwritten and the
      aggregator will be kept alive to process this genuine new activity.
      
      If the aggregation work is the last thing the worker does, and we decide
      to go idle, the brief period of non-idle time incurred between the
      aggregation run and the kworker's dequeue will be stranded in the
      per-cpu buckets until the clock is woken by later activity.  But that
      should not be a problem.  The buckets can hold 4s worth of time, and
      future activity will wake the clock with a 2s delay, giving us 2s worth
      of data we can leave behind when disabling aggregation.  If it takes a
      worker more than two seconds to go idle after it finishes its last work
      item, we likely have bigger problems in the system, and won't notice one
      sample that was averaged with a bogus per-CPU weight.
      
      Link: http://lkml.kernel.org/r/20190116193501.1910-1-hannes@cmpxchg.org
      Fixes: eb414681d5a0 ("psi: pressure stall information for CPU, memory, and IO")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NSuren Baghdasaryan <surenb@google.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      70f1297d
    • B
      psi: fix reference to kernel commandline enable · bcc32a75
      Baruch Siach 提交于
      commit 428a1cb4baeb9e5c7feda93af7372ba6d2491558 upstream.
      
      The kernel commandline parameter named in CONFIG_PSI_DEFAULT_DISABLED
      help text contradicts the documentation in kernel-parameters.txt, and
      the code.  Fix that.
      
      Link: http://lkml.kernel.org/r/20181203213416.GA12627@cmpxchg.org
      Fixes: e0c274472d ("psi: make disabling/enabling easier for vendor kernels")
      Signed-off-by: NBaruch Siach <baruch@tkos.co.il>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      bcc32a75
    • J
      psi: make disabling/enabling easier for vendor kernels · 8570f933
      Johannes Weiner 提交于
      commit e0c274472d5d27f277af722e017525e0b33784cd upstream.
      
      Mel Gorman reports a hackbench regression with psi that would prohibit
      shipping the suse kernel with it default-enabled, but he'd still like
      users to be able to opt in at little to no cost to others.
      
      With the current combination of CONFIG_PSI and the psi_disabled bool set
      from the commandline, this is a challenge.  Do the following things to
      make it easier:
      
      1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros
         to enable CONFIG_PSI in their kernel but leave the feature disabled
         unless a user requests it at boot-time.
      
         To avoid double negatives, rename psi_disabled= to psi=.
      
      2. Make psi_disabled a static branch to eliminate any branch costs
         when the feature is disabled.
      
      In terms of numbers before and after this patch, Mel says:
      
      : The following is a comparision using CONFIG_PSI=n as a baseline against
      : your patch and a vanilla kernel
      :
      :                          4.20.0-rc4             4.20.0-rc4             4.20.0-rc4
      :                 kconfigdisable-v1r1                vanilla        psidisable-v1r1
      : Amean     1       1.3100 (   0.00%)      1.3923 (  -6.28%)      1.3427 (  -2.49%)
      : Amean     3       3.8860 (   0.00%)      4.1230 *  -6.10%*      3.8860 (  -0.00%)
      : Amean     5       6.8847 (   0.00%)      8.0390 * -16.77%*      6.7727 (   1.63%)
      : Amean     7       9.9310 (   0.00%)     10.8367 *  -9.12%*      9.9910 (  -0.60%)
      : Amean     12     16.6577 (   0.00%)     18.2363 *  -9.48%*     17.1083 (  -2.71%)
      : Amean     18     26.5133 (   0.00%)     27.8833 *  -5.17%*     25.7663 (   2.82%)
      : Amean     24     34.3003 (   0.00%)     34.6830 (  -1.12%)     32.0450 (   6.58%)
      : Amean     30     40.0063 (   0.00%)     40.5800 (  -1.43%)     41.5087 (  -3.76%)
      : Amean     32     40.1407 (   0.00%)     41.2273 (  -2.71%)     39.9417 (   0.50%)
      :
      : It's showing that the vanilla kernel takes a hit (as the bisection
      : indicated it would) and that disabling PSI by default is reasonably
      : close in terms of performance for this particular workload on this
      : particular machine so;
      
      Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Tested-by: NMel Gorman <mgorman@techsingularity.net>
      Reported-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      8570f933
    • O
      kernel/sched/psi.c: simplify cgroup_move_task() · d09e62ba
      Olof Johansson 提交于
      commit 8fcb2312d1e3300e81aa871aad00d4c038cfc184 upstream.
      
      The existing code triggered an invalid warning about 'rq' possibly being
      used uninitialized.  Instead of doing the silly warning suppression by
      initializa it to NULL, refactor the code to bail out early instead.
      
      Warning was:
      
        kernel/sched/psi.c: In function `cgroup_move_task':
        kernel/sched/psi.c:639:13: warning: `rq' may be used uninitialized in this function [-Wmaybe-uninitialized]
      
      Link: http://lkml.kernel.org/r/20181103183339.8669-1-olof@lixom.net
      Fixes: 2ce7135adc9ad ("psi: cgroup support")
      Signed-off-by: NOlof Johansson <olof@lixom.net>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      d09e62ba
    • J
      psi: cgroup support · d26da194
      Johannes Weiner 提交于
      commit 2ce7135adc9ad081aa3c49744144376ac74fea60 upstream.
      
      On a system that executes multiple cgrouped jobs and independent
      workloads, we don't just care about the health of the overall system, but
      also that of individual jobs, so that we can ensure individual job health,
      fairness between jobs, or prioritize some jobs over others.
      
      This patch implements pressure stall tracking for cgroups.  In kernels
      with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
      and io.pressure files that track aggregate pressure stall times for only
      the tasks inside the cgroup.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [Joseph: fix apply conflicts in cgroup_create()]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      
      Conflicts:
          kernel/cgroup/cgroup.c
      d26da194
    • J
      psi: pressure stall information for CPU, memory, and IO · 8b2bf079
      Johannes Weiner 提交于
      commit eb414681d5a07d28d2ff90dc05f69ec6b232ebd2 upstream.
      
      When systems are overcommitted and resources become contended, it's hard
      to tell exactly the impact this has on workload productivity, or how close
      the system is to lockups and OOM kills.  In particular, when machines work
      multiple jobs concurrently, the impact of overcommit in terms of latency
      and throughput on the individual job can be enormous.
      
      In order to maximize hardware utilization without sacrificing individual
      job health or risk complete machine lockups, this patch implements a way
      to quantify resource pressure in the system.
      
      A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
      expose the percentage of time the system is stalled on CPU, memory, or IO,
      respectively.  Stall states are aggregate versions of the per-task delay
      accounting delays:
      
             cpu: some tasks are runnable but not executing on a CPU
             memory: tasks are reclaiming, or waiting for swapin or thrashing cache
             io: tasks are waiting for io completions
      
      These percentages of walltime can be thought of as pressure percentages,
      and they give a general sense of system health and productivity loss
      incurred by resource overcommit.  They can also indicate when the system
      is approaching lockup scenarios and OOMs.
      
      To do this, psi keeps track of the task states associated with each CPU
      and samples the time they spend in stall states.  Every 2 seconds, the
      samples are averaged across CPUs - weighted by the CPUs' non-idle time to
      eliminate artifacts from unused CPUs - and translated into percentages of
      walltime.  A running average of those percentages is maintained over 10s,
      1m, and 5m periods (similar to the loadaverage).
      
      [hannes@cmpxchg.org: doc fixlet, per Randy]
        Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
      [hannes@cmpxchg.org: code optimization]
        Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
      [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
        Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
      [hannes@cmpxchg.org: fix build]
        Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [Joseph: fix apply conflicts in task_struct]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      8b2bf079
    • J
      sched: introduce this_rq_lock_irq() · ec035638
      Johannes Weiner 提交于
      commit 246b3b3342c9b0a2e24cda2178be87bc36e1c874 upstream.
      
      do_sched_yield() disables IRQs, looks up this_rq() and locks it.  The next
      patch is adding another site with the same pattern, so provide a
      convenience function for it.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-8-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      ec035638
    • J
      sched: sched.h: make rq locking and clock functions available in stats.h · 420e6dea
      Johannes Weiner 提交于
      commit 1f351d7f7590857ea281579c26e6045b4c548ef4 upstream.
      
      kernel/sched/sched.h includes "stats.h" half-way through the file.  The
      next patch introduces users of sched.h's rq locking functions and
      update_rq_clock() in kernel/sched/stats.h.  Move those definitions up in
      the file so they are available in stats.h.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-7-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      420e6dea
    • J
      sched: loadavg: make calc_load_n() public · 3bf93774
      Johannes Weiner 提交于
      commit 5c54f5b9edb1aa2eabbb1091c458f1b6776a1896 upstream.
      
      It's going to be used in a later patch. Keep the churn separate.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-6-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      3bf93774
    • J
      sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD · 4ca637b4
      Johannes Weiner 提交于
      commit 8508cf3ffad4defa202b303e5b6379efc4cd9054 upstream.
      
      There are several definitions of those functions/macros in places that
      mess with fixed-point load averages.  Provide an official version.
      
      [akpm@linux-foundation.org: fix missed conversion in block/blk-iolatency.c]
      Link: http://lkml.kernel.org/r/20180828172258.3185-5-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [Joseph: use stat.mean instead of stat->rqs.mean to solve conflict]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      
      Conflicts:
          block/blk-iolatency.c
      4ca637b4
    • J
      delayacct: track delays from thrashing cache pages · 2dde6f77
      Johannes Weiner 提交于
      commit b1d29ba82cf2bc784f4c963ddd6a2cf29e229b33 upstream.
      
      Delay accounting already measures the time a task spends in direct reclaim
      and waiting for swapin, but in low memory situations tasks spend can spend
      a significant amount of their time waiting on thrashing page cache.  This
      isn't tracked right now.
      
      To know the full impact of memory contention on an individual task,
      measure the delay when waiting for a recently evicted active cache page to
      read back into memory.
      
      Also update tools/accounting/getdelays.c:
      
           [hannes@computer accounting]$ sudo ./getdelays -d -p 1
           print delayacct stats ON
           PID     1
      
           CPU             count     real total  virtual total    delay total  delay average
                           50318      745000000      847346785      400533713          0.008ms
           IO              count    delay total  delay average
                             435      122601218              0ms
           SWAP            count    delay total  delay average
                               0              0              0ms
           RECLAIM         count    delay total  delay average
                               0              0              0ms
           THRASHING       count    delay total  delay average
                              19       12621439              0ms
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      2dde6f77
    • J
      mm: workingset: tell cache transitions from workingset thrashing · e2d3e3cb
      Johannes Weiner 提交于
      commit 1899ad18c6072d689896badafb81267b0a1092a4 upstream.
      
      Refaults happen during transitions between workingsets as well as in-place
      thrashing.  Knowing the difference between the two has a range of
      applications, including measuring the impact of memory shortage on the
      system performance, as well as the ability to smarter balance pressure
      between the filesystem cache and the swap-backed workingset.
      
      During workingset transitions, inactive cache refaults and pushes out
      established active cache.  When that active cache isn't stale, however,
      and also ends up refaulting, that's bonafide thrashing.
      
      Introduce a new page flag that tells on eviction whether the page has been
      active or not in its lifetime.  This bit is then stored in the shadow
      entry, to classify refaults as transitioning or thrashing.
      
      How many page->flags does this leave us with on 32-bit?
      
      	20 bits are always page flags
      
      	21 if you have an MMU
      
      	23 with the zone bits for DMA, Normal, HighMem, Movable
      
      	29 with the sparsemem section bits
      
      	30 if PAE is enabled
      
      	31 with this patch.
      
      So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
      that's not enough, the system can switch to discontigmem and re-gain the 6
      or 7 sparsemem section bits.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      e2d3e3cb
    • J
      mm: workingset: don't drop refault information prematurely · b027d193
      Johannes Weiner 提交于
      commit 95f9ab2d596e8cbb388315e78c82b9a131bf2928 upstream.
      
      Patch series "psi: pressure stall information for CPU, memory, and IO", v4.
      
      		Overview
      
      PSI reports the overall wallclock time in which the tasks in a system (or
      cgroup) wait for (contended) hardware resources.
      
      This helps users understand the resource pressure their workloads are
      under, which allows them to rootcause and fix throughput and latency
      problems caused by overcommitting, underprovisioning, suboptimal job
      placement in a grid; as well as anticipate major disruptions like OOM.
      
      		Real-world applications
      
      We're using the data collected by PSI (and its previous incarnation,
      memdelay) quite extensively at Facebook, and with several success stories.
      
      One usecase is avoiding OOM hangs/livelocks.  The reason these happen is
      because the OOM killer is triggered by reclaim not being able to free
      pages, but with fast flash devices there is *always* some clean and
      uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
      spend 90% of the time thrashing the cache pages of their own executables.
      There is no situation where this ever makes sense in practice.  We wrote a
      <100 line POC python script to monitor memory pressure and kill stuff way
      before such pathological thrashing leads to full system losses that would
      require forcible hard resets.
      
      We've since extended and deployed this code into other places to guarantee
      latency and throughput SLAs, since they're usually violated way before the
      kernel OOM killer would ever kick in.
      
      It is available here: https://github.com/facebookincubator/oomd
      
      Eventually we probably want to trigger the in-kernel OOM killer based on
      extreme sustained pressure as well, so that Linux can avoid memory
      livelocks - which technically aren't deadlocks, but to the user
      indistinguishable from them - out of the box.  We'd continue using OOMD as
      the first line of defense to ensure workload health and implement complex
      kill policies that are beyond the scope of the kernel.
      
      We also use PSI memory pressure for loadshedding.  Our batch job
      infrastructure used to use heuristics based on various VM stats to
      anticipate OOM situations, with lackluster success.  We switched it to PSI
      and managed to anticipate and avoid OOM kills and lockups fairly reliably.
      The reduction of OOM outages in the worker pool raised the pool's
      aggregate productivity, and we were able to switch that service to smaller
      machines.
      
      Lastly, we use cgroups to isolate a machine's main workload from
      maintenance crap like package upgrades, logging, configuration, as well as
      to prevent multiple workloads on a machine from stepping on each others'
      toes.  We were not able to configure this properly without the pressure
      metrics; we would see latency or bandwidth drops, but it would often be
      hard to impossible to rootcause it post-mortem.
      
      We now log and graph pressure for the containers in our fleet and can
      trivially link latency spikes and throughput drops to shortages of
      specific resources after the fact, and fix the job config/scheduling.
      
      PSI has also received testing, feedback, and feature requests from Android
      and EndlessOS for the purpose of low-latency OOM killing, to intervene in
      pressure situations before the UI starts hanging.
      
      		How do you use this feature?
      
      A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3
      files: cpu, memory, and io.  If using cgroup2, cgroups will also have
      cpu.pressure, memory.pressure and io.pressure files, which simply
      aggregate task stalls at the cgroup level instead of system-wide.
      
      The cpu file contains one line:
      
      	some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
      
      The averages give the percentage of walltime in which one or more tasks
      are delayed on the runqueue while another task has the CPU.  They're
      recent averages over 10s, 1m, 5m windows, so you can tell short term
      trends from long term ones, similarly to the load average.
      
      The total= value gives the absolute stall time in microseconds.  This
      allows detecting latency spikes that might be too short to sway the
      running averages.  It also allows custom time averaging in case the
      10s/1m/5m windows aren't adequate for the usecase (or are too coarse with
      future hardware).
      
      What to make of this "some" metric?  If CPU utilization is at 100% and CPU
      pressure is 0, it means the system is perfectly utilized, with one
      runnable thread per CPU and nobody waiting.  At two or more runnable tasks
      per CPU, the system is 100% overcommitted and the pressure average will
      indicate as much.  From a utilization perspective this is a great state of
      course: no CPU cycles are being wasted, even when 50% of the threads were
      to go idle (as most workloads do vary).  From the perspective of the
      individual job it's not great, however, and they would do better with more
      resources.  Depending on what your priority and options are, raised "some"
      numbers may or may not require action.
      
      The memory file contains two lines:
      
      some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
      full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
      
      The some line is the same as for cpu, the time in which at least one task
      is stalled on the resource.  In the case of memory, this includes waiting
      on swap-in, page cache refaults and page reclaim.
      
      The full line, however, indicates time in which *nobody* is using the CPU
      productively due to pressure: all non-idle tasks are waiting for memory in
      one form or another.  Significant time spent in there is a good trigger
      for killing things, moving jobs to other machines, or dropping incoming
      requests, since neither the jobs nor the machine overall are making too
      much headway.
      
      The io file is similar to memory.  Because the block layer doesn't have a
      concept of hardware contention right now (how much longer is my IO request
      taking due to other tasks?), it reports CPU potential lost on all IO
      delays, not just the potential lost due to competition.
      
      		FAQ
      
      Q: How is PSI's CPU component different from the load average?
      
      A: There are several quirks in the load average that make it hard to
         impossible to tell how overcommitted the CPU really is.
      
         1. The load average is reported as a raw number of active tasks.
            You need to know how many CPUs there are in the system, how many
            CPUs the workload is allowed to use, then think about what the
            proportion between load and the number of CPUs mean for the
            tasks trying to run.
      
            PSI reports the percentage of wallclock time in which tasks are
            waiting for a CPU to run on. It doesn't matter how many CPUs are
            present or usable. The number always tells the quality of life
            of tasks in the system or in a particular cgroup.
      
         2. The shortest averaging window is 1m, which is extremely coarse,
            and it's sampled in 5s intervals. A *lot* can happen on a CPU in
            5 seconds. This *may* be able to identify persistent long-term
            trends and very clear and obvious overloads, but it's unusable
            for latency spikes and more subtle overutilization.
      
            PSI's shortest window is 10s. It also exports the cumulative
            stall times (in microseconds) of synchronously recorded events.
      
         3. On Linux, the load average for historical reasons includes all
            TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how
            busy the system is, but on the flipside it doesn't distinguish
            whether tasks are likely to contend over the CPU or IO - which
            obviously requires very different interventions from a sys admin
            or a job scheduler.
      
            PSI reports independent metrics for CPU and IO. You can tell
            which resource is making the tasks wait, but in conjunction
            still see how overloaded the system is overall.
      
      Q: What's the cost / performance impact of this feature?
      
      A: PSI's primary cost is in the scheduler, in particular task wakeups
         and sleeps.
      
         I benchmarked this code using Facebook's two most scheduling
         sensitive workloads: memcache and webserver. They handle a ton of
         small requests - lots of wakeups and sleeps with little actual work
         in between - so they tend to be canaries for scheduler regressions.
      
         In the tests, the boxes were handling live traffic over the course
         of several hours. Half the machines, the control, ran with
         CONFIG_PSI=n.
      
         For memcache I used eight machines total. They're 2-socket, 14
         core, 56 thread boxes. The test runs for half the test period,
         flips the test and control kernels on the hardware to rule out HW
         factors, DC location etc., then runs the other half of the test.
      
         For the webservers, I used 32 machines total. They're single
         socket, 16 core, 32 thread machines.
      
         During the memcache test, CPU load was nopsi=78.05% psi=78.98% in
         the first half and nopsi=77.52% psi=78.25%, so PSI added between
         0.7 and 0.9 percentage points to the CPU load, a difference of
         about 1%.
      
         UPDATE: I re-ran this test with the v3 version of this patch set
         and the CPU utilization was equivalent between test and control.
      
         UPDATE: v4 is on par with v3.
      
         As far as end-to-end request latency from the client perspective
         goes, we don't sample those finely enough to capture the requests
         going to those particular machines during the test, but we know the
         p50 turnaround time in this workload is 54us, and perf bench sched
         pipe on those machines show nopsi=5.232666 us/op and psi=5.587347
         us/op, so this doesn't add much here either.
      
         The profile for the pipe benchmark shows:
      
              0.87%  sched-pipe  [kernel.vmlinux]    [k] psi_group_change
              0.83%  perf.real   [kernel.vmlinux]    [k] psi_group_change
              0.82%  perf.real   [kernel.vmlinux]    [k] psi_task_change
              0.58%  sched-pipe  [kernel.vmlinux]    [k] psi_task_change
      
         The webserver load is running inside 4 nested cgroup levels. The
         CPU load with both nopsi and psi kernels was indistinguishable at
         81%.
      
         For comparison, we had to disable the cgroup cpu controller on the
         webservers because it added 4 percentage points to the CPU% during
         this same exact test.
      
         Versions of this accounting code now run on 80% of our fleet. None
         of our workloads have reported regressions during the rollout.
      
      Daniel Drake said:
      
      : I just retested the latest version at
      : http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results
      : are great.
      :
      : Test setup:
      : Endless OS
      : GeminiLake N4200 low end laptop
      : 2GB RAM
      : swap (and zram swap) disabled
      :
      : Baseline test: open a handful of large-ish apps and several website
      : tabs in Google Chrome.
      :
      : Results: after a couple of minutes, system is excessively thrashing, mouse
      : cursor can barely be moved, UI is not responding to mouse clicks, so it's
      : impractical to recover from this situation as an ordinary user
      :
      : Add my simple killer:
      : https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd
      :
      : Results: when the thrashing causes the UI to become sluggish, the killer
      : steps in and kills something (usually a chrome tab), and the system
      : remains usable.  I repeatedly opened more apps and more websites over a 15
      : minute period but I wasn't able to get the system to a point of UI
      : unresponsiveness.
      
      Suren said:
      
      : Backported to 4.9 and retested on ARMv8 8 code system running Android.
      : Signals behave as expected reacting to memory pressure, no jumps in
      : "total" counters that would indicate an overflow/underflow issues.  Nicely
      : done!
      
      This patch (of 9):
      
      If we keep just enough refault information to match the *current* page
      cache during reclaim time, we could lose a lot of events when there is
      only a temporary spike in non-cache memory consumption that pushes out all
      the cache.  Once cache comes back, we won't see those refaults.  They
      might not be actionable for LRU aging, but we want to know about them for
      measuring memory pressure.
      
      [hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters]
        Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <jweiner@fb.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      b027d193
    • G
      alinux: net/tcp: Support tunable tcp timeout value in TIME-WAIT state · 0e7b6b7f
      George Zhang 提交于
      By default the tcp_tw_timeout value is 60 seconds. The minimum is
      1 second and the maximum is 600. This setting is useful on system under
      heavy tcp load.
      
      NOTE: set the tcp_tw_timeout below 60 seconds voilates the "quiet time"
      restriction, and make your system into the risk of causing some old data
      to be accepted as new or new data rejected as old duplicated by some
      receivers.
      
      Link: http://web.archive.org/web/20150102003320/http://tools.ietf.org/html/rfc793Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0e7b6b7f
    • A
      PCI: Fix "try" semantics of bus and slot reset · e69730a1
      Alex Williamson 提交于
      commit ddefc033eecf23f1e8b81d0663c5db965adf5516 upstream
      
      The commit referenced below introduced device locking around save and
      restore of state for each device during a PCI bus "try" reset, making
      it decidely non-"try" and prone to deadlock in the event that a device
      is already locked.  Restore __pci_reset_bus() and __pci_reset_slot()
      to their advertised locking semantics by pushing the save and restore
      functions into the branch where the entire tree is already locked.
      Extend the helper function names with "_locked" and update the comment
      to reflect this calling requirement.
      
      Fixes: b014e96d ("PCI: Protect pci_error_handlers->reset_notify() usage with device_lock()")
      Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
      Signed-off-by: NZhiyuan Hou <zhiyuan2048@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      e69730a1
    • K
      alinux: net/hookers: fix link error with ipv6 disabled · 817f2cbf
      kbuild test robot 提交于
      lkp-build bot reported the following link error with ipv6 disabled:
      
      ld: net/hookers/hookers.o:(.data+0x40): undefined reference to `ipv6_specific'
      ld: net/hookers/hookers.o:(.data+0x78): undefined reference to `ipv6_mapped'
      ld: net/hookers/hookers.o:(.data+0xe8): undefined reference to `inet6_stream_ops'
      
      Fixed this issue by adding IS_ENABLED(CONFIG_IPV6) check.
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      817f2cbf
    • K
      alinux: writeback: memcg_blkcg_tree_lock can be static · 2307342e
      kbuild test robot 提交于
      Fixes: 60448d43 ("writeback: add memcg_blkcg_link tree")
      Signed-off-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      2307342e
    • C
      alinux: net/hookers: only enable on x86 platform · 2c0162e8
      Caspar Zhang 提交于
      read/write_cr0() are used in net/hookers.c, but they are only available
      on x86 platform. Adding a depend-on fields in Kconfig to disable this
      feature in other platforms.
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      2c0162e8
    • J
      alinux: fs/writeback: wrap cgroup writeback v1 logic · 5e06ec32
      Joseph Qi 提交于
      Wrap cgroup writeback v1 logic to prevent build errors without
      CONFIG_CGROUPS or CONFIG_CGROUP_WRITEBACK.
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      5e06ec32
    • J
      alinux: writeback: introduce cgwb_v1 boot param · 6f7afdf3
      Jiufei Xue 提交于
      So far writeback control is supported for cgroup v1 interface. However
      it also has some restrictions, so introduce a new kernel boot parameter
      to control the behavior which is disabled by default. Users can enable
      the writeback control for cgroup v1 with the command line "cgwb_v1".
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6f7afdf3
    • L
      alinux: fs/writeback: Attach inode's wb to root if needed · 67dc261c
      luanshi 提交于
      There might have tons of files queued in the writeback, awaiting for
      writing back. Unfortunately, the writeback's cgroup has been dead. In
      this case, we reassociate the inode with another writeback, but we
      possibly can't because the writeback associated with the dead cgroup is
      the only valid one. In this case, the new writeback is allocated,
      initialized and associated with the inode in the non-stopping fashion
      until all data resident in the inode's page cache are flushed to disk.
      It causes unnecessary high system load.
      
      This fixes the issue by enforce moving the inode to root cgroup when the
      previous binding cgroup becomes dead. With it, no more unnecessary
      writebacks are created, populated and the system load decreased by about
      6x in the test case we carried out:
          Without the patch: 30% system load
          With the patch:    5%  system load
      Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      67dc261c
    • J
      alinux: fs/writeback: fix double free of blkcg_css · b8016b41
      Jiufei Xue 提交于
      We have gotten a WARNNING when releasing blkcg_css:
      
      [332489.681635] WARNING: CPU: 55 PID: 14859 at lib/list_debug.c:56 __list_del_entry+0x81/0xc0
      [332489.682191] list_del corruption, ffff883e6b94d450->prev is LIST_POISON2 (dead000000000200)
      ......
      [332489.683895] CPU: 55 PID: 14859 Comm: kworker/55:2 Tainted: G
      [332489.684477] Hardware name: Inspur SA5248M4/X10DRT-PS, BIOS 4.05A
      10/11/2016
      [332489.685061] Workqueue: cgroup_destroy css_release_work_fn
      [332489.685654]  ffffc9001d92bd28 ffffffff81380042 ffffc9001d92bd78
      0000000000000000
      [332489.686269]  ffffc9001d92bd68 ffffffff81088f8b 0000003800000000
      ffff883e6b94d4a0
      [332489.686867]  ffff883e6b94d400 ffffffff81ce8fe0 ffff88375b24f400
      ffff883e6b94d4a0
      [332489.687479] Call Trace:
      [332489.688078]  [<ffffffff81380042>] dump_stack+0x63/0x81
      [332489.688681]  [<ffffffff81088f8b>] __warn+0xcb/0xf0
      [332489.689276]  [<ffffffff8108900f>] warn_slowpath_fmt+0x5f/0x80
      [332489.689877]  [<ffffffff8139e7c1>] __list_del_entry+0x81/0xc0
      [332489.690481]  [<ffffffff81125552>] css_release_work_fn+0x42/0x140
      [332489.691090]  [<ffffffff810a2db9>] process_one_work+0x189/0x420
      [332489.691693]  [<ffffffff810a309e>] worker_thread+0x4e/0x4b0
      [332489.692293]  [<ffffffff810a3050>] ? process_one_work+0x420/0x420
      [332489.692905]  [<ffffffff810a9616>] kthread+0xe6/0x100
      [332489.693504]  [<ffffffff810a9530>] ? kthread_park+0x60/0x60
      [332489.694099]  [<ffffffff817184e1>] ret_from_fork+0x41/0x50
      [332489.694722] ---[ end trace 0cf869c4a5cfba87 ]---
      ......
      
      This is caused by calling css_get after the css is killed by another
      thread described below:
      
                 Thread 1                       Thread 2
      cgroup_rmdir
        -> kill_css
          -> percpu_ref_kill_and_confirm
            -> css_killed_ref_fn
      
      css_killed_work_fn
        -> css_put
          -> css_release
                                              wb_get_create
      					  -> find_blkcg_css
      					    -> css_get
      					  -> css_put
      					    -> css_release (double free)
          -> css_release_workfn
            -> css_free_work_fn
             -> blkcg_css_free
      
      When doublefree happened, it may free the memory still used by
      other threads and cause a kernel panic.
      
      Fix this by using css_tryget_online in find_blkcg_css while will return
      false if the css is killed.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b8016b41
    • J
    • J
      alinux: writeback: add memcg_blkcg_link tree · 7396367d
      Jiufei Xue 提交于
      Here we add a global radix tree to link memcg and blkcg that the user
      attach the tasks to when using cgroup v1, which is used for writeback
      cgroup.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      7396367d
    • G
      alinux: net: kernel hookers service for toa module · e209e3cb
      George Zhang 提交于
      LVS fullnat will replace network traffic's source ip with its local ip,
      and thus the backend servers cannot obtain the real client ip.
      
      To solve this, LVS has introduced the tcp option address (TOA) to store
      the essential ip address information in the last tcp ack packet of the
      3-way handshake, and the backend servers need to retrieve it from the
      packet header.
      
      In this patch, we have introduced the sk_toa_data member in the sock
      structure to hold the TOA information. There used to be an in-tree
      module for TOA managing, whereas it has now been maintained as an
      standalone module.
      
      In this case, the toa module should register its hook function(s) using
      the provided interfaces in the hookers module.
      
      TOA in sock structure:
      
      	__be32 sk_toa_data[16];
      
      The hookers module only provides the sk_toa_data placeholder, and the
      toa module can use this variable through the layout it needs.
      
      Hook interfaces:
      
      The hookers module replaces the kernel's syn_recv_sock and getname
      handler with a stub that chains the toa module's hook function(s) to the
      original handling function. The hookers module allows hook functions to
      be installed and uninstalled in any order.
      
      toa module:
      
      The external toa module will be provided in separate RPM package.
      
      [xuyu@linux.alibaba.com: amend commit log]
      Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      e209e3cb
    • C
      virtio_blk: add discard and write zeroes support · 0eea9fdf
      Changpeng Liu 提交于
      commit 1f23816b8eb8fdc39990abe166c10a18c16f6b21 upstream.
      
      In commit 88c85538, "virtio-blk: add discard and write zeroes features
      to specification" (https://github.com/oasis-tcs/virtio-spec), the virtio
      block specification has been extended to add VIRTIO_BLK_T_DISCARD and
      VIRTIO_BLK_T_WRITE_ZEROES commands.  This patch enables support for
      discard and write zeroes in the virtio-blk driver when the device
      advertises the corresponding features, VIRTIO_BLK_F_DISCARD and
      VIRTIO_BLK_F_WRITE_ZEROES.
      Signed-off-by: NChangpeng Liu <changpeng.liu@intel.com>
      Signed-off-by: NDaniel Verkamp <dverkamp@chromium.org>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0eea9fdf
    • J
      alinux: kconfig: Disable x86 clocksource watchdog · deb14433
      Jiufei Xue 提交于
      Unstable tsc will trigger clocksource watchdog and disable itself, as a
      result other clocksource will be elected as the current clocksource
      which will result in performace issue on our servers.
      
      RHEL7 also disabled this feature for some issues, see changelog:
      [x86] disable clocksource watchdog (Prarit Bhargava) [914709]
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      deb14433
    • J
      alinux: Revert "x86/tsc: Prepare warp test for TSC adjustment" · 0fda568f
      Jiufei Xue 提交于
      This reverts commit 76d3b851.
      
      The returned value for check_tsc_warp() is useless now, remove it.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      0fda568f
    • J
      alinux: Revert "x86/tsc: Try to adjust TSC if sync test fails" · a7d3a215
      Jiufei Xue 提交于
      This reverts commit cc4db268.
      
      When we do hot-add and enable vCPU, the time inside the VM jumps and
      then VM stucks.
      The dmesg shows like this:
      [   48.402948] CPU2 has been hot-added
      [   48.413774] smpboot: Booting Node 0 Processor 2 APIC 0x2
      [   48.415155] kvm-clock: cpu 2, msr 6b615081, secondary cpu clock
      [   48.453690] TSC ADJUST compensate: CPU2 observed 139318776350 warp.  Adjust: 139318776350
      [  102.060874] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
      [  102.060874] clocksource:                       'kvm-clock' wd_now: 1cb1cfc4bf8 wd_last: 1be9588f1fe mask: ffffffffffffffff
      [  102.060874] clocksource:                       'tsc' cs_now: 207d794f7e cs_last: 205a32697a mask: ffffffffffffffff
      [  102.060874] tsc: Marking TSC unstable due to clocksource watchdog
      [  102.070188] KVM setup async PF for cpu 2
      [  102.071461] kvm-stealtime: cpu 2, msr 13ba95000
      [  102.074530] Will online and init hotplugged CPU: 2
      
      This is because the TSC for the newly added VCPU is initialized to 0
      while others are ahead. Guest will do the TSC ADJUST compensate and
      cause the time jumps.
      
      Commit bd8fab39("KVM: x86: fix maintaining of kvm_clock stability
      on guest CPU hotplug") can fix this problem.  However, the host kernel
      version may be older, so do not ajust TSC if sync test fails, just mark
      it unstable.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      a7d3a215
    • J
      alinux: block-throttle: enable hierarchical throttling even on traditional hierarchy · 0430bf1d
      Joseph Qi 提交于
      ECI may have an use case that configuring each device mapper disk
      throttling policy just under root blkio cgroup, but actually using them
      in different containers.
      Since hierarchical throttling is now only supported on cgroup v2 and ECI
      uses cgroup v1, so we have to enable hierarchical throttling on cgroup
      v1.
      This is ported from redhat 7u, and a year ago Jiufei already ported it
      to alikernel 4.9 as well. So I think this change should be acceptable.
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      0430bf1d
    • E
      alinux: drivers/virtio: add vring_force_dma_api boot param · 9c4a8e9d
      Eryu Guan 提交于
      Prior to xdragon platform 20181230 release (e.g. 0930 release),
      vring_use_dma_api() is required to return 'true' unconditionally.
      
      Introduce a new kernel boot parameter called "vring_force_dma_api" to
      control the behavior, boot xdragon host with "vring_force_dma_api"
      command line to make ENI hotplug work, so that normal ECS hosts keep the
      original behavior.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
      9c4a8e9d
    • A
      boot: give rdrand some credit · 957ba62a
      Arjan van de Ven 提交于
      Cherry-pick from clear-linux patches:
      https://github.com/clearlinux-pkgs/linux-kvm/0104-give-rdrand-some-credit.patch
      
      try to credit rdrand/rdseed with some entropy
      
      In VMs but even modern hardware, we're super starved for entropy, and while we can
      and do wear a tin foil hat, it's very hard to argue that
      rdrand and rdtsc add zero entropy.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      957ba62a
    • J
    • A
      NEMU: Compile in evged always · f4747532
      Arjan van de Ven 提交于
      Cherry-pick from kata-container patches:
      https://github.com/kata-containers/packaging/tree/master/kernel/patches/0002-Compile-in-evged-always.patch
      
      We need evged for NEMU (and in general for hw reduced)
      
      The config option cannot be set normally since it breaks all
      regular systems, and hardware reduced is really a runtime choice.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f4747532
    • E
      ext4: fix reserved cluster accounting at page invalidation time · c94af26e
      Eric Whitney 提交于
      commit f456767d3391e9f7d9d25a2e7241d75676dc19da upstream.
      
      Add new code to count canceled pending cluster reservations on bigalloc
      file systems and to reduce the cluster reservation count on all file
      systems using delayed allocation.  This replaces old code in
      ext4_da_page_release_reservations that was incorrect.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      c94af26e
    • E
      ext4: adjust reserved cluster count when removing extents · 60a5bf34
      Eric Whitney 提交于
      commit 9fe671496b6c286f9033aedfc1718d67721da0ae upstream.
      
      Modify ext4_ext_remove_space() and the code it calls to correct the
      reserved cluster count for pending reservations (delayed allocated
      clusters shared with allocated blocks) when a block range is removed
      from the extent tree.  Pending reservations may be found for the clusters
      at the ends of written or unwritten extents when a block range is removed.
      If a physical cluster at the end of an extent is freed, it's necessary
      to increment the reserved cluster count to maintain correct accounting
      if the corresponding logical cluster is shared with at least one
      delayed and unwritten extent as found in the extents status tree.
      
      Add a new function, ext4_rereserve_cluster(), to reapply a reservation
      on a delayed allocated cluster sharing blocks with a freed allocated
      cluster.  To avoid ENOSPC on reservation, a flag is applied to
      ext4_free_blocks() to briefly defer updating the freeclusters counter
      when an allocated cluster is freed.  This prevents another thread
      from allocating the freed block before the reservation can be reapplied.
      
      Redefine the partial cluster object as a struct to carry more state
      information and to clarify the code using it.
      
      Adjust the conditional code structure in ext4_ext_remove_space to
      reduce the indentation level in the main body of the code to improve
      readability.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      60a5bf34
    • E
      ext4: reduce reserved cluster count by number of allocated clusters · d49d8b35
      Eric Whitney 提交于
      commit b6bf9171ef5c37b66d446378ba63af5339a56a97 upstream.
      
      Ext4 does not always reduce the reserved cluster count by the number
      of clusters allocated when mapping a delayed extent.  It sometimes
      adds back one or more clusters after allocation if delalloc blocks
      adjacent to the range allocated by ext4_ext_map_blocks() share the
      clusters newly allocated for that range.  However, this overcounts
      the number of clusters needed to satisfy future mapping requests
      (holding one or more reservations for clusters that have already been
      allocated) and premature ENOSPC and quota failures, etc., result.
      
      Ext4 also does not reduce the reserved cluster count when allocating
      clusters for non-delayed allocated writes that have previously been
      reserved for delayed writes.  This also results in overcounts.
      
      To make it possible to handle reserved cluster accounting for
      fallocated regions in the same manner as used for other non-delayed
      writes, do the reserved cluster accounting for them at the time of
      allocation.  In the current code, this is only done later when a
      delayed extent sharing the fallocated region is finally mapped.
      
      Address comment correcting handling of unsigned long long constant
      from Jan Kara's review of RFC version of this patch.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      d49d8b35
    • E
      ext4: fix reserved cluster accounting at delayed write time · f683c7e6
      Eric Whitney 提交于
      commit 0b02f4c0d6d9e2c611dfbdd4317193e9dca740e6 upstream.
      
      The code in ext4_da_map_blocks sometimes reserves space for more
      delayed allocated clusters than it should, resulting in premature
      ENOSPC, exceeded quota, and inaccurate free space reporting.
      
      Fix this by checking for written and unwritten blocks shared in the
      same cluster with the newly delayed allocated block.  A cluster
      reservation should not be made for a cluster for which physical space
      has already been allocated.
      Signed-off-by: NEric Whitney <enwlinux@gmail.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      f683c7e6