1. 11 6月, 2019 2 次提交
    • J
      mm: workingset: tell cache transitions from workingset thrashing · 20403665
      Johannes Weiner 提交于
      commit 1899ad18c6072d689896badafb81267b0a1092a4 upstream.
      
      Refaults happen during transitions between workingsets as well as in-place
      thrashing.  Knowing the difference between the two has a range of
      applications, including measuring the impact of memory shortage on the
      system performance, as well as the ability to smarter balance pressure
      between the filesystem cache and the swap-backed workingset.
      
      During workingset transitions, inactive cache refaults and pushes out
      established active cache.  When that active cache isn't stale, however,
      and also ends up refaulting, that's bonafide thrashing.
      
      Introduce a new page flag that tells on eviction whether the page has been
      active or not in its lifetime.  This bit is then stored in the shadow
      entry, to classify refaults as transitioning or thrashing.
      
      How many page->flags does this leave us with on 32-bit?
      
      	20 bits are always page flags
      
      	21 if you have an MMU
      
      	23 with the zone bits for DMA, Normal, HighMem, Movable
      
      	29 with the sparsemem section bits
      
      	30 if PAE is enabled
      
      	31 with this patch.
      
      So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
      that's not enough, the system can switch to discontigmem and re-gain the 6
      or 7 sparsemem section bits.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      20403665
    • J
      mm: workingset: don't drop refault information prematurely · 63a30543
      Johannes Weiner 提交于
      commit 95f9ab2d596e8cbb388315e78c82b9a131bf2928 upstream.
      
      Patch series "psi: pressure stall information for CPU, memory, and IO", v4.
      
      		Overview
      
      PSI reports the overall wallclock time in which the tasks in a system (or
      cgroup) wait for (contended) hardware resources.
      
      This helps users understand the resource pressure their workloads are
      under, which allows them to rootcause and fix throughput and latency
      problems caused by overcommitting, underprovisioning, suboptimal job
      placement in a grid; as well as anticipate major disruptions like OOM.
      
      		Real-world applications
      
      We're using the data collected by PSI (and its previous incarnation,
      memdelay) quite extensively at Facebook, and with several success stories.
      
      One usecase is avoiding OOM hangs/livelocks.  The reason these happen is
      because the OOM killer is triggered by reclaim not being able to free
      pages, but with fast flash devices there is *always* some clean and
      uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
      spend 90% of the time thrashing the cache pages of their own executables.
      There is no situation where this ever makes sense in practice.  We wrote a
      <100 line POC python script to monitor memory pressure and kill stuff way
      before such pathological thrashing leads to full system losses that would
      require forcible hard resets.
      
      We've since extended and deployed this code into other places to guarantee
      latency and throughput SLAs, since they're usually violated way before the
      kernel OOM killer would ever kick in.
      
      It is available here: https://github.com/facebookincubator/oomd
      
      Eventually we probably want to trigger the in-kernel OOM killer based on
      extreme sustained pressure as well, so that Linux can avoid memory
      livelocks - which technically aren't deadlocks, but to the user
      indistinguishable from them - out of the box.  We'd continue using OOMD as
      the first line of defense to ensure workload health and implement complex
      kill policies that are beyond the scope of the kernel.
      
      We also use PSI memory pressure for loadshedding.  Our batch job
      infrastructure used to use heuristics based on various VM stats to
      anticipate OOM situations, with lackluster success.  We switched it to PSI
      and managed to anticipate and avoid OOM kills and lockups fairly reliably.
      The reduction of OOM outages in the worker pool raised the pool's
      aggregate productivity, and we were able to switch that service to smaller
      machines.
      
      Lastly, we use cgroups to isolate a machine's main workload from
      maintenance crap like package upgrades, logging, configuration, as well as
      to prevent multiple workloads on a machine from stepping on each others'
      toes.  We were not able to configure this properly without the pressure
      metrics; we would see latency or bandwidth drops, but it would often be
      hard to impossible to rootcause it post-mortem.
      
      We now log and graph pressure for the containers in our fleet and can
      trivially link latency spikes and throughput drops to shortages of
      specific resources after the fact, and fix the job config/scheduling.
      
      PSI has also received testing, feedback, and feature requests from Android
      and EndlessOS for the purpose of low-latency OOM killing, to intervene in
      pressure situations before the UI starts hanging.
      
      		How do you use this feature?
      
      A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3
      files: cpu, memory, and io.  If using cgroup2, cgroups will also have
      cpu.pressure, memory.pressure and io.pressure files, which simply
      aggregate task stalls at the cgroup level instead of system-wide.
      
      The cpu file contains one line:
      
      	some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
      
      The averages give the percentage of walltime in which one or more tasks
      are delayed on the runqueue while another task has the CPU.  They're
      recent averages over 10s, 1m, 5m windows, so you can tell short term
      trends from long term ones, similarly to the load average.
      
      The total= value gives the absolute stall time in microseconds.  This
      allows detecting latency spikes that might be too short to sway the
      running averages.  It also allows custom time averaging in case the
      10s/1m/5m windows aren't adequate for the usecase (or are too coarse with
      future hardware).
      
      What to make of this "some" metric?  If CPU utilization is at 100% and CPU
      pressure is 0, it means the system is perfectly utilized, with one
      runnable thread per CPU and nobody waiting.  At two or more runnable tasks
      per CPU, the system is 100% overcommitted and the pressure average will
      indicate as much.  From a utilization perspective this is a great state of
      course: no CPU cycles are being wasted, even when 50% of the threads were
      to go idle (as most workloads do vary).  From the perspective of the
      individual job it's not great, however, and they would do better with more
      resources.  Depending on what your priority and options are, raised "some"
      numbers may or may not require action.
      
      The memory file contains two lines:
      
      some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
      full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
      
      The some line is the same as for cpu, the time in which at least one task
      is stalled on the resource.  In the case of memory, this includes waiting
      on swap-in, page cache refaults and page reclaim.
      
      The full line, however, indicates time in which *nobody* is using the CPU
      productively due to pressure: all non-idle tasks are waiting for memory in
      one form or another.  Significant time spent in there is a good trigger
      for killing things, moving jobs to other machines, or dropping incoming
      requests, since neither the jobs nor the machine overall are making too
      much headway.
      
      The io file is similar to memory.  Because the block layer doesn't have a
      concept of hardware contention right now (how much longer is my IO request
      taking due to other tasks?), it reports CPU potential lost on all IO
      delays, not just the potential lost due to competition.
      
      		FAQ
      
      Q: How is PSI's CPU component different from the load average?
      
      A: There are several quirks in the load average that make it hard to
         impossible to tell how overcommitted the CPU really is.
      
         1. The load average is reported as a raw number of active tasks.
            You need to know how many CPUs there are in the system, how many
            CPUs the workload is allowed to use, then think about what the
            proportion between load and the number of CPUs mean for the
            tasks trying to run.
      
            PSI reports the percentage of wallclock time in which tasks are
            waiting for a CPU to run on. It doesn't matter how many CPUs are
            present or usable. The number always tells the quality of life
            of tasks in the system or in a particular cgroup.
      
         2. The shortest averaging window is 1m, which is extremely coarse,
            and it's sampled in 5s intervals. A *lot* can happen on a CPU in
            5 seconds. This *may* be able to identify persistent long-term
            trends and very clear and obvious overloads, but it's unusable
            for latency spikes and more subtle overutilization.
      
            PSI's shortest window is 10s. It also exports the cumulative
            stall times (in microseconds) of synchronously recorded events.
      
         3. On Linux, the load average for historical reasons includes all
            TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how
            busy the system is, but on the flipside it doesn't distinguish
            whether tasks are likely to contend over the CPU or IO - which
            obviously requires very different interventions from a sys admin
            or a job scheduler.
      
            PSI reports independent metrics for CPU and IO. You can tell
            which resource is making the tasks wait, but in conjunction
            still see how overloaded the system is overall.
      
      Q: What's the cost / performance impact of this feature?
      
      A: PSI's primary cost is in the scheduler, in particular task wakeups
         and sleeps.
      
         I benchmarked this code using Facebook's two most scheduling
         sensitive workloads: memcache and webserver. They handle a ton of
         small requests - lots of wakeups and sleeps with little actual work
         in between - so they tend to be canaries for scheduler regressions.
      
         In the tests, the boxes were handling live traffic over the course
         of several hours. Half the machines, the control, ran with
         CONFIG_PSI=n.
      
         For memcache I used eight machines total. They're 2-socket, 14
         core, 56 thread boxes. The test runs for half the test period,
         flips the test and control kernels on the hardware to rule out HW
         factors, DC location etc., then runs the other half of the test.
      
         For the webservers, I used 32 machines total. They're single
         socket, 16 core, 32 thread machines.
      
         During the memcache test, CPU load was nopsi=78.05% psi=78.98% in
         the first half and nopsi=77.52% psi=78.25%, so PSI added between
         0.7 and 0.9 percentage points to the CPU load, a difference of
         about 1%.
      
         UPDATE: I re-ran this test with the v3 version of this patch set
         and the CPU utilization was equivalent between test and control.
      
         UPDATE: v4 is on par with v3.
      
         As far as end-to-end request latency from the client perspective
         goes, we don't sample those finely enough to capture the requests
         going to those particular machines during the test, but we know the
         p50 turnaround time in this workload is 54us, and perf bench sched
         pipe on those machines show nopsi=5.232666 us/op and psi=5.587347
         us/op, so this doesn't add much here either.
      
         The profile for the pipe benchmark shows:
      
              0.87%  sched-pipe  [kernel.vmlinux]    [k] psi_group_change
              0.83%  perf.real   [kernel.vmlinux]    [k] psi_group_change
              0.82%  perf.real   [kernel.vmlinux]    [k] psi_task_change
              0.58%  sched-pipe  [kernel.vmlinux]    [k] psi_task_change
      
         The webserver load is running inside 4 nested cgroup levels. The
         CPU load with both nopsi and psi kernels was indistinguishable at
         81%.
      
         For comparison, we had to disable the cgroup cpu controller on the
         webservers because it added 4 percentage points to the CPU% during
         this same exact test.
      
         Versions of this accounting code now run on 80% of our fleet. None
         of our workloads have reported regressions during the rollout.
      
      Daniel Drake said:
      
      : I just retested the latest version at
      : http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results
      : are great.
      :
      : Test setup:
      : Endless OS
      : GeminiLake N4200 low end laptop
      : 2GB RAM
      : swap (and zram swap) disabled
      :
      : Baseline test: open a handful of large-ish apps and several website
      : tabs in Google Chrome.
      :
      : Results: after a couple of minutes, system is excessively thrashing, mouse
      : cursor can barely be moved, UI is not responding to mouse clicks, so it's
      : impractical to recover from this situation as an ordinary user
      :
      : Add my simple killer:
      : https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd
      :
      : Results: when the thrashing causes the UI to become sluggish, the killer
      : steps in and kills something (usually a chrome tab), and the system
      : remains usable.  I repeatedly opened more apps and more websites over a 15
      : minute period but I wasn't able to get the system to a point of UI
      : unresponsiveness.
      
      Suren said:
      
      : Backported to 4.9 and retested on ARMv8 8 code system running Android.
      : Signals behave as expected reacting to memory pressure, no jumps in
      : "total" counters that would indicate an overflow/underflow issues.  Nicely
      : done!
      
      This patch (of 9):
      
      If we keep just enough refault information to match the *current* page
      cache during reclaim time, we could lose a lot of events when there is
      only a temporary spike in non-cache memory consumption that pushes out all
      the cache.  Once cache comes back, we won't see those refaults.  They
      might not be actionable for LRU aging, but we want to know about them for
      measuring memory pressure.
      
      [hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters]
        Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <jweiner@fb.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      63a30543
  2. 04 6月, 2019 6 次提交
    • K
      writeback: memcg_blkcg_tree_lock can be static · bff0a7d6
      kbuild test robot 提交于
      Fixes: 60448d43 ("writeback: add memcg_blkcg_link tree")
      Signed-off-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      bff0a7d6
    • J
      fs/writeback: wrap cgroup writeback v1 logic · f2d895d6
      Joseph Qi 提交于
      Wrap cgroup writeback v1 logic to prevent build errors without
      CONFIG_CGROUPS or CONFIG_CGROUP_WRITEBACK.
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      f2d895d6
    • J
      writeback: introduce cgwb_v1 boot param · 37ab1da1
      Jiufei Xue 提交于
      So far writeback control is supported for cgroup v1 interface. However
      it also has some restrictions, so introduce a new kernel boot parameter
      to control the behavior which is disabled by default. Users can enable
      the writeback control for cgroup v1 with the command line "cgwb_v1".
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      37ab1da1
    • J
      fs/writeback: fix double free of blkcg_css · bde2f8ae
      Jiufei Xue 提交于
      We have gotten a WARNNING when releasing blkcg_css:
      
      [332489.681635] WARNING: CPU: 55 PID: 14859 at lib/list_debug.c:56 __list_del_entry+0x81/0xc0
      [332489.682191] list_del corruption, ffff883e6b94d450->prev is LIST_POISON2 (dead000000000200)
      ......
      [332489.683895] CPU: 55 PID: 14859 Comm: kworker/55:2 Tainted: G
      [332489.684477] Hardware name: Inspur SA5248M4/X10DRT-PS, BIOS 4.05A
      10/11/2016
      [332489.685061] Workqueue: cgroup_destroy css_release_work_fn
      [332489.685654]  ffffc9001d92bd28 ffffffff81380042 ffffc9001d92bd78
      0000000000000000
      [332489.686269]  ffffc9001d92bd68 ffffffff81088f8b 0000003800000000
      ffff883e6b94d4a0
      [332489.686867]  ffff883e6b94d400 ffffffff81ce8fe0 ffff88375b24f400
      ffff883e6b94d4a0
      [332489.687479] Call Trace:
      [332489.688078]  [<ffffffff81380042>] dump_stack+0x63/0x81
      [332489.688681]  [<ffffffff81088f8b>] __warn+0xcb/0xf0
      [332489.689276]  [<ffffffff8108900f>] warn_slowpath_fmt+0x5f/0x80
      [332489.689877]  [<ffffffff8139e7c1>] __list_del_entry+0x81/0xc0
      [332489.690481]  [<ffffffff81125552>] css_release_work_fn+0x42/0x140
      [332489.691090]  [<ffffffff810a2db9>] process_one_work+0x189/0x420
      [332489.691693]  [<ffffffff810a309e>] worker_thread+0x4e/0x4b0
      [332489.692293]  [<ffffffff810a3050>] ? process_one_work+0x420/0x420
      [332489.692905]  [<ffffffff810a9616>] kthread+0xe6/0x100
      [332489.693504]  [<ffffffff810a9530>] ? kthread_park+0x60/0x60
      [332489.694099]  [<ffffffff817184e1>] ret_from_fork+0x41/0x50
      [332489.694722] ---[ end trace 0cf869c4a5cfba87 ]---
      ......
      
      This is caused by calling css_get after the css is killed by another
      thread described below:
      
                 Thread 1                       Thread 2
      cgroup_rmdir
        -> kill_css
          -> percpu_ref_kill_and_confirm
            -> css_killed_ref_fn
      
      css_killed_work_fn
        -> css_put
          -> css_release
                                              wb_get_create
      					  -> find_blkcg_css
      					    -> css_get
      					  -> css_put
      					    -> css_release (double free)
          -> css_release_workfn
            -> css_free_work_fn
             -> blkcg_css_free
      
      When doublefree happened, it may free the memory still used by
      other threads and cause a kernel panic.
      
      Fix this by using css_tryget_online in find_blkcg_css while will return
      false if the css is killed.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      bde2f8ae
    • J
      d5b6e5d0
    • J
      writeback: add memcg_blkcg_link tree · d51e0ab9
      Jiufei Xue 提交于
      Here we add a global radix tree to link memcg and blkcg that the user
      attach the tasks to when using cgroup v1, which is used for writeback
      cgroup.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      d51e0ab9
  3. 22 5月, 2019 4 次提交
  4. 17 5月, 2019 3 次提交
  5. 10 5月, 2019 1 次提交
  6. 08 5月, 2019 2 次提交
  7. 04 5月, 2019 1 次提交
  8. 02 5月, 2019 1 次提交
    • J
      mm: Fix warning in insert_pfn() · 423497a9
      Jan Kara 提交于
      commit f2c57d91b0d96aa13ccff4e3b178038f17b00658 upstream.
      
      In DAX mode a write pagefault can race with write(2) in the following
      way:
      
      CPU0                            CPU1
                                      write fault for mapped zero page (hole)
      dax_iomap_rw()
        iomap_apply()
          xfs_file_iomap_begin()
            - allocates blocks
          dax_iomap_actor()
            invalidate_inode_pages2_range()
              - invalidates radix tree entries in given range
                                      dax_iomap_pte_fault()
                                        grab_mapping_entry()
                                          - no entry found, creates empty
                                        ...
                                        xfs_file_iomap_begin()
                                          - finds already allocated block
                                        ...
                                        vmf_insert_mixed_mkwrite()
                                          - WARNs and does nothing because there
                                            is still zero page mapped in PTE
              unmap_mapping_pages()
      
      This race results in WARN_ON from insert_pfn() and is occasionally
      triggered by fstest generic/344. Note that the race is otherwise
      harmless as before write(2) on CPU0 is finished, we will invalidate page
      tables properly and thus user of mmap will see modified data from
      write(2) from that point on. So just restrict the warning only to the
      case when the PFN in PTE is not zero page.
      
      Link: http://lkml.kernel.org/r/20180824154542.26872-1-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      423497a9
  9. 27 4月, 2019 3 次提交
  10. 20 4月, 2019 1 次提交
    • R
      mm: hide incomplete nr_indirectly_reclaimable in /proc/zoneinfo · d49dea54
      Roman Gushchin 提交于
      [fixed differently upstream, this is a work-around to resolve it for 4.19.y]
      
      Yongqin reported that /proc/zoneinfo format is broken in 4.14
      due to commit 7aaf7727 ("mm: don't show nr_indirectly_reclaimable
      in /proc/vmstat")
      
      Node 0, zone      DMA
        per-node stats
            nr_inactive_anon 403
            nr_active_anon 89123
            nr_inactive_file 128887
            nr_active_file 47377
            nr_unevictable 2053
            nr_slab_reclaimable 7510
            nr_slab_unreclaimable 10775
            nr_isolated_anon 0
            nr_isolated_file 0
            <...>
            nr_vmscan_write 0
            nr_vmscan_immediate_reclaim 0
            nr_dirtied   6022
            nr_written   5985
                         74240
            ^^^^^^^^^^
        pages free     131656
      
      The problem is caused by the nr_indirectly_reclaimable counter,
      which is hidden from the /proc/vmstat, but not from the
      /proc/zoneinfo. Let's fix this inconsistency and hide the
      counter from /proc/zoneinfo exactly as from /proc/vmstat.
      
      BTW, in 4.19+ the counter has been renamed and exported by
      the commit b29940c1abd7 ("mm: rename and change semantics of
      nr_indirectly_reclaimable_bytes"), so there is no such a problem
      anymore.
      
      Cc: <stable@vger.kernel.org> # 4.14.x-4.18.x
      Fixes: 7aaf7727 ("mm: don't show nr_indirectly_reclaimable in /proc/vmstat")
      Reported-by: NYongqin Liu <yongqin.liu@linaro.org>
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d49dea54
  11. 17 4月, 2019 2 次提交
    • G
      mm: writeback: use exact memcg dirty counts · 43f47331
      Greg Thelen 提交于
      commit 0b3d6e6f2dd0a7b697b1aa8c167265908940624b upstream.
      
      Since commit a983b5eb ("mm: memcontrol: fix excessive complexity in
      memory.stat reporting") memcg dirty and writeback counters are managed
      as:
      
       1) per-memcg per-cpu values in range of [-32..32]
      
       2) per-memcg atomic counter
      
      When a per-cpu counter cannot fit in [-32..32] it's flushed to the
      atomic.  Stat readers only check the atomic.  Thus readers such as
      balance_dirty_pages() may see a nontrivial error margin: 32 pages per
      cpu.
      
      Assuming 100 cpus:
         4k x86 page_size:  13 MiB error per memcg
        64k ppc page_size: 200 MiB error per memcg
      
      Considering that dirty+writeback are used together for some decisions the
      errors double.
      
      This inaccuracy can lead to undeserved oom kills.  One nasty case is
      when all per-cpu counters hold positive values offsetting an atomic
      negative value (i.e.  per_cpu[*]=32, atomic=n_cpu*-32).
      balance_dirty_pages() only consults the atomic and does not consider
      throttling the next n_cpu*32 dirty pages.  If the file_lru is in the
      13..200 MiB range then there's absolutely no dirty throttling, which
      burdens vmscan with only dirty+writeback pages thus resorting to oom
      kill.
      
      It could be argued that tiny containers are not supported, but it's more
      subtle.  It's the amount the space available for file lru that matters.
      If a container has memory.max-200MiB of non reclaimable memory, then it
      will also suffer such oom kills on a 100 cpu machine.
      
      The following test reliably ooms without this patch.  This patch avoids
      oom kills.
      
        $ cat test
        mount -t cgroup2 none /dev/cgroup
        cd /dev/cgroup
        echo +io +memory > cgroup.subtree_control
        mkdir test
        cd test
        echo 10M > memory.max
        (echo $BASHPID > cgroup.procs && exec /memcg-writeback-stress /foo)
        (echo $BASHPID > cgroup.procs && exec dd if=/dev/zero of=/foo bs=2M count=100)
      
        $ cat memcg-writeback-stress.c
        /*
         * Dirty pages from all but one cpu.
         * Clean pages from the non dirtying cpu.
         * This is to stress per cpu counter imbalance.
         * On a 100 cpu machine:
         * - per memcg per cpu dirty count is 32 pages for each of 99 cpus
         * - per memcg atomic is -99*32 pages
         * - thus the complete dirty limit: sum of all counters 0
         * - balance_dirty_pages() only sees atomic count -99*32 pages, which
         *   it max()s to 0.
         * - So a workload can dirty -99*32 pages before balance_dirty_pages()
         *   cares.
         */
        #define _GNU_SOURCE
        #include <err.h>
        #include <fcntl.h>
        #include <sched.h>
        #include <stdlib.h>
        #include <stdio.h>
        #include <sys/stat.h>
        #include <sys/sysinfo.h>
        #include <sys/types.h>
        #include <unistd.h>
      
        static char *buf;
        static int bufSize;
      
        static void set_affinity(int cpu)
        {
        	cpu_set_t affinity;
      
        	CPU_ZERO(&affinity);
        	CPU_SET(cpu, &affinity);
        	if (sched_setaffinity(0, sizeof(affinity), &affinity))
        		err(1, "sched_setaffinity");
        }
      
        static void dirty_on(int output_fd, int cpu)
        {
        	int i, wrote;
      
        	set_affinity(cpu);
        	for (i = 0; i < 32; i++) {
        		for (wrote = 0; wrote < bufSize; ) {
        			int ret = write(output_fd, buf+wrote, bufSize-wrote);
        			if (ret == -1)
        				err(1, "write");
        			wrote += ret;
        		}
        	}
        }
      
        int main(int argc, char **argv)
        {
        	int cpu, flush_cpu = 1, output_fd;
        	const char *output;
      
        	if (argc != 2)
        		errx(1, "usage: output_file");
      
        	output = argv[1];
        	bufSize = getpagesize();
        	buf = malloc(getpagesize());
        	if (buf == NULL)
        		errx(1, "malloc failed");
      
        	output_fd = open(output, O_CREAT|O_RDWR);
        	if (output_fd == -1)
        		err(1, "open(%s)", output);
      
        	for (cpu = 0; cpu < get_nprocs(); cpu++) {
        		if (cpu != flush_cpu)
        			dirty_on(output_fd, cpu);
        	}
      
        	set_affinity(flush_cpu);
        	if (fsync(output_fd))
        		err(1, "fsync(%s)", output);
        	if (close(output_fd))
        		err(1, "close(%s)", output);
        	free(buf);
        }
      
      Make balance_dirty_pages() and wb_over_bg_thresh() work harder to
      collect exact per memcg counters.  This avoids the aforementioned oom
      kills.
      
      This does not affect the overhead of memory.stat, which still reads the
      single atomic counter.
      
      Why not use percpu_counter? memcg already handles cpus going offline, so
      no need for that overhead from percpu_counter.  And the percpu_counter
      spinlocks are more heavyweight than is required.
      
      It probably also makes sense to use exact dirty and writeback counters
      in memcg oom reports.  But that is saved for later.
      
      Link: http://lkml.kernel.org/r/20190329174609.164344-1-gthelen@google.comSigned-off-by: NGreg Thelen <gthelen@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.16+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      43f47331
    • A
      mm/huge_memory.c: fix modifying of page protection by insert_pfn_pmd() · 9a62d691
      Aneesh Kumar K.V 提交于
      commit c6f3c5ee40c10bb65725047a220570f718507001 upstream.
      
      With some architectures like ppc64, set_pmd_at() cannot cope with a
      situation where there is already some (different) valid entry present.
      
      Use pmdp_set_access_flags() instead to modify the pfn which is built to
      deal with modifying existing PMD entries.
      
      This is similar to commit cae85cb8add3 ("mm/memory.c: fix modifying of
      page protection by insert_pfn()")
      
      We also do similar update w.r.t insert_pfn_pud eventhough ppc64 don't
      support pud pfn entries now.
      
      Without this patch we also see the below message in kernel log "BUG:
      non-zero pgtables_bytes on freeing mm:"
      
      Link: http://lkml.kernel.org/r/20190402115125.18803-1-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reported-by: NChandan Rajendra <chandan@linux.ibm.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9a62d691
  12. 06 4月, 2019 10 次提交
    • Q
      page_poison: play nicely with KASAN · a6c56bf6
      Qian Cai 提交于
      [ Upstream commit 4117992df66a26fa33908b4969e04801534baab1 ]
      
      KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING).
      It triggers false positives in the allocation path:
      
        BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
        Read of size 8 at addr ffff88881f800000 by task swapper/0
        CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
        Call Trace:
         dump_stack+0xe0/0x19a
         print_address_description.cold.2+0x9/0x28b
         kasan_report.cold.3+0x7a/0xb5
         __asan_report_load8_noabort+0x19/0x20
         memchr_inv+0x2ea/0x330
         kernel_poison_pages+0x103/0x3d5
         get_page_from_freelist+0x15e7/0x4d90
      
      because KASAN has not yet unpoisoned the shadow page for allocation
      before it checks memchr_inv() but only found a stale poison pattern.
      
      Also, false positives in free path,
      
        BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
        Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1
        CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
        Call Trace:
         dump_stack+0xe0/0x19a
         print_address_description.cold.2+0x9/0x28b
         kasan_report.cold.3+0x7a/0xb5
         check_memory_region+0x22d/0x250
         memset+0x28/0x40
         kernel_poison_pages+0x29e/0x3d5
         __free_pages_ok+0x75f/0x13e0
      
      due to KASAN adds poisoned redzones around slab objects, but the page
      poisoning needs to poison the whole page.
      
      Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a6c56bf6
    • Q
      mm/slab.c: kmemleak no scan alien caches · f09c424c
      Qian Cai 提交于
      [ Upstream commit 92d1d07daad65c300c7d0b68bbef8867e9895d54 ]
      
      Kmemleak throws endless warnings during boot due to in
      __alloc_alien_cache(),
      
          alc = kmalloc_node(memsize, gfp, node);
          init_arraycache(&alc->ac, entries, batch);
          kmemleak_no_scan(ac);
      
      Kmemleak does not track the array cache (alc->ac) but the alien cache
      (alc) instead, so let it track the latter by lifting kmemleak_no_scan()
      out of init_arraycache().
      
      There is another place that calls init_arraycache(), but
      alloc_kmem_cache_cpus() uses the percpu allocation where will never be
      considered as a leak.
      
        kmemleak: Found object by alias at 0xffff8007b9aa7e38
        CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
        Call trace:
         dump_backtrace+0x0/0x168
         show_stack+0x24/0x30
         dump_stack+0x88/0xb0
         lookup_object+0x84/0xac
         find_and_get_object+0x84/0xe4
         kmemleak_no_scan+0x74/0xf4
         setup_kmem_cache_node+0x2b4/0x35c
         __do_tune_cpucache+0x250/0x2d4
         do_tune_cpucache+0x4c/0xe4
         enable_cpucache+0xc8/0x110
         setup_cpu_cache+0x40/0x1b8
         __kmem_cache_create+0x240/0x358
         create_cache+0xc0/0x198
         kmem_cache_create_usercopy+0x158/0x20c
         kmem_cache_create+0x50/0x64
         fsnotify_init+0x58/0x6c
         do_one_initcall+0x194/0x388
         kernel_init_freeable+0x668/0x688
         kernel_init+0x18/0x124
         ret_from_fork+0x10/0x18
        kmemleak: Object 0xffff8007b9aa7e00 (size 256):
        kmemleak:   comm "swapper/0", pid 1, jiffies 4294697137
        kmemleak:   min_count = 1
        kmemleak:   count = 0
        kmemleak:   flags = 0x1
        kmemleak:   checksum = 0
        kmemleak:   backtrace:
             kmemleak_alloc+0x84/0xb8
             kmem_cache_alloc_node_trace+0x31c/0x3a0
             __kmalloc_node+0x58/0x78
             setup_kmem_cache_node+0x26c/0x35c
             __do_tune_cpucache+0x250/0x2d4
             do_tune_cpucache+0x4c/0xe4
             enable_cpucache+0xc8/0x110
             setup_cpu_cache+0x40/0x1b8
             __kmem_cache_create+0x240/0x358
             create_cache+0xc0/0x198
             kmem_cache_create_usercopy+0x158/0x20c
             kmem_cache_create+0x50/0x64
             fsnotify_init+0x58/0x6c
             do_one_initcall+0x194/0x388
             kernel_init_freeable+0x668/0x688
             kernel_init+0x18/0x124
        kmemleak: Not scanning unknown object at 0xffff8007b9aa7e38
        CPU: 190 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2+ #2
        Call trace:
         dump_backtrace+0x0/0x168
         show_stack+0x24/0x30
         dump_stack+0x88/0xb0
         kmemleak_no_scan+0x90/0xf4
         setup_kmem_cache_node+0x2b4/0x35c
         __do_tune_cpucache+0x250/0x2d4
         do_tune_cpucache+0x4c/0xe4
         enable_cpucache+0xc8/0x110
         setup_cpu_cache+0x40/0x1b8
         __kmem_cache_create+0x240/0x358
         create_cache+0xc0/0x198
         kmem_cache_create_usercopy+0x158/0x20c
         kmem_cache_create+0x50/0x64
         fsnotify_init+0x58/0x6c
         do_one_initcall+0x194/0x388
         kernel_init_freeable+0x668/0x688
         kernel_init+0x18/0x124
         ret_from_fork+0x10/0x18
      
      Link: http://lkml.kernel.org/r/20190129184518.39808-1-cai@lca.pw
      Fixes: 1fe00d50 ("slab: factor out initialization of array cache")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f09c424c
    • U
      mm/vmalloc.c: fix kernel BUG at mm/vmalloc.c:512! · 8a0fc62e
      Uladzislau Rezki (Sony) 提交于
      [ Upstream commit afd07389d3f4933c7f7817a92fb5e053d59a3182 ]
      
      One of the vmalloc stress test case triggers the kernel BUG():
      
        <snip>
        [60.562151] ------------[ cut here ]------------
        [60.562154] kernel BUG at mm/vmalloc.c:512!
        [60.562206] invalid opcode: 0000 [#1] PREEMPT SMP PTI
        [60.562247] CPU: 0 PID: 430 Comm: vmalloc_test/0 Not tainted 4.20.0+ #161
        [60.562293] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
        [60.562351] RIP: 0010:alloc_vmap_area+0x36f/0x390
        <snip>
      
      it can happen due to big align request resulting in overflowing of
      calculated address, i.e.  it becomes 0 after ALIGN()'s fixup.
      
      Fix it by checking if calculated address is within vstart/vend range.
      
      Link: http://lkml.kernel.org/r/20190124115648.9433-2-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Joel Fernandes <joelaf@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Garnier <thgarnie@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      8a0fc62e
    • V
      mm, mempolicy: fix uninit memory access · 67abbb9c
      Vlastimil Babka 提交于
      [ Upstream commit 2e25644e8da4ed3a27e7b8315aaae74660be72dc ]
      
      Syzbot with KMSAN reports (excerpt):
      
      ==================================================================
      BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:353 [inline]
      BUG: KMSAN: uninit-value in mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
      CPU: 1 PID: 17420 Comm: syz-executor4 Not tainted 4.20.0-rc7+ #15
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
        __dump_stack lib/dump_stack.c:77 [inline]
        dump_stack+0x173/0x1d0 lib/dump_stack.c:113
        kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613
        __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:295
        mpol_rebind_policy mm/mempolicy.c:353 [inline]
        mpol_rebind_mm+0x249/0x370 mm/mempolicy.c:384
        update_tasks_nodemask+0x608/0xca0 kernel/cgroup/cpuset.c:1120
        update_nodemasks_hier kernel/cgroup/cpuset.c:1185 [inline]
        update_nodemask kernel/cgroup/cpuset.c:1253 [inline]
        cpuset_write_resmask+0x2a98/0x34b0 kernel/cgroup/cpuset.c:1728
      
      ...
      
      Uninit was created at:
        kmsan_save_stack_with_flags mm/kmsan/kmsan.c:204 [inline]
        kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:158
        kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176
        kmem_cache_alloc+0x572/0xb90 mm/slub.c:2777
        mpol_new mm/mempolicy.c:276 [inline]
        do_mbind mm/mempolicy.c:1180 [inline]
        kernel_mbind+0x8a7/0x31a0 mm/mempolicy.c:1347
        __do_sys_mbind mm/mempolicy.c:1354 [inline]
      
      As it's difficult to report where exactly the uninit value resides in
      the mempolicy object, we have to guess a bit.  mm/mempolicy.c:353
      contains this part of mpol_rebind_policy():
      
              if (!mpol_store_user_nodemask(pol) &&
                  nodes_equal(pol->w.cpuset_mems_allowed, *newmask))
      
      "mpol_store_user_nodemask(pol)" is testing pol->flags, which I couldn't
      ever see being uninitialized after leaving mpol_new().  So I'll guess
      it's actually about accessing pol->w.cpuset_mems_allowed on line 354,
      but still part of statement starting on line 353.
      
      For w.cpuset_mems_allowed to be not initialized, and the nodes_equal()
      reachable for a mempolicy where mpol_set_nodemask() is called in
      do_mbind(), it seems the only possibility is a MPOL_PREFERRED policy
      with empty set of nodes, i.e.  MPOL_LOCAL equivalent, with MPOL_F_LOCAL
      flag.  Let's exclude such policies from the nodes_equal() check.  Note
      the uninit access should be benign anyway, as rebinding this kind of
      policy is always a no-op.  Therefore no actual need for stable
      inclusion.
      
      Link: http://lkml.kernel.org/r/a71997c3-e8ae-a787-d5ce-3db05768b27c@suse.cz
      Link: http://lkml.kernel.org/r/73da3e9c-cc84-509e-17d9-0c434bb9967d@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: syzbot+b19c2dc2c990ea657a71@syzkaller.appspotmail.com
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      67abbb9c
    • T
      memcg: killed threads should not invoke memcg OOM killer · 9d785b92
      Tetsuo Handa 提交于
      [ Upstream commit 7775face207922ea62a4e96b9cd45abfdc7b9840 ]
      
      If a memory cgroup contains a single process with many threads
      (including different process group sharing the mm) then it is possible
      to trigger a race when the oom killer complains that there are no oom
      elible tasks and complain into the log which is both annoying and
      confusing because there is no actual problem.  The race looks as
      follows:
      
      P1				oom_reaper		P2
      try_charge						try_charge
        mem_cgroup_out_of_memory
          mutex_lock(oom_lock)
            out_of_memory
              oom_kill_process(P1,P2)
               wake_oom_reaper
          mutex_unlock(oom_lock)
          				oom_reap_task
      							  mutex_lock(oom_lock)
      							    select_bad_process # no victim
      
      The problem is more visible with many threads.
      
      Fix this by checking for fatal_signal_pending from
      mem_cgroup_out_of_memory when the oom_lock is already held.
      
      The oom bypass is safe because we do the same early in the try_charge
      path already.  The situation migh have changed in the mean time.  It
      should be safe to check for fatal_signal_pending and tsk_is_oom_victim
      but for a better code readability abstract the current charge bypass
      condition into should_force_charge and reuse it from that path.  "
      
      Link: http://lkml.kernel.org/r/01370f70-e1f6-ebe4-b95e-0df21a0bc15e@i-love.sakura.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9d785b92
    • T
      mm,oom: don't kill global init via memory.oom.group · eed3ca0a
      Tetsuo Handa 提交于
      [ Upstream commit d342a0b38674867ea67fde47b0e1e60ffe9f17a2 ]
      
      Since setting global init process to some memory cgroup is technically
      possible, oom_kill_memcg_member() must check it.
      
        Tasks in /test1 are going to be killed due to memory.oom.group set
        Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
        oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
      
      #include <stdio.h>
      #include <string.h>
      #include <unistd.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      
      int main(int argc, char *argv[])
      {
      	static char buffer[10485760];
      	static int pipe_fd[2] = { EOF, EOF };
      	unsigned int i;
      	int fd;
      	char buf[64] = { };
      	if (pipe(pipe_fd))
      		return 1;
      	if (chdir("/sys/fs/cgroup/"))
      		return 1;
      	fd = open("cgroup.subtree_control", O_WRONLY);
      	write(fd, "+memory", 7);
      	close(fd);
      	mkdir("test1", 0755);
      	fd = open("test1/memory.oom.group", O_WRONLY);
      	write(fd, "1", 1);
      	close(fd);
      	fd = open("test1/cgroup.procs", O_WRONLY);
      	write(fd, "1", 1);
      	snprintf(buf, sizeof(buf) - 1, "%d", getpid());
      	write(fd, buf, strlen(buf));
      	close(fd);
      	snprintf(buf, sizeof(buf) - 1, "%lu", sizeof(buffer) * 5);
      	fd = open("test1/memory.max", O_WRONLY);
      	write(fd, buf, strlen(buf));
      	close(fd);
      	for (i = 0; i < 10; i++)
      		if (fork() == 0) {
      			char c;
      			close(pipe_fd[1]);
      			read(pipe_fd[0], &c, 1);
      			memset(buffer, 0, sizeof(buffer));
      			sleep(3);
      			_exit(0);
      		}
      	close(pipe_fd[0]);
      	close(pipe_fd[1]);
      	sleep(3);
      	return 0;
      }
      
      [   37.052923][ T9185] a.out invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
      [   37.056169][ T9185] CPU: 4 PID: 9185 Comm: a.out Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
      [   37.059205][ T9185] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
      [   37.062954][ T9185] Call Trace:
      [   37.063976][ T9185]  dump_stack+0x67/0x95
      [   37.065263][ T9185]  dump_header+0x51/0x570
      [   37.066619][ T9185]  ? trace_hardirqs_on+0x3f/0x110
      [   37.068171][ T9185]  ? _raw_spin_unlock_irqrestore+0x3d/0x70
      [   37.069967][ T9185]  oom_kill_process+0x18d/0x210
      [   37.071515][ T9185]  out_of_memory+0x11b/0x380
      [   37.072936][ T9185]  mem_cgroup_out_of_memory+0xb6/0xd0
      [   37.074601][ T9185]  try_charge+0x790/0x820
      [   37.076021][ T9185]  mem_cgroup_try_charge+0x42/0x1d0
      [   37.077629][ T9185]  mem_cgroup_try_charge_delay+0x11/0x30
      [   37.079370][ T9185]  do_anonymous_page+0x105/0x5e0
      [   37.080939][ T9185]  __handle_mm_fault+0x9cb/0x1070
      [   37.082485][ T9185]  handle_mm_fault+0x1b2/0x3a0
      [   37.083819][ T9185]  ? handle_mm_fault+0x47/0x3a0
      [   37.085181][ T9185]  __do_page_fault+0x255/0x4c0
      [   37.086529][ T9185]  do_page_fault+0x28/0x260
      [   37.087788][ T9185]  ? page_fault+0x8/0x30
      [   37.088978][ T9185]  page_fault+0x1e/0x30
      [   37.090142][ T9185] RIP: 0033:0x7f8b183aefe0
      [   37.091433][ T9185] Code: 20 f3 44 0f 7f 44 17 d0 f3 44 0f 7f 47 30 f3 44 0f 7f 44 17 c0 48 01 fa 48 83 e2 c0 48 39 d1 74 a3 66 0f 1f 84 00 00 00 00 00 <66> 44 0f 7f 01 66 44 0f 7f 41 10 66 44 0f 7f 41 20 66 44 0f 7f 41
      [   37.096917][ T9185] RSP: 002b:00007fffc5d329e8 EFLAGS: 00010206
      [   37.098615][ T9185] RAX: 00000000006010e0 RBX: 0000000000000008 RCX: 0000000000c30000
      [   37.100905][ T9185] RDX: 00000000010010c0 RSI: 0000000000000000 RDI: 00000000006010e0
      [   37.103349][ T9185] RBP: 0000000000000000 R08: 00007f8b188f4740 R09: 0000000000000000
      [   37.105797][ T9185] R10: 00007fffc5d32420 R11: 00007f8b183aef40 R12: 0000000000000005
      [   37.108228][ T9185] R13: 0000000000000000 R14: ffffffffffffffff R15: 0000000000000000
      [   37.110840][ T9185] memory: usage 51200kB, limit 51200kB, failcnt 125
      [   37.113045][ T9185] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
      [   37.115808][ T9185] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
      [   37.117660][ T9185] Memory cgroup stats for /test1: cache:0KB rss:49484KB rss_huge:30720KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:49700KB inactive_file:0KB active_file:0KB unevictable:0KB
      [   37.123371][ T9185] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/test1,task_memcg=/test1,task=a.out,pid=9188,uid=0
      [   37.128158][ T9185] Memory cgroup out of memory: Killed process 9188 (a.out) total-vm:14456kB, anon-rss:10324kB, file-rss:504kB, shmem-rss:0kB
      [   37.132710][ T9185] Tasks in /test1 are going to be killed due to memory.oom.group set
      [   37.132833][   T54] oom_reaper: reaped process 9188 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.135498][ T9185] Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
      [   37.143434][ T9185] Memory cgroup out of memory: Killed process 9182 (a.out) total-vm:14456kB, anon-rss:76kB, file-rss:588kB, shmem-rss:0kB
      [   37.144328][   T54] oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.147585][ T9185] Memory cgroup out of memory: Killed process 9183 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
      [   37.157222][ T9185] Memory cgroup out of memory: Killed process 9184 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:508kB, shmem-rss:0kB
      [   37.157259][ T9185] Memory cgroup out of memory: Killed process 9185 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
      [   37.157291][ T9185] Memory cgroup out of memory: Killed process 9186 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:508kB, shmem-rss:0kB
      [   37.157306][   T54] oom_reaper: reaped process 9183 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.157328][ T9185] Memory cgroup out of memory: Killed process 9187 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
      [   37.157452][ T9185] Memory cgroup out of memory: Killed process 9189 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
      [   37.158733][ T9185] Memory cgroup out of memory: Killed process 9190 (a.out) total-vm:14456kB, anon-rss:552kB, file-rss:512kB, shmem-rss:0kB
      [   37.160083][   T54] oom_reaper: reaped process 9186 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.160187][   T54] oom_reaper: reaped process 9189 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.206941][   T54] oom_reaper: reaped process 9185 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.212300][ T9185] Memory cgroup out of memory: Killed process 9191 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
      [   37.212317][   T54] oom_reaper: reaped process 9190 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.218860][ T9185] Memory cgroup out of memory: Killed process 9192 (a.out) total-vm:14456kB, anon-rss:1080kB, file-rss:512kB, shmem-rss:0kB
      [   37.227667][   T54] oom_reaper: reaped process 9192 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.292323][ T9193] abrt-hook-ccpp (9193) used greatest stack depth: 10480 bytes left
      [   37.351843][    T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
      [   37.354833][    T1] CPU: 7 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
      [   37.357876][    T1] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
      [   37.361685][    T1] Call Trace:
      [   37.363239][    T1]  dump_stack+0x67/0x95
      [   37.365010][    T1]  panic+0xfc/0x2b0
      [   37.366853][    T1]  do_exit+0xd55/0xd60
      [   37.368595][    T1]  do_group_exit+0x47/0xc0
      [   37.370415][    T1]  get_signal+0x32a/0x920
      [   37.372449][    T1]  ? _raw_spin_unlock_irqrestore+0x3d/0x70
      [   37.374596][    T1]  do_signal+0x32/0x6e0
      [   37.376430][    T1]  ? exit_to_usermode_loop+0x26/0x9b
      [   37.378418][    T1]  ? prepare_exit_to_usermode+0xa8/0xd0
      [   37.380571][    T1]  exit_to_usermode_loop+0x3e/0x9b
      [   37.382588][    T1]  prepare_exit_to_usermode+0xa8/0xd0
      [   37.384594][    T1]  ? page_fault+0x8/0x30
      [   37.386453][    T1]  retint_user+0x8/0x18
      [   37.388160][    T1] RIP: 0033:0x7f42c06974a8
      [   37.389922][    T1] Code: Bad RIP value.
      [   37.391788][    T1] RSP: 002b:00007ffc3effd388 EFLAGS: 00010213
      [   37.394075][    T1] RAX: 000000000000000e RBX: 00007ffc3effd390 RCX: 0000000000000000
      [   37.396963][    T1] RDX: 000000000000002a RSI: 00007ffc3effd390 RDI: 0000000000000004
      [   37.399550][    T1] RBP: 00007ffc3effd680 R08: 0000000000000000 R09: 0000000000000000
      [   37.402334][    T1] R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000001
      [   37.404890][    T1] R13: ffffffffffffffff R14: 0000000000000884 R15: 000056460b1ac3b0
      
      Link: http://lkml.kernel.org/r/201902010336.x113a4EO027170@www262.sakura.ne.jp
      Fixes: 3d8b38eb ("mm, oom: introduce memory.oom.group")
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      eed3ca0a
    • D
      mm, swap: bounds check swap_info array accesses to avoid NULL derefs · ed3345a6
      Daniel Jordan 提交于
      [ Upstream commit c10d38cc8d3e43f946b6c2bf4602c86791587f30 ]
      
      Dan Carpenter reports a potential NULL dereference in
      get_swap_page_of_type:
      
        Smatch complains that the NULL checks on "si" aren't consistent.  This
        seems like a real bug because we have not ensured that the type is
        valid and so "si" can be NULL.
      
      Add the missing check for NULL, taking care to use a read barrier to
      ensure CPU1 observes CPU0's updates in the correct order:
      
           CPU0                           CPU1
           alloc_swap_info()              if (type >= nr_swapfiles)
             swap_info[type] = p              /* handle invalid entry */
             smp_wmb()                    smp_rmb()
             ++nr_swapfiles               p = swap_info[type]
      
      Without smp_rmb, CPU1 might observe CPU0's write to nr_swapfiles before
      CPU0's write to swap_info[type] and read NULL from swap_info[type].
      
      Ying Huang noticed other places in swapfile.c don't order these reads
      properly.  Introduce swap_type_to_swap_info to encourage correct usage.
      
      Use READ_ONCE and WRITE_ONCE to follow the Linux Kernel Memory Model
      (see tools/memory-model/Documentation/explanation.txt).
      
      This ordering need not be enforced in places where swap_lock is held
      (e.g.  si_swapinfo) because swap_lock serializes updates to nr_swapfiles
      and the swap_info array.
      
      Link: http://lkml.kernel.org/r/20190131024410.29859-1-daniel.m.jordan@oracle.com
      Fixes: ec8acf20 ("swap: add per-partition lock for swapfile")
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Suggested-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Shaohua Li <shli@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      ed3345a6
    • Q
      mm/page_ext.c: fix an imbalance with kmemleak · 4c6d7dc7
      Qian Cai 提交于
      [ Upstream commit 0c81585499601acd1d0e1cbf424cabfaee60628c ]
      
      After offlining a memory block, kmemleak scan will trigger a crash, as
      it encounters a page ext address that has already been freed during
      memory offlining.  At the beginning in alloc_page_ext(), it calls
      kmemleak_alloc(), but it does not call kmemleak_free() in
      free_page_ext().
      
          BUG: unable to handle kernel paging request at ffff888453d00000
          PGD 128a01067 P4D 128a01067 PUD 128a04067 PMD 47e09e067 PTE 800ffffbac2ff060
          Oops: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
          CPU: 1 PID: 1594 Comm: bash Not tainted 5.0.0-rc8+ #15
          Hardware name: HP ProLiant DL180 Gen9/ProLiant DL180 Gen9, BIOS U20 10/25/2017
          RIP: 0010:scan_block+0xb5/0x290
          Code: 85 6e 01 00 00 48 b8 00 00 30 f5 81 88 ff ff 48 39 c3 0f 84 5b 01 00 00 48 89 d8 48 c1 e8 03 42 80 3c 20 00 0f 85 87 01 00 00 <4c> 8b 3b e8 f3 0c fa ff 4c 39 3d 0c 6b 4c 01 0f 87 08 01 00 00 4c
          RSP: 0018:ffff8881ec57f8e0 EFLAGS: 00010082
          RAX: 0000000000000000 RBX: ffff888453d00000 RCX: ffffffffa61e5a54
          RDX: 0000000000000000 RSI: 0000000000000008 RDI: ffff888453d00000
          RBP: ffff8881ec57f920 R08: fffffbfff4ed588d R09: fffffbfff4ed588c
          R10: fffffbfff4ed588c R11: ffffffffa76ac463 R12: dffffc0000000000
          R13: ffff888453d00ff9 R14: ffff8881f80cef48 R15: ffff8881f80cef48
          FS:  00007f6c0e3f8740(0000) GS:ffff8881f7680000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: ffff888453d00000 CR3: 00000001c4244003 CR4: 00000000001606a0
          Call Trace:
           scan_gray_list+0x269/0x430
           kmemleak_scan+0x5a8/0x10f0
           kmemleak_write+0x541/0x6ca
           full_proxy_write+0xf8/0x190
           __vfs_write+0xeb/0x980
           vfs_write+0x15a/0x4f0
           ksys_write+0xd2/0x1b0
           __x64_sys_write+0x73/0xb0
           do_syscall_64+0xeb/0xaaa
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
          RIP: 0033:0x7f6c0dad73b8
          Code: 89 02 48 c7 c0 ff ff ff ff eb b3 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 63 2d 00 8b 00 85 c0 75 17 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 41 54 49 89 d4 55
          RSP: 002b:00007ffd5b863cb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
          RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007f6c0dad73b8
          RDX: 0000000000000005 RSI: 000055a9216e1710 RDI: 0000000000000001
          RBP: 000055a9216e1710 R08: 000000000000000a R09: 00007ffd5b863840
          R10: 000000000000000a R11: 0000000000000246 R12: 00007f6c0dda9780
          R13: 0000000000000005 R14: 00007f6c0dda4740 R15: 0000000000000005
          Modules linked in: nls_iso8859_1 nls_cp437 vfat fat kvm_intel kvm irqbypass efivars ip_tables x_tables xfs sd_mod ahci libahci igb i2c_algo_bit libata i2c_core dm_mirror dm_region_hash dm_log dm_mod efivarfs
          CR2: ffff888453d00000
          ---[ end trace ccf646c7456717c5 ]---
          Kernel panic - not syncing: Fatal exception
          Shutting down cpus with NMI
          Kernel Offset: 0x24c00000 from 0xffffffff81000000 (relocation range:
          0xffffffff80000000-0xffffffffbfffffff)
          ---[ end Kernel panic - not syncing: Fatal exception ]---
      
      Link: http://lkml.kernel.org/r/20190227173147.75650-1-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4c6d7dc7
    • P
      mm/cma.c: cma_declare_contiguous: correct err handling · f555b008
      Peng Fan 提交于
      [ Upstream commit 0d3bd18a5efd66097ef58622b898d3139790aa9d ]
      
      In case cma_init_reserved_mem failed, need to free the memblock
      allocated by memblock_reserve or memblock_alloc_range.
      
      Quote Catalin's comments:
        https://lkml.org/lkml/2019/2/26/482
      
      Kmemleak is supposed to work with the memblock_{alloc,free} pair and it
      ignores the memblock_reserve() as a memblock_alloc() implementation
      detail. It is, however, tolerant to memblock_free() being called on
      a sub-range or just a different range from a previous memblock_alloc().
      So the original patch looks fine to me. FWIW:
      
      Link: http://lkml.kernel.org/r/20190227144631.16708-1-peng.fan@nxp.comSigned-off-by: NPeng Fan <peng.fan@nxp.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Andrey Konovalov <andreyknvl@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f555b008
    • Q
      mm/sparse: fix a bad comparison · 7b287c47
      Qian Cai 提交于
      [ Upstream commit d778015ac95bc036af73342c878ab19250e01fe1 ]
      
      next_present_section_nr() could only return an unsigned number -1, so
      just check it specifically where compilers will convert -1 to unsigned
      if needed.
      
        mm/sparse.c: In function 'sparse_init_nid':
        mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
               ((section_nr >= 0) &&    \
                            ^~
        mm/sparse.c:478:2: note: in expansion of macro
        'for_each_present_section_nr'
          for_each_present_section_nr(pnum_begin, pnum) {
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~
        mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
               ((section_nr >= 0) &&    \
                            ^~
        mm/sparse.c:497:2: note: in expansion of macro
        'for_each_present_section_nr'
          for_each_present_section_nr(pnum_begin, pnum) {
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~
        mm/sparse.c: In function 'sparse_init':
        mm/sparse.c:200:20: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]
               ((section_nr >= 0) &&    \
                            ^~
        mm/sparse.c:520:2: note: in expansion of macro
        'for_each_present_section_nr'
          for_each_present_section_nr(pnum_begin + 1, pnum_end) {
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Link: http://lkml.kernel.org/r/20190228181839.86504-1-cai@lca.pw
      Fixes: c4e1be9e ("mm, sparsemem: break out of loops early")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      7b287c47
  13. 03 4月, 2019 3 次提交
  14. 24 3月, 2019 1 次提交
    • J
      mm/memory.c: do_fault: avoid usage of stale vm_area_struct · 09417dd3
      Jan Stancek 提交于
      commit fc8efd2ddfed3f343c11b693e87140ff358d7ff5 upstream.
      
      LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
      This is a stress test, where one thread mmaps/writes/munmaps memory area
      and other thread is trying to read from it:
      
        CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
        Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
        Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
        Call Trace:
        ([<0000000000000000>]           (null))
         [<00000000001adae4>] lock_acquire+0xec/0x258
         [<000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98
         [<000000000012a780>] page_table_free+0x48/0x1a8
         [<00000000002f6e54>] do_fault+0xdc/0x670
         [<00000000002fadae>] __handle_mm_fault+0x416/0x5f0
         [<00000000002fb138>] handle_mm_fault+0x1b0/0x320
         [<00000000001248cc>] do_dat_exception+0x19c/0x2c8
         [<000000000080e5ee>] pgm_check_handler+0x19e/0x200
      
      page_table_free() is called with NULL mm parameter, but because "0" is a
      valid address on s390 (see S390_lowcore), it keeps going until it
      eventually crashes in lockdep's lock_acquire.  This crash is
      reproducible at least since 4.14.
      
      Problem is that "vmf->vma" used in do_fault() can become stale.  Because
      mmap_sem may be released, other threads can come in, call munmap() and
      cause "vma" be returned to kmem cache, and get zeroed/re-initialized and
      re-used:
      
      handle_mm_fault                           |
        __handle_mm_fault                       |
          do_fault                              |
            vma = vmf->vma                      |
            do_read_fault                       |
              __do_fault                        |
                vma->vm_ops->fault(vmf);        |
                  mmap_sem is released          |
                                                |
                                                | do_munmap()
                                                |   remove_vma_list()
                                                |     remove_vma()
                                                |       vm_area_free()
                                                |         # vma is released
                                                | ...
                                                | # same vma is allocated
                                                | # from kmem cache
                                                | do_mmap()
                                                |   vm_area_alloc()
                                                |     memset(vma, 0, ...)
                                                |
            pte_free(vma->vm_mm, ...);          |
              page_table_free                   |
                spin_lock_bh(&mm->context.lock);|
                  <crash>                       |
      
      Cache mm_struct to avoid using potentially stale "vma".
      
      [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
      
      Link: http://lkml.kernel.org/r/5b3fdf19e2a5be460a384b936f5b56e13733f1b8.1551595137.git.jstancek@redhat.comSigned-off-by: NJan Stancek <jstancek@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NMatthew Wilcox <willy@infradead.org>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      09417dd3