1. 19 1月, 2020 7 次提交
    • S
      perf/smmuv3: Enable HiSilicon Erratum 162001800 quirk · 8c6138de
      Shameer Kolothum 提交于
      commit 24062fe85860debfdae0eeaa495f27c9971ec163 upstream
      
      HiSilicon erratum 162001800 describes the limitation of
      SMMUv3 PMCG implementation on HiSilicon Hip08 platforms.
      
      On these platforms, the PMCG event counter registers
      (SMMU_PMCG_EVCNTRn) are read only and as a result it
      is not possible to set the initial counter period value
      on event monitor start.
      
      To work around this, the current value of the counter
      is read and used for delta calculations. OEM information
      from ACPI header is used to identify the affected hardware
      platforms.
      Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
      Reviewed-by: NHanjun Guo <hanjun.guo@linaro.org>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      Acked-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      [will: update silicon-errata.txt and add reason string to acpi match]
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      8c6138de
    • S
      perf/smmuv3: Add MSI irq support · fdfcde65
      Shameer Kolothum 提交于
      commit f202cdab3b48d8c2c1846c938ea69cb8aa897699 upstream
      
      This adds support for MSI-based counter overflow interrupt.
      Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      fdfcde65
    • N
      perf/smmuv3: Add arm64 smmuv3 pmu driver · 6e1561d0
      Neil Leeder 提交于
      commit 7d839b4b9e00645e49345d6ce5dfa8edf53c1a21 upstream
      
      Adds a new driver to support the SMMUv3 PMU and add it into the
      perf events framework.
      
      Each SMMU node may have multiple PMUs associated with it, each of
      which may support different events.
      
      SMMUv3 PMCG devices are named as smmuv3_pmcg_<phys_addr_page> where
      <phys_addr_page> is the physical page address of the SMMU PMCG
      wrapped to 4K boundary. For example, the PMCG at 0xff88840000 is
      named smmuv3_pmcg_ff88840
      
      Filtering by stream id is done by specifying filtering parameters
      with the event. options are:
         filter_enable    - 0 = no filtering, 1 = filtering enabled
         filter_span      - 0 = exact match, 1 = pattern match
         filter_stream_id - pattern to filter against
      
      Example: perf stat -e smmuv3_pmcg_ff88840/transaction,filter_enable=1,
                             filter_span=1,filter_stream_id=0x42/ -a netperf
      
      Applies filter pattern 0x42 to transaction events, which means events
      matching stream ids 0x42 & 0x43 are counted as only upper StreamID
      bits are required to match the given filter. Further filtering
      information is available in the SMMU documentation.
      
      SMMU events are not attributable to a CPU, so task mode and sampling
      are not supported.
      Signed-off-by: NNeil Leeder <nleeder@codeaurora.org>
      Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      [will: fold in review feedback from Robin]
      [will: rewrite Kconfig text and allow building as a module]
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      6e1561d0
    • N
      ACPI/IORT: Add support for PMCG · c54f184c
      Neil Leeder 提交于
      commit 24e516049360eda85cf3fe9903221d43886c2689 upstream.
      
      Add support for the SMMU Performance Monitor Counter Group
      information from ACPI. This is in preparation for its use
      in the SMMUv3 PMU driver.
      Signed-off-by: NNeil Leeder <nleeder@codeaurora.org>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      Acked-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      c54f184c
    • G
      iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages() · a3f3cb1f
      Ganapatrao Kulkarni 提交于
      commit c4b17afb0a4e8d042320efaf2acf55cb26795f78 upstream.
      
      Change function __iommu_dma_alloc_pages() to allocate pages for DMA from
      respective device NUMA node. The ternary operator which would be for
      alloc_pages_node() is tidied along with this.
      
      The motivation for this change is to have a policy for page allocation
      consistent with direct DMA mapping, which attempts to allocate pages local
      to the device, as mentioned in [1].
      
      In addition, for certain workloads it has been observed a marginal
      performance improvement. The patch caused an observation of 0.9% average
      throughput improvement for running tcrypt with HiSilicon crypto engine.
      
      We also include a modification to use kvzalloc() for kzalloc()/vzalloc()
      combination.
      
      [1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.htmlSigned-off-by: NGanapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
      [JPG: Added kvzalloc(), drop pages ** being device local, remove ternary operator, update message]
      Signed-off-by: NJohn Garry <john.garry@huawei.com>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      a3f3cb1f
    • P
      mm/hotplug: make remove_memory() interface usable · 27d25b17
      Pavel Tatashin 提交于
      commit eca499ab3749a4537dee77ffead47a1a2c0dee19 upstream
      
      Presently the remove_memory() interface is inherently broken.  It tries
      to remove memory but panics if some memory is not offline.  The problem
      is that it is impossible to ensure that all memory blocks are offline as
      this function also takes lock_device_hotplug that is required to change
      memory state via sysfs.
      
      So, between calling this function and offlining all memory blocks there
      is always a window when lock_device_hotplug is released, and therefore,
      there is always a chance for a panic during this window.
      
      Make this interface to return an error if memory removal fails.  This
      way it is safe to call this function without panicking machine, and also
      makes it symmetric to add_memory() which already returns an error.
      
      Link: http://lkml.kernel.org/r/20190517215438.6487-3-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nyinhe <yinhe@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      27d25b17
    • D
      mm/memory_hotplug: make remove_memory() take the device_hotplug_lock · e060ade6
      David Hildenbrand 提交于
      commit d15e59260f62bd5e0f625cf5f5240f6ffac78ab6 upstream
      
      Patch series "mm: online/offline_pages called w.o. mem_hotplug_lock", v3.
      
      Reading through the code and studying how mem_hotplug_lock is to be used,
      I noticed that there are two places where we can end up calling
      device_online()/device_offline() - online_pages()/offline_pages() without
      the mem_hotplug_lock.  And there are other places where we call
      device_online()/device_offline() without the device_hotplug_lock.
      
      While e.g.
      	echo "online" > /sys/devices/system/memory/memory9/state
      is fine, e.g.
      	echo 1 > /sys/devices/system/memory/memory9/online
      Will not take the mem_hotplug_lock. However the device_lock() and
      device_hotplug_lock.
      
      E.g.  via memory_probe_store(), we can end up calling
      add_memory()->online_pages() without the device_hotplug_lock.  So we can
      have concurrent callers in online_pages().  We e.g.  touch in
      online_pages() basically unprotected zone->present_pages then.
      
      Looks like there is a longer history to that (see Patch #2 for details),
      and fixing it to work the way it was intended is not really possible.  We
      would e.g.  have to take the mem_hotplug_lock in device/base/core.c, which
      sounds wrong.
      
      Summary: We had a lock inversion on mem_hotplug_lock and device_lock().
      More details can be found in patch 3 and patch 6.
      
      I propose the general rules (documentation added in patch 6):
      
      1. add_memory/add_memory_resource() must only be called with
         device_hotplug_lock.
      2. remove_memory() must only be called with device_hotplug_lock. This is
         already documented and holds for all callers.
      3. device_online()/device_offline() must only be called with
         device_hotplug_lock. This is already documented and true for now in core
         code. Other callers (related to memory hotplug) have to be fixed up.
      4. mem_hotplug_lock is taken inside of add_memory/remove_memory/
         online_pages/offline_pages.
      
      To me, this looks way cleaner than what we have right now (and easier to
      verify).  And looking at the documentation of remove_memory, using
      lock_device_hotplug also for add_memory() feels natural.
      
      This patch (of 6):
      
      remove_memory() is exported right now but requires the
      device_hotplug_lock, which is not exported.  So let's provide a variant
      that takes the lock and only export that one.
      
      The lock is already held in
      	arch/powerpc/platforms/pseries/hotplug-memory.c
      	drivers/acpi/acpi_memhotplug.c
      	arch/powerpc/platforms/powernv/memtrace.c
      
      Apart from that, there are not other users in the tree.
      
      Link: http://lkml.kernel.org/r/20180925091457.28651-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Rashmica Gupta <rashmica.g@gmail.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: John Allen <jallen@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nyinhe <yinhe@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      e060ade6
  2. 17 1月, 2020 1 次提交
    • X
      alinux: mm: kidled: fix frame-larger-than build warning · 042cecca
      Xu Yu 提交于
      This fix the following build warning:
      
      mm/memcontrol.c: In function 'mem_cgroup_idle_page_stats_show':
      mm/memcontrol.c:3866:1: warning: the frame size of 2160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
      
      The root cause is that "mem_cgroup_idle_page_stats_show" has two
      "struct idle_page_stats" variables, each of which is 1056 bytes in
      size, on the stack, thus exceeding the 2048 max frame size.
      
      This fix the build warning by dynamically allocating memory to these two
      variables with kmalloc.
      
      Fixes: f55ac551 ("alinux: mm: Support kidled")
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      042cecca
  3. 15 1月, 2020 24 次提交
  4. 14 1月, 2020 8 次提交
    • S
      tpm: tpm_tis_spi: Introduce a flow control callback · 9d732776
      Stephen Boyd 提交于
      commit 8ab5e82afa969b65b286d8949c12d2a64c83960c upstream.
      
      Cr50 firmware has a different flow control protocol than the one used by
      this TPM PTP SPI driver. Introduce a flow control callback so we can
      override the standard sequence with the custom one that Cr50 uses.
      
      Cc: Andrey Pronin <apronin@chromium.org>
      Cc: Duncan Laurie <dlaurie@chromium.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <groeck@chromium.org>
      Cc: Alexander Steffen <Alexander.Steffen@infineon.com>
      Cc: Heiko Stuebner <heiko@sntech.de>
      Signed-off-by: NStephen Boyd <swboyd@chromium.org>
      Tested-by: NHeiko Stuebner <heiko@sntech.de>
      Reviewed-by: NHeiko Stuebner <heiko@sntech.de>
      Reviewed-by: NJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: NJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
      9d732776
    • T
      tcp: Add snd_wnd to TCP_INFO · 05f1bb78
      Thomas Higdon 提交于
      commit 8f7baad7f03543451af27f5380fc816b008aa1f2 upstream
      
      Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP
      performance problems --
      > (1) Usually when we're diagnosing TCP performance problems, we do so
      > from the sender, since the sender makes most of the
      > performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
      > From the sender-side the thing that would be most useful is to see
      > tp->snd_wnd, the receive window that the receiver has advertised to
      > the sender.
      
      This serves the purpose of adding an additional __u32 to avoid the
      would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
      Signed-off-by: NThomas Higdon <tph@fb.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Acked-by: NDust Li <dust.li@linux.alibaba.com>
      05f1bb78
    • T
      tcp: Add TCP_INFO counter for packets received out-of-order · ad8cde51
      Thomas Higdon 提交于
      commit f9af2dbbfe01def62765a58af7fbc488351893c3 upstream
      
      For receive-heavy cases on the server-side, we want to track the
      connection quality for individual client IPs. This counter, similar to
      the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
      tracks out-of-order packet reception. By providing this counter in
      TCP_INFO, it will allow understanding to what degree receive-heavy
      sockets are experiencing out-of-order delivery and packet drops
      indicating congestion.
      
      Please note that this is similar to the counter in NetBSD TCP_INFO, and
      has the same name.
      
      Also note that we avoid increasing the size of the tcp_sock struct by
      taking advantage of a hole.
      Signed-off-by: NThomas Higdon <tph@fb.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Acked-by: NDust Li <dust.li@linux.alibaba.com>
      ad8cde51
    • X
      alinux: mm,memcg: export memory.{min,low} to cgroup v1 · 32ad3d43
      Xu Yu 提交于
      Export "memory.min" and "memory.low" from cgroup v2 to v1.
      
      There is a subtle difference between v1 and v2, i.e. no task is allowed
      in intermediate memcgs under v2 hierarchy and this can make a different
      behaviour for them, it requires all the intermediate nodes having the
      memory.min|low set, we must keep this in mind when using this feature
      under v1.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      32ad3d43
    • X
      alinux: mm,memcg: export memory.{events,events.local} to v1 · b3ea37a5
      Xu Yu 提交于
      Export "memory.events" and "memory.events.local" from cgroup v2 to v1.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      b3ea37a5
    • R
      mm: don't raise MEMCG_OOM event due to failed high-order allocation · 9ddf0a56
      Roman Gushchin 提交于
      commit 7a1adfddaf0d11a39fdcaf6e82a88e9c0586e08b upstream.
      
      It was reported that on some of our machines containers were restarted
      with OOM symptoms without an obvious reason.  Despite there were almost no
      memory pressure and plenty of page cache, MEMCG_OOM event was raised
      occasionally, causing the container management software to think, that OOM
      has happened.  However, no tasks have been killed.
      
      The following investigation showed that the problem is caused by a failing
      attempt to charge a high-order page.  In such case, the OOM killer is
      never invoked.  As shown below, it can happen under conditions, which are
      very far from a real OOM: e.g.  there is plenty of clean page cache and no
      memory pressure.
      
      There is no sense in raising an OOM event in this case, as it might
      confuse a user and lead to wrong and excessive actions (e.g.  restart the
      workload, as in my case).
      
      Let's look at the charging path in try_charge().  If the memory usage is
      about memory.max, which is absolutely natural for most memory cgroups, we
      try to reclaim some pages.  Even if we were able to reclaim enough memory
      for the allocation, the following check can fail due to a race with
      another concurrent allocation:
      
          if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
              goto retry;
      
      For regular pages the following condition will save us from triggering
      the OOM:
      
         if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
             goto retry;
      
      But for high-order allocation this condition will intentionally fail.  The
      reason behind is that we'll likely fall to regular pages anyway, so it's
      ok and even preferred to return ENOMEM.
      
      In this case the idea of raising MEMCG_OOM looks dubious.
      
      Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation
      order check, so that the event won't be raised for high order allocations.
      This change doesn't affect regular pages allocation and charging.
      
      Link: http://lkml.kernel.org/r/20181004214050.7417-1-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      9ddf0a56
    • S
      mm, memcg: introduce memory.events.local · 5fcab459
      Shakeel Butt 提交于
      commit 1e577f970f66a53d429cbee37b36177c9712f488 upstream.
      
      The memory controller in cgroup v2 exposes memory.events file for each
      memcg which shows the number of times events like low, high, max, oom
      and oom_kill have happened for the whole tree rooted at that memcg.
      Users can also poll or register notification to monitor the changes in
      that file.  Any event at any level of the tree rooted at memcg will
      notify all the listeners along the path till root_mem_cgroup.  There are
      existing users which depend on this behavior.
      
      However there are users which are only interested in the events
      happening at a specific level of the memcg tree and not in the events in
      the underlying tree rooted at that memcg.  One such use-case is a
      centralized resource monitor which can dynamically adjust the limits of
      the jobs running on a system.  The jobs can create their sub-hierarchy
      for their own sub-tasks.  The centralized monitor is only interested in
      the events at the top level memcgs of the jobs as it can then act and
      adjust the limits of the jobs.  Using the current memory.events for such
      centralized monitor is very inconvenient.  The monitor will keep
      receiving events which it is not interested and to find if the received
      event is interesting, it has to read memory.event files of the next
      level and compare it with the top level one.  So, let's introduce
      memory.events.local to the memcg which shows and notify for the events
      at the memcg level.
      
      Now, does memory.stat and memory.pressure need their local versions.  IMHO
      no due to the no internal process contraint of the cgroup v2.  The
      memory.stat file of the top level memcg of a job shows the stats and
      vmevents of the whole tree.  The local stats or vmevents of the top level
      memcg will only change if there is a process running in that memcg but v2
      does not allow that.  Similarly for memory.pressure there will not be any
      process in the internal nodes and thus no chance of local pressure.
      
      Link: http://lkml.kernel.org/r/20190527174643.209172-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Chris Down <chris@chrisdown.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      5fcab459
    • C
      mm, memcg: consider subtrees in memory.events · 0d9e08d3
      Chris Down 提交于
      commit 9852ae3fe5293264f01c49f2571ef7688f7823ce upstream.
      
      memory.stat and other files already consider subtrees in their output, and
      we should too in order to not present an inconsistent interface.
      
      The current situation is fairly confusing, because people interacting with
      cgroups expect hierarchical behaviour in the vein of memory.stat,
      cgroup.events, and other files.  For example, this causes confusion when
      debugging reclaim events under low, as currently these always read "0" at
      non-leaf memcg nodes, which frequently causes people to misdiagnose breach
      behaviour.  The same confusion applies to other counters in this file when
      debugging issues.
      
      Aggregation is done at write time instead of at read-time since these
      counters aren't hot (unlike memory.stat which is per-page, so it does it
      at read time), and it makes sense to bundle this with the file
      notifications.
      
      After this patch, events are propagated up the hierarchy:
      
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 0
          oom 0
          oom_kill 0
          [root@ktst ~]# systemd-run -p MemoryMax=1 true
          Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 7
          oom 1
          oom_kill 1
      
      As this is a change in behaviour, this can be reverted to the old
      behaviour by mounting with the `memory_localevents' flag set.  However, we
      use the new behaviour by default as there's a lack of evidence that there
      are any current users of memory.events that would find this change
      undesirable.
      
      akpm: this is a behaviour change, so Cc:stable.  THis is so that
      forthcoming distros which use cgroup v2 are more likely to pick up the
      revised behaviour.
      
      [xuyu: remove the new memory_localevents mount option because it is
      rarely used]
      
      Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      0d9e08d3