1. 18 3月, 2020 40 次提交
    • N
      perf/smmuv3: Add arm64 smmuv3 pmu driver · 47a6fe19
      Neil Leeder 提交于
      commit 7d839b4b9e00645e49345d6ce5dfa8edf53c1a21 upstream
      
      Adds a new driver to support the SMMUv3 PMU and add it into the
      perf events framework.
      
      Each SMMU node may have multiple PMUs associated with it, each of
      which may support different events.
      
      SMMUv3 PMCG devices are named as smmuv3_pmcg_<phys_addr_page> where
      <phys_addr_page> is the physical page address of the SMMU PMCG
      wrapped to 4K boundary. For example, the PMCG at 0xff88840000 is
      named smmuv3_pmcg_ff88840
      
      Filtering by stream id is done by specifying filtering parameters
      with the event. options are:
         filter_enable    - 0 = no filtering, 1 = filtering enabled
         filter_span      - 0 = exact match, 1 = pattern match
         filter_stream_id - pattern to filter against
      
      Example: perf stat -e smmuv3_pmcg_ff88840/transaction,filter_enable=1,
                             filter_span=1,filter_stream_id=0x42/ -a netperf
      
      Applies filter pattern 0x42 to transaction events, which means events
      matching stream ids 0x42 & 0x43 are counted as only upper StreamID
      bits are required to match the given filter. Further filtering
      information is available in the SMMU documentation.
      
      SMMU events are not attributable to a CPU, so task mode and sampling
      are not supported.
      Signed-off-by: NNeil Leeder <nleeder@codeaurora.org>
      Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      [will: fold in review feedback from Robin]
      [will: rewrite Kconfig text and allow building as a module]
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      47a6fe19
    • N
      ACPI/IORT: Add support for PMCG · 9c6dfb51
      Neil Leeder 提交于
      commit 24e516049360eda85cf3fe9903221d43886c2689 upstream.
      
      Add support for the SMMU Performance Monitor Counter Group
      information from ACPI. This is in preparation for its use
      in the SMMUv3 PMU driver.
      Signed-off-by: NNeil Leeder <nleeder@codeaurora.org>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      Acked-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      9c6dfb51
    • G
      iommu/dma: Use NUMA aware memory allocations in __iommu_dma_alloc_pages() · 613ca475
      Ganapatrao Kulkarni 提交于
      commit c4b17afb0a4e8d042320efaf2acf55cb26795f78 upstream.
      
      Change function __iommu_dma_alloc_pages() to allocate pages for DMA from
      respective device NUMA node. The ternary operator which would be for
      alloc_pages_node() is tidied along with this.
      
      The motivation for this change is to have a policy for page allocation
      consistent with direct DMA mapping, which attempts to allocate pages local
      to the device, as mentioned in [1].
      
      In addition, for certain workloads it has been observed a marginal
      performance improvement. The patch caused an observation of 0.9% average
      throughput improvement for running tcrypt with HiSilicon crypto engine.
      
      We also include a modification to use kvzalloc() for kzalloc()/vzalloc()
      combination.
      
      [1] https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg1692998.htmlSigned-off-by: NGanapatrao Kulkarni <ganapatrao.kulkarni@cavium.com>
      [JPG: Added kvzalloc(), drop pages ** being device local, remove ternary operator, update message]
      Signed-off-by: NJohn Garry <john.garry@huawei.com>
      Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      613ca475
    • P
      mm/hotplug: make remove_memory() interface usable · f8502f80
      Pavel Tatashin 提交于
      commit eca499ab3749a4537dee77ffead47a1a2c0dee19 upstream
      
      Presently the remove_memory() interface is inherently broken.  It tries
      to remove memory but panics if some memory is not offline.  The problem
      is that it is impossible to ensure that all memory blocks are offline as
      this function also takes lock_device_hotplug that is required to change
      memory state via sysfs.
      
      So, between calling this function and offlining all memory blocks there
      is always a window when lock_device_hotplug is released, and therefore,
      there is always a chance for a panic during this window.
      
      Make this interface to return an error if memory removal fails.  This
      way it is safe to call this function without panicking machine, and also
      makes it symmetric to add_memory() which already returns an error.
      
      Link: http://lkml.kernel.org/r/20190517215438.6487-3-pasha.tatashin@soleen.comSigned-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nyinhe <yinhe@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      f8502f80
    • D
      mm/memory_hotplug: make remove_memory() take the device_hotplug_lock · d2097173
      David Hildenbrand 提交于
      commit d15e59260f62bd5e0f625cf5f5240f6ffac78ab6 upstream
      
      Patch series "mm: online/offline_pages called w.o. mem_hotplug_lock", v3.
      
      Reading through the code and studying how mem_hotplug_lock is to be used,
      I noticed that there are two places where we can end up calling
      device_online()/device_offline() - online_pages()/offline_pages() without
      the mem_hotplug_lock.  And there are other places where we call
      device_online()/device_offline() without the device_hotplug_lock.
      
      While e.g.
      	echo "online" > /sys/devices/system/memory/memory9/state
      is fine, e.g.
      	echo 1 > /sys/devices/system/memory/memory9/online
      Will not take the mem_hotplug_lock. However the device_lock() and
      device_hotplug_lock.
      
      E.g.  via memory_probe_store(), we can end up calling
      add_memory()->online_pages() without the device_hotplug_lock.  So we can
      have concurrent callers in online_pages().  We e.g.  touch in
      online_pages() basically unprotected zone->present_pages then.
      
      Looks like there is a longer history to that (see Patch #2 for details),
      and fixing it to work the way it was intended is not really possible.  We
      would e.g.  have to take the mem_hotplug_lock in device/base/core.c, which
      sounds wrong.
      
      Summary: We had a lock inversion on mem_hotplug_lock and device_lock().
      More details can be found in patch 3 and patch 6.
      
      I propose the general rules (documentation added in patch 6):
      
      1. add_memory/add_memory_resource() must only be called with
         device_hotplug_lock.
      2. remove_memory() must only be called with device_hotplug_lock. This is
         already documented and holds for all callers.
      3. device_online()/device_offline() must only be called with
         device_hotplug_lock. This is already documented and true for now in core
         code. Other callers (related to memory hotplug) have to be fixed up.
      4. mem_hotplug_lock is taken inside of add_memory/remove_memory/
         online_pages/offline_pages.
      
      To me, this looks way cleaner than what we have right now (and easier to
      verify).  And looking at the documentation of remove_memory, using
      lock_device_hotplug also for add_memory() feels natural.
      
      This patch (of 6):
      
      remove_memory() is exported right now but requires the
      device_hotplug_lock, which is not exported.  So let's provide a variant
      that takes the lock and only export that one.
      
      The lock is already held in
      	arch/powerpc/platforms/pseries/hotplug-memory.c
      	drivers/acpi/acpi_memhotplug.c
      	arch/powerpc/platforms/powernv/memtrace.c
      
      Apart from that, there are not other users in the tree.
      
      Link: http://lkml.kernel.org/r/20180925091457.28651-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Rashmica Gupta <rashmica.g@gmail.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: John Allen <jallen@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nyinhe <yinhe@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      d2097173
    • X
      alinux: mm: kidled: fix frame-larger-than build warning · 25f9e572
      Xu Yu 提交于
      This fix the following build warning:
      
      mm/memcontrol.c: In function 'mem_cgroup_idle_page_stats_show':
      mm/memcontrol.c:3866:1: warning: the frame size of 2160 bytes is larger than 2048 bytes [-Wframe-larger-than=]
      
      The root cause is that "mem_cgroup_idle_page_stats_show" has two
      "struct idle_page_stats" variables, each of which is 1056 bytes in
      size, on the stack, thus exceeding the 2048 max frame size.
      
      This fix the build warning by dynamically allocating memory to these two
      variables with kmalloc.
      
      Fixes: a29243e2 ("alinux: mm: Support kidled")
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      25f9e572
    • A
      mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections · f38de7b3
      Alexander Duyck 提交于
      commit 0e56acae4b4dd4a9fbe897854ab83a109e2a9e11 upstream.
      
      Add yet another iterator, for_each_free_mem_range_in_zone_from, and then
      use it to support initializing and freeing pages in groups no larger than
      MAX_ORDER_NR_PAGES.  By doing this we can greatly improve the cache
      locality of the pages while we do several loops over them in the init and
      freeing process.
      
      We are able to tighten the loops further as a result of the "from"
      iterator as we can perform the initial checks for first_init_pfn in our
      first call to the iterator, and continue without the need for those checks
      via the "from" iterator.  I have added this functionality in the function
      called deferred_init_mem_pfn_range_in_zone that primes the iterator and
      causes us to exit if we encounter any failure.
      
      On my x86_64 test system with 384GB of memory per node I saw a reduction
      in initialization time from 1.85s to 1.38s as a result of this patch.
      
      Link: http://lkml.kernel.org/r/20190405221231.12227.85836.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: <yi.z.zhang@linux.intel.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f38de7b3
    • A
      mm: implement new zone specific memblock iterator · ad97e5e4
      Alexander Duyck 提交于
      commit 837566e7e08e3f89444166444836a8a49b9f9322 upstream.
      
      Introduce a new iterator for_each_free_mem_pfn_range_in_zone.
      
      This iterator will take care of making sure a given memory range provided
      is in fact contained within a zone.  It takes are of all the bounds
      checking we were doing in deferred_grow_zone, and deferred_init_memmap.
      In addition it should help to speed up the search a bit by iterating until
      the end of a range is greater than the start of the zone pfn range, and
      will exit completely if the start is beyond the end of the zone.
      
      Link: http://lkml.kernel.org/r/20190405221225.12227.22573.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <yi.z.zhang@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      ad97e5e4
    • A
      mm: drop meminit_pfn_in_nid as it is redundant · b065ceca
      Alexander Duyck 提交于
      commit 56ec43d8b02719402c9fcf984feb52ec2300f8a5 upstream.
      
      As best as I can tell the meminit_pfn_in_nid call is completely redundant.
      The deferred memory initialization is already making use of
      for_each_free_mem_range which in turn will call into __next_mem_range
      which will only return a memory range if it matches the node ID provided
      assuming it is not NUMA_NO_NODE.
      
      I am operating on the assumption that there are no zones or pgdata_t
      structures that have a NUMA node of NUMA_NO_NODE associated with them.  If
      that is the case then __next_mem_range will never return a memory range
      that doesn't match the zone's node ID and as such the check is redundant.
      
      So one piece I would like to verify on this is if this works for ia64.
      Technically it was using a different approach to get the node ID, but it
      seems to have the node ID also encoded into the memblock.  So I am
      assuming this is okay, but would like to get confirmation on that.
      
      On my x86_64 test system with 384GB of memory per node I saw a reduction
      in initialization time from 2.80s to 1.85s as a result of this patch.
      
      Link: http://lkml.kernel.org/r/20190405221219.12227.93957.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <yi.z.zhang@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b065ceca
    • A
      mm: use mm_zero_struct_page from SPARC on all 64b architectures · e23b0cb5
      Alexander Duyck 提交于
      commit 5470dea49f5382257c242ac617d908267727f1a8 upstream.
      
      Patch series "Deferred page init improvements", v7.
      
      This patchset is essentially a refactor of the page initialization logic
      that is meant to provide for better code reuse while providing a
      significant improvement in deferred page initialization performance.
      
      In my testing on an x86_64 system with 384GB of RAM I have seen the
      following.  In the case of regular memory initialization the deferred init
      time was decreased from 3.75s to 1.38s on average.  This amounts to a 172%
      improvement for the deferred memory initialization performance.
      
      I have called out the improvement observed with each patch.
      
      This patch (of 4):
      
      Use the same approach that was already in use on Sparc on all the
      architectures that support a 64b long.
      
      This is mostly motivated by the fact that 7 to 10 store/move instructions
      are likely always going to be faster than having to call into a function
      that is not specialized for handling page init.
      
      An added advantage to doing it this way is that the compiler can get away
      with combining writes in the __init_single_page call.  As a result the
      memset call will be reduced to only about 4 write operations, or at least
      that is what I am seeing with GCC 6.2 as the flags, LRU pointers, and
      count/mapcount seem to be cancelling out at least 4 of the 8 assignments
      on my system.
      
      One change I had to make to the function was to reduce the minimum page
      size to 56 to support some powerpc64 configurations.
      
      This change should introduce no change on SPARC since it already had this
      code.  In the case of x86_64 I saw a reduction from 3.75s to 2.80s when
      initializing 384GB of RAM per node.  Pavel Tatashin tested on a system
      with Broadcom's Stingray CPU and 48GB of RAM and found that
      __init_single_page() takes 19.30ns / 64-byte struct page before this patch
      and with this patch it takes 17.33ns / 64-byte struct page.  Mike Rapoport
      ran a similar test on a OpenPower (S812LC 8348-21C) with Power8 processor
      and 128GB or RAM.  His results per 64-byte struct page were 4.68ns before,
      and 4.59ns after this patch.
      
      Link: http://lkml.kernel.org/r/20190405221213.12227.9392.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <yi.z.zhang@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e23b0cb5
    • C
      nvme-mpath: remove I/O polling support · 5501fe75
      Christoph Hellwig 提交于
      commit 9d6610b76fa374eae3deb93bcbace4a06c2e3b95 upstream.
      
      The ->poll_fn has been stale for a while, as a lot of places check for mq
      ops.  But there is no real point in it anyway, as we don't even use
      the multipath code for subsystems without multiple ports, which is usually
      what we do high performance I/O to.  If it really becomes an issue we
      should rework the nvme code to also skip the multipath code for any
      private namespace, even if that could mean some trouble when rescanning.
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NShile Zhang <shile.zhang@linux.alibaba-inc.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      5501fe75
    • D
      amd-gpu: Don't undefine READ and WRITE · fb7ad259
      David Howells 提交于
      commit 1fcb748d187d0c7732a75a509e924ead6d070e04 upstream.
      
      Remove the undefinition of READ and WRITE because these constants may be
      used elsewhere in subsequently included header files, thus breaking them.
      
      These constants don't actually appear to be used in the driver, so the
      undefinition seems pointless.
      
      Fixes: 4562236b ("drm/amd/dc: Add dc display driver (v2)")
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      fb7ad259
    • M
      blk-mq: grab .q_usage_counter when queuing request from plug code path · 4a55f77f
      Ming Lei 提交于
      commit e87eb301bee183d82bb3d04bd71b6660889a2588 upstream.
      
      Just like aio/io_uring, we need to grab 2 refcount for queuing one
      request, one is for submission, another is for completion.
      
      If the request isn't queued from plug code path, the refcount grabbed
      in generic_make_request() serves for submission. In theroy, this
      refcount should have been released after the sumission(async run queue)
      is done. blk_freeze_queue() works with blk_sync_queue() together
      for avoiding race between cleanup queue and IO submission, given async
      run queue activities are canceled because hctx->run_work is scheduled with
      the refcount held, so it is fine to not hold the refcount when
      running the run queue work function for dispatch IO.
      
      However, if request is staggered into plug list, and finally queued
      from plug code path, the refcount in submission side is actually missed.
      And we may start to run queue after queue is removed because the queue's
      kobject refcount isn't guaranteed to be grabbed in flushing plug list
      context, then kernel oops is triggered, see the following race:
      
      blk_mq_flush_plug_list():
              blk_mq_sched_insert_requests()
                      insert requests to sw queue or scheduler queue
                      blk_mq_run_hw_queue
      
      Because of concurrent run queue, all requests inserted above may be
      completed before calling the above blk_mq_run_hw_queue. Then queue can
      be freed during the above blk_mq_run_hw_queue().
      
      Fixes the issue by grab .q_usage_counter before calling
      blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
      safe because the queue is absolutely alive before inserting request.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Tested-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      [Joseph: use the passing 'q' directly]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      4a55f77f
    • K
      block/bfq: fix ifdef for CONFIG_BFQ_GROUP_IOSCHED=y · 2a7f90c1
      Konstantin Khlebnikov 提交于
      commit 42b1bd33dcdef4ffd98f695e188bab82f9fa46d8 upstream.
      
      Replace BFQ_GROUP_IOSCHED_ENABLED with CONFIG_BFQ_GROUP_IOSCHED.
      Code under these ifdefs never worked, something might be broken.
      
      Fixes: 0471559c ("block, bfq: add/remove entity weights correctly")
      Reviewed-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      2a7f90c1
    • J
      block: remove bogus check for queue_lock assignment · 7a086776
      Jens Axboe 提交于
      commit 5e27891e88555fecd8262e110e1a29feca4b0166 upstream.
      
      We just allocated the queue and haven't even set it up yet,
      hence we know that checking if ->mq_ops is NULL is always
      going to be true.
      
      In fact we do need to assign a lock to ->queue_lock always,
      as we need it for the queue flags modifications.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      7a086776
    • M
      block: don't use bio->bi_vcnt to figure out segment number · 7eb1d529
      Ming Lei 提交于
      commit 1a67356e9a4829da2935dd338630a550c59c8489 upstream.
      
      It is wrong to use bio->bi_vcnt to figure out how many segments
      there are in the bio even though CLONED flag isn't set on this bio,
      because this bio may be splitted or advanced.
      
      So always use bio_segments() in blk_recount_segments(), and it shouldn't
      cause any performance loss now because the physical segment number is figured
      out in blk_queue_split() and BIO_SEG_VALID is set meantime since
      bdced438 ("block: setup bi_phys_segments after splitting").
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Fixes: 76d8137a ("blk-merge: recaculate segment if it isn't less than max segments")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      7eb1d529
    • Z
      scsi: core: Run queue when state is set to running after being blocked · 79d0c296
      zhengbin 提交于
      commit 70fc085c5015c54a7b8742a45fc9ab05d6da90da upstream.
      
      Use dd to test a SCSI device:
      
        1. echo "blocked" >/sys/block/sda/device/state
        2. dd if=/dev/sda of=/mnt/t.log bs=1M count=10
        3. echo "running" >/sys/block/sda/device/state
      
      dd should finish this work after step 3, but it hangs.
      
      After step2, the call chain is this:
      
      blk_mq_dispatch_rq_list-->scsi_queue_rq-->prep_to_mq
      
      prep_to_mq will return BLK_STS_RESOURCE, and scsi_queue_rq will
      transition it to BLK_STS_DEV_RESOURCE which means that driver can
      guarantee that IO dispatch will be triggered in future when the
      resource is available.  Need to follow the rule if we set the device
      state to running.
      
      [mkp: tweaked commit description and code comment as suggested by Bart]
      Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      79d0c296
    • Z
      block: fix NULL pointer dereference in register_disk · 9d70cdf3
      zhengbin 提交于
      commit 4d7c1d3fd7c7eda7dea351f071945e843a46c145 upstream.
      
      If __device_add_disk-->bdi_register_owner-->bdi_register-->
      bdi_register_va-->device_create_vargs fails, bdi->dev is still
      NULL, __device_add_disk-->register_disk will visit bdi->dev->kobj.
      This patch fixes that.
      Signed-off-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      9d70cdf3
    • D
      blk-mq: Add a NULL check in blk_mq_free_map_and_requests() · 5f0efd1f
      Dan Carpenter 提交于
      commit 4e6db0f21c99c25980c8d183f95cdb6ad64cebd2 upstream.
      
      I recently found some code which called blk_mq_free_map_and_requests()
      with a NULL set->tags pointer.  I fixed the caller, but it seems like a
      good idea to add a NULL check here as well.  Now we can call:
      
      	blk_mq_free_tag_set(set);
      	blk_mq_free_tag_set(set);
      
      twice in a row and it's harmless.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      5f0efd1f
    • X
      blk-mq: place trace_block_getrq() in correct place · d23f53bf
      Xiaoguang Wang 提交于
      commit d6f1dda27251909a27b8d8aacb498628a1047978 upstream.
      
      trace_block_getrq() is to indicate a request struct has been allocated
      for queue, so put it in right place.
      Reviewed-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      d23f53bf
    • G
      blk-mq: protect debugfs_create_files() from failures · 9ec13b19
      Greg Kroah-Hartman 提交于
      commit 36991ca68db9dd43bac7f3519f080ee3939263ef upstream.
      
      If debugfs were to return a non-NULL error for a debugfs call, using
      that pointer later in debugfs_create_files() would crash.
      
      Fix that by properly checking the pointer before referencing it.
      Reported-by: NMichal Hocko <mhocko@kernel.org>
      Reported-and-tested-by: syzbot+b382ba6a802a3d242790@syzkaller.appspotmail.com
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      9ec13b19
    • M
      blk-mq: not embed .mq_kobj and ctx->kobj into queue instance · 9ff28240
      Ming Lei 提交于
      commit 1db4909e76f64a85f4aaa187f0f683f5c85a471d upstream.
      
      Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime
      from block layer's view, actually they don't because userspace may
      grab one kobject anytime via sysfs.
      
      This patch fixes the issue by the following approach:
      
      1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing
      all ctxs
      
      2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release
      handler of .mq_kobj
      
      3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that
      .mq_kobj is always released after all ctxs are freed.
      
      This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE
      is enabled.
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      9ff28240
    • J
      blk-mq: fallback to previous nr_hw_queues when updating fails · 4f3484ac
      Jianchao Wang 提交于
      commit e01ad46d53b59720c6ae69963ee1756506954c85 upstream.
      
      When we try to increate the nr_hw_queues, we may fail due to
      shortage of memory or other reason, then blk_mq_realloc_hw_ctxs stops
      and some entries in q->queue_hw_ctx are left with NULL. However,
      because queue map has been updated with new nr_hw_queues, some cpus
      have been mapped to hw queue which just encounters allocation failure,
      thus blk_mq_map_queue could return NULL. This will cause panic in
      following blk_mq_map_swqueue.
      
      To fix it, when increase nr_hw_queues fails, fallback to previous
      nr_hw_queues and post warning. At the same time, driver's .map_queues
      usually use completion irq affinity to map hw and cpu, fallback
      nr_hw_queues will cause lack of some cpu's map to hw, so use default
      blk_mq_map_queues to do that.
      
      Reported-by: syzbot+83e8cbe702263932d9d4@syzkaller.appspotmail.com
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      4f3484ac
    • J
      blk-mq: realloc hctx when hw queue is mapped to another node · 0f44f194
      Jianchao Wang 提交于
      commit 34d11ffac1f56c3895dad32153abd6814452dc77 upstream.
      
      When the hw queues and mq_map are updated, a hctx could be mapped
      to a different numa node. At this moment, we need to realloc the
      hctx. If fail to do that, go on using previous hctx.
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      0f44f194
    • J
      blk-mq: adjust debugfs and sysfs register when updating nr_hw_queues · 1d5199ea
      Jianchao Wang 提交于
      commit 477e19dedc9d3e1f4443a1d4ae00572a988120ea upstream.
      
      blk-mq debugfs and sysfs entries need to be removed before updating
      queue map, otherwise, we get get wrong result there. This patch fixes
      it and remove the redundant debugfs and sysfs register/unregister
      operations during __blk_mq_update_nr_hw_queues.
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      1d5199ea
    • Q
      mm/memblock.c: skip kmemleak for kasan_init() · 9f093569
      Qian Cai 提交于
      commit fed84c78527009d4f799a3ed9a566502fa026d82 upstream.
      
      Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and
      Huawei TaiShan 2280 aarch64 servers).
      
      After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early
      log buffer went from something like 280 to 260000 which caused kmemleak
      disabled and crash dump memory reservation failed.  The multitude of
      kmemleak_alloc() calls is from nested loops while KASAN is setting up full
      memory mappings, so let early kmemleak allocations skip those
      memblock_alloc_internal() calls came from kasan_init() given that those
      early KASAN memory mappings should not reference to other memory.  Hence,
      no kmemleak false positives.
      
      kasan_init
        kasan_map_populate [1]
          kasan_pgd_populate [2]
            kasan_pud_populate [3]
              kasan_pmd_populate [4]
                kasan_pte_populate [5]
                  kasan_alloc_zeroed_page
                    memblock_alloc_try_nid
                      memblock_alloc_internal
                        kmemleak_alloc
      
      [1] for_each_memblock(memory, reg)
      [2] while (pgdp++, addr = next, addr != end)
      [3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)))
      [4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)))
      [5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)))
      
      Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.usSigned-off-by: NQian Cai <cai@gmx.us>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      9f093569
    • X
      alinux: jbd2: track slow handle which is preventing transaction committing · 83cd9d23
      Xiaoguang Wang 提交于
      While transaction is going to commit, it first sets its state to be
      T_LOCKED and waits all outstanding handles to complete, and the
      committing transaction will always be in locked state so long as it
      has outstanding handles, also the whole fs will be locked and all later
      fs modification operations will be stucked in wait_transaction_locked().
      
      It's hard to tell why handles are that slow, so here we add a new staic
      tracepoint to track such slow handle, and show io wait time and sched
      wait time, output likes below:
        fsstress-20347 [024] ....  1570.305454: jbd2_slow_handle_stats: dev 254,17
      tid 15853 type 4 line_no 3101 interval 126 sync 0 requested_blocks 24
      dirtied_blocks 0 trans_wait 122 space_wait 0 sched_wait 0 io_wait 126
      
      "trans_wait 122" means that this current committing transaction has been
      locked for 122ms, due to this handle is not completed quickly.
      
      From "io_wait 126", we can see that io is the major reason.
      
      In this patch, we also add a per fs control file used to determine
      whether a handle can be considered to be slow.
          /proc/fs/jbd2/vdb1-8/stall_thresh
      default value is 100ms, users can set new threshold by echoing new value
      to this file.
      
      Later I also plan to add a proc file fs per fs to record these info.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      83cd9d23
    • X
      alinux: fs: record page or bio info while process is waitting on it · 2864376f
      Xiaoguang Wang 提交于
      If one process context is stucked in wait_on_buffer(), lock_buffer(),
      lock_page() and wait_on_page_writeback() and wait_on_bit_io(), it's
      hard to tell ture reason, for example, whether this page is under io,
      or this page is just locked too long by other process context.
      
      Normally io request has multiple bios, and every bio contains multiple
      pages which will hold data to be read from or written to device, so here
      we record page info or bio info in task_struct while process calls
      lock_page(), lock_buffer(), wait_on_page_writeback(), wait_on_buffer()
      and wait_on_bit_io(), we add a new proce interface:
      [lege@localhost linux]$ cat /proc/4516/wait_res
      1 ffffd0969f95d3c0 4295369599 4295381596
      
      Above info means that thread 4516 is waitting on a page, address is
      ffffd0969f95d3c0, and has waited for 11997ms.
      
      First field denotes the page address process is waitting on.
      Second field denotes the wait moment and the third denotes current moment.
      
      In practice, if we found a process waitting on one page for too long time,
      we can get page's address by reading /proc/$pid/wait_page, and search this
      page address in all block devices' /sys/kernel/debug/block/${devname}/rq_hang,
      if search operation hits one, we can get the request and know why this io
      request hangs that long.
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      2864376f
    • X
      alinux: blk: add iohang check function · 80d6ee24
      Xiaoguang Wang 提交于
      Background:
        We do not have a dependable block layer interface to determine whether
      block device has io requests which have not been completed for somewhat
      long time. Currently we have 'in_flight' interface, it counts the number
      of I/O requests that have been issued to the device driver but have
      not yet completed, and it does not include I/O requests that are in the
      queue but not yet issued to the device driver, which means it will not
      count io requests that have been stucked in block layer.
        Also say that there are steady io requests issued to device driver,
      'in_flight' maybe always non-zero, but you could not determine whether
      there is one io request which has not been completed for too long.
      
      Solution:
        To find io requests which have not been completed for too long, here
      add 3 new inferfaces:
        /sys/block/vdb/queue/hang_threshold
      If one io request's running time has been greater than this value, count
      this io as hang.
      
        /sys/block/vdb/hang
      Show read/write io requests' hang counter.
      
        /sys/kernel/debug/block/vdb/rq_hang
      Show all hang io requests's detailed info, like below:
        ffff97db96301200 {.op=WRITE, .cmd_flags=SYNC, .rq_flags=STARTED|
      ELVPRIV|IO_STAT|STATS, .state=in_flight, .tag=30, .internal_tag=169,
      .start_time_ns=140634088407, .io_start_time_ns=140634102958,
      .current_time=146497371953, .bio = ffff97db91e8e000,
      .bio_pages = { ffffd096a0602540 }, .bio = ffff97db91e8ec00,
      .bio_pages = { ffffd096a070eec0 }, .bio = ffff97db91e8f600,
      .bio_pages = { ffffd096a0424cc0 }, .bio = ffff97db91e8f300,
      .bio_pages = { ffffd096a0600a80 }}
      
      With above info, we can easily see this request's latency distribution,
      and see next patch for bio_pages's usage.
      
      Note, /sys/kernel/debug/block/vdb/rq_hang only exists in blk-mq device driver
      and needs CONFIG_BLK_DEBUG_FS enabled.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      80d6ee24
    • S
      tpm: tpm_tis_spi: Introduce a flow control callback · 4cf2f5de
      Stephen Boyd 提交于
      commit 8ab5e82afa969b65b286d8949c12d2a64c83960c upstream.
      
      Cr50 firmware has a different flow control protocol than the one used by
      this TPM PTP SPI driver. Introduce a flow control callback so we can
      override the standard sequence with the custom one that Cr50 uses.
      
      Cc: Andrey Pronin <apronin@chromium.org>
      Cc: Duncan Laurie <dlaurie@chromium.org>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guenter Roeck <groeck@chromium.org>
      Cc: Alexander Steffen <Alexander.Steffen@infineon.com>
      Cc: Heiko Stuebner <heiko@sntech.de>
      Signed-off-by: NStephen Boyd <swboyd@chromium.org>
      Tested-by: NHeiko Stuebner <heiko@sntech.de>
      Reviewed-by: NHeiko Stuebner <heiko@sntech.de>
      Reviewed-by: NJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: NJarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
      Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
      Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>
      4cf2f5de
    • T
      tcp: Add snd_wnd to TCP_INFO · ecee8235
      Thomas Higdon 提交于
      commit 8f7baad7f03543451af27f5380fc816b008aa1f2 upstream
      
      Neal Cardwell mentioned that snd_wnd would be useful for diagnosing TCP
      performance problems --
      > (1) Usually when we're diagnosing TCP performance problems, we do so
      > from the sender, since the sender makes most of the
      > performance-critical decisions (cwnd, pacing, TSO size, TSQ, etc).
      > From the sender-side the thing that would be most useful is to see
      > tp->snd_wnd, the receive window that the receiver has advertised to
      > the sender.
      
      This serves the purpose of adding an additional __u32 to avoid the
      would-be hole caused by the addition of the tcpi_rcvi_ooopack field.
      Signed-off-by: NThomas Higdon <tph@fb.com>
      Acked-by: NYuchung Cheng <ycheng@google.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Acked-by: NDust Li <dust.li@linux.alibaba.com>
      ecee8235
    • T
      tcp: Add TCP_INFO counter for packets received out-of-order · 0107d737
      Thomas Higdon 提交于
      commit f9af2dbbfe01def62765a58af7fbc488351893c3 upstream
      
      For receive-heavy cases on the server-side, we want to track the
      connection quality for individual client IPs. This counter, similar to
      the existing system-wide TCPOFOQueue counter in /proc/net/netstat,
      tracks out-of-order packet reception. By providing this counter in
      TCP_INFO, it will allow understanding to what degree receive-heavy
      sockets are experiencing out-of-order delivery and packet drops
      indicating congestion.
      
      Please note that this is similar to the counter in NetBSD TCP_INFO, and
      has the same name.
      
      Also note that we avoid increasing the size of the tcp_sock struct by
      taking advantage of a hole.
      Signed-off-by: NThomas Higdon <tph@fb.com>
      Acked-by: NNeal Cardwell <ncardwell@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
      Acked-by: NDust Li <dust.li@linux.alibaba.com>
      0107d737
    • X
      alinux: mm,memcg: export memory.{min,low} to cgroup v1 · 3e655c51
      Xu Yu 提交于
      Export "memory.min" and "memory.low" from cgroup v2 to v1.
      
      There is a subtle difference between v1 and v2, i.e. no task is allowed
      in intermediate memcgs under v2 hierarchy and this can make a different
      behaviour for them, it requires all the intermediate nodes having the
      memory.min|low set, we must keep this in mind when using this feature
      under v1.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      3e655c51
    • X
      alinux: mm,memcg: export memory.{events,events.local} to v1 · 94285517
      Xu Yu 提交于
      Export "memory.events" and "memory.events.local" from cgroup v2 to v1.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      94285517
    • R
      mm: don't raise MEMCG_OOM event due to failed high-order allocation · 5213c279
      Roman Gushchin 提交于
      commit 7a1adfddaf0d11a39fdcaf6e82a88e9c0586e08b upstream.
      
      It was reported that on some of our machines containers were restarted
      with OOM symptoms without an obvious reason.  Despite there were almost no
      memory pressure and plenty of page cache, MEMCG_OOM event was raised
      occasionally, causing the container management software to think, that OOM
      has happened.  However, no tasks have been killed.
      
      The following investigation showed that the problem is caused by a failing
      attempt to charge a high-order page.  In such case, the OOM killer is
      never invoked.  As shown below, it can happen under conditions, which are
      very far from a real OOM: e.g.  there is plenty of clean page cache and no
      memory pressure.
      
      There is no sense in raising an OOM event in this case, as it might
      confuse a user and lead to wrong and excessive actions (e.g.  restart the
      workload, as in my case).
      
      Let's look at the charging path in try_charge().  If the memory usage is
      about memory.max, which is absolutely natural for most memory cgroups, we
      try to reclaim some pages.  Even if we were able to reclaim enough memory
      for the allocation, the following check can fail due to a race with
      another concurrent allocation:
      
          if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
              goto retry;
      
      For regular pages the following condition will save us from triggering
      the OOM:
      
         if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
             goto retry;
      
      But for high-order allocation this condition will intentionally fail.  The
      reason behind is that we'll likely fall to regular pages anyway, so it's
      ok and even preferred to return ENOMEM.
      
      In this case the idea of raising MEMCG_OOM looks dubious.
      
      Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation
      order check, so that the event won't be raised for high order allocations.
      This change doesn't affect regular pages allocation and charging.
      
      Link: http://lkml.kernel.org/r/20181004214050.7417-1-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      5213c279
    • S
      mm, memcg: introduce memory.events.local · de7ca746
      Shakeel Butt 提交于
      commit 1e577f970f66a53d429cbee37b36177c9712f488 upstream.
      
      The memory controller in cgroup v2 exposes memory.events file for each
      memcg which shows the number of times events like low, high, max, oom
      and oom_kill have happened for the whole tree rooted at that memcg.
      Users can also poll or register notification to monitor the changes in
      that file.  Any event at any level of the tree rooted at memcg will
      notify all the listeners along the path till root_mem_cgroup.  There are
      existing users which depend on this behavior.
      
      However there are users which are only interested in the events
      happening at a specific level of the memcg tree and not in the events in
      the underlying tree rooted at that memcg.  One such use-case is a
      centralized resource monitor which can dynamically adjust the limits of
      the jobs running on a system.  The jobs can create their sub-hierarchy
      for their own sub-tasks.  The centralized monitor is only interested in
      the events at the top level memcgs of the jobs as it can then act and
      adjust the limits of the jobs.  Using the current memory.events for such
      centralized monitor is very inconvenient.  The monitor will keep
      receiving events which it is not interested and to find if the received
      event is interesting, it has to read memory.event files of the next
      level and compare it with the top level one.  So, let's introduce
      memory.events.local to the memcg which shows and notify for the events
      at the memcg level.
      
      Now, does memory.stat and memory.pressure need their local versions.  IMHO
      no due to the no internal process contraint of the cgroup v2.  The
      memory.stat file of the top level memcg of a job shows the stats and
      vmevents of the whole tree.  The local stats or vmevents of the top level
      memcg will only change if there is a process running in that memcg but v2
      does not allow that.  Similarly for memory.pressure there will not be any
      process in the internal nodes and thus no chance of local pressure.
      
      Link: http://lkml.kernel.org/r/20190527174643.209172-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Chris Down <chris@chrisdown.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      de7ca746
    • C
      mm, memcg: consider subtrees in memory.events · 0a198bb9
      Chris Down 提交于
      commit 9852ae3fe5293264f01c49f2571ef7688f7823ce upstream.
      
      memory.stat and other files already consider subtrees in their output, and
      we should too in order to not present an inconsistent interface.
      
      The current situation is fairly confusing, because people interacting with
      cgroups expect hierarchical behaviour in the vein of memory.stat,
      cgroup.events, and other files.  For example, this causes confusion when
      debugging reclaim events under low, as currently these always read "0" at
      non-leaf memcg nodes, which frequently causes people to misdiagnose breach
      behaviour.  The same confusion applies to other counters in this file when
      debugging issues.
      
      Aggregation is done at write time instead of at read-time since these
      counters aren't hot (unlike memory.stat which is per-page, so it does it
      at read time), and it makes sense to bundle this with the file
      notifications.
      
      After this patch, events are propagated up the hierarchy:
      
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 0
          oom 0
          oom_kill 0
          [root@ktst ~]# systemd-run -p MemoryMax=1 true
          Running as unit: run-r251162a189fb4562b9dabfdc9b0422f5.service
          [root@ktst ~]# cat /sys/fs/cgroup/system.slice/memory.events
          low 0
          high 0
          max 7
          oom 1
          oom_kill 1
      
      As this is a change in behaviour, this can be reverted to the old
      behaviour by mounting with the `memory_localevents' flag set.  However, we
      use the new behaviour by default as there's a lack of evidence that there
      are any current users of memory.events that would find this change
      undesirable.
      
      akpm: this is a behaviour change, so Cc:stable.  THis is so that
      forthcoming distros which use cgroup v2 are more likely to pick up the
      revised behaviour.
      
      [xuyu: remove the new memory_localevents mount option because it is
      rarely used]
      
      Link: http://lkml.kernel.org/r/20190208224419.GA24772@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      0a198bb9
    • X
      alinux: mm,memcg: export memory.high to v1 · 7d36553b
      Xu Yu 提交于
      Export "memory.high" from cgroup v2 to v1 which can be used to archive
      some memory QoS.
      
      This is also a way of migrating to v2 gradually.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      7d36553b
    • J
      arm64: mm: add missing PTE_SPECIAL in pte_mkdevmap on arm64 · f72a099b
      Jia He 提交于
      commit 30e235389faadb9e3d918887b1f126155d7d761d upstream.
      
      Without this patch, the MAP_SYNC test case will cause a print_bad_pte
      warning on arm64 as follows:
      
      [   25.542693] BUG: Bad page map in process mapdax333 pte:2e8000448800f53 pmd:41ff5f003
      [   25.546360] page:ffff7e0010220000 refcount:1 mapcount:-1 mapping:ffff8003e29c7440 index:0x0
      [   25.550281] ext4_dax_aops
      [   25.550282] name:"__aaabbbcccddd__"
      [   25.551553] flags: 0x3ffff0000001002(referenced|reserved)
      [   25.555802] raw: 03ffff0000001002 ffff8003dfffa908 0000000000000000 ffff8003e29c7440
      [   25.559446] raw: 0000000000000000 0000000000000000 00000001fffffffe 0000000000000000
      [   25.563075] page dumped because: bad pte
      [   25.564938] addr:0000ffffbe05b000 vm_flags:208000fb anon_vma:0000000000000000 mapping:ffff8003e29c7440 index:0
      [   25.574272] file:__aaabbbcccddd__ fault:ext4_dax_fault mmmmap:ext4_file_mmap readpage:0x0
      [   25.578799] CPU: 1 PID: 1180 Comm: mapdax333 Not tainted 5.2.0+ #21
      [   25.581702] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
      [   25.585624] Call trace:
      [   25.587008]  dump_backtrace+0x0/0x178
      [   25.588799]  show_stack+0x24/0x30
      [   25.590328]  dump_stack+0xa8/0xcc
      [   25.591901]  print_bad_pte+0x18c/0x218
      [   25.593628]  unmap_page_range+0x778/0xc00
      [   25.595506]  unmap_single_vma+0x94/0xe8
      [   25.597304]  unmap_vmas+0x90/0x108
      [   25.598901]  unmap_region+0xc0/0x128
      [   25.600566]  __do_munmap+0x284/0x3f0
      [   25.602245]  __vm_munmap+0x78/0xe0
      [   25.603820]  __arm64_sys_munmap+0x34/0x48
      [   25.605709]  el0_svc_common.constprop.0+0x78/0x168
      [   25.607956]  el0_svc_handler+0x34/0x90
      [   25.609698]  el0_svc+0x8/0xc
      [...]
      
      The root cause is in _vm_normal_page, without the PTE_SPECIAL bit,
      the return value will be incorrectly set to pfn_to_page(pfn) instead
      of NULL. Besides, this patch also rewrite the pmd_mkdevmap to avoid
      setting PTE_SPECIAL for pmd
      
      The MAP_SYNC test case is as follows(Provided by Yibo Cai)
      $#include <stdio.h>
      $#include <string.h>
      $#include <unistd.h>
      $#include <sys/file.h>
      $#include <sys/mman.h>
      
      $#ifndef MAP_SYNC
      $#define MAP_SYNC 0x80000
      $#endif
      
      /* mount -o dax /dev/pmem0 /mnt */
      $#define F "/mnt/__aaabbbcccddd__"
      
      int main(void)
      {
          int fd;
          char buf[4096];
          void *addr;
      
          if ((fd = open(F, O_CREAT|O_TRUNC|O_RDWR, 0644)) < 0) {
              perror("open1");
              return 1;
          }
      
          if (write(fd, buf, 4096) != 4096) {
              perror("lseek");
              return 1;
          }
      
          addr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_SYNC, fd, 0);
          if (addr == MAP_FAILED) {
              perror("mmap");
              printf("did you mount with '-o dax'?\n");
              return 1;
          }
      
          memset(addr, 0x55, 4096);
      
          if (munmap(addr, 4096) == -1) {
              perror("munmap");
              return 1;
          }
      
          close(fd);
      
          return 0;
      }
      
      Fixes: 73b20c84d42d ("arm64: mm: implement pte_devmap support")
      Reported-by: NYibo Cai <Yibo.Cai@arm.com>
      Acked-by: NWill Deacon <will@kernel.org>
      Acked-by: NRobin Murphy <Robin.Murphy@arm.com>
      Signed-off-by: NJia He <justin.he@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      f72a099b
    • R
      arm64: mm: implement pte_devmap support · 1820ca63
      Robin Murphy 提交于
      commit 73b20c84d42de14673a987816dd4d132c7b1f801 upstream.
      
      In order for things like get_user_pages() to work on ZONE_DEVICE memory,
      we need a software PTE bit to identify device-backed PFNs.  Hook this up
      along with the relevant helpers to join in with ARCH_HAS_PTE_DEVMAP.
      
      [robin.murphy@arm.com: build fixes]
        Link: http://lkml.kernel.org/r/13026c4e64abc17133bbfa07d7731ec6691c0bcd.1559050949.git.robin.murphy@arm.com
      Link: http://lkml.kernel.org/r/817d92886fc3b33bcbf6e105ee83a74babb3a5aa.1558547956.git.robin.murphy@arm.comSigned-off-by: NRobin Murphy <robin.murphy@arm.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oliver O'Halloran <oohall@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShannon Zhao <shannon.zhao@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      1820ca63