1. 02 9月, 2020 31 次提交
    • D
      mm/memory_hotplug: set node_start_pfn of hotadded pgdat to 0 · 64f827bf
      David Hildenbrand 提交于
      task #29077503
      commit c68ab18c6aee0397574afb418f6775f23379198e upstream
      Patch series "mm/memory_hotplug: handle memblocks only with
      CONFIG_ARCH_KEEP_MEMBLOCK", v1.
      
      A hotadded node/pgdat will span no pages at all, until memory is moved to
      the zone/node via move_pfn_range_to_zone() -> resize_pgdat_range - e.g.,
      when onlining memory blocks.  We don't have to initialize the
      node_start_pfn to the memory we are adding.
      
      This patch (of 2):
      
      Especially, there is an inconsistency:
       - Hotplugging memory to a memory-less node with cpus: node_start_pf ==  0
       - Offlining and removing last memory from a node: node_start_pfn == 0
       - Hotplugging memory to a memory-less node without cpus: node_start_pfn != 0
      
      As soon as memory is onlined, node_start_pfn is overwritten with the
      actual start.  E.g., when adding two DIMMs but only onlining one of both,
      only that DIMM (with online memory blocks) is spanned by the node.
      
      Currently, the validity of node_start_pfn really is linked to
      node_spanned_pages != 0.  With node_spanned_pages == 0 (e.g., before
      onlining memory), it has no meaning.
      
      So let's stop setting node_start_pfn, just to be overwritten via
      move_pfn_range_to_zone().  This avoids confusion when looking at the code,
      wondering which magic will be performed with the node_start_pfn in this
      function, when hotadding a pgdat.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200422155353.25381-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20200422155353.25381-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from ccommit c68ab18c6aee0397574afb418f6775f23379198e)
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      64f827bf
    • D
      mm/memory_hotplug: Introduce offline_and_remove_memory() · c062d118
      David Hildenbrand 提交于
      task #29077503
      commit 08b3acd7a68fc17902e1cb6b146389322840deab upstream
      virtio-mem wants to offline and remove a memory block once it unplugged
      all subblocks (e.g., using alloc_contig_range()). Let's provide
      an interface to do that from a driver. virtio-mem already supports to
      offline partially unplugged memory blocks. Offlining a fully unplugged
      memory block will not require to migrate any pages. All unplugged
      subblocks are PageOffline() and have a reference count of 0 - so
      offlining code will simply skip them.
      
      All we need is an interface to offline and remove the memory from kernel
      module context, where we don't have access to the memory block devices
      (esp. find_memory_block() and device_offline()) and the device hotplug
      lock.
      
      To keep things simple, allow to only work on a single memory block.
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-9-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      (cherry picked from ccommit 08b3acd7a68fc17902e1cb6b146389322840deab)
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      c062d118
    • D
      mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE · 20500825
      David Hildenbrand 提交于
      task #29077503
      commit aa218795cb5fd583c94fc838dc76b7379dc4976a upstream
      virtio-mem wants to allow to offline memory blocks of which some parts
      were unplugged (allocated via alloc_contig_range()), especially, to later
      offline and remove completely unplugged memory blocks. The important part
      is that PageOffline() has to remain set until the section is offline, so
      these pages will never get accessed (e.g., when dumping). The pages should
      not be handed back to the buddy (which would require clearing PageOffline()
      and result in issues if offlining fails and the pages are suddenly in the
      buddy).
      
      Let's allow to do that by allowing to isolate any PageOffline() page
      when offlining. This way, we can reach the memory hotplug notifier
      MEM_GOING_OFFLINE, where the driver can signal that he is fine with
      offlining this page by dropping its reference count. PageOffline() pages
      with a reference count of 0 can then be skipped when offlining the
      pages (like if they were free, however they are not in the buddy).
      
      Anybody who uses PageOffline() pages and does not agree to offline them
      (e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
      decrement the reference count and make offlining fail when trying to
      migrate such an unmovable page. So there should be no observable change.
      Same applies to balloon compaction users (movable PageOffline() pages), the
      pages will simply be migrated.
      
      Note 1: If offlining fails, a driver has to increment the reference
      	count again in MEM_CANCEL_OFFLINE.
      
      Note 2: A driver that makes use of this has to be aware that re-onlining
      	the memory block has to be handled by hooking into onlining code
      	(online_page_callback_t), resetting the page PageOffline() and
      	not giving them to the buddy.
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      cherry picked from ccommit aa218795cb5fd583c94fc838dc76b7379dc4976a
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      
      Conflicts: keep non-related code old, and remove offlined_pages++
      	mm/memory_hotplug.c
      	mm/page_alloc.c
      	mm/page_isolation.c
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      20500825
    • D
      virtio-mem: Paravirtualized memory hotunplug part 2 · 733f2794
      David Hildenbrand 提交于
      task #29077503
      commit 255f598507083905995ecab96392770ae03aac7f upstream
      and, therefore, managed by the buddy), and eventually replug it later.
      
      When requested to unplug memory, we use alloc_contig_range() to allocate
      subblocks in online memory blocks (so we are the owner) and send them to
      our hypervisor. When requested to plug memory, we can replug such memory
      using free_contig_range() after asking our hypervisor.
      
      We also want to mark all allocated pages PG_offline, so nobody will
      touch them. To differentiate pages that were never onlined when
      onlining the memory block from pages allocated via alloc_contig_range(), we
      use PageDirty(). Based on this flag, virtio_mem_fake_online() can either
      online the pages for the first time or use free_contig_range().
      
      It is worth noting that there are no guarantees on how much memory can
      actually get unplugged again. All device memory might completely be
      fragmented with unmovable data, such that no subblock can get unplugged.
      
      We are not touching the ZONE_MOVABLE. If memory is onlined to the
      ZONE_MOVABLE, it can only get unplugged after that memory was offlined
      manually by user space. In normal operation, virtio-mem memory is
      suggested to be onlined to ZONE_NORMAL. In the future, we will try to
      make unplug more likely to succeed.
      
      Add a module parameter to control if online memory shall be touched.
      
      As we want to access alloc_contig_range()/free_contig_range() from
      kernel module context, export the symbols.
      
      Note: Whenever virtio-mem uses alloc_contig_range(), all affected pages
      are on the same node, in the same zone, and contain no holes.
      
      Acked-by: Michal Hocko <mhocko@suse.com> # to export contig range allocator API
      Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Link: https://lore.kernel.org/r/20200507140139.17083-6-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
      (cherry picked from ccommit 255f598507083905995ecab96392770ae03aac7f)
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      
      Conflicts:
      	minium fix on mm/page_alloc.c
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      733f2794
    • D
      mm/memory_hotplug: export generic_online_page() · 0c6a9eb5
      David Hildenbrand 提交于
      task #29077503
      commit 18db149120c106cf2b1a2595f82f3229f9d223b8 upstream
      
      Let's replace the __online_page...() functions by generic_online_page().
      Hyper-V only wants to delay the actual onlining of un-backed pages, so
      we can simpy re-use the generic function.
      
      This patch (of 3):
      
      Let's expose generic_online_page() so online_page_callback users can
      simply fall back to the generic implementation when actually deciding to
      online the pages.
      
      Link: http://lkml.kernel.org/r/20190909114830.662-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from ccommit 18db149120c106cf2b1a2595f82f3229f9d223b8)
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      0c6a9eb5
    • A
      mm/page_alloc.c: memory hotplug: free pages as higher order · bd6aced3
      Arun KS 提交于
      task #29077503
      commit a9cd410a3d296846a8125aa43d97a573a354c472 upstream
      When freeing pages are done with higher order, time spent on coalescing
      pages by buddy allocator can be reduced.  With section size of 256MB,
      hot add latency of a single section shows improvement from 50-60 ms to
      less than 1 ms, hence improving the hot add latency by 60 times.  Modify
      external providers of online callback to align with the change.
      
      [arunks@codeaurora.org: v11]
        Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
      [akpm@linux-foundation.org: remove unused local, per Arun]
      [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
      [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
      [arunks@codeaurora.org: v8]
        Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
      [arunks@codeaurora.org: v9]
        Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
      Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      (cherry picked from ccommit a9cd410a3d296846a8125aa43d97a573a354c472)
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      
      Conflicts:
      	replace totalram_pages_add as old way.
      bd6aced3
    • M
      mm, memory_hotplug: deobfuscate migration part of offlining · 5eee4728
      Michal Hocko 提交于
      task #29077503
      commit bb8965bd82fd4ed433a888f1383016ab3fa0d7de upstream
      Memory migration might fail during offlining and we keep retrying in that
      case.  This is currently obfuscated by goto retry loop.  The code is hard
      to follow and as a result it is even suboptimal becase each retry round
      scans the full range from start_pfn even though we have successfully
      scanned/migrated [start_pfn, pfn] range already.  This is all only because
      check_pages_isolated failure has to rescan the full range again.
      
      De-obfuscate the migration retry loop by promoting it to a real for loop.
      In fact remove the goto altogether by making it a proper double loop
      (yeah, gotos are nasty in this specific case).  In the end we will get a
      slightly more optimal code which is better readable.
      
      [akpm@linux-foundation.org: reflow comments to 80 cols]
      Link: http://lkml.kernel.org/r/20181211142741.2607-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      (cherry picked from ccommit bb8965bd82fd4ed433a888f1383016ab3fa0d7de)
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      5eee4728
    • M
      mm, memory_hotplug: __offline_pages fix wrong locking · a6785cdc
      Michal Hocko 提交于
      task #29077503
      commit e3df4c6e4836ce93cd5cf92d9cbdeaf4439a0241 upstream
      offlining a page range.  This is indeed the case when
      test_pages_in_a_zone respp.  start_isolate_page_range fail.  This was an
      omission when forward porting the debugging patch from an older kernel.
      
      Fix the issue by dropping mem_hotplug_done from the failure condition
      and keeping the single unlock in the catch all failure path.
      
      Link: http://lkml.kernel.org/r/20190115120307.22768-1-mhocko@kernel.org
      Fixes: 7960509329c2 ("mm, memory_hotplug: print reason for the offlining failure")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Tested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from ccommit e3df4c6e4836ce93cd5cf92d9cbdeaf4439a0241)
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      a6785cdc
    • M
      mm, memory_hotplug: print reason for the offlining failure · 86f9a7e3
      Michal Hocko 提交于
      task #29077503
      commit 7960509329c24a2bf0bc4929636614a1b7bb4443 upstream
      The memory offlining failure reporting is inconsistent and insufficient.
      Some error paths simply do not report the failure to the log at all.  When
      we do report there are no details about the reason of the failure and
      there are several of them which makes memory offlining failures hard to
      debug.
      
      Make sure that the
      	memory offlining [mem %#010llx-%#010llx] failed
      message is printed for all failures and also provide a short textual
      reason for the failure e.g.
      
      [ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff
      
      this tells us that the offlining has failed because of a signal pending
      aka user intervention.
      
      [akpm@linux-foundation.org: tweak messages a bit]
      Link: http://lkml.kernel.org/r/20181107101830.17405-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <OSalvador@suse.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      (cherry picked from ccommit 7960509329c24a2bf0bc4929636614a1b7bb4443)
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      86f9a7e3
    • A
      mm/page_isolation.c: convert SKIP_HWPOISON to MEMORY_OFFLINE · 80aa4777
      Alex Shi 提交于
      task #29077503
      commit 756d25be457fc5497da0ceee0f3d0c9eb4d8535d upstream
      We have two types of users of page isolation:
      
       1. Memory offlining:  Offline memory so it can be unplugged. Memory
                             won't be touched.
      
       2. Memory allocation: Allocate memory (e.g., alloc_contig_range()) to
                             become the owner of the memory and make use of
                             it.
      
      For example, in case we want to offline memory, we can ignore (skip
      over) PageHWPoison() pages, as the memory won't get used.  We can allow
      to offline memory.  In contrast, we don't want to allow to allocate such
      memory.
      
      Let's generalize the approach so we can special case other types of
      pages we want to skip over in case we offline memory.  While at it, also
      pass the same flags to test_pages_isolated().
      Original-by: NDavid Hildenbrand <david@redhat.com>
      Link: http://lkml.kernel.org/r/20191021172353.3056-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from ccommit 756d25be457fc5497da0ceee0f3d0c9eb4d8535d)
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      
      Conflicts:
      	reenable patch context on all files.
      80aa4777
    • M
      mm: only report isolation failures when offlining memory · 5316eb6e
      Michal Hocko 提交于
      task #29077503
      commit d381c54760dcfad23743da40516e7e003d73952a upstream
      Heiko has complained that his log is swamped by warnings from
      has_unmovable_pages
      
      [   20.536664] page dumped because: has_unmovable_pages
      [   20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0
      [   20.536794] flags: 0x3fffe0000010200(slab|head)
      [   20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600
      [   20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000
      [   20.536797] page dumped because: has_unmovable_pages
      [   20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0
      [   20.536815] flags: 0x7fffe0000000000()
      [   20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000
      [   20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000
      
      which are not triggered by the memory hotplug but rather CMA allocator.
      The original idea behind dumping the page state for all call paths was
      that these messages will be helpful debugging failures.  From the above it
      seems that this is not the case for the CMA path because we are lacking
      much more context.  E.g the second reported page might be a CMA allocated
      page.  It is still interesting to see a slab page in the CMA area but it
      is hard to tell whether this is bug from the above output alone.
      
      Address this issue by dumping the page state only on request.  Both
      start_isolate_page_range and has_unmovable_pages already have an argument
      to ignore hwpoison pages so make this argument more generic and turn it
      into flags and allow callers to combine non-default modes into a mask.
      While we are at it, has_unmovable_pages call from
      is_pageblock_removable_nolock (sysfs removable file) is questionable to
      report the failure so drop it from there as well.
      
      Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from ccommit d381c54760dcfad23743da40516e7e003d73952a)
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      
      Conflicts:
      	mm/page_alloc.c
      5316eb6e
    • C
      mm, page_alloc: reset the zone->watermark_boost early · bd231c59
      Charan Teja Reddy 提交于
      to #28825456
      
      commit aa09259109583b98b9d9e7ed0d8eb1b880d1eb97 upstream.
      
      Updating the zone watermarks by any means, like min_free_kbytes,
      water_mark_scale_factor etc, when ->watermark_boost is set will result in
      higher low and high watermarks than the user asked.
      
      Below are the steps to reproduce the problem on system setup of Android
      kernel running on Snapdragon hardware.
      
      1) Default settings of the system are as below:
      
         #cat /proc/sys/vm/min_free_kbytes = 5162
         #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
      	Node 0, zone   Normal
      		min      797
      		low      8340
      		high     8539
      
      2) Monitor the zone->watermark_boost(by adding a debug print in the
         kernel) and whenever it is greater than zero value, write the same
         value of min_free_kbytes obtained from step 1.
      
         #echo 5162 > /proc/sys/vm/min_free_kbytes
      
      3) Then read the zone watermarks in the system while the
         ->watermark_boost is zero.  This should show the same values of
         watermarks as step 1 but shown a higher values than asked.
      
         #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
      	Node 0, zone   Normal
      		min      797
      		low      21148
      		high     21347
      
      These higher values are because of updating the zone watermarks using the
      macro min_wmark_pages(zone) which also adds the zone->watermark_boost.
      
      	#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] +
      					z->watermark_boost)
      
      So the steps that lead to the issue are:
      
      1) On the extfrag event, watermarks are boosted by storing the required
         value in ->watermark_boost.
      
      2) User tries to update the zone watermarks level in the system through
         min_free_kbytes or watermark_scale_factor.
      
      3) Later, when kswapd woke up, it resets the zone->watermark_boost to
         zero.
      
      In step 2), we use the min_wmark_pages() macro to store the watermarks
      in the zone structure thus the values are always offsetted by
      ->watermark_boost value. This can be avoided by resetting the
      ->watermark_boost to zero before it is used.
      Signed-off-by: NCharan Teja Reddy <charante@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Link: http://lkml.kernel.org/r/1589457511-4255-1-git-send-email-charante@codeaurora.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      bd231c59
    • H
      mm: limit boost_watermark on small zones · ab70cdb0
      Henry Willard 提交于
      to #28825456
      
      commit 14f69140ff9c92a0928547ceefb153a842e8492c upstream.
      
      Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an
      external fragmentation event occurs") adds a boost_watermark() function
      which increases the min watermark in a zone by at least
      pageblock_nr_pages or the number of pages in a page block.
      
      On Arm64, with 64K pages and 512M huge pages, this is 8192 pages or
      512M.  It does this regardless of the number of managed pages managed in
      the zone or the likelihood of success.
      
      This can put the zone immediately under water in terms of allocating
      pages from the zone, and can cause a small machine to fail immediately
      due to OoM.  Unlike set_recommended_min_free_kbytes(), which
      substantially increases min_free_kbytes and is tied to THP,
      boost_watermark() can be called even if THP is not active.
      
      The problem is most likely to appear on architectures such as Arm64
      where pageblock_nr_pages is very large.
      
      It is desirable to run the kdump capture kernel in as small a space as
      possible to avoid wasting memory.  In some architectures, such as Arm64,
      there are restrictions on where the capture kernel can run, and
      therefore, the space available.  A capture kernel running in 768M can
      fail due to OoM immediately after boost_watermark() sets the min in zone
      DMA32, where most of the memory is, to 512M.  It fails even though there
      is over 500M of free memory.  With boost_watermark() suppressed, the
      capture kernel can run successfully in 448M.
      
      This patch limits boost_watermark() to boosting a zone's min watermark
      only when there are enough pages that the boost will produce positive
      results.  In this case that is estimated to be four times as many pages
      as pageblock_nr_pages.
      
      Mel said:
      
      : There is no harm in marking it stable.  Clearly it does not happen very
      : often but it's not impossible.  32-bit x86 is a lot less common now
      : which would previously have been vulnerable to triggering this easily.
      : ppc64 has a larger base page size but typically only has one zone.
      : arm64 is likely the most vulnerable, particularly when CMA is
      : configured with a small movable zone.
      
      Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: NHenry Willard <henry.willard@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1588294148-6586-1-git-send-email-henry.willard@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      [xuyu: expand zone_managed_pages function]
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      ab70cdb0
    • M
      mm, vmscan: do not special-case slab reclaim when watermarks are boosted · fb4da0ed
      Mel Gorman 提交于
      to #28825456
      
      commit 28360f398778d7623a5ff8a8e90958c0d925e120 upstream.
      
      Dave Chinner reported a problem pointing a finger at commit 1c30844d2dfe
      ("mm: reclaim small amounts of memory when an external fragmentation
      event occurs").
      
      The report is extensive:
      
        https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/
      
      and it's worth recording the most relevant parts (colorful language and
      typos included).
      
      	When running a simple, steady state 4kB file creation test to
      	simulate extracting tarballs larger than memory full of small
      	files into the filesystem, I noticed that once memory fills up
      	the cache balance goes to hell.
      
      	The workload is creating one dirty cached inode for every dirty
      	page, both of which should require a single IO each to clean and
      	reclaim, and creation of inodes is throttled by the rate at which
      	dirty writeback runs at (via balance dirty pages). Hence the ingest
      	rate of new cached inodes and page cache pages is identical and
      	steady. As a result, memory reclaim should quickly find a steady
      	balance between page cache and inode caches.
      
      	The moment memory fills, the page cache is reclaimed at a much
      	faster rate than the inode cache, and evidence suggests that
      	the inode cache shrinker is not being called when large batches
      	of pages are being reclaimed. In roughly the same time period
      	that it takes to fill memory with 50% pages and 50% slab caches,
      	memory reclaim reduces the page cache down to just dirty pages
      	and slab caches fill the entirety of memory.
      
      	The LRU is largely full of dirty pages, and we're getting spikes
      	of random writeback from memory reclaim so it's all going to shit.
      	Behaviour never recovers, the page cache remains pinned at just
      	dirty pages, and nothing I could tune would make any difference.
      	vfs_cache_pressure makes no difference - I would set it so high
      	it should trim the entire inode caches in a single pass, yet it
      	didn't do anything. It was clear from tracing and live telemetry
      	that the shrinkers were pretty much not running except when
      	there was absolutely no memory free at all, and then they did
      	the minimum necessary to free memory to make progress.
      
      	So I went looking at the code, trying to find places where pages
      	got reclaimed and the shrinkers weren't called. There's only one
      	- kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm:
      	reclaim small amounts of memory when an external fragmentation
      	event occurs").
      
      The watermark boosting introduced by the commit is triggered in response
      to an allocation "fragmentation event".  The boosting was not intended
      to target THP specifically and triggers even if THP is disabled.
      However, with Dave's perfectly reasonable workload, fragmentation events
      can be very common given the ratio of slab to page cache allocations so
      boosting remains active for long periods of time.
      
      As high-order allocations might use compaction and compaction cannot
      move slab pages the decision was made in the commit to special-case
      kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as
      reclaiming slab does not directly help compaction.
      
      As Dave notes, this decision means that slab can be artificially
      protected for long periods of time and messes up the balance with slab
      and page caches.
      
      Removing the special casing can still indirectly help avoid
      fragmentation by avoiding fragmentation-causing events due to slab
      allocation as pages from a slab pageblock will have some slab objects
      freed.  Furthermore, with the special casing, reclaim behaviour is
      unpredictable as kswapd sometimes examines slab and sometimes does not
      in a manner that is tricky to tune or analyse.
      
      This patch removes the special casing.  The downside is that this is not
      a universal performance win.  Some benchmarks that depend on the
      residency of data when rereading metadata may see a regression when slab
      reclaim is restored to its original behaviour.  Similarly, some
      benchmarks that only read-once or write-once may perform better when
      page reclaim is too aggressive.  The primary upside is that slab
      shrinker is less surprising (arguably more sane but that's a matter of
      opinion), behaves consistently regardless of the fragmentation state of
      the system and properly obeys VM sysctls.
      
      A fsmark benchmark configuration was constructed similar to what Dave
      reported and is codified by the mmtest configuration
      config-io-fsmark-small-file-stream.  It was evaluated on a 1-socket
      machine to avoid dealing with NUMA-related issues and the timing of
      reclaim.  The storage was an SSD Samsung Evo and a fresh trimmed XFS
      filesystem was used for the test data.
      
      This is not an exact replication of Dave's setup.  The configuration
      scales its parameters depending on the memory size of the SUT to behave
      similarly across machines.  The parameters mean the first sample
      reported by fs_mark is using 50% of RAM which will barely be throttled
      and look like a big outlier.  Dave used fake NUMA to have multiple
      kswapd instances which I didn't replicate.  Finally, the number of
      iterations differ from Dave's test as the target disk was not large
      enough.  While not identical, it should be representative.
      
        fsmark
                                           5.3.0-rc3              5.3.0-rc3
                                             vanilla          shrinker-v1r1
        Min       1-files/sec     4444.80 (   0.00%)     4765.60 (   7.22%)
        1st-qrtle 1-files/sec     5005.10 (   0.00%)     5091.70 (   1.73%)
        2nd-qrtle 1-files/sec     4917.80 (   0.00%)     4855.60 (  -1.26%)
        3rd-qrtle 1-files/sec     4667.40 (   0.00%)     4831.20 (   3.51%)
        Max-1     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-5     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-10    1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-90    1-files/sec     4649.60 (   0.00%)     4780.70 (   2.82%)
        Max-95    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
        Max-99    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
        Max       1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Hmean     1-files/sec     5004.75 (   0.00%)     5075.96 (   1.42%)
        Stddev    1-files/sec     1778.70 (   0.00%)     1369.66 (  23.00%)
        CoeffVar  1-files/sec       33.70 (   0.00%)       26.05 (  22.71%)
        BHmean-99 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
        BHmean-95 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
        BHmean-90 1-files/sec     5107.05 (   0.00%)     5131.41 (   0.48%)
        BHmean-75 1-files/sec     5208.45 (   0.00%)     5206.68 (  -0.03%)
        BHmean-50 1-files/sec     5405.53 (   0.00%)     5381.62 (  -0.44%)
        BHmean-25 1-files/sec     6179.75 (   0.00%)     6095.14 (  -1.37%)
      
                           5.3.0-rc3   5.3.0-rc3
                             vanillashrinker-v1r1
        Duration User         501.82      497.29
        Duration System      4401.44     4424.08
        Duration Elapsed     8124.76     8358.05
      
      This is showing a slight skew for the max result representing a large
      outlier for the 1st, 2nd and 3rd quartile are similar indicating that
      the bulk of the results show little difference.  Note that an earlier
      version of the fsmark configuration showed a regression but that
      included more samples taken while memory was still filling.
      
      Note that the elapsed time is higher.  Part of this is that the
      configuration included time to delete all the test files when the test
      completes -- the test automation handles the possibility of testing
      fsmark with multiple thread counts.  Without the patch, many of these
      objects would be memory resident which is part of what the patch is
      addressing.
      
      There are other important observations that justify the patch.
      
      1. With the vanilla kernel, the number of dirty pages in the system is
         very low for much of the test. With this patch, dirty pages is
         generally kept at 10% which matches vm.dirty_background_ratio which
         is normal expected historical behaviour.
      
      2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
         0.95 for much of the test i.e. Slab is being left alone and
         dominating memory consumption. With the patch applied, the ratio
         varies between 0.35 and 0.45 with the bulk of the measured ratios
         roughly half way between those values. This is a different balance to
         what Dave reported but it was at least consistent.
      
      3. Slabs are scanned throughout the entire test with the patch applied.
         The vanille kernel has periods with no scan activity and then
         relatively massive spikes.
      
      4. Without the patch, kswapd scan rates are very variable. With the
         patch, the scan rates remain quite steady.
      
      4. Overall vmstats are closer to normal expectations
      
      	                                5.3.0-rc3      5.3.0-rc3
      	                                  vanilla  shrinker-v1r1
          Ops Direct pages scanned             99388.00      328410.00
          Ops Kswapd pages scanned          45382917.00    33451026.00
          Ops Kswapd pages reclaimed        30869570.00    25239655.00
          Ops Direct pages reclaimed           74131.00        5830.00
          Ops Kswapd efficiency %                 68.02          75.45
          Ops Kswapd velocity                   5585.75        4002.25
          Ops Page reclaim immediate         1179721.00      430927.00
          Ops Slabs scanned                 62367361.00    73581394.00
          Ops Direct inode steals               2103.00        1002.00
          Ops Kswapd inode steals             570180.00     5183206.00
      
      	o Vanilla kernel is hitting direct reclaim more frequently,
      	  not very much in absolute terms but the fact the patch
      	  reduces it is interesting
      	o "Page reclaim immediate" in the vanilla kernel indicates
      	  dirty pages are being encountered at the tail of the LRU.
      	  This is generally bad and means in this case that the LRU
      	  is not long enough for dirty pages to be cleaned by the
      	  background flush in time. This is much reduced by the
      	  patch.
      	o With the patch, kswapd is reclaiming 10 times more slab
      	  pages than with the vanilla kernel. This is indicative
      	  of the watermark boosting over-protecting slab
      
      A more complete set of tests were run that were part of the basis for
      introducing boosting and while there are some differences, they are well
      within tolerances.
      
      Bottom line, the special casing kswapd to avoid slab behaviour is
      unpredictable and can lead to abnormal results for normal workloads.
      
      This patch restores the expected behaviour that slab and page cache is
      balanced consistently for a workload with a steady allocation ratio of
      slab/pagecache pages.  It also means that if there are workloads that
      favour the preservation of slab over pagecache that it can be tuned via
      vm.vfs_cache_pressure where as the vanilla kernel effectively ignores
      the parameter when boosting is active.
      
      Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net
      Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      fb4da0ed
    • A
      mm/page_alloc.c: fix never set ALLOC_NOFRAGMENT flag · a274182f
      Andrey Ryabinin 提交于
      to #28825456
      
      commit 8118b82eb756e271929697e8ada5f637dc443af1 upstream.
      
      Commit 0a79cdad5eb2 ("mm: use alloc_flags to record if kswapd can wake")
      removed setting of the ALLOC_NOFRAGMENT flag.  Bring it back.
      
      The runtime effect is that ALLOC_NOFRAGMENT behaviour is restored so
      that allocations are spread across local zones to avoid fragmentation
      due to mixing pageblocks as long as possible.
      
      Link: http://lkml.kernel.org/r/20190423120806.3503-2-aryabinin@virtuozzo.com
      Fixes: 0a79cdad5eb2 ("mm: use alloc_flags to record if kswapd can wake")
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      a274182f
    • A
      mm/page_alloc.c: avoid potential NULL pointer dereference · 60eb3dfc
      Andrey Ryabinin 提交于
      to #28825456
      
      commit 8139ad043d632c0e9e12d760068a7a8e91659aa1 upstream.
      
      ac.preferred_zoneref->zone passed to alloc_flags_nofragment() can be NULL.
      'zone' pointer unconditionally derefernced in alloc_flags_nofragment().
      Bail out on NULL zone to avoid potential crash.  Currently we don't see
      any crashes only because alloc_flags_nofragment() has another bug which
      allows compiler to optimize away all accesses to 'zone'.
      
      Link: http://lkml.kernel.org/r/20190423120806.3503-1-aryabinin@virtuozzo.com
      Fixes: 6bb154504f8b ("mm, page_alloc: spread allocations across zones before introducing fragmentation")
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      60eb3dfc
    • M
      mm: do not boost watermarks to avoid fragmentation for the DISCONTIG memory model · d5971569
      Mel Gorman 提交于
      to #28825456
      
      commit 24512228b7a3f412b5a51f189df302616b021c33 upstream.
      
      Mikulas Patocka reported that commit 1c30844d2dfe ("mm: reclaim small
      amounts of memory when an external fragmentation event occurs") "broke"
      memory management on parisc.
      
      The machine is not NUMA but the DISCONTIG model creates three pgdats
      even though it's a UMA machine for the following ranges
      
              0) Start 0x0000000000000000 End 0x000000003fffffff Size   1024 MB
              1) Start 0x0000000100000000 End 0x00000001bfdfffff Size   3070 MB
              2) Start 0x0000004040000000 End 0x00000040ffffffff Size   3072 MB
      
      Mikulas reported:
      
      	With the patch 1c30844d2, the kernel will incorrectly reclaim the
      	first zone when it fills up, ignoring the fact that there are two
      	completely free zones. Basiscally, it limits cache size to 1GiB.
      
      	For example, if I run:
      	# dd if=/dev/sda of=/dev/null bs=1M count=2048
      
      	- with the proper kernel, there should be "Buffers - 2GiB"
      	when this command finishes. With the patch 1c30844d2, buffers
      	will consume just 1GiB or slightly more, because the kernel was
      	incorrectly reclaiming them.
      
      The page allocator and reclaim makes assumptions that pgdats really
      represent NUMA nodes and zones represent ranges and makes decisions on
      that basis.  Watermark boosting for small pgdats leads to unexpected
      results even though this would have behaved reasonably on SPARSEMEM.
      
      DISCONTIG is essentially deprecated and even parisc plans to move to
      SPARSEMEM so there is no need to be fancy, this patch simply disables
      watermark boosting by default on DISCONTIGMEM.
      
      Link: http://lkml.kernel.org/r/20190419094335.GJ18914@techsingularity.net
      Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Tested-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      d5971569
    • M
      mm, page_alloc: fix a division by zero error when boosting watermarks v2 · 92bebf3c
      Mel Gorman 提交于
      to #28825456
      
      commit 94b3334cbebea34d56a7e6321c6fe9d89b309a49 upstream.
      
      Yury Norov reported that an arm64 KVM instance could not boot since
      after v5.0-rc1 and could addressed by reverting the patches
      
        1c30844d2dfe272d58c ("mm: reclaim small amounts of memory when an external
        73444bc4d8f92e46a20 ("mm, page_alloc: do not wake kswapd with zone lock held")
      
      The problem is that a division by zero error is possible if boosting
      occurs very early in boot if the system has very little memory.  This
      patch avoids the division by zero error.
      
      Link: http://lkml.kernel.org/r/20190213143012.GT9565@techsingularity.net
      Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reported-by: NYury Norov <yury.norov@gmail.com>
      Tested-by: NYury Norov <yury.norov@gmail.com>
      Tested-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      92bebf3c
    • M
      mm, page_alloc: do not wake kswapd with zone lock held · ba16c9c8
      Mel Gorman 提交于
      to #28825456
      
      commit 73444bc4d8f92e46a20cb6bd3342fc2ea75c6787 upstream.
      
      syzbot reported the following regression in the latest merge window and
      it was confirmed by Qian Cai that a similar bug was visible from a
      different context.
      
        ======================================================
        WARNING: possible circular locking dependency detected
        4.20.0+ #297 Not tainted
        ------------------------------------------------------
        syz-executor0/8529 is trying to acquire lock:
        000000005e7fb829 (&pgdat->kswapd_wait){....}, at:
        __wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120
      
        but task is already holding lock:
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: spin_lock
        include/linux/spinlock.h:329 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_bulk
        mm/page_alloc.c:2548 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: __rmqueue_pcplist
        mm/page_alloc.c:3021 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_pcplist
        mm/page_alloc.c:3050 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue
        mm/page_alloc.c:3072 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at:
        get_page_from_freelist+0x1bae/0x52a0 mm/page_alloc.c:3491
      
      It appears to be a false positive in that the only way the lock ordering
      should be inverted is if kswapd is waking itself and the wakeup
      allocates debugging objects which should already be allocated if it's
      kswapd doing the waking.  Nevertheless, the possibility exists and so
      it's best to avoid the problem.
      
      This patch flags a zone as needing a kswapd using the, surprisingly,
      unused zone flag field.  The flag is read without the lock held to do
      the wakeup.  It's possible that the flag setting context is not the same
      as the flag clearing context or for small races to occur.  However, each
      race possibility is harmless and there is no visible degredation in
      fragmentation treatment.
      
      While zone->flag could have continued to be unused, there is potential
      for moving some existing fields into the flags field instead.
      Particularly read-mostly ones like zone->initialized and
      zone->contiguous.
      
      Link: http://lkml.kernel.org/r/20190103225712.GJ31517@techsingularity.net
      Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Reported-by: syzbot+93d94a001cfbce9e60e1@syzkaller.appspotmail.com
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NQian Cai <cai@lca.pw>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      Conflicts:
      	include/linux/mmzone.h
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      ba16c9c8
    • M
      mm: reclaim small amounts of memory when an external fragmentation event occurs · 9bcadc70
      Mel Gorman 提交于
      to #28825456
      
      commit 1c30844d2dfe272d58c8fc000960b835d13aa2ac upstream.
      
      An external fragmentation event was previously described as
      
          When the page allocator fragments memory, it records the event using
          the mm_page_alloc_extfrag event. If the fallback_order is smaller
          than a pageblock order (order-9 on 64-bit x86) then it's considered
          an event that will cause external fragmentation issues in the future.
      
      The kernel reduces the probability of such events by increasing the
      watermark sizes by calling set_recommended_min_free_kbytes early in the
      lifetime of the system.  This works reasonably well in general but if
      there are enough sparsely populated pageblocks then the problem can still
      occur as enough memory is free overall and kswapd stays asleep.
      
      This patch introduces a watermark_boost_factor sysctl that allows a zone
      watermark to be temporarily boosted when an external fragmentation causing
      events occurs.  The boosting will stall allocations that would decrease
      free memory below the boosted low watermark and kswapd is woken if the
      calling context allows to reclaim an amount of memory relative to the size
      of the high watermark and the watermark_boost_factor until the boost is
      cleared.  When kswapd finishes, it wakes kcompactd at the pageblock order
      to clean some of the pageblocks that may have been affected by the
      fragmentation event.  kswapd avoids any writeback, slab shrinkage and swap
      from reclaim context during this operation to avoid excessive system
      disruption in the name of fragmentation avoidance.  Care is taken so that
      kswapd will do normal reclaim work if the system is really low on memory.
      
      This was evaluated using the same workloads as "mm, page_alloc: Spread
      allocations across zones before introducing fragmentation".
      
      1-socket Skylake machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 1 THP allocating thread
      --------------------------------------
      
      4.20-rc3 extfrag events < order 9:   804694
      4.20-rc3+patch:                      408912 (49% reduction)
      4.20-rc3+patch1-4:                    18421 (98% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-1      653.58 (   0.00%)      652.71 (   0.13%)
      Amean     fault-huge-1        0.00 (   0.00%)      178.93 * -99.00%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1        0.00 (   0.00%)        5.12 ( 100.00%)
      
      Note that external fragmentation causing events are massively reduced by
      this path whether in comparison to the previous kernel or the vanilla
      kernel.  The fault latency for huge pages appears to be increased but that
      is only because THP allocations were successful with the patch applied.
      
      1-socket Skylake machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  291392
      4.20-rc3+patch:                     191187 (34% reduction)
      4.20-rc3+patch1-4:                   13464 (95% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Min       fault-base-1      912.00 (   0.00%)      905.00 (   0.77%)
      Min       fault-huge-1      127.00 (   0.00%)      135.00 (  -6.30%)
      Amean     fault-base-1     1467.55 (   0.00%)     1481.67 (  -0.96%)
      Amean     fault-huge-1     1127.11 (   0.00%)     1063.88 *   5.61%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1       77.64 (   0.00%)       83.46 (   7.49%)
      
      As before, massive reduction in external fragmentation events, some jitter
      on latencies and an increase in THP allocation success rates.
      
      2-socket Haswell machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 5 THP allocating threads
      ----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  215698
      4.20-rc3+patch:                     200210 (7% reduction)
      4.20-rc3+patch1-4:                   14263 (93% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     1346.45 (   0.00%)     1306.87 (   2.94%)
      Amean     fault-huge-5     3418.60 (   0.00%)     1348.94 (  60.54%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5        0.78 (   0.00%)        7.91 ( 910.64%)
      
      There is a 93% reduction in fragmentation causing events, there is a big
      reduction in the huge page fault latency and allocation success rate is
      higher.
      
      2-socket Haswell machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9: 166352
      4.20-rc3+patch:                    147463 (11% reduction)
      4.20-rc3+patch1-4:                  11095 (93% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     6217.43 (   0.00%)     7419.67 * -19.34%*
      Amean     fault-huge-5     3163.33 (   0.00%)     3263.80 (  -3.18%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5       95.14 (   0.00%)       87.98 (  -7.53%)
      
      There is a large reduction in fragmentation events with some jitter around
      the latencies and success rates.  As before, the high THP allocation
      success rate does mean the system is under a lot of pressure.  However, as
      the fragmentation events are reduced, it would be expected that the
      long-term allocation success rate would be higher.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      9bcadc70
    • M
      mm: use alloc_flags to record if kswapd can wake · fd98e14a
      Mel Gorman 提交于
      to #28825456
      
      commit 0a79cdad5eb213b3a629e624565b1b3bf9192b7c upstream.
      
      This is a preparation patch that copies the GFP flag __GFP_KSWAPD_RECLAIM
      into alloc_flags.  This is a preparation patch only that avoids having to
      pass gfp_mask through a long callchain in a future patch.
      
      Note that the setting in the fast path happens in alloc_flags_nofragment()
      and it may be claimed that this has nothing to do with ALLOC_NO_FRAGMENT.
      That's true in this patch but is not true later so it's done now for
      easier review to show where the flag needs to be recorded.
      
      No functional change.
      
      [mgorman@techsingularity.net: ALLOC_KSWAPD flag needs to be applied in the !CONFIG_ZONE_DMA32 case]
        Link: http://lkml.kernel.org/r/20181126143503.GO23260@techsingularity.net
      Link: http://lkml.kernel.org/r/20181123114528.28802-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      fd98e14a
    • M
      mm, page_alloc: spread allocations across zones before introducing fragmentation · 039531d2
      Mel Gorman 提交于
      to #28825456
      
      commit 6bb154504f8b496780ec53ec81aba957a12981fa upstream.
      
      Patch series "Fragmentation avoidance improvements", v5.
      
      It has been noted before that fragmentation avoidance (aka
      anti-fragmentation) is not perfect. Given sufficient time or an adverse
      workload, memory gets fragmented and the long-term success of high-order
      allocations degrades. This series defines an adverse workload, a definition
      of external fragmentation events (including serious) ones and a series
      that reduces the level of those fragmentation events.
      
      The details of the workload and the consequences are described in more
      detail in the changelogs. However, from patch 1, this is a high-level
      summary of the adverse workload. The exact details are found in the
      mmtests implementation.
      
      The broad details of the workload are as follows;
      
      1. Create an XFS filesystem (not specified in the configuration but done
         as part of the testing for this patch)
      2. Start 4 fio threads that write a number of 64K files inefficiently.
         Inefficiently means that files are created on first access and not
         created in advance (fio parameterr create_on_open=1) and fallocate
         is not used (fallocate=none). With multiple IO issuers this creates
         a mix of slab and page cache allocations over time. The total size
         of the files is 150% physical memory so that the slabs and page cache
         pages get mixed
      3. Warm up a number of fio read-only threads accessing the same files
         created in step 2. This part runs for the same length of time it
         took to create the files. It'll fault back in old data and further
         interleave slab and page cache allocations. As it's now low on
         memory due to step 2, fragmentation occurs as pageblocks get
         stolen.
      4. While step 3 is still running, start a process that tries to allocate
         75% of memory as huge pages with a number of threads. The number of
         threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
         threads contending with fio, any other threads or forcing cross-NUMA
         scheduling. Note that the test has not been used on a machine with less
         than 8 cores. The benchmark records whether huge pages were allocated
         and what the fault latency was in microseconds
      5. Measure the number of events potentially causing external fragmentation,
         the fault latency and the huge page allocation success rate.
      6. Cleanup
      
      Overall the series reduces external fragmentation causing events by over 94%
      on 1 and 2 socket machines, which in turn impacts high-order allocation
      success rates over the long term. There are differences in latencies and
      high-order allocation success rates. Latencies are a mixed bag as they
      are vulnerable to exact system state and whether allocations succeeded
      so they are treated as a secondary metric.
      
      Patch 1 uses lower zones if they are populated and have free memory
      	instead of fragmenting a higher zone. It's special cased to
      	handle a Normal->DMA32 fallback with the reasons explained
      	in the changelog.
      
      Patch 2-4 boosts watermarks temporarily when an external fragmentation
      	event occurs. kswapd wakes to reclaim a small amount of old memory
      	and then wakes kcompactd on completion to recover the system
      	slightly. This introduces some overhead in the slowpath. The level
      	of boosting can be tuned or disabled depending on the tolerance
      	for fragmentation vs allocation latency.
      
      Patch 5 stalls some movable allocation requests to let kswapd from patch 4
      	make some progress. The duration of the stalls is very low but it
      	is possible to tune the system to avoid fragmentation events if
      	larger stalls can be tolerated.
      
      The bulk of the improvement in fragmentation avoidance is from patches
      1-4 but patch 5 can deal with a rare corner case and provides the option
      of tuning a system for THP allocation success rates in exchange for
      some stalls to control fragmentation.
      
      This patch (of 5):
      
      The page allocator zone lists are iterated based on the watermarks of each
      zone which does not take anti-fragmentation into account.  On x86, node 0
      may have multiple zones while other nodes have one zone.  A consequence is
      that tasks running on node 0 may fragment ZONE_NORMAL even though
      ZONE_DMA32 has plenty of free memory.  This patch special cases the
      allocator fast path such that it'll try an allocation from a lower local
      zone before fragmenting a higher zone.  In this case, stealing of
      pageblocks or orders larger than a pageblock are still allowed in the fast
      path as they are uninteresting from a fragmentation point of view.
      
      This was evaluated using a benchmark designed to fragment memory before
      attempting THP allocations.  It's implemented in mmtests as the following
      configurations
      
      configs/config-global-dhp__workload_thpfioscale
      configs/config-global-dhp__workload_thpfioscale-defrag
      configs/config-global-dhp__workload_thpfioscale-madvhugepage
      
      e.g. from mmtests
      ./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1
      
      The broad details of the workload are as follows;
      
      1. Create an XFS filesystem (not specified in the configuration but done
         as part of the testing for this patch).
      2. Start 4 fio threads that write a number of 64K files inefficiently.
         Inefficiently means that files are created on first access and not
         created in advance (fio parameter create_on_open=1) and fallocate
         is not used (fallocate=none). With multiple IO issuers this creates
         a mix of slab and page cache allocations over time. The total size
         of the files is 150% physical memory so that the slabs and page cache
         pages get mixed.
      3. Warm up a number of fio read-only processes accessing the same files
         created in step 2. This part runs for the same length of time it
         took to create the files. It'll refault old data and further
         interleave slab and page cache allocations. As it's now low on
         memory due to step 2, fragmentation occurs as pageblocks get
         stolen.
      4. While step 3 is still running, start a process that tries to allocate
         75% of memory as huge pages with a number of threads. The number of
         threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
         threads contending with fio, any other threads or forcing cross-NUMA
         scheduling. Note that the test has not been used on a machine with less
         than 8 cores. The benchmark records whether huge pages were allocated
         and what the fault latency was in microseconds.
      5. Measure the number of events potentially causing external fragmentation,
         the fault latency and the huge page allocation success rate.
      6. Cleanup the test files.
      
      Note that due to the use of IO and page cache that this benchmark is not
      suitable for running on large machines where the time to fragment memory
      may be excessive.  Also note that while this is one mix that generates
      fragmentation that it's not the only mix that generates fragmentation.
      Differences in workload that are more slab-intensive or whether SLUB is
      used with high-order pages may yield different results.
      
      When the page allocator fragments memory, it records the event using the
      mm_page_alloc_extfrag ftrace event.  If the fallback_order is smaller than
      a pageblock order (order-9 on 64-bit x86) then it's considered to be an
      "external fragmentation event" that may cause issues in the future.
      Hence, the primary metric here is the number of external fragmentation
      events that occur with order < 9.  The secondary metric is allocation
      latency and huge page allocation success rates but note that differences
      in latencies and what the success rate also can affect the number of
      external fragmentation event which is why it's a secondary metric.
      
      1-socket Skylake machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 1 THP allocating thread
      --------------------------------------
      
      4.20-rc3 extfrag events < order 9:   804694
      4.20-rc3+patch:                      408912 (49% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-1      662.92 (   0.00%)      653.58 *   1.41%*
      Amean     fault-huge-1        0.00 (   0.00%)        0.00 (   0.00%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-1        0.00 (   0.00%)        0.00 (   0.00%)
      
      Fault latencies are slightly reduced while allocation success rates remain
      at zero as this configuration does not make any special effort to allocate
      THP and fio is heavily active at the time and either filling memory or
      keeping pages resident.  However, a 49% reduction of serious fragmentation
      events reduces the changes of external fragmentation being a problem in
      the future.
      
      Vlastimil asked during review for a breakdown of the allocation types
      that are falling back.
      
      vanilla
         3816 MIGRATE_UNMOVABLE
       800845 MIGRATE_MOVABLE
           33 MIGRATE_UNRECLAIMABLE
      
      patch
          735 MIGRATE_UNMOVABLE
       408135 MIGRATE_MOVABLE
           42 MIGRATE_UNRECLAIMABLE
      
      The majority of the fallbacks are due to movable allocations and this is
      consistent for the workload throughout the series so will not be presented
      again as the primary source of fallbacks are movable allocations.
      
      Movable fallbacks are sometimes considered "ok" to fallback because they
      can be migrated.  The problem is that they can fill an
      unmovable/reclaimable pageblock causing those allocations to fallback
      later and polluting pageblocks with pages that cannot move.  If there is a
      movable fallback, it is pretty much guaranteed to affect an
      unmovable/reclaimable pageblock and while it might not be enough to
      actually cause a unmovable/reclaimable fallback in the future, we cannot
      know that in advance so the patch takes the only option available to it.
      Hence, it's important to control them.  This point is also consistent
      throughout the series and will not be repeated.
      
      1-socket Skylake machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  291392
      4.20-rc3+patch:                     191187 (34% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-1     1495.14 (   0.00%)     1467.55 (   1.85%)
      Amean     fault-huge-1     1098.48 (   0.00%)     1127.11 (  -2.61%)
      
      thpfioscale Percentage Faults Huge
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-1       78.57 (   0.00%)       77.64 (  -1.18%)
      
      Fragmentation events were reduced quite a bit although this is known
      to be a little variable. The latencies and allocation success rates
      are similar but they were already quite high.
      
      2-socket Haswell machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 5 THP allocating threads
      ----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  215698
      4.20-rc3+patch:                     200210 (7% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-5     1350.05 (   0.00%)     1346.45 (   0.27%)
      Amean     fault-huge-5     4181.01 (   0.00%)     3418.60 (  18.24%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-5        1.15 (   0.00%)        0.78 ( -31.88%)
      
      The reduction of external fragmentation events is slight and this is
      partially due to the removal of __GFP_THISNODE in commit ac5b2c18911f
      ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP
      allocations can now spill over to remote nodes instead of fragmenting
      local memory.
      
      2-socket Haswell machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9: 166352
      4.20-rc3+patch:                    147463 (11% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-5     6138.97 (   0.00%)     6217.43 (  -1.28%)
      Amean     fault-huge-5     2294.28 (   0.00%)     3163.33 * -37.88%*
      
      thpfioscale Percentage Faults Huge
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-5       96.82 (   0.00%)       95.14 (  -1.74%)
      
      There was a slight reduction in external fragmentation events although the
      latencies were higher.  The allocation success rate is high enough that
      the system is struggling and there is quite a lot of parallel reclaim and
      compaction activity.  There is also a certain degree of luck on whether
      processes start on node 0 or not for this patch but the relevance is
      reduced later in the series.
      
      Overall, the patch reduces the number of external fragmentation causing
      events so the success of THP over long periods of time would be improved
      for this adverse workload.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      Conflicts:
      	mm/page_alloc.c
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      039531d2
    • J
      mm/filemap.c: don't bother dropping mmap_sem for zero size readahead · 0b74ae70
      Jan Kara 提交于
      to #28718400
      
      commit 5c72feee3e45b40a3c96c7145ec422899d0e8964 upstream.
      
      When handling a page fault, we drop mmap_sem to start async readahead so
      that we don't block on IO submission with mmap_sem held.  However there's
      no point to drop mmap_sem in case readahead is disabled.  Handle that case
      to avoid pointless dropping of mmap_sem and retrying the fault.  This was
      actually reported to block mlockall(MCL_CURRENT) indefinitely.
      
      Fixes: 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking operations")
      Reported-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NRobert Stupp <snazy@gmx.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/20200212101356.30759-1-jack@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      0b74ae70
    • Y
      mm: mmu_gather: remove __tlb_reset_range() for force flush · 1d42b185
      Yang Shi 提交于
      to #28718400
      
      commit 7a30df49f63ad92318ddf1f7498d1129a77dd4bd upstream.
      
      A few new fields were added to mmu_gather to make TLB flush smarter for
      huge page by telling what level of page table is changed.
      
      __tlb_reset_range() is used to reset all these page table state to
      unchanged, which is called by TLB flush for parallel mapping changes for
      the same range under non-exclusive lock (i.e.  read mmap_sem).
      
      Before commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
      munmap"), the syscalls (e.g.  MADV_DONTNEED, MADV_FREE) which may update
      PTEs in parallel don't remove page tables.  But, the forementioned
      commit may do munmap() under read mmap_sem and free page tables.  This
      may result in program hang on aarch64 reported by Jan Stancek.  The
      problem could be reproduced by his test program with slightly modified
      below.
      
      ---8<---
      
      static int map_size = 4096;
      static int num_iter = 500;
      static long threads_total;
      
      static void *distant_area;
      
      void *map_write_unmap(void *ptr)
      {
      	int *fd = ptr;
      	unsigned char *map_address;
      	int i, j = 0;
      
      	for (i = 0; i < num_iter; i++) {
      		map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ,
      			MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      		if (map_address == MAP_FAILED) {
      			perror("mmap");
      			exit(1);
      		}
      
      		for (j = 0; j < map_size; j++)
      			map_address[j] = 'b';
      
      		if (munmap(map_address, map_size) == -1) {
      			perror("munmap");
      			exit(1);
      		}
      	}
      
      	return NULL;
      }
      
      void *dummy(void *ptr)
      {
      	return NULL;
      }
      
      int main(void)
      {
      	pthread_t thid[2];
      
      	/* hint for mmap in map_write_unmap() */
      	distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ,
      			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
      	munmap(distant_area, (size_t)DISTANT_MMAP_SIZE);
      	distant_area += DISTANT_MMAP_SIZE / 2;
      
      	while (1) {
      		pthread_create(&thid[0], NULL, map_write_unmap, NULL);
      		pthread_create(&thid[1], NULL, dummy, NULL);
      
      		pthread_join(thid[0], NULL);
      		pthread_join(thid[1], NULL);
      	}
      }
      ---8<---
      
      The program may bring in parallel execution like below:
      
              t1                                        t2
      munmap(map_address)
        downgrade_write(&mm->mmap_sem);
        unmap_region()
        tlb_gather_mmu()
          inc_tlb_flush_pending(tlb->mm);
        free_pgtables()
          tlb->freed_tables = 1
          tlb->cleared_pmds = 1
      
                                              pthread_exit()
                                              madvise(thread_stack, 8M, MADV_DONTNEED)
                                                zap_page_range()
                                                  tlb_gather_mmu()
                                                    inc_tlb_flush_pending(tlb->mm);
      
        tlb_finish_mmu()
          if (mm_tlb_flush_nested(tlb->mm))
            __tlb_reset_range()
      
      __tlb_reset_range() would reset freed_tables and cleared_* bits, but this
      may cause inconsistency for munmap() which do free page tables.  Then it
      may result in some architectures, e.g.  aarch64, may not flush TLB
      completely as expected to have stale TLB entries remained.
      
      Use fullmm flush since it yields much better performance on aarch64 and
      non-fullmm doesn't yields significant difference on x86.
      
      The original proposed fix came from Jan Stancek who mainly debugged this
      issue, I just wrapped up everything together.
      
      Jan's testing results:
      
      v5.2-rc2-24-gbec7550cca10
      --------------------------
               mean     stddev
      real    37.382   2.780
      user     1.420   0.078
      sys     54.658   1.855
      
      v5.2-rc2-24-gbec7550cca10 + "mm: mmu_gather: remove __tlb_reset_range() for force flush"
      ---------------------------------------------------------------------------------------_
               mean     stddev
      real    37.119   2.105
      user     1.548   0.087
      sys     55.698   1.357
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1558322252-113575-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NJan Stancek <jstancek@redhat.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Tested-by: NJan Stancek <jstancek@redhat.com>
      Suggested-by: NWill Deacon <will.deacon@arm.com>
      Tested-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>	[4.20+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      [xuyu: backport from mm/mmu_gather.c to mm/memory.c]
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      1d42b185
    • J
      filemap: drop the mmap_sem for all blocking operations · 2841653a
      Josef Bacik 提交于
      to #28718400
      
      commit 6b4c9f4469819a0c1a38a0a4541337e0f9bf6c11 upstream.
      
      Currently we only drop the mmap_sem if there is contention on the page
      lock.  The idea is that we issue readahead and then go to lock the page
      while it is under IO and we want to not hold the mmap_sem during the IO.
      
      The problem with this is the assumption that the readahead does anything.
      In the case that the box is under extreme memory or IO pressure we may end
      up not reading anything at all for readahead, which means we will end up
      reading in the page under the mmap_sem.
      
      Even if the readahead does something, it could get throttled because of io
      pressure on the system and the process is in a lower priority cgroup.
      
      Holding the mmap_sem while doing IO is problematic because it can cause
      system-wide priority inversions.  Consider some large company that does a
      lot of web traffic.  This large company has load balancing logic in it's
      core web server, cause some engineer thought this was a brilliant plan.
      This load balancing logic gets statistics from /proc about the system,
      which trip over processes mmap_sem for various reasons.  Now the web
      server application is in a protected cgroup, but these other processes may
      not be, and if they are being throttled while their mmap_sem is held we'll
      stall, and cause this nice death spiral.
      
      Instead rework filemap fault path to drop the mmap sem at any point that
      we may do IO or block for an extended period of time.  This includes while
      issuing readahead, locking the page, or needing to call ->readpage because
      readahead did not occur.  Then once we have a fully uptodate page we can
      return with VM_FAULT_RETRY and come back again to find our nicely in-cache
      page that was gotten outside of the mmap_sem.
      
      This patch also adds a new helper for locking the page with the mmap_sem
      dropped.  This doesn't make sense currently as generally speaking if the
      page is already locked it'll have been read in (unless there was an error)
      before it was unlocked.  However a forthcoming patchset will change this
      with the ability to abort read-ahead bio's if necessary, making it more
      likely that we could contend for a page lock and still have a not uptodate
      page.  This allows us to deal with this case by grabbing the lock and
      issuing the IO without the mmap_sem held, and then returning
      VM_FAULT_RETRY to come back around.
      
      [josef@toxicpanda.com: v6]
        Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
      [kirill@shutemov.name: fix race in filemap_fault()]
        Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
      [akpm@linux-foundation.org: coding style fixes]
      Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      2841653a
    • J
      filemap: kill page_cache_read usage in filemap_fault · d33d6167
      Josef Bacik 提交于
      to #28718400
      
      commit a75d4c33377277b6034dd1e2663bce444f952c14 upstream.
      
      Patch series "drop the mmap_sem when doing IO in the fault path", v6.
      
      Now that we have proper isolation in place with cgroups2 we have started
      going through and fixing the various priority inversions.  Most are all
      gone now, but this one is sort of weird since it's not necessarily a
      priority inversion that happens within the kernel, but rather because of
      something userspace does.
      
      We have giant applications that we want to protect, and parts of these
      giant applications do things like watch the system state to determine how
      healthy the box is for load balancing and such.  This involves running
      'ps' or other such utilities.  These utilities will often walk
      /proc/<pid>/whatever, and these files can sometimes need to
      down_read(&task->mmap_sem).  Not usually a big deal, but we noticed when
      we are stress testing that sometimes our protected application has latency
      spikes trying to get the mmap_sem for tasks that are in lower priority
      cgroups.
      
      This is because any down_write() on a semaphore essentially turns it into
      a mutex, so even if we currently have it held for reading, any new readers
      will not be allowed on to keep from starving the writer.  This is fine,
      except a lower priority task could be stuck doing IO because it has been
      throttled to the point that its IO is taking much longer than normal.  But
      because a higher priority group depends on this completing it is now stuck
      behind lower priority work.
      
      In order to avoid this particular priority inversion we want to use the
      existing retry mechanism to stop from holding the mmap_sem at all if we
      are going to do IO.  This already exists in the read case sort of, but
      needed to be extended for more than just grabbing the page lock.  With
      io.latency we throttle at submit_bio() time, so the readahead stuff can
      block and even page_cache_read can block, so all these paths need to have
      the mmap_sem dropped.
      
      The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
      because we have to reserve space for the dirty page, which can be a very
      expensive operation.  We use the same retry method as the read path, and
      simply cache the page and verify the page is still setup properly the next
      pass through ->page_mkwrite().
      
      I've tested these patches with xfstests and there are no regressions.
      
      This patch (of 3):
      
      If we do not have a page at filemap_fault time we'll do this weird forced
      page_cache_read thing to populate the page, and then drop it again and
      loop around and find it.  This makes for 2 ways we can read a page in
      filemap_fault, and it's not really needed.  Instead add a FGP_FOR_MMAP
      flag so that pagecache_get_page() will return a unlocked page that's in
      pagecache.  Then use the normal page locking and readpage logic already in
      filemap_fault.  This simplifies the no page in page cache case
      significantly.
      
      [akpm@linux-foundation.org: fix comment text]
      [josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
        Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
      Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      Conflicts:
      	mm/filemap.c
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      d33d6167
    • J
      filemap: pass vm_fault to the mmap ra helpers · 4023e1eb
      Josef Bacik 提交于
      to #28718400
      
      commit 2a1180f1bd389e9d47693e5eb384b95f482d8d19 upstream.
      
      All of the arguments to these functions come from the vmf.
      
      Cut down on the amount of arguments passed by simply passing in the vmf
      to these two helpers.
      
      Link: http://lkml.kernel.org/r/20181211173801.29535-3-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      4023e1eb
    • Y
      mm: unmap VM_PFNMAP mappings with optimized path · 7124f1ac
      Yang Shi 提交于
      to #28718400
      
      commit cb4922496ae40a775a1b17025eaa1060e8991253 upstream.
      
      When unmapping VM_PFNMAP mappings, vm flags need to be updated.  Since the
      vmas have been detached, so it sounds safe to update vm flags with read
      mmap_sem.
      
      Link: http://lkml.kernel.org/r/1537376621-51150-4-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NMatthew Wilcox <willy@infradead.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      7124f1ac
    • Y
      mm: unmap VM_HUGETLB mappings with optimized path · 435ce551
      Yang Shi 提交于
      to #28718400
      
      commit b4cefb36051244bcb5651026d862c332a6cac7df upstream.
      
      When unmapping VM_HUGETLB mappings, vm flags need to be updated.  Since
      the vmas have been detached, so it sounds safe to update vm flags with
      read mmap_sem.
      
      Link: http://lkml.kernel.org/r/1537376621-51150-3-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NMatthew Wilcox <willy@infradead.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      435ce551
    • Y
      mm: mmap: zap pages with read mmap_sem in munmap · 7027c305
      Yang Shi 提交于
      to #28718400
      
      commit dd2283f2605e3b3e9c61bcae844b34f2afa4813f upstream.
      
      Patch series "mm: zap pages with read mmap_sem in munmap for large
      mapping", v11.
      
      Background:
      Recently, when we ran some vm scalability tests on machines with large memory,
      we ran into a couple of mmap_sem scalability issues when unmapping large memory
      space, please refer to https://lkml.org/lkml/2017/12/14/733 and
      https://lkml.org/lkml/2018/2/20/576.
      
      History:
      Then akpm suggested to unmap large mapping section by section and drop mmap_sem
      at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).
      
      V1 patch series was submitted to the mailing list per Andrew's suggestion
      (see https://lkml.org/lkml/2018/3/20/786).  Then I received a lot great
      feedback and suggestions.
      
      Then this topic was discussed on LSFMM summit 2018.  In the summit, Michal
      Hocko suggested (also in the v1 patches review) to try "two phases"
      approach.  Zapping pages with read mmap_sem, then doing via cleanup with
      write mmap_sem (for discussion detail, see
      https://lwn.net/Articles/753269/)
      
      Approach:
      Zapping pages is the most time consuming part, according to the suggestion from
      Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
      what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.
      
      But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
        * The unexpected state from PF if it wins the race in the middle of munmap.
          It may return zero page, instead of the content or SIGSEGV.
        * Can't handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
          is a showstopper from akpm
      
      But, some part may need write mmap_sem, for example, vma splitting. So,
      the design is as follows:
              acquire write mmap_sem
              lookup vmas (find and split vmas)
              deal with special mappings
              detach vmas
              downgrade_write
      
              zap pages
              free page tables
              release mmap_sem
      
      The vm events with read mmap_sem may come in during page zapping, but
      since vmas have been detached before, they, i.e.  page fault, gup, etc,
      will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
      expected.
      
      If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
      mappings.  They will be handled by falling back to regular do_munmap()
      with exclusive mmap_sem held in this patch since they may update vm flags.
      
      But, with the "detach vmas first" approach, the vmas have been detached
      when vm flags are updated, so it sounds safe to update vm flags with read
      mmap_sem for this specific case.  So, VM_HUGETLB and VM_PFNMAP will be
      handled by using the optimized path in the following separate patches for
      bisectable sake.
      
      Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
      However it is fine to have false-positive MMF_RECALC_UPROBES according to
      uprobes developer.  So, uprobe unmap will not be handled by the regular
      path.
      
      With the "detach vmas first" approach we don't have to re-acquire mmap_sem
      again to clean up vmas to avoid race window which might get the address
      space changed since downgrade_write() doesn't release the lock to lead
      regression, which simply downgrades to read lock.
      
      And, since the lock acquire/release cost is managed to the minimum and
      almost as same as before, the optimization could be extended to any size
      of mapping without incurring significant penalty to small mappings.
      
      For the time being, just do this in munmap syscall path.  Other
      vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
      intact due to some implementation difficulties since they acquire write
      mmap_sem from very beginning and hold it until the end, do_munmap() might
      be called in the middle.  But, the optimized do_munmap would like to be
      called without mmap_sem held so that we can do the optimization.  So, if
      we want to do the similar optimization for mmap/mremap path, I'm afraid we
      would have to redesign them.  mremap might be called on very large area
      depending on the usecases, the optimization to it will be considered in
      the future.
      
      This patch (of 3):
      
      When running some mmap/munmap scalability tests with large memory (i.e.
      > 300GB), the below hung task issue may happen occasionally.
      
      INFO: task ps:14018 blocked for more than 120 seconds.
             Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
      message.
       ps              D    0 14018      1 0x00000004
        ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
        ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
        00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
       Call Trace:
        [<ffffffff817154d0>] ? __schedule+0x250/0x730
        [<ffffffff817159e6>] schedule+0x36/0x80
        [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
        [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
        [<ffffffff81717db0>] down_read+0x20/0x40
        [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
        [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
        [<ffffffff81241d87>] __vfs_read+0x37/0x150
        [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
        [<ffffffff81242266>] vfs_read+0x96/0x130
        [<ffffffff812437b5>] SyS_read+0x55/0xc0
        [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
      
      It is because munmap holds mmap_sem exclusively from very beginning to all
      the way down to the end, and doesn't release it in the middle.  When
      unmapping large mapping, it may take long time (take ~18 seconds to unmap
      320GB mapping with every single page mapped on an idle machine).
      
      Zapping pages is the most time consuming part, according to the suggestion
      from Michal Hocko [1], zapping pages can be done with holding read
      mmap_sem, like what MADV_DONTNEED does.  Then re-acquire write mmap_sem to
      cleanup vmas.
      
      But, some part may need write mmap_sem, for example, vma splitting. So,
      the design is as follows:
              acquire write mmap_sem
              lookup vmas (find and split vmas)
              deal with special mappings
              detach vmas
              downgrade_write
      
              zap pages
              free page tables
              release mmap_sem
      
      The vm events with read mmap_sem may come in during page zapping, but
      since vmas have been detached before, they, i.e.  page fault, gup, etc,
      will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
      expected.
      
      If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
      mappings.  They will be handled by without downgrading mmap_sem in this
      patch since they may update vm flags.
      
      But, with the "detach vmas first" approach, the vmas have been detached
      when vm flags are updated, so it sounds safe to update vm flags with read
      mmap_sem for this specific case.  So, VM_HUGETLB and VM_PFNMAP will be
      handled by using the optimized path in the following separate patches for
      bisectable sake.
      
      Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
      However it is fine to have false-positive MMF_RECALC_UPROBES according to
      uprobes developer.
      
      With the "detach vmas first" approach we don't have to re-acquire mmap_sem
      again to clean up vmas to avoid race window which might get the address
      space changed since downgrade_write() doesn't release the lock to lead
      regression, which simply downgrades to read lock.
      
      And, since the lock acquire/release cost is managed to the minimum and
      almost as same as before, the optimization could be extended to any size
      of mapping without incurring significant penalty to small mappings.
      
      For the time being, just do this in munmap syscall path.  Other
      vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
      intact due to some implementation difficulties since they acquire write
      mmap_sem from very beginning and hold it until the end, do_munmap() might
      be called in the middle.  But, the optimized do_munmap would like to be
      called without mmap_sem held so that we can do the optimization.  So, if
      we want to do the similar optimization for mmap/mremap path, I'm afraid we
      would have to redesign them.  mremap might be called on very large area
      depending on the usecases, the optimization to it will be considered in
      the future.
      
      With the patches, exclusive mmap_sem hold time when munmap a 80GB address
      space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level
      from second.
      
      munmap_test-15002 [008]   594.380138: funcgraph_entry: |
      __vm_munmap() {
      munmap_test-15002 [008]   594.380146: funcgraph_entry:      !2485684 us
      |    unmap_region();
      munmap_test-15002 [008]   596.865836: funcgraph_exit:       !2485692 us
      |  }
      
      Here the execution time of unmap_region() is used to evaluate the time of
      holding read mmap_sem, then the remaining time is used with holding
      exclusive lock.
      
      [1] https://lwn.net/Articles/753269/
      
      Link: http://lkml.kernel.org/r/1537376621-51150-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi &lt;yang.shi@linux.alibaba.com&gt;Suggested-by: Michal Hocko <mhocko@kernel.org>
      Suggested-by: NKirill A. Shutemov <kirill@shutemov.name>
      Suggested-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NMatthew Wilcox <willy@infradead.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      7027c305
    • C
      alinux: introduce deferred_meminit boot parameter · 05f6ed40
      chenxiangzuo 提交于
      fix #27418285
      
      We introduce a boot parametter 'deferred_meminit' for defer
      page init feature. Default it is disabled, and we can pass
      'deferred_meminit' to enable it.
      Signed-off-by: Nchenxiangzuo <cxz18821786681@linux.alibaba.com>
      Reviewed-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      05f6ed40
  2. 29 6月, 2020 1 次提交
  3. 23 6月, 2020 8 次提交