1. 29 12月, 2018 40 次提交
    • P
      mm/pageblock: throw compile error if pageblock_bits cannot hold MIGRATE_TYPES · 125b860b
      Pingfan Liu 提交于
      Currently, NR_PAGEBLOCK_BITS and MIGRATE_TYPES are not associated by code.
      If someone adds extra migrate type, then he may forget to enlarge the
      NR_PAGEBLOCK_BITS.  Hence it requires some way to fix.
      
      NR_PAGEBLOCK_BITS depends on MIGRATE_TYPES, while these macro spread on
      two different .h file with reverse dependency, it is a little hard to
      refer to MIGRATE_TYPES in pageblock-flag.h.  This patch tries to remind
      such relation in compiling-time.
      
      Link: http://lkml.kernel.org/r/1544508709-11358-1-git-send-email-kernelfans@gmail.comSigned-off-by: NPingfan Liu <kernelfans@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      125b860b
    • K
      ksm: react on changing "sleep_millisecs" parameter faster · fcf9a0ef
      Kirill Tkhai 提交于
      ksm thread unconditionally sleeps in ksm_scan_thread() after each
      iteration:
      
      	schedule_timeout_interruptible(
      		msecs_to_jiffies(ksm_thread_sleep_millisecs))
      
      The timeout is configured in /sys/kernel/mm/ksm/sleep_millisecs.
      
      In case of user writes a big value by a mistake, and the thread enters
      into schedule_timeout_interruptible(), it's not possible to cancel the
      sleep by writing a new smaler value; the thread is just sleeping till
      timeout expires.
      
      The patch fixes the problem by waking the thread each time after the value
      is updated.
      
      This also may be useful for debug purposes; and also for userspace
      daemons, which change sleep_millisecs value in dependence of system load.
      
      Link: http://lkml.kernel.org/r/154454107680.3258.3558002210423531566.stgit@localhost.localdomainSigned-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NCyrill Gorcunov <gorcunov@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fcf9a0ef
    • M
      mm, fault_around: do not take a reference to a locked page · e0975b2a
      Michal Hocko 提交于
      filemap_map_pages takes a speculative reference to each page in the range
      before it tries to lock that page.  While this is correct it also can
      influence page migration which will bail out when seeing an elevated
      reference count.  The faultaround code would bail on seeing a locked page
      so we can pro-actively check the PageLocked bit before
      page_cache_get_speculative and prevent from pointless reference count
      churn.
      
      Link: http://lkml.kernel.org/r/20181211142741.2607-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e0975b2a
    • M
      mm, memory_hotplug: deobfuscate migration part of offlining · bb8965bd
      Michal Hocko 提交于
      Memory migration might fail during offlining and we keep retrying in that
      case.  This is currently obfuscated by goto retry loop.  The code is hard
      to follow and as a result it is even suboptimal becase each retry round
      scans the full range from start_pfn even though we have successfully
      scanned/migrated [start_pfn, pfn] range already.  This is all only because
      check_pages_isolated failure has to rescan the full range again.
      
      De-obfuscate the migration retry loop by promoting it to a real for loop.
      In fact remove the goto altogether by making it a proper double loop
      (yeah, gotos are nasty in this specific case).  In the end we will get a
      slightly more optimal code which is better readable.
      
      [akpm@linux-foundation.org: reflow comments to 80 cols]
      Link: http://lkml.kernel.org/r/20181211142741.2607-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb8965bd
    • M
      mm, memory_hotplug: try to migrate full pfn range · a85009c3
      Michal Hocko 提交于
      Patch series "few memory offlining enhancements".
      
      I have been chasing memory offlining not making progress recently.  On the
      way I have noticed few weird decisions in the code.  The migration itself
      is restricted without a reasonable justification and the retry loop around
      the migration is quite messy.  This is addressed by patch 1 and patch 2.
      
      Patch 3 is targeting on the faultaround code which has been a hot
      candidate for the initial issue reported upstream [2] and that I am
      debugging internally.  It turned out to be not the main contributor in the
      end but I believe we should address it regardless.  See the patch
      description for more details.
      
      [1] http://lkml.kernel.org/r/20181120134323.13007-1-mhocko@kernel.org
      [2] http://lkml.kernel.org/r/20181114070909.GB2653@MiWiFi-R3L-srv
      
      This patch (of 3):
      
      do_migrate_range has been limiting the number of pages to migrate to 256
      for some reason which is not documented.  Even if the limit made some
      sense back then when it was introduced it doesn't really serve a good
      purpose these days.  If the range contains huge pages then we break out of
      the loop too early and go through LRU and pcp caches draining and
      scan_movable_pages is quite suboptimal.
      
      The only reason to limit the number of pages I can think of is to reduce
      the potential time to react on the fatal signal.  But even then the number
      of pages is a questionable metric because even a single page migration
      might block in a non-killable state (e.g.  __unmap_and_move).
      
      Remove the limit and offline the full requested range (this is one
      memblock worth of pages with the current code).  Should we ever get a
      report that offlining takes too long to react on fatal signal then we
      should rather fix the core migration to use killable waits and bailout
      on a signal.
      
      Link: http://lkml.kernel.org/r/20181211142741.2607-1-mhocko@kernel.org
      Link: http://lkml.kernel.org/r/20181211142741.2607-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a85009c3
    • M
      mm, thp, proc: report THP eligibility for each vma · 7635d9cb
      Michal Hocko 提交于
      Userspace falls short when trying to find out whether a specific memory
      range is eligible for THP.  There are usecases that would like to know
      that
      http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
      : This is used to identify heap mappings that should be able to fault thp
      : but do not, and they normally point to a low-on-memory or fragmentation
      : issue.
      
      The only way to deduce this now is to query for hg resp.  nh flags and
      confronting the state with the global setting.  Except that there is also
      PR_SET_THP_DISABLE that might change the picture.  So the final logic is
      not trivial.  Moreover the eligibility of the vma depends on the type of
      VMA as well.  In the past we have supported only anononymous memory VMAs
      but things have changed and shmem based vmas are supported as well these
      days and the query logic gets even more complicated because the
      eligibility depends on the mount option and another global configuration
      knob.
      
      Simplify the current state and report the THP eligibility in
      /proc/<pid>/smaps for each existing vma.  Reuse
      transparent_hugepage_enabled for this purpose.  The original
      implementation of this function assumes that the caller knows that the vma
      itself is supported for THP so make the core checks into
      __transparent_hugepage_enabled and use it for existing callers.
      __show_smap just use the new transparent_hugepage_enabled which also
      checks the vma support status (please note that this one has to be out of
      line due to include dependency issues).
      
      [mhocko@kernel.org: fix oops with NULL ->f_mapping]
        Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Paul Oppenheimer <bepvte@gmail.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7635d9cb
    • J
      mm/mmu_notifier: use structure for invalidate_range_start/end calls v2 · ac46d4f3
      Jérôme Glisse 提交于
      To avoid having to change many call sites everytime we want to add a
      parameter use a structure to group all parameters for the mmu_notifier
      invalidate_range_start/end cakks.  No functional changes with this patch.
      
      [akpm@linux-foundation.org: coding style fixes]
      Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Acked-by: NChristian König <christian.koenig@amd.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      From: Jérôme Glisse <jglisse@redhat.com>
      Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3
      
      fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n
      
      Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac46d4f3
    • J
      mm/mmu_notifier: use structure for invalidate_range_start/end callback · 5d6527a7
      Jérôme Glisse 提交于
      Patch series "mmu notifier contextual informations", v2.
      
      This patchset adds contextual information, why an invalidation is
      happening, to mmu notifier callback.  This is necessary for user of mmu
      notifier that wish to maintains their own data structure without having to
      add new fields to struct vm_area_struct (vma).
      
      For instance device can have they own page table that mirror the process
      address space.  When a vma is unmap (munmap() syscall) the device driver
      can free the device page table for the range.
      
      Today we do not have any information on why a mmu notifier call back is
      happening and thus device driver have to assume that it is always an
      munmap().  This is inefficient at it means that it needs to re-allocate
      device page table on next page fault and rebuild the whole device driver
      data structure for the range.
      
      Other use case beside munmap() also exist, for instance it is pointless
      for device driver to invalidate the device page table when the
      invalidation is for the soft dirtyness tracking.  Or device driver can
      optimize away mprotect() that change the page table permission access for
      the range.
      
      This patchset enables all this optimizations for device drivers.  I do not
      include any of those in this series but another patchset I am posting will
      leverage this.
      
      The patchset is pretty simple from a code point of view.  The first two
      patches consolidate all mmu notifier arguments into a struct so that it is
      easier to add/change arguments.  The last patch adds the contextual
      information (munmap, protection, soft dirty, clear, ...).
      
      This patch (of 3):
      
      To avoid having to change many callback definition everytime we want to
      add a parameter use a structure to group all parameters for the
      mmu_notifier invalidate_range_start/end callback.  No functional changes
      with this patch.
      
      [akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
      Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Acked-by: NJan Kara <jack@suse.cz>
      Acked-by: Jason Gunthorpe <jgg@mellanox.com>	[infiniband]
      Cc: Matthew Wilcox <mawilcox@microsoft.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5d6527a7
    • M
      hwpoison, memory_hotplug: allow hwpoisoned pages to be offlined · b15c8726
      Michal Hocko 提交于
      We have received a bug report that an injected MCE about faulty memory
      prevents memory offline to succeed on 4.4 base kernel.  The underlying
      reason was that the HWPoison page has an elevated reference count and the
      migration keeps failing.  There are two problems with that.  First of all
      it is dubious to migrate the poisoned page because we know that accessing
      that memory is possible to fail.  Secondly it doesn't make any sense to
      migrate a potentially broken content and preserve the memory corruption
      over to a new location.
      
      Oscar has found out that 4.4 and the current upstream kernels behave
      slightly differently with his simply testcase
      
      ===
      
      int main(void)
      {
              int ret;
              int i;
              int fd;
              char *array = malloc(4096);
              char *array_locked = malloc(4096);
      
              fd = open("/tmp/data", O_RDONLY);
              read(fd, array, 4095);
      
              for (i = 0; i < 4096; i++)
                      array_locked[i] = 'd';
      
              ret = mlock((void *)PAGE_ALIGN((unsigned long)array_locked), sizeof(array_locked));
              if (ret)
                      perror("mlock");
      
              sleep (20);
      
              ret = madvise((void *)PAGE_ALIGN((unsigned long)array_locked), 4096, MADV_HWPOISON);
              if (ret)
                      perror("madvise");
      
              for (i = 0; i < 4096; i++)
                      array_locked[i] = 'd';
      
              return 0;
      }
      ===
      
      + offline this memory.
      
      In 4.4 kernels he saw the hwpoisoned page to be returned back to the LRU
      list
      kernel:  [<ffffffff81019ac9>] dump_trace+0x59/0x340
      kernel:  [<ffffffff81019e9a>] show_stack_log_lvl+0xea/0x170
      kernel:  [<ffffffff8101ac71>] show_stack+0x21/0x40
      kernel:  [<ffffffff8132bb90>] dump_stack+0x5c/0x7c
      kernel:  [<ffffffff810815a1>] warn_slowpath_common+0x81/0xb0
      kernel:  [<ffffffff811a275c>] __pagevec_lru_add_fn+0x14c/0x160
      kernel:  [<ffffffff811a2eed>] pagevec_lru_move_fn+0xad/0x100
      kernel:  [<ffffffff811a334c>] __lru_cache_add+0x6c/0xb0
      kernel:  [<ffffffff81195236>] add_to_page_cache_lru+0x46/0x70
      kernel:  [<ffffffffa02b4373>] extent_readpages+0xc3/0x1a0 [btrfs]
      kernel:  [<ffffffff811a16d7>] __do_page_cache_readahead+0x177/0x200
      kernel:  [<ffffffff811a18c8>] ondemand_readahead+0x168/0x2a0
      kernel:  [<ffffffff8119673f>] generic_file_read_iter+0x41f/0x660
      kernel:  [<ffffffff8120e50d>] __vfs_read+0xcd/0x140
      kernel:  [<ffffffff8120e9ea>] vfs_read+0x7a/0x120
      kernel:  [<ffffffff8121404b>] kernel_read+0x3b/0x50
      kernel:  [<ffffffff81215c80>] do_execveat_common.isra.29+0x490/0x6f0
      kernel:  [<ffffffff81215f08>] do_execve+0x28/0x30
      kernel:  [<ffffffff81095ddb>] call_usermodehelper_exec_async+0xfb/0x130
      kernel:  [<ffffffff8161c045>] ret_from_fork+0x55/0x80
      
      And that latter confuses the hotremove path because an LRU page is
      attempted to be migrated and that fails due to an elevated reference
      count.  It is quite possible that the reuse of the HWPoisoned page is some
      kind of fixed race condition but I am not really sure about that.
      
      With the upstream kernel the failure is slightly different.  The page
      doesn't seem to have LRU bit set but isolate_movable_page simply fails and
      do_migrate_range simply puts all the isolated pages back to LRU and
      therefore no progress is made and scan_movable_pages finds same set of
      pages over and over again.
      
      Fix both cases by explicitly checking HWPoisoned pages before we even try
      to get reference on the page, try to unmap it if it is still mapped.  As
      explained by Naoya:
      
      : Hwpoison code never unmapped those for no big reason because
      : Ksm pages never dominate memory, so we simply didn't have strong
      : motivation to save the pages.
      
      Also put WARN_ON(PageLRU) in case there is a race and we can hit LRU
      HWPoison pages which shouldn't happen but I couldn't convince myself about
      that.  Naoya has noted the following:
      
      : Theoretically no such gurantee, because try_to_unmap() doesn't have a
      : guarantee of success and then memory_failure() returns immediately
      : when hwpoison_user_mappings fails.
      : Or the following code (comes after hwpoison_user_mappings block) also impli=
      : es
      : that the target page can still have PageLRU flag.
      :
      :         /*
      :          * Torn down by someone else?
      :          */
      :         if (PageLRU(p) && !PageSwapCache(p) && p->mapping =3D=3D NULL) {
      :                 action_result(pfn, MF_MSG_TRUNCATED_LRU, MF_IGNORED);
      :                 res =3D -EBUSY;
      :                 goto out;
      :         }
      :
      : So I think it's OK to keep "if (WARN_ON(PageLRU(page)))" block in
      : current version of your patch.
      
      Link: http://lkml.kernel.org/r/20181206120135.14079-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.com>
      Debugged-by: NOscar Salvador <osalvador@suse.com>
      Tested-by: NOscar Salvador <osalvador@suse.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b15c8726
    • O
      mm, kmemleak: little optimization while scanning · 9f1eb38e
      Oscar Salvador 提交于
      kmemleak_scan() goes through all online nodes and tries to scan all used
      pages.
      
      We can do better and use pfn_to_online_page(), so in case we have
      CONFIG_MEMORY_HOTPLUG, offlined pages will be skiped automatically.  For
      boxes where CONFIG_MEMORY_HOTPLUG is not present, pfn_to_online_page()
      will fallback to pfn_valid().
      
      Another little optimization is to check if the page belongs to the node we
      are currently checking, so in case we have nodes interleaved we will not
      check the same pfn multiple times.
      
      I ran some tests:
      
      Add some memory to node1 and node2 making it interleaved:
      
      (qemu) object_add memory-backend-ram,id=ram0,size=1G
      (qemu) device_add pc-dimm,id=dimm0,memdev=ram0,node=1
      (qemu) object_add memory-backend-ram,id=ram1,size=1G
      (qemu) device_add pc-dimm,id=dimm1,memdev=ram1,node=2
      (qemu) object_add memory-backend-ram,id=ram2,size=1G
      (qemu) device_add pc-dimm,id=dimm2,memdev=ram2,node=1
      
      Then, we offline that memory:
       # for i in {32..39} ; do echo "offline" > /sys/devices/system/node/node1/memory$i/state;done
       # for i in {48..55} ; do echo "offline" > /sys/devices/system/node/node1/memory$i/state;don
       # for i in {40..47} ; do echo "offline" > /sys/devices/system/node/node2/memory$i/state;done
      
      And we run kmemleak_scan:
      
       # echo "scan" > /sys/kernel/debug/kmemleak
      
      before the patch:
      
      kmemleak: time spend: 41596 us
      
      after the patch:
      
      kmemleak: time spend: 34899 us
      
      [akpm@linux-foundation.org: remove stray newline, per Oscar]
      Link: http://lkml.kernel.org/r/20181206131918.25099-1-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f1eb38e
    • K
    • O
      mm/page_alloc.c: drop uneeded __meminit and __meminitdata · bbe5d993
      Oscar Salvador 提交于
      Since commit 03e85f9d ("mm/page_alloc: Introduce
      free_area_init_core_hotplug"), some functions changed to only be called
      during system initialization.  Concretly, free_area_init_node() and the
      functions that hang from it.
      
      Also, some variables are no longer used after the system has gone
      through initialization.  So this could be considered as a late clean-up
      for that patch.
      
      This patch changes the functions from __meminit to __init, and the
      variables from __meminitdata to __initdata.
      
      In return, we get some KBs back:
      
      Before:
        Freeing unused kernel image memory: 2472K
      
      After:
        Freeing unused kernel image memory: 2480K
      
      Link: http://lkml.kernel.org/r/20181204111507.4808-1-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbe5d993
    • B
      mm/page-writeback.c: don't break integrity writeback on ->writepage() error · 3fa750dc
      Brian Foster 提交于
      write_cache_pages() is used in both background and integrity writeback
      scenarios by various filesystems.  Background writeback is mostly
      concerned with cleaning a certain number of dirty pages based on various
      mm heuristics.  It may not write the full set of dirty pages or wait for
      I/O to complete.  Integrity writeback is responsible for persisting a set
      of dirty pages before the writeback job completes.  For example, an
      fsync() call must perform integrity writeback to ensure data is on disk
      before the call returns.
      
      write_cache_pages() unconditionally breaks out of its processing loop in
      the event of a ->writepage() error.  This is fine for background
      writeback, which had no strict requirements and will eventually come
      around again.  This can cause problems for integrity writeback on
      filesystems that might need to clean up state associated with failed page
      writeouts.  For example, XFS performs internal delayed allocation
      accounting before returning a ->writepage() error, where applicable.  If
      the current writeback happens to be associated with an unmount and
      write_cache_pages() completes the writeback prematurely due to error, the
      filesystem is unmounted in an inconsistent state if dirty+delalloc pages
      still exist.
      
      To handle this problem, update write_cache_pages() to always process the
      full set of pages for integrity writeback regardless of ->writepage()
      errors.  Save the first encountered error and return it to the caller once
      complete.  This facilitates XFS (or any other fs that expects integrity
      writeback to process the entire set of dirty pages) to clean up its
      internal state completely in the event of persistent mapping errors.
      Background writeback continues to exit on the first error encountered.
      
      [akpm@linux-foundation.org: fix typo in comment]
      Link: http://lkml.kernel.org/r/20181116134304.32440-1-bfoster@redhat.comSigned-off-by: NBrian Foster <bfoster@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fa750dc
    • Y
      mm/hmm.c: remove set but not used variable 'devmem' · 0ecea993
      YueHaibing 提交于
      Fixes gcc '-Wunused-but-set-variable' warning:
      
      mm/hmm.c: In function 'hmm_devmem_ref_kill':
      mm/hmm.c:995:21: warning:
       variable 'devmem' set but not used [-Wunused-but-set-variable]
      
      It not used any more since 35d39f953d4e ("mm, hmm: replace
      hmm_devmem_pages_create() with devm_memremap_pages()")
      
      Link: http://lkml.kernel.org/r/1543629971-128377-1-git-send-email-yuehaibing@huawei.comSigned-off-by: NYueHaibing <yuehaibing@huawei.com>
      Reviewed-by: NReviewed-by: Jérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ecea993
    • W
      mm, hotplug: move init_currently_empty_zone() under zone_span_lock protection · fa004ab7
      Wei Yang 提交于
      During online_pages phase, pgdat->nr_zones will be updated in case this
      zone is empty.
      
      Currently the online_pages phase is protected by the global locks
      (device_device_hotplug_lock and mem_hotplug_lock), which ensures there is
      no contention during the update of nr_zones.
      
      These global locks introduces scalability issues (especially the second
      one), which slow down code relying on get_online_mems().  This is also a
      preparation for not having to rely on get_online_mems() but instead some
      more fine grained locks.
      
      The patch moves init_currently_empty_zone under both zone_span_writelock
      and pgdat_resize_lock because both the pgdat state is changed (nr_zones)
      and the zone's start_pfn.  Also this patch changes the documentation of
      node_size_lock to include the protection of nr_zones.
      
      Link: http://lkml.kernel.org/r/20181203205016.14123-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa004ab7
    • W
      mm, sparse: pass nid instead of pgdat to sparse_add_one_section() · 4e0d2e7e
      Wei Yang 提交于
      Since the information needed in sparse_add_one_section() is node id to
      allocate proper memory, it is not necessary to pass its pgdat.
      
      This patch changes the prototype of sparse_add_one_section() to pass node
      id directly.  This is intended to reduce misleading that
      sparse_add_one_section() would touch pgdat.
      
      Link: http://lkml.kernel.org/r/20181204085657.20472-2-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e0d2e7e
    • W
      mm, sparse: drop pgdat_resize_lock in sparse_add/remove_one_section() · 83af6588
      Wei Yang 提交于
      pgdat_resize_lock is used to protect pgdat's memory region information
      like: node_start_pfn, node_present_pages, etc.  While in function
      sparse_add/remove_one_section(), pgdat_resize_lock is used to protect
      initialization/release of one mem_section.  This looks not proper.
      
      These code paths are currently protected by mem_hotplug_lock currently but
      should there ever be any reason for locking at the sparse layer a
      dedicated lock should be introduced.
      
      Following is the current call trace of sparse_add/remove_one_section()
      
          mem_hotplug_begin()
          arch_add_memory()
             add_pages()
                 __add_pages()
                     __add_section()
                         sparse_add_one_section()
          mem_hotplug_done()
      
          mem_hotplug_begin()
          arch_remove_memory()
              __remove_pages()
                  __remove_section()
                      sparse_remove_one_section()
          mem_hotplug_done()
      
      The comment above the pgdat_resize_lock also mentions "Holding this will
      also guarantee that any pfn_valid() stays that way.", which is true with
      the current implementation and false after this patch.  But current
      implementation doesn't meet this comment.  There isn't any pfn walkers to
      take the lock so this looks like a relict from the past.  This patch also
      removes this comment.
      
      [richard.weiyang@gmail.com: v4]
        Link: http://lkml.kernel.org/r/20181204085657.20472-1-richard.weiyang@gmail.com
      [mhocko@suse.com: changelog suggestion]
      Link: http://lkml.kernel.org/r/20181128091243.19249-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83af6588
    • Q
      mm/memblock.c: skip kmemleak for kasan_init() · fed84c78
      Qian Cai 提交于
      Kmemleak does not play well with KASAN (tested on both HPE Apollo 70 and
      Huawei TaiShan 2280 aarch64 servers).
      
      After calling start_kernel()->setup_arch()->kasan_init(), kmemleak early
      log buffer went from something like 280 to 260000 which caused kmemleak
      disabled and crash dump memory reservation failed.  The multitude of
      kmemleak_alloc() calls is from nested loops while KASAN is setting up full
      memory mappings, so let early kmemleak allocations skip those
      memblock_alloc_internal() calls came from kasan_init() given that those
      early KASAN memory mappings should not reference to other memory.  Hence,
      no kmemleak false positives.
      
      kasan_init
        kasan_map_populate [1]
          kasan_pgd_populate [2]
            kasan_pud_populate [3]
              kasan_pmd_populate [4]
                kasan_pte_populate [5]
                  kasan_alloc_zeroed_page
                    memblock_alloc_try_nid
                      memblock_alloc_internal
                        kmemleak_alloc
      
      [1] for_each_memblock(memory, reg)
      [2] while (pgdp++, addr = next, addr != end)
      [3] while (pudp++, addr = next, addr != end && pud_none(READ_ONCE(*pudp)))
      [4] while (pmdp++, addr = next, addr != end && pmd_none(READ_ONCE(*pmdp)))
      [5] while (ptep++, addr = next, addr != end && pte_none(READ_ONCE(*ptep)))
      
      Link: http://lkml.kernel.org/r/1543442925-17794-1-git-send-email-cai@gmx.usSigned-off-by: NQian Cai <cai@gmx.us>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fed84c78
    • O
      mm, memory_hotplug: add nid parameter to arch_remove_memory · 2c2a5af6
      Oscar Salvador 提交于
      Patch series "Do not touch pages in hot-remove path", v2.
      
      This patchset aims for two things:
      
       1) A better definition about offline and hot-remove stage
       2) Solving bugs where we can access non-initialized pages
          during hot-remove operations [2] [3].
      
      This is achieved by moving all page/zone handling to the offline
      stage, so we do not need to access pages when hot-removing memory.
      
      [1] https://patchwork.kernel.org/cover/10691415/
      [2] https://patchwork.kernel.org/patch/10547445/
      [3] https://www.spinics.net/lists/linux-mm/msg161316.html
      
      This patch (of 5):
      
      This is a preparation for the following-up patches.  The idea of passing
      the nid is that it will allow us to get rid of the zone parameter
      afterwards.
      
      Link: http://lkml.kernel.org/r/20181127162005.15833-2-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c2a5af6
    • W
      mm: check nr_initialised with PAGES_PER_SECTION directly in defer_init() · 23b68cfa
      Wei Yang 提交于
      When DEFERRED_STRUCT_PAGE_INIT is configured, only the first section of
      each node's highest zone is initialized before defer stage.
      
      static_init_pgcnt is used to store the number of pages like this:
      
          pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
                                                    pgdat->node_spanned_pages);
      
      because we don't want to overflow zone's range.
      
      But this is not necessary, since defer_init() is called like this:
      
        memmap_init_zone()
          for pfn in [start_pfn, end_pfn)
            defer_init(pfn, end_pfn)
      
      In case (pgdat->node_spanned_pages < PAGES_PER_SECTION), the loop would
      stop before calling defer_init().
      
      BTW, comparing PAGES_PER_SECTION with node_spanned_pages is not correct,
      since nr_initialised is zone based instead of node based.  Even
      node_spanned_pages is bigger than PAGES_PER_SECTION, its highest zone
      would have pages less than PAGES_PER_SECTION.
      
      Link: http://lkml.kernel.org/r/20181122094807.6985-1-richard.weiyang@gmail.comSigned-off-by: NWei Yang <richard.weiyang@gmail.com>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23b68cfa
    • H
      mm: put_and_wait_on_page_locked() while page is migrated · 9a1ea439
      Hugh Dickins 提交于
      Waiting on a page migration entry has used wait_on_page_locked() all along
      since 2006: but you cannot safely wait_on_page_locked() without holding a
      reference to the page, and that extra reference is enough to make
      migrate_page_move_mapping() fail with -EAGAIN, when a racing task faults
      on the entry before migrate_page_move_mapping() gets there.
      
      And that failure is retried nine times, amplifying the pain when trying to
      migrate a popular page.  With a single persistent faulter, migration
      sometimes succeeds; with two or three concurrent faulters, success becomes
      much less likely (and the more the page was mapped, the worse the overhead
      of unmapping and remapping it on each try).
      
      This is especially a problem for memory offlining, where the outer level
      retries forever (or until terminated from userspace), because a heavy
      refault workload can trigger an endless loop of migration failures.
      wait_on_page_locked() is the wrong tool for the job.
      
      David Herrmann (but was he the first?) noticed this issue in 2014:
      https://marc.info/?l=linux-mm&m=140110465608116&w=2
      
      Tim Chen started a thread in August 2017 which appears relevant:
      https://marc.info/?l=linux-mm&m=150275941014915&w=2 where Kan Liang went
      on to implicate __migration_entry_wait():
      https://marc.info/?l=linux-mm&m=150300268411980&w=2 and the thread ended
      up with the v4.14 commits: 2554db91 ("sched/wait: Break up long wake
      list walk") 11a19c7b ("sched/wait: Introduce wakeup boomark in
      wake_up_page_bit")
      
      Baoquan He reported "Memory hotplug softlock issue" 14 November 2018:
      https://marc.info/?l=linux-mm&m=154217936431300&w=2
      
      We have all assumed that it is essential to hold a page reference while
      waiting on a page lock: partly to guarantee that there is still a struct
      page when MEMORY_HOTREMOVE is configured, but also to protect against
      reuse of the struct page going to someone who then holds the page locked
      indefinitely, when the waiter can reasonably expect timely unlocking.
      
      But in fact, so long as wait_on_page_bit_common() does the put_page(), and
      is careful not to rely on struct page contents thereafter, there is no
      need to hold a reference to the page while waiting on it.  That does mean
      that this case cannot go back through the loop: but that's fine for the
      page migration case, and even if used more widely, is limited by the "Stop
      walking if it's locked" optimization in wake_page_function().
      
      Add interface put_and_wait_on_page_locked() to do this, using "behavior"
      enum in place of "lock" arg to wait_on_page_bit_common() to implement it.
      No interruptible or killable variant needed yet, but they might follow: I
      have a vague notion that reporting -EINTR should take precedence over
      return from wait_on_page_bit_common() without knowing the page state, so
      arrange it accordingly - but that may be nothing but pedantic.
      
      __migration_entry_wait() still has to take a brief reference to the page,
      prior to calling put_and_wait_on_page_locked(): but now that it is dropped
      before waiting, the chance of impeding page migration is very much
      reduced.  Should we perhaps disable preemption across this?
      
      shrink_page_list()'s __ClearPageLocked(): that was a surprise!  This
      survived a lot of testing before that showed up.  PageWaiters may have
      been set by wait_on_page_bit_common(), and the reference dropped, just
      before shrink_page_list() succeeds in freezing its last page reference: in
      such a case, unlock_page() must be used.  Follow the suggestion from
      Michal Hocko, just revert a978d6f5 ("mm: unlockless reclaim") now:
      that optimization predates PageWaiters, and won't buy much these days; but
      we can reinstate it for the !PageWaiters case if anyone notices.
      
      It does raise the question: should vmscan.c's is_page_cache_freeable() and
      __remove_mapping() now treat a PageWaiters page as if an extra reference
      were held?  Perhaps, but I don't think it matters much, since
      shrink_page_list() already had to win its trylock_page(), so waiters are
      not very common there: I noticed no difference when trying the bigger
      change, and it's surely not needed while put_and_wait_on_page_locked() is
      only used for page migration.
      
      [willy@infradead.org: add put_and_wait_on_page_locked() kerneldoc]
      Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261121330.1116@eggly.anvilsSigned-off-by: NHugh Dickins <hughd@google.com>
      Reported-by: NBaoquan He <bhe@redhat.com>
      Tested-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Kan Liang <kan.liang@intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Nick Piggin <npiggin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a1ea439
    • Y
      mm, oom: add oom victim's memcg to the oom context information · f0c867d9
      yuzhoujian 提交于
      The current oom report doesn't display victim's memcg context during the
      global OOM situation.  While this information is not strictly needed, it
      can be really helpful for containerized environments to locate which
      container has lost a process.  Now that we have a single line for the oom
      context, we can trivially add both the oom memcg (this can be either
      global_oom or a specific memcg which hits its hard limits) and task_memcg
      which is the victim's memcg.
      
      Below is the single line output in the oom report after this patch.
      
      - global oom context information:
      
      oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,global_oom,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid>
      
      - memcg oom context information:
      
      oom-kill:constraint=<constraint>,nodemask=<nodemask>,cpuset=<cpuset>,mems_allowed=<mems_allowed>,oom_memcg=<memcg>,task_memcg=<memcg>,task=<comm>,pid=<pid>,uid=<uid>
      
      [penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()]
        Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp
      Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.comSigned-off-by: Nyuzhoujian <yuzhoujian@didichuxing.com>
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Yang Shi <yang.s@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f0c867d9
    • Y
      mm, oom: reorganize the oom report in dump_header · ef8444ea
      yuzhoujian 提交于
      OOM report contains several sections.  The first one is the allocation
      context that has triggered the OOM.  Then we have cpuset context followed
      by the stack trace of the OOM path.  The tird one is the OOM memory
      information.  Followed by the current memory state of all system tasks.
      At last, we will show oom eligible tasks and the information about the
      chosen oom victim.
      
      One thing that makes parsing more awkward than necessary is that we do not
      have a single and easily parsable line about the oom context.  This patch
      is reorganizing the oom report to
      
      1) who invoked oom and what was the allocation request
      
      [  515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
      
      2) OOM stack trace
      
      [  515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3
      [  515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016
      [  515.906821] Call Trace:
      [  515.908062]  dump_stack+0x5a/0x73
      [  515.909311]  dump_header+0x55/0x28c
      [  515.914260]  oom_kill_process+0x2d8/0x300
      [  515.916708]  out_of_memory+0x145/0x4a0
      [  515.917932]  __alloc_pages_slowpath+0x7d2/0xa16
      [  515.919157]  __alloc_pages_nodemask+0x277/0x290
      [  515.920367]  filemap_fault+0x3d0/0x6c0
      [  515.921529]  ? filemap_map_pages+0x2b8/0x420
      [  515.922709]  ext4_filemap_fault+0x2c/0x40 [ext4]
      [  515.923884]  __do_fault+0x20/0x80
      [  515.925032]  __handle_mm_fault+0xbc0/0xe80
      [  515.926195]  handle_mm_fault+0xfa/0x210
      [  515.927357]  __do_page_fault+0x233/0x4c0
      [  515.928506]  do_page_fault+0x32/0x140
      [  515.929646]  ? page_fault+0x8/0x30
      [  515.930770]  page_fault+0x1e/0x30
      
      3) OOM memory information
      
      [  515.958093] Mem-Info:
      [  515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0
       active_file:4402672 inactive_file:483963 isolated_file:1344
       unevictable:0 dirty:4886753 writeback:0 unstable:0
       slab_reclaimable:148442 slab_unreclaimable:18741
       mapped:1347 shmem:1347 pagetables:58669 bounce:0
       free:88663 free_pcp:0 free_cma:0
      ...
      
      4) current memory state of all system tasks
      
      [  516.079544] [    744]     0   744     9211     1345   114688       82             0 systemd-journal
      [  516.082034] [    787]     0   787    31764        0   143360       92             0 lvmetad
      [  516.084465] [    792]     0   792    10930        1   110592      208         -1000 systemd-udevd
      [  516.086865] [   1199]     0  1199    13866        0   131072      112         -1000 auditd
      [  516.089190] [   1222]     0  1222    31990        1   110592      157             0 smartd
      [  516.091477] [   1225]     0  1225     4864       85    81920       43             0 irqbalance
      [  516.093712] [   1226]     0  1226    52612        0   258048      426             0 abrtd
      [  516.112128] [   1280]     0  1280   109774       55   299008      400             0 NetworkManager
      [  516.113998] [   1295]     0  1295    28817       37    69632       24             0 ksmtuned
      [  516.144596] [  10718]     0 10718  2622484  1721372 15998976   267219             0 panic
      [  516.145792] [  10719]     0 10719  2622484  1164767  9818112    53576             0 panic
      [  516.146977] [  10720]     0 10720  2622484  1174361  9904128    53709             0 panic
      [  516.148163] [  10721]     0 10721  2622484  1209070 10194944    54824             0 panic
      [  516.149329] [  10722]     0 10722  2622484  1745799 14774272    91138             0 panic
      
      5) oom context (contrains and the chosen victim).
      
      oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0
      
      An admin can easily get the full oom context at a single line which
      makes parsing much easier.
      
      Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.comSigned-off-by: Nyuzhoujian <yuzhoujian@didichuxing.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <yang.s@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef8444ea
    • A
    • A
    • A
    • M
      mm: reclaim small amounts of memory when an external fragmentation event occurs · 1c30844d
      Mel Gorman 提交于
      An external fragmentation event was previously described as
      
          When the page allocator fragments memory, it records the event using
          the mm_page_alloc_extfrag event. If the fallback_order is smaller
          than a pageblock order (order-9 on 64-bit x86) then it's considered
          an event that will cause external fragmentation issues in the future.
      
      The kernel reduces the probability of such events by increasing the
      watermark sizes by calling set_recommended_min_free_kbytes early in the
      lifetime of the system.  This works reasonably well in general but if
      there are enough sparsely populated pageblocks then the problem can still
      occur as enough memory is free overall and kswapd stays asleep.
      
      This patch introduces a watermark_boost_factor sysctl that allows a zone
      watermark to be temporarily boosted when an external fragmentation causing
      events occurs.  The boosting will stall allocations that would decrease
      free memory below the boosted low watermark and kswapd is woken if the
      calling context allows to reclaim an amount of memory relative to the size
      of the high watermark and the watermark_boost_factor until the boost is
      cleared.  When kswapd finishes, it wakes kcompactd at the pageblock order
      to clean some of the pageblocks that may have been affected by the
      fragmentation event.  kswapd avoids any writeback, slab shrinkage and swap
      from reclaim context during this operation to avoid excessive system
      disruption in the name of fragmentation avoidance.  Care is taken so that
      kswapd will do normal reclaim work if the system is really low on memory.
      
      This was evaluated using the same workloads as "mm, page_alloc: Spread
      allocations across zones before introducing fragmentation".
      
      1-socket Skylake machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 1 THP allocating thread
      --------------------------------------
      
      4.20-rc3 extfrag events < order 9:   804694
      4.20-rc3+patch:                      408912 (49% reduction)
      4.20-rc3+patch1-4:                    18421 (98% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-1      653.58 (   0.00%)      652.71 (   0.13%)
      Amean     fault-huge-1        0.00 (   0.00%)      178.93 * -99.00%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1        0.00 (   0.00%)        5.12 ( 100.00%)
      
      Note that external fragmentation causing events are massively reduced by
      this path whether in comparison to the previous kernel or the vanilla
      kernel.  The fault latency for huge pages appears to be increased but that
      is only because THP allocations were successful with the patch applied.
      
      1-socket Skylake machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  291392
      4.20-rc3+patch:                     191187 (34% reduction)
      4.20-rc3+patch1-4:                   13464 (95% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Min       fault-base-1      912.00 (   0.00%)      905.00 (   0.77%)
      Min       fault-huge-1      127.00 (   0.00%)      135.00 (  -6.30%)
      Amean     fault-base-1     1467.55 (   0.00%)     1481.67 (  -0.96%)
      Amean     fault-huge-1     1127.11 (   0.00%)     1063.88 *   5.61%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1       77.64 (   0.00%)       83.46 (   7.49%)
      
      As before, massive reduction in external fragmentation events, some jitter
      on latencies and an increase in THP allocation success rates.
      
      2-socket Haswell machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 5 THP allocating threads
      ----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  215698
      4.20-rc3+patch:                     200210 (7% reduction)
      4.20-rc3+patch1-4:                   14263 (93% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     1346.45 (   0.00%)     1306.87 (   2.94%)
      Amean     fault-huge-5     3418.60 (   0.00%)     1348.94 (  60.54%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5        0.78 (   0.00%)        7.91 ( 910.64%)
      
      There is a 93% reduction in fragmentation causing events, there is a big
      reduction in the huge page fault latency and allocation success rate is
      higher.
      
      2-socket Haswell machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9: 166352
      4.20-rc3+patch:                    147463 (11% reduction)
      4.20-rc3+patch1-4:                  11095 (93% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     6217.43 (   0.00%)     7419.67 * -19.34%*
      Amean     fault-huge-5     3163.33 (   0.00%)     3263.80 (  -3.18%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5       95.14 (   0.00%)       87.98 (  -7.53%)
      
      There is a large reduction in fragmentation events with some jitter around
      the latencies and success rates.  As before, the high THP allocation
      success rate does mean the system is under a lot of pressure.  However, as
      the fragmentation events are reduced, it would be expected that the
      long-term allocation success rate would be higher.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c30844d
    • M
      mm: use alloc_flags to record if kswapd can wake · 0a79cdad
      Mel Gorman 提交于
      This is a preparation patch that copies the GFP flag __GFP_KSWAPD_RECLAIM
      into alloc_flags.  This is a preparation patch only that avoids having to
      pass gfp_mask through a long callchain in a future patch.
      
      Note that the setting in the fast path happens in alloc_flags_nofragment()
      and it may be claimed that this has nothing to do with ALLOC_NO_FRAGMENT.
      That's true in this patch but is not true later so it's done now for
      easier review to show where the flag needs to be recorded.
      
      No functional change.
      
      [mgorman@techsingularity.net: ALLOC_KSWAPD flag needs to be applied in the !CONFIG_ZONE_DMA32 case]
        Link: http://lkml.kernel.org/r/20181126143503.GO23260@techsingularity.net
      Link: http://lkml.kernel.org/r/20181123114528.28802-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a79cdad
    • M
      mm: move zone watermark accesses behind an accessor · a9214443
      Mel Gorman 提交于
      This is a preparation patch only, no functional change.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9214443
    • M
      mm, page_alloc: spread allocations across zones before introducing fragmentation · 6bb15450
      Mel Gorman 提交于
      Patch series "Fragmentation avoidance improvements", v5.
      
      It has been noted before that fragmentation avoidance (aka
      anti-fragmentation) is not perfect. Given sufficient time or an adverse
      workload, memory gets fragmented and the long-term success of high-order
      allocations degrades. This series defines an adverse workload, a definition
      of external fragmentation events (including serious) ones and a series
      that reduces the level of those fragmentation events.
      
      The details of the workload and the consequences are described in more
      detail in the changelogs. However, from patch 1, this is a high-level
      summary of the adverse workload. The exact details are found in the
      mmtests implementation.
      
      The broad details of the workload are as follows;
      
      1. Create an XFS filesystem (not specified in the configuration but done
         as part of the testing for this patch)
      2. Start 4 fio threads that write a number of 64K files inefficiently.
         Inefficiently means that files are created on first access and not
         created in advance (fio parameterr create_on_open=1) and fallocate
         is not used (fallocate=none). With multiple IO issuers this creates
         a mix of slab and page cache allocations over time. The total size
         of the files is 150% physical memory so that the slabs and page cache
         pages get mixed
      3. Warm up a number of fio read-only threads accessing the same files
         created in step 2. This part runs for the same length of time it
         took to create the files. It'll fault back in old data and further
         interleave slab and page cache allocations. As it's now low on
         memory due to step 2, fragmentation occurs as pageblocks get
         stolen.
      4. While step 3 is still running, start a process that tries to allocate
         75% of memory as huge pages with a number of threads. The number of
         threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
         threads contending with fio, any other threads or forcing cross-NUMA
         scheduling. Note that the test has not been used on a machine with less
         than 8 cores. The benchmark records whether huge pages were allocated
         and what the fault latency was in microseconds
      5. Measure the number of events potentially causing external fragmentation,
         the fault latency and the huge page allocation success rate.
      6. Cleanup
      
      Overall the series reduces external fragmentation causing events by over 94%
      on 1 and 2 socket machines, which in turn impacts high-order allocation
      success rates over the long term. There are differences in latencies and
      high-order allocation success rates. Latencies are a mixed bag as they
      are vulnerable to exact system state and whether allocations succeeded
      so they are treated as a secondary metric.
      
      Patch 1 uses lower zones if they are populated and have free memory
      	instead of fragmenting a higher zone. It's special cased to
      	handle a Normal->DMA32 fallback with the reasons explained
      	in the changelog.
      
      Patch 2-4 boosts watermarks temporarily when an external fragmentation
      	event occurs. kswapd wakes to reclaim a small amount of old memory
      	and then wakes kcompactd on completion to recover the system
      	slightly. This introduces some overhead in the slowpath. The level
      	of boosting can be tuned or disabled depending on the tolerance
      	for fragmentation vs allocation latency.
      
      Patch 5 stalls some movable allocation requests to let kswapd from patch 4
      	make some progress. The duration of the stalls is very low but it
      	is possible to tune the system to avoid fragmentation events if
      	larger stalls can be tolerated.
      
      The bulk of the improvement in fragmentation avoidance is from patches
      1-4 but patch 5 can deal with a rare corner case and provides the option
      of tuning a system for THP allocation success rates in exchange for
      some stalls to control fragmentation.
      
      This patch (of 5):
      
      The page allocator zone lists are iterated based on the watermarks of each
      zone which does not take anti-fragmentation into account.  On x86, node 0
      may have multiple zones while other nodes have one zone.  A consequence is
      that tasks running on node 0 may fragment ZONE_NORMAL even though
      ZONE_DMA32 has plenty of free memory.  This patch special cases the
      allocator fast path such that it'll try an allocation from a lower local
      zone before fragmenting a higher zone.  In this case, stealing of
      pageblocks or orders larger than a pageblock are still allowed in the fast
      path as they are uninteresting from a fragmentation point of view.
      
      This was evaluated using a benchmark designed to fragment memory before
      attempting THP allocations.  It's implemented in mmtests as the following
      configurations
      
      configs/config-global-dhp__workload_thpfioscale
      configs/config-global-dhp__workload_thpfioscale-defrag
      configs/config-global-dhp__workload_thpfioscale-madvhugepage
      
      e.g. from mmtests
      ./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1
      
      The broad details of the workload are as follows;
      
      1. Create an XFS filesystem (not specified in the configuration but done
         as part of the testing for this patch).
      2. Start 4 fio threads that write a number of 64K files inefficiently.
         Inefficiently means that files are created on first access and not
         created in advance (fio parameter create_on_open=1) and fallocate
         is not used (fallocate=none). With multiple IO issuers this creates
         a mix of slab and page cache allocations over time. The total size
         of the files is 150% physical memory so that the slabs and page cache
         pages get mixed.
      3. Warm up a number of fio read-only processes accessing the same files
         created in step 2. This part runs for the same length of time it
         took to create the files. It'll refault old data and further
         interleave slab and page cache allocations. As it's now low on
         memory due to step 2, fragmentation occurs as pageblocks get
         stolen.
      4. While step 3 is still running, start a process that tries to allocate
         75% of memory as huge pages with a number of threads. The number of
         threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
         threads contending with fio, any other threads or forcing cross-NUMA
         scheduling. Note that the test has not been used on a machine with less
         than 8 cores. The benchmark records whether huge pages were allocated
         and what the fault latency was in microseconds.
      5. Measure the number of events potentially causing external fragmentation,
         the fault latency and the huge page allocation success rate.
      6. Cleanup the test files.
      
      Note that due to the use of IO and page cache that this benchmark is not
      suitable for running on large machines where the time to fragment memory
      may be excessive.  Also note that while this is one mix that generates
      fragmentation that it's not the only mix that generates fragmentation.
      Differences in workload that are more slab-intensive or whether SLUB is
      used with high-order pages may yield different results.
      
      When the page allocator fragments memory, it records the event using the
      mm_page_alloc_extfrag ftrace event.  If the fallback_order is smaller than
      a pageblock order (order-9 on 64-bit x86) then it's considered to be an
      "external fragmentation event" that may cause issues in the future.
      Hence, the primary metric here is the number of external fragmentation
      events that occur with order < 9.  The secondary metric is allocation
      latency and huge page allocation success rates but note that differences
      in latencies and what the success rate also can affect the number of
      external fragmentation event which is why it's a secondary metric.
      
      1-socket Skylake machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 1 THP allocating thread
      --------------------------------------
      
      4.20-rc3 extfrag events < order 9:   804694
      4.20-rc3+patch:                      408912 (49% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-1      662.92 (   0.00%)      653.58 *   1.41%*
      Amean     fault-huge-1        0.00 (   0.00%)        0.00 (   0.00%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-1        0.00 (   0.00%)        0.00 (   0.00%)
      
      Fault latencies are slightly reduced while allocation success rates remain
      at zero as this configuration does not make any special effort to allocate
      THP and fio is heavily active at the time and either filling memory or
      keeping pages resident.  However, a 49% reduction of serious fragmentation
      events reduces the changes of external fragmentation being a problem in
      the future.
      
      Vlastimil asked during review for a breakdown of the allocation types
      that are falling back.
      
      vanilla
         3816 MIGRATE_UNMOVABLE
       800845 MIGRATE_MOVABLE
           33 MIGRATE_UNRECLAIMABLE
      
      patch
          735 MIGRATE_UNMOVABLE
       408135 MIGRATE_MOVABLE
           42 MIGRATE_UNRECLAIMABLE
      
      The majority of the fallbacks are due to movable allocations and this is
      consistent for the workload throughout the series so will not be presented
      again as the primary source of fallbacks are movable allocations.
      
      Movable fallbacks are sometimes considered "ok" to fallback because they
      can be migrated.  The problem is that they can fill an
      unmovable/reclaimable pageblock causing those allocations to fallback
      later and polluting pageblocks with pages that cannot move.  If there is a
      movable fallback, it is pretty much guaranteed to affect an
      unmovable/reclaimable pageblock and while it might not be enough to
      actually cause a unmovable/reclaimable fallback in the future, we cannot
      know that in advance so the patch takes the only option available to it.
      Hence, it's important to control them.  This point is also consistent
      throughout the series and will not be repeated.
      
      1-socket Skylake machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  291392
      4.20-rc3+patch:                     191187 (34% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-1     1495.14 (   0.00%)     1467.55 (   1.85%)
      Amean     fault-huge-1     1098.48 (   0.00%)     1127.11 (  -2.61%)
      
      thpfioscale Percentage Faults Huge
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-1       78.57 (   0.00%)       77.64 (  -1.18%)
      
      Fragmentation events were reduced quite a bit although this is known
      to be a little variable. The latencies and allocation success rates
      are similar but they were already quite high.
      
      2-socket Haswell machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 5 THP allocating threads
      ----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  215698
      4.20-rc3+patch:                     200210 (7% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-5     1350.05 (   0.00%)     1346.45 (   0.27%)
      Amean     fault-huge-5     4181.01 (   0.00%)     3418.60 (  18.24%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-5        1.15 (   0.00%)        0.78 ( -31.88%)
      
      The reduction of external fragmentation events is slight and this is
      partially due to the removal of __GFP_THISNODE in commit ac5b2c18
      ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP
      allocations can now spill over to remote nodes instead of fragmenting
      local memory.
      
      2-socket Haswell machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9: 166352
      4.20-rc3+patch:                    147463 (11% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-5     6138.97 (   0.00%)     6217.43 (  -1.28%)
      Amean     fault-huge-5     2294.28 (   0.00%)     3163.33 * -37.88%*
      
      thpfioscale Percentage Faults Huge
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-5       96.82 (   0.00%)       95.14 (  -1.74%)
      
      There was a slight reduction in external fragmentation events although the
      latencies were higher.  The allocation success rate is high enough that
      the system is struggling and there is quite a lot of parallel reclaim and
      compaction activity.  There is also a certain degree of luck on whether
      processes start on node 0 or not for this patch but the relevance is
      reduced later in the series.
      
      Overall, the patch reduces the number of external fragmentation causing
      events so the success of THP over long periods of time would be improved
      for this adverse workload.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6bb15450
    • D
      mm/memory_hotplug: drop "online" parameter from add_memory_resource() · f29d8e9c
      David Hildenbrand 提交于
      Userspace should always be in charge of how to online memory and if memory
      should be onlined automatically in the kernel.  Let's drop the parameter
      to overwrite this - XEN passes memhp_auto_online, just like add_memory(),
      so we can directly use that instead internally.
      
      Link: http://lkml.kernel.org/r/20181123123740.27652-1-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NJuergen Gross <jgross@suse.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Stefano Stabellini <sstabellini@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f29d8e9c
    • M
      memblock: replace usage of __memblock_free_early() with memblock_free() · 4d72868c
      Mike Rapoport 提交于
      __memblock_free_early() is only used by the convenience wrappers, so
      essentially we wrap a call to memblock_free() twice.  Replace calls of
      __memblock_free_early() with calls to memblock_free() and drop the former.
      
      Link: http://lkml.kernel.org/r/20181125102940.GE28634@rapoport-lnxSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Wentao Wang <witallwang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d72868c
    • W
    • A
      mm/page_alloc.c: use a single function to free page · 742aa7fb
      Aaron Lu 提交于
      There are multiple places of freeing a page, they all do the same things
      so a common function can be used to reduce code duplicate.
      
      It also avoids bug fixed in one function but left in another.
      
      Link: http://lkml.kernel.org/r/20181119134834.17765-3-aaron.lu@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Pawel Staszewski <pstaszewski@itcare.pl>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      742aa7fb
    • A
      mm/page_alloc.c: free order-0 pages through PCP in page_frag_free() · 65895b67
      Aaron Lu 提交于
      page_frag_free() calls __free_pages_ok() to free the page back to Buddy.
      This is OK for high order page, but for order-0 pages, it misses the
      optimization opportunity of using Per-Cpu-Pages and can cause zone lock
      contention when called frequently.
      
      Pawel Staszewski recently shared his result of 'how Linux kernel handles
      normal traffic'[1] and from perf data, Jesper Dangaard Brouer found the
      lock contention comes from page allocator:
      
        mlx5e_poll_tx_cq
        |
         --16.34%--napi_consume_skb
                   |
                   |--12.65%--__free_pages_ok
                   |          |
                   |           --11.86%--free_one_page
                   |                     |
                   |                     |--10.10%--queued_spin_lock_slowpath
                   |                     |
                   |                      --0.65%--_raw_spin_lock
                   |
                   |--1.55%--page_frag_free
                   |
                    --1.44%--skb_release_data
      
      Jesper explained how it happened: mlx5 driver RX-page recycle mechanism is
      not effective in this workload and pages have to go through the page
      allocator.  The lock contention happens during mlx5 DMA TX completion
      cycle.  And the page allocator cannot keep up at these speeds.[2]
      
      I thought that __free_pages_ok() are mostly freeing high order pages and
      thought this is an lock contention for high order pages but Jesper
      explained in detail that __free_pages_ok() here are actually freeing
      order-0 pages because mlx5 is using order-0 pages to satisfy its page pool
      allocation request.[3]
      
      The free path as pointed out by Jesper is:
      skb_free_head()
        -> skb_free_frag()
          -> page_frag_free()
      And the pages being freed on this path are order-0 pages.
      
      Fix this by doing similar things as in __page_frag_cache_drain() - send
      the being freed page to PCP if it's an order-0 page, or directly to Buddy
      if it is a high order page.
      
      With this change, Paweł hasn't noticed lock contention yet in his
      workload and Jesper has noticed a 7% performance improvement using a micro
      benchmark and lock contention is gone.  Ilias' test on a 'low' speed 1Gbit
      interface on an cortex-a53 shows ~11% performance boost testing with
      64byte packets and __free_pages_ok() disappeared from perf top.
      
      [1]: https://www.spinics.net/lists/netdev/msg531362.html
      [2]: https://www.spinics.net/lists/netdev/msg531421.html
      [3]: https://www.spinics.net/lists/netdev/msg531556.html
      
      [akpm@linux-foundation.org: add comment]
      Link: http://lkml.kernel.org/r/20181120014544.GB10657@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Reported-by: NPawel Staszewski <pstaszewski@itcare.pl>
      Analysed-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Tested-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Acked-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NPankaj gupta <pagupta@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65895b67
    • D
      mm, hmm: mark hmm_devmem_{add, add_resource} EXPORT_SYMBOL_GPL · 02917e9f
      Dan Williams 提交于
      At Maintainer Summit, Greg brought up a topic I proposed around
      EXPORT_SYMBOL_GPL usage.  The motivation was considerations for when
      EXPORT_SYMBOL_GPL is warranted and the criteria for taking the exceptional
      step of reclassifying an existing export.  Specifically, I wanted to make
      the case that although the line is fuzzy and hard to specify in abstract
      terms, it is nonetheless clear that devm_memremap_pages() and HMM
      (Heterogeneous Memory Management) have crossed it.  The
      devm_memremap_pages() facility should have been EXPORT_SYMBOL_GPL from the
      beginning, and HMM as a derivative of that functionality should have
      naturally picked up that designation as well.
      
      Contrary to typical rules, the HMM infrastructure was merged upstream with
      zero in-tree consumers.  There was a promise at the time that those users
      would be merged "soon", but it has been over a year with no drivers
      arriving.  While the Nouveau driver is about to belatedly make good on
      that promise it is clear that HMM was targeted first and foremost at an
      out-of-tree consumer.
      
      HMM is derived from devm_memremap_pages(), a facility Christoph and I
      spearheaded to support persistent memory.  It combines a device lifetime
      model with a dynamically created 'struct page' / memmap array for any
      physical address range.  It enables coordination and control of the many
      code paths in the kernel built to interact with memory via 'struct page'
      objects.  With HMM the integration goes even deeper by allowing device
      drivers to hook and manipulate page fault and page free events.
      
      One interpretation of when EXPORT_SYMBOL is suitable is when it is
      exporting stable and generic leaf functionality.  The
      devm_memremap_pages() facility continues to see expanding use cases,
      peer-to-peer DMA being the most recent, with no clear end date when it
      will stop attracting reworks and semantic changes.  It is not suitable to
      export devm_memremap_pages() as a stable 3rd party driver API due to the
      fact that it is still changing and manipulates core behavior.  Moreover,
      it is not in the best interest of the long term development of the core
      memory management subsystem to permit any external driver to effectively
      define its own system-wide memory management policies with no
      encouragement to engage with upstream.
      
      I am also concerned that HMM was designed in a way to minimize further
      engagement with the core-MM.  That, with these hooks in place,
      device-drivers are free to implement their own policies without much
      consideration for whether and how the core-MM could grow to meet that
      need.  Going forward not only should HMM be EXPORT_SYMBOL_GPL, but the
      core-MM should be allowed the opportunity and stimulus to change and
      address these new use cases as first class functionality.
      
      Original changelog:
      
      hmm_devmem_add(), and hmm_devmem_add_resource() duplicated
      devm_memremap_pages() and are now simple now wrappers around the core
      facility to inject a dev_pagemap instance into the global pgmap_radix and
      hook page-idle events.  The devm_memremap_pages() interface is base
      infrastructure for HMM.  HMM has more and deeper ties into the kernel
      memory management implementation than base ZONE_DEVICE which is itself a
      EXPORT_SYMBOL_GPL facility.
      
      Originally, the HMM page structure creation routines copied the
      devm_memremap_pages() code and reused ZONE_DEVICE.  A cleanup to unify the
      implementations was discussed during the initial review:
      http://lkml.iu.edu/hypermail/linux/kernel/1701.2/00812.html Recent work to
      extend devm_memremap_pages() for the peer-to-peer-DMA facility enabled
      this cleanup to move forward.
      
      In addition to the integration with devm_memremap_pages() HMM depends on
      other GPL-only symbols:
      
          mmu_notifier_unregister_no_release
          percpu_ref
          region_intersects
          __class_create
      
      It goes further to consume / indirectly expose functionality that is not
      exported to any other driver:
      
          alloc_pages_vma
          walk_page_range
      
      HMM is derived from devm_memremap_pages(), and extends deep core-kernel
      fundamentals. Similar to devm_memremap_pages(), mark its entry points
      EXPORT_SYMBOL_GPL().
      
      [logang@deltatee.com: PCI/P2PDMA: match interface changes to devm_memremap_pages()]
        Link: http://lkml.kernel.org/r/20181130225911.2900-1-logang@deltatee.com
      Link: http://lkml.kernel.org/r/154275560565.76910.15919297436557795278.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>,
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      02917e9f
    • D
      mm, hmm: replace hmm_devmem_pages_create() with devm_memremap_pages() · bbecd94e
      Dan Williams 提交于
      Commit e8d51348 ("memremap: change devm_memremap_pages interface to
      use struct dev_pagemap") refactored devm_memremap_pages() to allow a
      dev_pagemap instance to be supplied.  Passing in a dev_pagemap interface
      simplifies the design of pgmap type drivers in that they can rely on
      container_of() to lookup any private data associated with the given
      dev_pagemap instance.
      
      In addition to the cleanups this also gives hmm users multi-order-radix
      improvements that arrived with commit ab1b597e "mm,
      devm_memremap_pages: use multi-order radix for ZONE_DEVICE lookups"
      
      As part of the conversion to the devm_memremap_pages() method of
      handling the percpu_ref relative to when pages are put, the percpu_ref
      completion needs to move to hmm_devmem_ref_exit().  See 71389703
      ("mm, zone_device: Replace {get, put}_zone_device_page...") for details.
      
      Link: http://lkml.kernel.org/r/154275560053.76910.10870962637383152392.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJérôme Glisse <jglisse@redhat.com>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbecd94e
    • D
      mm, hmm: use devm semantics for hmm_devmem_{add, remove} · 58ef15b7
      Dan Williams 提交于
      devm semantics arrange for resources to be torn down when
      device-driver-probe fails or when device-driver-release completes.
      Similar to devm_memremap_pages() there is no need to support an explicit
      remove operation when the users properly adhere to devm semantics.
      
      Note that devm_kzalloc() automatically handles allocating node-local
      memory.
      
      Link: http://lkml.kernel.org/r/154275559545.76910.9186690723515469051.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58ef15b7
    • H
      mm/page_alloc.c: change the order of MIGRATE_RECLAIMABLE/MIGRATE_MOVABLE in fallbacks · 7ead3342
      Huang Shijie 提交于
      In the enum migratetype definition, MIGRATE_MOVABLE is before
      MIGRATE_RECLAIMABLE.  Change the order of them to match the enumeration's
      order.
      
      Link: http://lkml.kernel.org/r/20181121085821.3442-1-sjhuang@iluvatar.aiSigned-off-by: NHuang Shijie <sjhuang@iluvatar.ai>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ead3342
    • A
      mm/swap: use nr_node_ids for avail_lists in swap_info_struct · 66f71da9
      Aaron Lu 提交于
      Since a2468cc9 ("swap: choose swap device according to numa node"),
      avail_lists field of swap_info_struct is changed to an array with
      MAX_NUMNODES elements.  This made swap_info_struct size increased to 40KiB
      and needs an order-4 page to hold it.
      
      This is not optimal in that:
      1 Most systems have way less than MAX_NUMNODES(1024) nodes so it
        is a waste of memory;
      2 It could cause swapon failure if the swap device is swapped on
        after system has been running for a while, due to no order-4
        page is available as pointed out by Vasily Averin.
      
      Solve the above two issues by using nr_node_ids(which is the actual
      possible node number the running system has) for avail_lists instead of
      MAX_NUMNODES.
      
      nr_node_ids is unknown at compile time so can't be directly used when
      declaring this array.  What I did here is to declare avail_lists as zero
      element array and allocate space for it when allocating space for
      swap_info_struct.  The reason why keep using array but not pointer is
      plist_for_each_entry needs the field to be part of the struct, so pointer
      will not work.
      
      This patch is on top of Vasily Averin's fix commit.  I think the use of
      kvzalloc for swap_info_struct is still needed in case nr_node_ids is
      really big on some systems.
      
      Link: http://lkml.kernel.org/r/20181115083847.GA11129@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vasily Averin <vvs@virtuozzo.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      66f71da9