1. 04 10月, 2017 1 次提交
    • M
      mm, memory_hotplug: add scheduling point to __add_pages · f64ac5e6
      Michal Hocko 提交于
      Patch series "mm, memory_hotplug: fix few soft lockups in memory
      hotadd".
      
      Johannes has noticed few soft lockups when adding a large nvdimm device.
      All of them were caused by a long loop without any explicit cond_resched
      which is a problem for !PREEMPT kernels.
      
      The fix is quite straightforward.  Just make sure that cond_resched gets
      called from time to time.
      
      This patch (of 3):
      
      __add_pages gets a pfn range to add and there is no upper bound for a
      single call.  This is usually a memory block aligned size for the
      regular memory hotplug - smaller sizes are usual for memory balloning
      drivers, or the whole NUMA node for physical memory online.  There is no
      explicit scheduling point in that code path though.
      
      This can lead to long latencies while __add_pages is executed and we
      have even seen a soft lockup report during nvdimm initialization with
      !PREEMPT kernel
      
        NMI watchdog: BUG: soft lockup - CPU#11 stuck for 23s! [kworker/u641:3:832]
        [...]
        Workqueue: events_unbound async_run_entry_fn
        task: ffff881809270f40 ti: ffff881809274000 task.ti: ffff881809274000
        RIP: _raw_spin_unlock_irqrestore+0x11/0x20
        RSP: 0018:ffff881809277b10  EFLAGS: 00000286
        [...]
        Call Trace:
          sparse_add_one_section+0x13d/0x18e
          __add_pages+0x10a/0x1d0
          arch_add_memory+0x4a/0xc0
          devm_memremap_pages+0x29d/0x430
          pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
          nvdimm_bus_probe+0x64/0x110 [libnvdimm]
          driver_probe_device+0x1f7/0x420
          bus_for_each_drv+0x52/0x80
          __device_attach+0xb0/0x130
          bus_probe_device+0x87/0xa0
          device_add+0x3fc/0x5f0
          nd_async_device_register+0xe/0x40 [libnvdimm]
          async_run_entry_fn+0x43/0x150
          process_one_work+0x14e/0x410
          worker_thread+0x116/0x490
          kthread+0xc7/0xe0
          ret_from_fork+0x3f/0x70
        DWARF2 unwinder stuck at ret_from_fork+0x3f/0x70
      
      Fix this by adding cond_resched once per each memory section in the
      given pfn range.  Each section is constant amount of work which itself
      is not too expensive but many of them will just add up.
      
      Link: http://lkml.kernel.org/r/20170918121410.24466-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Tested-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Cc: Dan Williams <dan.j.williams@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f64ac5e6
  2. 09 9月, 2017 2 次提交
    • J
      mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory · 5042db43
      Jérôme Glisse 提交于
      HMM (heterogeneous memory management) need struct page to support
      migration from system main memory to device memory.  Reasons for HMM and
      migration to device memory is explained with HMM core patch.
      
      This patch deals with device memory that is un-addressable memory (ie CPU
      can not access it).  Hence we do not want those struct page to be manage
      like regular memory.  That is why we extend ZONE_DEVICE to support
      different types of memory.
      
      A persistent memory type is define for existing user of ZONE_DEVICE and a
      new device un-addressable type is added for the un-addressable memory
      type.  There is a clear separation between what is expected from each
      memory type and existing user of ZONE_DEVICE are un-affected by new
      requirement and new use of the un-addressable type.  All specific code
      path are protect with test against the memory type.
      
      Because memory is un-addressable we use a new special swap type for when a
      page is migrated to device memory (this reduces the number of maximum swap
      file).
      
      The main two additions beside memory type to ZONE_DEVICE is two callbacks.
      First one, page_free() is call whenever page refcount reach 1 (which
      means the page is free as ZONE_DEVICE page never reach a refcount of 0).
      This allow device driver to manage its memory and associated struct page.
      
      The second callback page_fault() happens when there is a CPU access to an
      address that is back by a device page (which are un-addressable by the
      CPU).  This callback is responsible to migrate the page back to system
      main memory.  Device driver can not block migration back to system memory,
      HMM make sure that such page can not be pin into device memory.
      
      If device is in some error condition and can not migrate memory back then
      a CPU page fault to device memory should end with SIGBUS.
      
      [arnd@arndb.de: fix warning]
        Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
      Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5042db43
    • N
      mm: memory_hotplug: memory hotremove supports thp migration · 8135d892
      Naoya Horiguchi 提交于
      This patch enables thp migration for memory hotremove.
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-11-zi.yan@sent.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8135d892
  3. 07 9月, 2017 5 次提交
    • M
      mm, memory_hotplug: get rid of zonelists_mutex · b93e0f32
      Michal Hocko 提交于
      zonelists_mutex was introduced by commit 4eaf3f64 ("mem-hotplug: fix
      potential race while building zonelist for new populated zone") to
      protect zonelist building from races.  This is no longer needed though
      because both memory online and offline are fully serialized.  New users
      have grown since then.
      
      Notably setup_per_zone_wmarks wants to prevent from races between memory
      hotplug, khugepaged setup and manual min_free_kbytes update via sysctl
      (see cfd3da1e ("mm: Serialize access to min_free_kbytes").  Let's
      add a private lock for that purpose.  This will not prevent from seeing
      halfway through memory hotplug operation but that shouldn't be a big
      deal becuse memory hotplug will update watermarks explicitly so we will
      eventually get a full picture.  The lock just makes sure we won't race
      when updating watermarks leading to weird results.
      
      Also __build_all_zonelists manipulates global data so add a private lock
      for it as well.  This doesn't seem to be necessary today but it is more
      robust to have a lock there.
      
      While we are at it make sure we document that memory online/offline
      depends on a full serialization either via mem_hotplug_begin() or
      device_lock.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-9-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Haicheng Li <haicheng.li@linux.intel.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b93e0f32
    • M
      mm, memory_hotplug: remove explicit build_all_zonelists from try_online_node · 34ad1296
      Michal Hocko 提交于
      try_online_node calls hotadd_new_pgdat which already calls
      build_all_zonelists.  So the additional call is redundant.  Even though
      hotadd_new_pgdat will only initialize zonelists of the new node this is
      the right thing to do because such a node doesn't have any memory so
      other zonelists would ignore all the zones from this node anyway.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-6-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      34ad1296
    • M
      mm, memory_hotplug: drop zone from build_all_zonelists · 72675e13
      Michal Hocko 提交于
      build_all_zonelists gets a zone parameter to initialize zone's pagesets.
      There is only a single user which gives a non-NULL zone parameter and
      that one doesn't really need the rest of the build_all_zonelists (see
      commit 6dcd73d7 ("memory-hotplug: allocate zone's pcp before
      onlining pages")).
      
      Therefore remove setup_zone_pageset from build_all_zonelists and call it
      from its only user directly.  This will also remove a pointless zonlists
      rebuilding which is always good.
      
      Link: http://lkml.kernel.org/r/20170721143915.14161-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Shaohua Li <shaohua.li@intel.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Wen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      72675e13
    • M
      mm, memory_hotplug: remove zone restrictions · c6f03e29
      Michal Hocko 提交于
      Historically we have enforced that any kernel zone (e.g ZONE_NORMAL) has
      to precede the Movable zone in the physical memory range.  The purpose
      of the movable zone is, however, not bound to any physical memory
      restriction.  It merely defines a class of migrateable and reclaimable
      memory.
      
      There are users (e.g.  CMA) who might want to reserve specific physical
      memory ranges for their own purpose.  Moreover our pfn walkers have to
      be prepared for zones overlapping in the physical range already because
      we do support interleaving NUMA nodes and therefore zones can interleave
      as well.  This means we can allow each memory block to be associated
      with a different zone.
      
      Loosen the current onlining semantic and allow explicit onlining type on
      any memblock.  That means that online_{kernel,movable} will be allowed
      regardless of the physical address of the memblock as long as it is
      offline of course.  This might result in moveble zone overlapping with
      other kernel zones.  Default onlining then becomes a bit tricky but
      still sensible.  echo online > memoryXY/state will online the given
      block to
      
      	1) the default zone if the given range is outside of any zone
      	2) the enclosing zone if such a zone doesn't interleave with
      	   any other zone
              3) the default zone if more zones interleave for this range
      
      where default zone is movable zone only if movable_node is enabled
      otherwise it is a kernel zone.
      
      Here is an example of the semantic with (movable_node is not present but
      it work in an analogous way). We start with following memblocks, all of
      them offline:
      
        memory34/valid_zones:Normal Movable
        memory35/valid_zones:Normal Movable
        memory36/valid_zones:Normal Movable
        memory37/valid_zones:Normal Movable
        memory38/valid_zones:Normal Movable
        memory39/valid_zones:Normal Movable
        memory40/valid_zones:Normal Movable
        memory41/valid_zones:Normal Movable
      
      Now, we online block 34 in default mode and block 37 as movable
      
        root@test1:/sys/devices/system/node/node1# echo online > memory34/state
        root@test1:/sys/devices/system/node/node1# echo online_movable > memory37/state
        memory34/valid_zones:Normal
        memory35/valid_zones:Normal Movable
        memory36/valid_zones:Normal Movable
        memory37/valid_zones:Movable
        memory38/valid_zones:Normal Movable
        memory39/valid_zones:Normal Movable
        memory40/valid_zones:Normal Movable
        memory41/valid_zones:Normal Movable
      
      As we can see all other blocks can still be onlined both into Normal and
      Movable zones and the Normal is default because the Movable zone spans
      only block37 now.
      
        root@test1:/sys/devices/system/node/node1# echo online_movable > memory41/state
        memory34/valid_zones:Normal
        memory35/valid_zones:Normal Movable
        memory36/valid_zones:Normal Movable
        memory37/valid_zones:Movable
        memory38/valid_zones:Movable Normal
        memory39/valid_zones:Movable Normal
        memory40/valid_zones:Movable Normal
        memory41/valid_zones:Movable
      
      Now the default zone for blocks 37-41 has changed because movable zone
      spans that range.
      
        root@test1:/sys/devices/system/node/node1# echo online_kernel > memory39/state
        memory34/valid_zones:Normal
        memory35/valid_zones:Normal Movable
        memory36/valid_zones:Normal Movable
        memory37/valid_zones:Movable
        memory38/valid_zones:Normal Movable
        memory39/valid_zones:Normal
        memory40/valid_zones:Movable Normal
        memory41/valid_zones:Movable
      
      Note that the block 39 now belongs to the zone Normal and so block38
      falls into Normal by default as well.
      
      For completness
      
        root@test1:/sys/devices/system/node/node1# for i in memory[34]?
        do
      	echo online > $i/state 2>/dev/null
        done
      
        memory34/valid_zones:Normal
        memory35/valid_zones:Normal
        memory36/valid_zones:Normal
        memory37/valid_zones:Movable
        memory38/valid_zones:Normal
        memory39/valid_zones:Normal
        memory40/valid_zones:Movable
        memory41/valid_zones:Movable
      
      Implementation wise the change is quite straightforward.  We can get rid
      of allow_online_pfn_range altogether.  online_pages allows only offline
      nodes already.  The original default_zone_for_pfn will become
      default_kernel_zone_for_pfn.  New default_zone_for_pfn implements the
      above semantic.  zone_for_pfn_range is slightly reorganized to implement
      kernel and movable online type explicitly and MMOP_ONLINE_KEEP becomes a
      catch all default behavior.
      
      Link: http://lkml.kernel.org/r/20170714121233.16861-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: <slaoub@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: <linux-api@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6f03e29
    • M
      mm, memory_hotplug: display allowed zones in the preferred ordering · e5e68930
      Michal Hocko 提交于
      Prior to commit f1dd2cd1 ("mm, memory_hotplug: do not associate
      hotadded memory to zones until online") we used to allow to change the
      valid zone types of a memory block if it is adjacent to a different zone
      type.
      
      This fact was reflected in memoryNN/valid_zones by the ordering of
      printed zones.  The first one was default (echo online > memoryNN/state)
      and the other one could be onlined explicitly by online_{movable,kernel}.
      
      This behavior was removed by the said patch and as such the ordering was
      not all that important.  In most cases a kernel zone would be default
      anyway.  The only exception is movable_node handled by "mm,
      memory_hotplug: support movable_node for hotpluggable nodes".
      
      Let's reintroduce this behavior again because later patch will remove
      the zone overlap restriction and so user will be allowed to online
      kernel resp.  movable block regardless of its placement.  Original
      behavior will then become significant again because it would be
      non-trivial for users to see what is the default zone to online into.
      
      Implementation is really simple.  Pull out zone selection out of
      move_pfn_range into zone_for_pfn_range helper and use it in
      show_valid_zones to display the zone for default onlining and then both
      kernel and movable if they are allowed.  Default online zone is not
      duplicated.
      
      Link: http://lkml.kernel.org/r/20170714121233.16861-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: <slaoub@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e5e68930
  4. 11 7月, 2017 7 次提交
    • T
      mm/memory-hotplug: switch locking to a percpu rwsem · 3f906ba2
      Thomas Gleixner 提交于
      Andrey reported a potential deadlock with the memory hotplug lock and
      the cpu hotplug lock.
      
      The reason is that memory hotplug takes the memory hotplug lock and then
      calls stop_machine() which calls get_online_cpus().  That's the reverse
      lock order to get_online_cpus(); get_online_mems(); in mm/slub_common.c
      
      The problem has been there forever.  The reason why this was never
      reported is that the cpu hotplug locking had this homebrewn recursive
      reader writer semaphore construct which due to the recursion evaded the
      full lock dep coverage.  The memory hotplug code copied that construct
      verbatim and therefor has similar issues.
      
      Three steps to fix this:
      
      1) Convert the memory hotplug locking to a per cpu rwsem so the
         potential issues get reported proper by lockdep.
      
      2) Lock the online cpus in mem_hotplug_begin() before taking the memory
         hotplug rwsem and use stop_machine_cpuslocked() in the page_alloc
         code to avoid recursive locking.
      
      3) The cpu hotpluck locking in #2 causes a recursive locking of the cpu
         hotplug lock via __offline_pages() -> lru_add_drain_all(). Solve this
         by invoking lru_add_drain_all_cpuslocked() instead.
      
      Link: http://lkml.kernel.org/r/20170704093421.506836322@linutronix.deReported-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f906ba2
    • J
      mm/memory_hotplug.c: remove unused local zone_type from __remove_zone() · a52149f1
      John Hubbard 提交于
      __remove_zone() sets up up zone_type, but never uses it for anything.
      This does not cause a warning, due to the (necessary) use of
      -Wno-unused-but-set-variable.  However, it's noise, so just delete it.
      
      Link: http://lkml.kernel.org/r/20170624043421.24465-2-jhubbard@nvidia.comSigned-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a52149f1
    • M
      mm: unify new_node_page and alloc_migrate_target · 8b913238
      Michal Hocko 提交于
      Commit 394e31d2 ("mem-hotplug: alloc new page from a nearest
      neighbor node when mem-offline") has duplicated a large part of
      alloc_migrate_target with some hotplug specific special casing.
      
      To be more precise it tried to enfore the allocation from a different
      node than the original page.  As a result the two function diverged in
      their shared logic, e.g.  the hugetlb allocation strategy.
      
      Let's unify the two and express different NUMA requirements by the given
      nodemask.  new_node_page will simply exclude the node it doesn't care
      about and alloc_migrate_target will use all the available nodes.
      alloc_migrate_target will then learn to migrate hugetlb pages more
      sanely and use preallocated pool when possible.
      
      Please note that alloc_migrate_target used to call alloc_page resp.
      alloc_pages_current so the memory policy of the current context which is
      quite strange when we consider that it is used in the context of
      alloc_contig_range which just tries to migrate pages which stand in the
      way.
      
      Link: http://lkml.kernel.org/r/20170608074553.22152-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b913238
    • M
      hugetlb, memory_hotplug: prefer to use reserved pages for migration · 4db9b2ef
      Michal Hocko 提交于
      new_node_page will try to use the origin's next NUMA node as the
      migration destination for hugetlb pages.  If such a node doesn't have
      any preallocated pool it falls back to __alloc_buddy_huge_page_no_mpol
      to allocate a surplus page instead.  This is quite subotpimal for any
      configuration when hugetlb pages are no distributed to all NUMA nodes
      evenly.  Say we have a hotplugable node 4 and spare hugetlb pages are
      node 0
      
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node2/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node3/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node4/hugepages/hugepages-2048kB/nr_hugepages:10000
        /sys/devices/system/node/node5/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node6/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node7/hugepages/hugepages-2048kB/nr_hugepages:0
      
      Now we consume the whole pool on node 4 and try to offline this node.
      All the allocated pages should be moved to node0 which has enough
      preallocated pages to hold them.  With the current implementation
      offlining very likely fails because hugetlb allocations during runtime
      are much less reliable.
      
      Fix this by reusing the nodemask which excludes migration source and try
      to find a first node which has a page in the preallocated pool first and
      fall back to __alloc_buddy_huge_page_no_mpol only when the whole pool is
      consumed.
      
      [akpm@linux-foundation.org: remove bogus arg from alloc_huge_page_nodemask() stub]
      Link: http://lkml.kernel.org/r/20170608074553.22152-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4db9b2ef
    • M
      mm, memory_hotplug: simplify empty node mask handling in new_node_page · 7f252f27
      Michal Hocko 提交于
      new_node_page tries to allocate the target page on a different NUMA node
      than the source page.  This makes sense in most cases during the hotplug
      because we are likely to offline the whole numa node.  But there are
      cases where there are no other nodes to fallback (e.g.  when offlining
      parts of the only existing node) and we have to fallback to allocating
      from the source node.  The current code does that but it can be
      simplified by checking the nmask and updating it before we even try to
      allocate rather than special casing it.
      
      This patch shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170608074553.22152-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: zhong jiang <zhongjiang@huawei.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7f252f27
    • M
      mm, memory_hotplug: support movable_node for hotpluggable nodes · 9f123ab5
      Michal Hocko 提交于
      movable_node kernel parameter allows making hotpluggable NUMA nodes to
      put all the hotplugable memory into movable zone which allows more or
      less reliable memory hotremove.  At least this is the case for the NUMA
      nodes present during the boot (see find_zone_movable_pfns_for_nodes).
      
      This is not the case for the memory hotplug, though.
      
      	echo online > /sys/devices/system/memory/memoryXYZ/state
      
      will default to a kernel zone (usually ZONE_NORMAL) unless the
      particular memblock is already in the movable zone range which is not
      the case normally when onlining the memory from the udev rule context
      for a freshly hotadded NUMA node.  The only option currently is to have
      a special udev rule to echo online_movable to all memblocks belonging to
      such a node which is rather clumsy.  Not to mention this is inconsistent
      as well because what ended up in the movable zone during the boot will
      end up in a kernel zone after hotremove & hotadd without special care.
      
      It would be nice to reuse memblock_is_hotpluggable but the runtime
      hotplug doesn't have that information available because the boot and
      hotplug paths are not shared and it would be really non trivial to make
      them use the same code path because the runtime hotplug doesn't play
      with the memblock allocator at all.
      
      Teach move_pfn_range that MMOP_ONLINE_KEEP can use the movable zone if
      movable_node is enabled and the range doesn't overlap with the existing
      normal zone.  This should provide a reasonable default onlining
      strategy.
      
      Strictly speaking the semantic is not identical with the boot time
      initialization because find_zone_movable_pfns_for_nodes covers only the
      hotplugable range as described by the BIOS/FW.  From my experience this
      is usually a full node though (except for Node0 which is special and
      never goes away completely).  If this turns out to be a problem in the
      real life we can tweak the code to store hotplug flag into memblocks but
      let's keep this simple now.
      
      Link: http://lkml.kernel.org/r/20170612111227.GI7476@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: <slaoub@gmail.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9f123ab5
    • G
      mm/memory_hotplug.c: add NULL check to avoid potential NULL pointer dereference · dbac61a3
      Gustavo A. R. Silva 提交于
      The NULL check at line 1226: if (!pgdat), implies that pointer pgdat
      might be NULL.
      
      rollback_node_hotadd() dereferences this pointer.  Add NULL check to
      avoid a potential NULL pointer dereference.
      
      Addresses-Coverity-ID: 1369133
      Link: http://lkml.kernel.org/r/20170530212436.GA6195@embeddedgusSigned-off-by: NGustavo A. R. Silva <garsilva@embeddedor.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dbac61a3
  5. 07 7月, 2017 15 次提交
    • M
      mm, memory_hotplug: move movable_node to the hotplug proper · 4932381e
      Michal Hocko 提交于
      movable_node_is_enabled is defined in memblock proper while it is
      initialized from the memory hotplug proper.  This is quite messy and it
      makes a dependency between the two so move movable_node along with the
      helper functions to memory_hotplug.
      
      To make it more entertaining the kernel parameter is ignored unless
      CONFIG_HAVE_MEMBLOCK_NODE_MAP=y because we do not have the node
      information for each memblock otherwise.  So let's warn when the option
      is disabled.
      
      Link: http://lkml.kernel.org/r/20170529114141.536-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4932381e
    • M
      mm, memory_hotplug: drop CONFIG_MOVABLE_NODE · f70029bb
      Michal Hocko 提交于
      Commit 20b2f52b ("numa: add CONFIG_MOVABLE_NODE for
      movable-dedicated node") has introduced CONFIG_MOVABLE_NODE without a
      good explanation on why it is actually useful.
      
      It makes a lot of sense to make movable node semantic opt in but we
      already have that because the feature has to be explicitly enabled on
      the kernel command line.  A config option on top only makes the
      configuration space larger without a good reason.  It also adds an
      additional ifdefery that pollutes the code.
      
      Just drop the config option and make it de-facto always enabled.  This
      shouldn't introduce any change to the semantic.
      
      Link: http://lkml.kernel.org/r/20170529114141.536-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f70029bb
    • M
      mm, memory_hotplug: drop artificial restriction on online/offline · 57c0a172
      Michal Hocko 提交于
      Patch series "remove CONFIG_MOVABLE_NODE".
      
      I am continuing to clean up the memory hotplug code and
      CONFIG_MOVABLE_NODE seems dubious at best.  The following two patches
      simply removes the flag and make it de-facto always enabled.
      
      The current semantic of the config option is twofold 1) it automatically
      binds hotplugable nodes to have memory in zone_movable by default when
      movable_node is enabled 2) forbids memory hotplug to online all the
      memory as movable when !CONFIG_MOVABLE_NODE.
      
      The later restriction is quite dubious because there is no clear cut of
      how much normal memory do we need for a reasonable system operation.  A
      single memory block which is sufficient to allow further movable onlines
      is far from sufficient (e.g a node with >2GB and memblocks 128MB will
      fill up this zone with struct pages leaving nothing for other
      allocations).  Removing the config option will not only reduce the
      configuration space it also removes quite some code.
      
      The semantic of the movable_node command line parameter is preserved.
      
      The first patch removes the restriction mentioned above and the second
      one simply removes all the CONFIG_MOVABLE_NODE related stuff.  The last
      patch moves movable_node flag handling to memory_hotplug proper where it
      belongs.
      
      [1] http://lkml.kernel.org/r/20170524122411.25212-1-mhocko@kernel.org
      
      This patch (of 3):
      
      Commit 74d42d8f ("memory_hotplug: ensure every online node has
      NORMAL memory") has introduced a restriction that every numa node has to
      have at least some memory in !movable zones before a first movable
      memory can be onlined if !CONFIG_MOVABLE_NODE.
      
      Likewise can_offline_normal checks the amount of normal memory in
      !movable zones and it disallows to offline memory if there is no normal
      memory left with a justification that "memory-management acts bad when
      we have nodes which is online but don't have any normal memory".
      
      While it is true that not having _any_ memory for kernel allocations on
      a NUMA node is far from great and such a node would be quite subotimal
      because all kernel allocations will have to fallback to another NUMA
      node but there is no reason to disallow such a configuration in
      principle.
      
      Besides that there is not really a big difference to have one memblock
      for ZONE_NORMAL available or none.  With 128MB size memblocks the system
      might trash on the kernel allocations requests anyway.  It is really
      hard to draw a line on how much normal memory is really sufficient so we
      have to rely on administrator to configure system sanely therefore drop
      the artificial restriction and remove can_offline_normal and
      can_online_high_movable altogether.
      
      Link: http://lkml.kernel.org/r/20170529114141.536-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Yasuaki Ishimatsu <yasu.isimatu@gmail.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Kani Toshimitsu <toshi.kani@hpe.com>
      Cc: Chen Yucong <slaoub@gmail.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57c0a172
    • V
      mm, page_alloc: pass preferred nid instead of zonelist to allocator · 04ec6264
      Vlastimil Babka 提交于
      The main allocator function __alloc_pages_nodemask() takes a zonelist
      pointer as one of its parameters.  All of its callers directly or
      indirectly obtain the zonelist via node_zonelist() using a preferred
      node id and gfp_mask.  We can make the code a bit simpler by doing the
      zonelist lookup in __alloc_pages_nodemask(), passing it a preferred node
      id instead (gfp_mask is already another parameter).
      
      There are some code size benefits thanks to removal of inlined
      node_zonelist():
      
        bloat-o-meter add/remove: 2/2 grow/shrink: 4/36 up/down: 399/-1351 (-952)
      
      This will also make things simpler if we proceed with converting cpusets
      to zonelists.
      
      Link: http://lkml.kernel.org/r/20170517081140.30654-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Li Zefan <lizefan@huawei.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      04ec6264
    • M
      mm, memory_hotplug: remove unused cruft after memory hotplug rework · 559bfc7d
      Michal Hocko 提交于
      zone_for_memory doesn't have any user anymore as well as the whole zone
      shifting infrastructure so drop them all.
      
      This shouldn't introduce any functional changes.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-15-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      559bfc7d
    • M
      mm, memory_hotplug: fix the section mismatch warning · cdf72f25
      Michal Hocko 提交于
      Tobias has reported following section mismatches introduced by "mm,
      memory_hotplug: do not associate hotadded memory to zones until online".
      
        WARNING: mm/built-in.o(.text+0x5a1c2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone()
        The function move_pfn_range_to_zone() references
        the function __meminit memmap_init_zone().
        This is often because move_pfn_range_to_zone lacks a __meminit
        annotation or the annotation of memmap_init_zone is wrong.
      
        WARNING: mm/built-in.o(.text+0x5a25b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone()
        The function move_pfn_range_to_zone() references
        the function __meminit init_currently_empty_zone().
        This is often because move_pfn_range_to_zone lacks a __meminit
        annotation or the annotation of init_currently_empty_zone is wrong.
      
        WARNING: vmlinux.o(.text+0x188aa2): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:memmap_init_zone()
        The function move_pfn_range_to_zone() references
        the function __meminit memmap_init_zone().
        This is often because move_pfn_range_to_zone lacks a __meminit
        annotation or the annotation of memmap_init_zone is wrong.
      
        WARNING: vmlinux.o(.text+0x188b3b): Section mismatch in reference from the function move_pfn_range_to_zone() to the function .meminit.text:init_currently_empty_zone()
        The function move_pfn_range_to_zone() references
        the function __meminit init_currently_empty_zone().
        This is often because move_pfn_range_to_zone lacks a __meminit
        annotation or the annotation of init_currently_empty_zone is wrong.
      
      Both memmap_init_zone and init_currently_empty_zone are marked __meminit
      but move_pfn_range_to_zone is used outside of __meminit sections (e.g.
      devm_memremap_pages) so we have to hide it from the checker by __ref
      annotation.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-14-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTobias Regnery <tobias.regnery@gmail.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdf72f25
    • M
      mm, memory_hotplug: replace for_device by want_memblock in arch_add_memory · 3d79a728
      Michal Hocko 提交于
      arch_add_memory gets for_device argument which then controls whether we
      want to create memblocks for created memory sections.  Simplify the
      logic by telling whether we want memblocks directly rather than going
      through pointless negation.  This also makes the api easier to
      understand because it is clear what we want rather than nothing telling
      for_device which can mean anything.
      
      This shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-13-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d79a728
    • M
      mm, memory_hotplug: do not assume ZONE_NORMAL is default kernel zone · c246a213
      Michal Hocko 提交于
      Heiko Carstens has noticed that he can generate overlapping zones for
      ZONE_DMA and ZONE_NORMAL:
      
        DMA      [mem 0x0000000000000000-0x000000007fffffff]
        Normal   [mem 0x0000000080000000-0x000000017fffffff]
      
        $ cat /sys/devices/system/memory/block_size_bytes
        10000000
        $ cat /sys/devices/system/memory/memory5/valid_zones
        DMA
        $ echo 0 > /sys/devices/system/memory/memory5/online
        $ cat /sys/devices/system/memory/memory5/valid_zones
        Normal
        $ echo 1 > /sys/devices/system/memory/memory5/online
        Normal
      
        $ cat /proc/zoneinfo
        Node 0, zone      DMA
        spanned  524288        <-----
        present  458752
        managed  455078
        start_pfn:           0 <-----
      
        Node 0, zone   Normal
        spanned  720896
        present  589824
        managed  571648
        start_pfn:           327680 <-----
      
      The reason is that we assume that the default zone for kernel onlining
      is ZONE_NORMAL.  This was a simplification introduced by the memory
      hotplug rework and it is easily fixable by checking the range overlap in
      the zone order and considering the first matching zone as the default
      one.  If there is no such zone then assume ZONE_NORMAL as we have been
      doing so far.
      
      Fixes: "mm, memory_hotplug: do not associate hotadded memory to zones until online"
      Link: http://lkml.kernel.org/r/20170601083746.4924-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Tested-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c246a213
    • M
      mm, memory_hotplug: fix MMOP_ONLINE_KEEP behavior · a69578a1
      Michal Hocko 提交于
      Heiko Carstens has noticed that the MMOP_ONLINE_KEEP is broken currently
      
        $ grep . memory3?/valid_zones
        memory34/valid_zones:Normal Movable
        memory35/valid_zones:Normal Movable
        memory36/valid_zones:Normal Movable
        memory37/valid_zones:Normal Movable
      
        $ echo online_movable > memory34/state
        $ grep . memory3?/valid_zones
        memory34/valid_zones:Movable
        memory35/valid_zones:Movable
        memory36/valid_zones:Movable
        memory37/valid_zones:Movable
      
        $ echo online > memory36/state
        $ grep . memory3?/valid_zones
        memory34/valid_zones:Movable
        memory36/valid_zones:Normal
        memory37/valid_zones:Movable
      
      so we have effectively punched a hole into the movable zone.
      
      The problem is that move_pfn_range() check for MMOP_ONLINE_KEEP is
      wrong.  It only checks whether the given range is already part of the
      movable zone which is not the case here as only memory34 is in the zone.
      Fix this by using allow_online_pfn_range(..., MMOP_ONLINE_KERNEL) if
      that is false then we can be sure that movable onlining is the right
      thing to do.
      
      Fixes: "mm, memory_hotplug: do not associate hotadded memory to zones until online"
      Link: http://lkml.kernel.org/r/20170601083746.4924-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Tested-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a69578a1
    • M
      mm, memory_hotplug: do not associate hotadded memory to zones until online · f1dd2cd1
      Michal Hocko 提交于
      The current memory hotplug implementation relies on having all the
      struct pages associate with a zone/node during the physical hotplug
      phase (arch_add_memory->__add_pages->__add_section->__add_zone).  In the
      vast majority of cases this means that they are added to ZONE_NORMAL.
      This has been so since 9d99aaa3 ("[PATCH] x86_64: Support memory
      hotadd without sparsemem") and it wasn't a big deal back then because
      movable onlining didn't exist yet.
      
      Much later memory hotplug wanted to (ab)use ZONE_MOVABLE for movable
      onlining 511c2aba ("mm, memory-hotplug: dynamic configure movable
      memory and portion memory") and then things got more complicated.
      Rather than reconsidering the zone association which was no longer
      needed (because the memory hotplug already depended on SPARSEMEM) a
      convoluted semantic of zone shifting has been developed.  Only the
      currently last memblock or the one adjacent to the zone_movable can be
      onlined movable.  This essentially means that the online type changes as
      the new memblocks are added.
      
      Let's simulate memory hot online manually
        $ echo 0x100000000 > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory32/valid_zones
        Normal Movable
      
        $ echo $((0x100000000+(128<<20))) > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      
        $ echo $((0x100000000+2*(128<<20))) > /sys/devices/system/memory/probe
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
      
        $ echo online_movable > /sys/devices/system/memory/memory34/state
        $ grep . /sys/devices/system/memory/memory3?/valid_zones
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable Normal
      
      This is an awkward semantic because an udev event is sent as soon as the
      block is onlined and an udev handler might want to online it based on
      some policy (e.g.  association with a node) but it will inherently race
      with new blocks showing up.
      
      This patch changes the physical online phase to not associate pages with
      any zone at all.  All the pages are just marked reserved and wait for
      the onlining phase to be associated with the zone as per the online
      request.  There are only two requirements
      
      	- existing ZONE_NORMAL and ZONE_MOVABLE cannot overlap
      
      	- ZONE_NORMAL precedes ZONE_MOVABLE in physical addresses
      
      the latter one is not an inherent requirement and can be changed in the
      future.  It preserves the current behavior and made the code slightly
      simpler.  This is subject to change in future.
      
      This means that the same physical online steps as above will lead to the
      following state: Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
      
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
      
      Implementation:
      The current move_pfn_range is reimplemented to check the above
      requirements (allow_online_pfn_range) and then updates the respective
      zone (move_pfn_range_to_zone), the pgdat and links all the pages in the
      pfn range with the zone/node.  __add_pages is updated to not require the
      zone and only initializes sections in the range.  This allowed to
      simplify the arch_add_memory code (s390 could get rid of quite some of
      code).
      
      devm_memremap_pages is the only user of arch_add_memory which relies on
      the zone association because it only hooks into the memory hotplug only
      half way.  It uses it to associate the new memory with ZONE_DEVICE but
      doesn't allow it to be {on,off}lined via sysfs.  This means that this
      particular code path has to call move_pfn_range_to_zone explicitly.
      
      The original zone shifting code is kept in place and will be removed in
      the follow up patch for an easier review.
      
      Please note that this patch also changes the original behavior when
      offlining a memory block adjacent to another zone (Normal vs.  Movable)
      used to allow to change its movable type.  This will be handled later.
      
      [richard.weiyang@gmail.com: simplify zone_intersects()]
        Link: http://lkml.kernel.org/r/20170616092335.5177-1-richard.weiyang@gmail.com
      [richard.weiyang@gmail.com: remove duplicate call for set_page_links]
        Link: http://lkml.kernel.org/r/20170616092335.5177-2-richard.weiyang@gmail.com
      [akpm@linux-foundation.org: remove unused local `i']
      Link: http://lkml.kernel.org/r/20170515085827.16474-12-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NReza Arbab <arbab@linux.vnet.ibm.com>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # For s390 bits
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1dd2cd1
    • M
      mm: consider zone which is not fully populated to have holes · 2d070eab
      Michal Hocko 提交于
      __pageblock_pfn_to_page has two users currently, set_zone_contiguous
      which checks whether the given zone contains holes and
      pageblock_pfn_to_page which then carefully returns a first valid page
      from the given pfn range for the given zone.  This doesn't handle zones
      which are not fully populated though.  Memory pageblocks can be offlined
      or might not have been onlined yet.  In such a case the zone should be
      considered to have holes otherwise pfn walkers can touch and play with
      offline pages.
      
      Current callers of pageblock_pfn_to_page in compaction seem to work
      properly right now because they only isolate PageBuddy
      (isolate_freepages_block) or PageLRU resp.  __PageMovable
      (isolate_migratepages_block) which will be always false for these pages.
      It would be safer to skip these pages altogether, though.
      
      In order to do this patch adds a new memory section state
      (SECTION_IS_ONLINE) which is set in memory_present (during boot time) or
      in online_pages_range during the memory hotplug.  Similarly
      offline_mem_sections clears the bit and it is called when the memory
      range is offlined.
      
      pfn_to_online_page helper is then added which check the mem section and
      only returns a page if it is onlined already.
      
      Use the new helper in __pageblock_pfn_to_page and skip the whole page
      block in such a case.
      
      [mhocko@suse.com: check valid section number in pfn_to_online_page (Vlastimil),
       mark sections online after all struct pages are initialized in
       online_pages_range (Vlastimil)]
        Link: http://lkml.kernel.org/r/20170518164210.GD18333@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170515085827.16474-8-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2d070eab
    • M
      mm, memory_hotplug: split up register_one_node() · 9037a993
      Michal Hocko 提交于
      Memory hotplug (add_memory_resource) has to reinitialize node
      infrastructure if the node is offline (one which went through the
      complete add_memory(); remove_memory() cycle).  That involves node
      registration to the kobj infrastructure (register_node), the proper
      association with cpus (register_cpu_under_node) and finally creation of
      node<->memblock symlinks (link_mem_sections).
      
      The last part requires to know node_start_pfn and node_spanned_pages
      which we currently have but a leter patch will postpone this
      initialization to the onlining phase which happens later.  In fact we do
      not need to rely on the early pgdat initialization even now because the
      currently hot added pfn range is currently known.
      
      Split register_one_node into core which does all the common work for the
      boot time NUMA initialization and the hotplug (__register_one_node).
      register_one_node keeps the full initialization while hotplug calls
      __register_one_node and manually calls link_mem_sections for the proper
      range.
      
      This shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-6-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9037a993
    • M
      mm, memory_hotplug: get rid of is_zone_device_section · 1b862aec
      Michal Hocko 提交于
      Device memory hotplug hooks into regular memory hotplug only half way.
      It needs memory sections to track struct pages but there is no
      need/desire to associate those sections with memory blocks and export
      them to the userspace via sysfs because they cannot be onlined anyway.
      
      This is currently expressed by for_device argument to arch_add_memory
      which then makes sure to associate the given memory range with
      ZONE_DEVICE.  register_new_memory then relies on is_zone_device_section
      to distinguish special memory hotplug from the regular one.  While this
      works now, later patches in this series want to move __add_zone outside
      of arch_add_memory path so we have to come up with something else.
      
      Add want_memblock down the __add_pages path and use it to control
      whether the section->memblock association should be done.
      arch_add_memory then just trivially want memblock for everything but
      for_device hotplug.
      
      remove_memory_section doesn't need is_zone_device_section either.  We
      can simply skip all the memblock specific cleanup if there is no
      memblock for the given section.
      
      This shouldn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-5-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b862aec
    • M
      mm, memory_hotplug: use node instead of zone in can_online_high_movable · c8f95657
      Michal Hocko 提交于
      The primary purpose of this helper is to query the node state so use the
      node id directly.  This is a preparatory patch for later changes.
      
      This shouldn't introduce any functional change
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c8f95657
    • M
      mm: remove return value from init_currently_empty_zone · dc0bbf3b
      Michal Hocko 提交于
      Patch series "mm: make movable onlining suck less", v4.
      
      Movable onlining is a real hack with many downsides - mainly
      reintroduction of lowmem/highmem issues we used to have on 32b systems -
      but it is the only way to make the memory hotremove more reliable which
      is something that people are asking for.
      
      The current semantic of memory movable onlinening is really cumbersome,
      however.  The main reason for this is that the udev driven approach is
      basically unusable because udev races with the memory probing while only
      the last memory block or the one adjacent to the existing zone_movable
      are allowed to be onlined movable.  In short the criterion for the
      successful online_movable changes under udev's feet.  A reliable udev
      approach would require a 2 phase approach where the first successful
      movable online would have to check all the previous blocks and online
      them in descending order.  This is hard to be considered sane.
      
      This patchset aims at making the onlining semantic more usable.  First
      of all it allows to online memory movable as long as it doesn't clash
      with the existing ZONE_NORMAL.  That means that ZONE_NORMAL and
      ZONE_MOVABLE cannot overlap.  Currently I preserve the original ordering
      semantic so the zone always precedes the movable zone but I have plans
      to remove this restriction in future because it is not really necessary.
      
      First 3 patches are cleanups which should be ready to be merged right
      away (unless I have missed something subtle of course).
      
      Patch 4 deals with ZONE_DEVICE dependencies down the __add_pages path.
      
      Patch 5 deals with implicit assumptions of register_one_node on pgdat
      initialization.
      
      Patches 6-10 deal with offline holes in the zone for pfn walkers.  I
      hope I got all of them right but people familiar with compaction should
      double check this.
      
      Patch 11 is the core of the change.  In order to make it easier to
      review I have tried it to be as minimalistic as possible and the large
      code removal is moved to patch 14.
      
      Patch 12 is a trivial follow up cleanup.  Patch 13 fixes sparse warnings
      and finally patch 14 removes the unused code.
      
      I have tested the patches in kvm:
        # qemu-system-x86_64 -enable-kvm -monitor pty -m 2G,slots=4,maxmem=4G -numa node,mem=1G -numa node,mem=1G ...
      
      and then probed the additional memory by
        (qemu) object_add memory-backend-ram,id=mem1,size=1G
        (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
      
      Then I have used this simple script to probe the memory block by hand
        # cat probe_memblock.sh
        #!/bin/sh
      
        BLOCK_NR=$1
      
        # echo $((0x100000000+$BLOCK_NR*(128<<20))) > /sys/devices/system/memory/probe
      
        # for i in $(seq 10); do sh probe_memblock.sh $i; done
        # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Normal Movable
        /sys/devices/system/memory/memory35/valid_zones:Normal Movable
        /sys/devices/system/memory/memory36/valid_zones:Normal Movable
        /sys/devices/system/memory/memory37/valid_zones:Normal Movable
        /sys/devices/system/memory/memory38/valid_zones:Normal Movable
        /sys/devices/system/memory/memory39/valid_zones:Normal Movable
      
      The main difference to the original implementation is that all new
      memblocks can be both online_kernel and online_movable initially because
      there is no clash obviously.  For the comparison the original
      implementation would have
      
        /sys/devices/system/memory/memory33/valid_zones:Normal
        /sys/devices/system/memory/memory34/valid_zones:Normal
        /sys/devices/system/memory/memory35/valid_zones:Normal
        /sys/devices/system/memory/memory36/valid_zones:Normal
        /sys/devices/system/memory/memory37/valid_zones:Normal
        /sys/devices/system/memory/memory38/valid_zones:Normal
        /sys/devices/system/memory/memory39/valid_zones:Normal Movable
      
      Now
        # echo online_movable > /sys/devices/system/memory/memory34/state
        # grep . /sys/devices/system/memory/memory3?/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
        /sys/devices/system/memory/memory35/valid_zones:Movable
        /sys/devices/system/memory/memory36/valid_zones:Movable
        /sys/devices/system/memory/memory37/valid_zones:Movable
        /sys/devices/system/memory/memory38/valid_zones:Movable
        /sys/devices/system/memory/memory39/valid_zones:Movable
      
      Block 33 can still be online both kernel and movable while all
      the remaining can be only movable.
      
      /proc/zonelist says
        Node 0, zone   Normal
          pages free     0
                min      0
                low      0
                high     0
                spanned  0
                present  0
        --
        Node 0, zone  Movable
          pages free     32753
                min      85
                low      117
                high     149
                spanned  32768
                present  32768
      
      A new memblock at a lower address will result in a new memblock (32)
      which will still allow both Normal and Movable.
      
        # sh probe_memblock.sh 0
        # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory32/valid_zones:Normal Movable
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
        /sys/devices/system/memory/memory35/valid_zones:Movable
      
      and online_kernel will convert it to the zone normal properly
      while 33 can be still onlined both ways.
      
        # echo online_kernel > /sys/devices/system/memory/memory32/state
        # grep . /sys/devices/system/memory/memory3[2-5]/valid_zones 2>/dev/null
        /sys/devices/system/memory/memory32/valid_zones:Normal
        /sys/devices/system/memory/memory33/valid_zones:Normal Movable
        /sys/devices/system/memory/memory34/valid_zones:Movable
        /sys/devices/system/memory/memory35/valid_zones:Movable
      
      /proc/zoneinfo will now tell
        Node 0, zone   Normal
          pages free     65441
                min      165
                low      230
                high     295
                spanned  65536
                present  65536
        --
        Node 0, zone  Movable
          pages free     32740
                min      82
                low      114
                high     146
                spanned  32768
                present  32768
      
      so both zones have one memblock spanned and present.
      
      Onlining 39 should associate this block to the movable zone
      
        # echo online > /sys/devices/system/memory/memory39/state
      
      /proc/zoneinfo will now tell
        Node 0, zone   Normal
          pages free     32765
                min      80
                low      112
                high     144
                spanned  32768
                present  32768
        --
        Node 0, zone  Movable
          pages free     65501
                min      160
                low      225
                high     290
                spanned  196608
                present  65536
      
      so we will have a movable zone which spans 6 memblocks, 2 present and 4
      representing a hole.
      
      Offlining both movable blocks will lead to the zone with no present
      pages which is the expected behavior I believe.
      
        # echo offline > /sys/devices/system/memory/memory39/state
        # echo offline > /sys/devices/system/memory/memory34/state
        # grep -A6 "Movable\|Normal" /proc/zoneinfo
        Node 0, zone   Normal
          pages free     32735
                min      90
                low      122
                high     154
                spanned  32768
                present  32768
        --
        Node 0, zone  Movable
          pages free     0
                min      0
                low      0
                high     0
                spanned  196608
                present  0
      
      As a bonus we will get a nice cleanup in the memory hotplug codebase.
      
      This patch (of 16):
      
      init_currently_empty_zone doesn't have any error to return yet it is
      still an int and callers try to be defensive and try to handle potential
      error.  Remove this nonsense and simplify all callers.
      
      This patch shouldn't have any visible effect
      
      Link: http://lkml.kernel.org/r/20170515085827.16474-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Daniel Kiper <daniel.kiper@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Tobias Regnery <tobias.regnery@gmail.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc0bbf3b
  6. 04 5月, 2017 1 次提交
    • M
      mm, vmscan: prevent kswapd sleeping prematurely due to mismatched classzone_idx · e716f2eb
      Mel Gorman 提交于
      kswapd is woken to reclaim a node based on a failed allocation request
      from any eligible zone.  Once reclaiming in balance_pgdat(), it will
      continue reclaiming until there is an eligible zone available for the
      zone it was woken for.  kswapd tracks what zone it was recently woken
      for in pgdat->kswapd_classzone_idx.  If it has not been woken recently,
      this zone will be 0.
      
      However, the decision on whether to sleep is made on
      kswapd_classzone_idx which is 0 without a recent wakeup request and that
      classzone does not account for lowmem reserves.  This allows kswapd to
      sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA
      request even if a stream of allocations cannot use that zone.  While
      kswapd may be woken again shortly in the near future there are two
      consequences -- the pgdat bits that control congestion are cleared
      prematurely and direct reclaim is more likely as kswapd slept
      prematurely.
      
      This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an
      invalid index) when there has been no recent wakeups.  If there are no
      wakeups, it'll decide whether to sleep based on the highest possible
      zone available (MAX_NR_ZONES - 1).  It then becomes critical that the
      "pgdat balanced" decisions during reclaim and when deciding to sleep are
      the same.  If there is a mismatch, kswapd can stay awake continually
      trying to balance tiny zones.
      
      simoop was used to evaluate it again.  Two of the preparation patches
      regressed the workload so they are included as the second set of
      results.  Otherwise this patch looks artifically excellent
      
                                               4.11.0-rc1            4.11.0-rc1            4.11.0-rc1
                                                  vanilla              clear-v2          keepawake-v2
      Amean    p50-Read             21670074.18 (  0.00%) 19786774.76 (  8.69%) 22668332.52 ( -4.61%)
      Amean    p95-Read             25456267.64 (  0.00%) 24101956.27 (  5.32%) 26738688.00 ( -5.04%)
      Amean    p99-Read             29369064.73 (  0.00%) 27691872.71 (  5.71%) 30991404.52 ( -5.52%)
      Amean    p50-Write                1390.30 (  0.00%)     1011.91 ( 27.22%)      924.91 ( 33.47%)
      Amean    p95-Write              412901.57 (  0.00%)    34874.98 ( 91.55%)     1362.62 ( 99.67%)
      Amean    p99-Write             6668722.09 (  0.00%)   575449.60 ( 91.37%)    16854.04 ( 99.75%)
      Amean    p50-Allocation          78714.31 (  0.00%)    84246.26 ( -7.03%)    74729.74 (  5.06%)
      Amean    p95-Allocation         175533.51 (  0.00%)   400058.43 (-127.91%)   101609.74 ( 42.11%)
      Amean    p99-Allocation         247003.02 (  0.00%) 10905600.00 (-4315.17%)   125765.57 ( 49.08%)
      
      With this patch on top, write and allocation latencies are massively
      improved.  The read latencies are slightly impaired but it's worth
      noting that this is mostly due to the IO scheduler and not directly
      related to reclaim.  The vmstats are a bit of a mix but the relevant
      ones are as follows;
      
                                  4.10.0-rc7  4.10.0-rc7  4.10.0-rc7
                                mmots-20170209 clear-v1r25keepawake-v1r25
      Swap Ins                             0           0           0
      Swap Outs                            0         608           0
      Direct pages scanned           6910672     3132699     6357298
      Kswapd pages scanned          57036946    82488665    56986286
      Kswapd pages reclaimed        55993488    63474329    55939113
      Direct pages reclaimed         6905990     2964843     6352115
      Kswapd efficiency                  98%         76%         98%
      Kswapd velocity              12494.375   17597.507   12488.065
      Direct efficiency                  99%         94%         99%
      Direct velocity               1513.835     668.306    1393.148
      Page writes by reclaim           0.000 4410243.000       0.000
      Page writes file                     0     4409635           0
      Page writes anon                     0         608           0
      Page reclaim immediate         1036792    14175203     1042571
      
                                  4.11.0-rc1  4.11.0-rc1  4.11.0-rc1
                                     vanilla  clear-v2  keepawake-v2
      Swap Ins                             0          12           0
      Swap Outs                            0         838           0
      Direct pages scanned           6579706     3237270     6256811
      Kswapd pages scanned          61853702    79961486    54837791
      Kswapd pages reclaimed        60768764    60755788    53849586
      Direct pages reclaimed         6579055     2987453     6256151
      Kswapd efficiency                  98%         75%         98%
      Page writes by reclaim           0.000 4389496.000       0.000
      Page writes file                     0     4388658           0
      Page writes anon                     0         838           0
      Page reclaim immediate         1073573    14473009      982507
      
      Swap-outs are equivalent to baseline.
      
      Direct reclaim is reduced but not eliminated.  It's worth noting that
      there are two periods of direct reclaim for this workload.  The first is
      when it switches from preparing the files for the actual test itself.
      It's a lot of file IO followed by a lot of allocs that reclaims heavily
      for a brief window.  While direct reclaim is lower with clear-v2, it is
      due to kswapd scanning aggressively and trying to reclaim the world
      which is not the right thing to do.  With the patches applied, there is
      still direct reclaim but the phase change from "creating work files" to
      starting multiple threads that allocate a lot of anonymous memory faster
      than kswapd can reclaim.
      
      Scanning/reclaim efficiency is restored by this patch.
      
      Page writes from reclaim context are back at 0 which is ideal.
      
      Pages immediately reclaimed after IO completes is slightly improved but
      it is expected this will vary slightly.
      
      On UMA, there is almost no change so this is not expected to be a
      universal win.
      
      [mgorman@suse.de: fix ->kswapd_classzone_idx initialization]
        Link: http://lkml.kernel.org/r/20170406174538.5msrznj6nt6qpbx5@suse.de
      Link: http://lkml.kernel.org/r/20170309075657.25121-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shantanu Goel <sgoel01@yahoo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e716f2eb
  7. 17 3月, 2017 1 次提交
    • H
      mm: add private lock to serialize memory hotplug operations · 55adc1d0
      Heiko Carstens 提交于
      Commit bfc8c901 ("mem-hotplug: implement get/put_online_mems")
      introduced new functions get/put_online_mems() and mem_hotplug_begin/end()
      in order to allow similar semantics for memory hotplug like for cpu
      hotplug.
      
      The corresponding functions for cpu hotplug are get/put_online_cpus()
      and cpu_hotplug_begin/done() for cpu hotplug.
      
      The commit however missed to introduce functions that would serialize
      memory hotplug operations like they are done for cpu hotplug with
      cpu_maps_update_begin/done().
      
      This basically leaves mem_hotplug.active_writer unprotected and allows
      concurrent writers to modify it, which may lead to problems as outlined
      by commit f931ab47 ("mm: fix devm_memremap_pages crash, use
      mem_hotplug_{begin, done}").
      
      That commit was extended again with commit b5d24fda ("mm,
      devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin,
      done}") which serializes memory hotplug operations for some call sites
      by using the device_hotplug lock.
      
      In addition with commit 3fc21924 ("mm: validate device_hotplug is held
      for memory hotplug") a sanity check was added to mem_hotplug_begin() to
      verify that the device_hotplug lock is held.
      
      This in turn triggers the following warning on s390:
      
      WARNING: CPU: 6 PID: 1 at drivers/base/core.c:643 assert_held_device_hotplug+0x4a/0x58
       Call Trace:
        assert_held_device_hotplug+0x40/0x58)
        mem_hotplug_begin+0x34/0xc8
        add_memory_resource+0x7e/0x1f8
        add_memory+0xda/0x130
        add_memory_merged+0x15c/0x178
        sclp_detect_standby_memory+0x2ae/0x2f8
        do_one_initcall+0xa2/0x150
        kernel_init_freeable+0x228/0x2d8
        kernel_init+0x2a/0x140
        kernel_thread_starter+0x6/0xc
      
      One possible fix would be to add more lock_device_hotplug() and
      unlock_device_hotplug() calls around each call site of
      mem_hotplug_begin/end().  But that would give the device_hotplug lock
      additional semantics it better should not have (serialize memory hotplug
      operations).
      
      Instead add a new memory_add_remove_lock which has the similar semantics
      like cpu_add_remove_lock for cpu hotplug.
      
      To keep things hopefully a bit easier the lock will be locked and unlocked
      within the mem_hotplug_begin/end() functions.
      
      Link: http://lkml.kernel.org/r/20170314125226.16779-2-heiko.carstens@de.ibm.comSigned-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Reported-by: NSebastian Ott <sebott@linux.vnet.ibm.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      55adc1d0
  8. 02 3月, 2017 1 次提交
  9. 25 2月, 2017 5 次提交
  10. 23 2月, 2017 1 次提交
    • Y
      mm/memory_hotplug: set magic number to page->freelist instead of page->lru.next · ddffe98d
      Yasuaki Ishimatsu 提交于
      To identify that pages of page table are allocated from bootmem
      allocator, magic number sets to page->lru.next.
      
      But page->lru list is initialized in reserve_bootmem_region().  So when
      calling free_pagetable(), the function cannot find the magic number of
      pages.  And free_pagetable() frees the pages by free_reserved_page() not
      put_page_bootmem().
      
      But if the pages are allocated from bootmem allocator and used as page
      table, the pages have private flag.  So before freeing the pages, we
      should clear the private flag by put_page_bootmem().
      
      Before applying the commit 7bfec6f4 ("mm, page_alloc: check multiple
      page fields with a single branch"), we could find the following visible
      issue:
      
        BUG: Bad page state in process kworker/u1024:1
        page:ffffea103cfd8040 count:0 mapcount:0 mappi
        flags: 0x6fffff80000800(private)
        page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
        bad because of flags: 0x800(private)
        <snip>
        Call Trace:
        [...] dump_stack+0x63/0x87
        [...] bad_page+0x114/0x130
        [...] free_pages_prepare+0x299/0x2d0
        [...] free_hot_cold_page+0x31/0x150
        [...] __free_pages+0x25/0x30
        [...] free_pagetable+0x6f/0xb4
        [...] remove_pagetable+0x379/0x7ff
        [...] vmemmap_free+0x10/0x20
        [...] sparse_remove_one_section+0x149/0x180
        [...] __remove_pages+0x2e9/0x4f0
        [...] arch_remove_memory+0x63/0xc0
        [...] remove_memory+0x8c/0xc0
        [...] acpi_memory_device_remove+0x79/0xa5
        [...] acpi_bus_trim+0x5a/0x8d
        [...] acpi_bus_trim+0x38/0x8d
        [...] acpi_device_hotplug+0x1b7/0x418
        [...] acpi_hotplug_work_fn+0x1e/0x29
        [...] process_one_work+0x152/0x400
        [...] worker_thread+0x125/0x4b0
        [...] kthread+0xd8/0xf0
        [...] ret_from_fork+0x22/0x40
      
      And the issue still silently occurs.
      
      Until freeing the pages of page table allocated from bootmem allocator,
      the page->freelist is never used.  So the patch sets magic number to
      page->freelist instead of page->lru.next.
      
      [isimatu.yasuaki@jp.fujitsu.com: fix merge issue]
        Link: http://lkml.kernel.org/r/722b1cc4-93ac-dd8b-2be2-7a7e313b3b0b@gmail.com
      Link: http://lkml.kernel.org/r/2c29bd9f-5b67-02d0-18a3-8828e78bbb6f@gmail.comSigned-off-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ddffe98d
  11. 04 2月, 2017 1 次提交
    • T
      base/memory, hotplug: fix a kernel oops in show_valid_zones() · a96dfddb
      Toshi Kani 提交于
      Reading a sysfs "memoryN/valid_zones" file leads to the following oops
      when the first page of a range is not backed by struct page.
      show_valid_zones() assumes that 'start_pfn' is always valid for
      page_zone().
      
       BUG: unable to handle kernel paging request at ffffea017a000000
       IP: show_valid_zones+0x6f/0x160
      
      This issue may happen on x86-64 systems with 64GiB or more memory since
      their memory block size is bumped up to 2GiB.  [1] An example of such
      systems is desribed below.  0x3240000000 is only aligned by 1GiB and
      this memory block starts from 0x3200000000, which is not backed by
      struct page.
      
       BIOS-e820: [mem 0x0000003240000000-0x000000603fffffff] usable
      
      Since test_pages_in_a_zone() already checks holes, fix this issue by
      extending this function to return 'valid_start' and 'valid_end' for a
      given range.  show_valid_zones() then proceeds with the valid range.
      
      [1] 'Commit bdee237c ("x86: mm: Use 2GB memory block size on
          large-memory x86-64 systems")'
      
      Link: http://lkml.kernel.org/r/20170127222149.30893-3-toshi.kani@hpe.comSigned-off-by: NToshi Kani <toshi.kani@hpe.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Zhang Zhen <zhenzhang.zhang@huawei.com>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: <stable@vger.kernel.org>	[4.4+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a96dfddb