1. 27 12月, 2019 4 次提交
    • Z
      pagecache: add Kconfig to enable/disable the feature · 862e2308
      zhongjiang 提交于
      euler inclusion
      category: bugfix
      CVE: NA
      Bugzilla: 9580
      
      ---------------------------
      
      Just add Kconfig to the feature.
      Signed-off-by: Nzhongjiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      862e2308
    • Z
      pagecache: add sysctl interface to limit pagecache · 6174ecb5
      zhong jiang 提交于
      euleros inclusion
      category: feature
      feature: pagecache limit
      
      add proc sysctl interface to set pagecache limit for reclaim memory
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NJing xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6174ecb5
    • W
      mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init · d453a476
      Waiman Long 提交于
      [ Upstream commit 3c0c12cc8f00ca5f81acb010023b8eb13e9a7004 ]
      
      When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred
      pages initialization can take a long time.  Below were the reported init
      times on a 8-socket 96-core 4TB IvyBridge system.
      
        1) Non-debug kernel without CONFIG_KASAN
           [    8.764222] node 1 initialised, 132086516 pages in 7027ms
      
        2) Debug kernel with CONFIG_KASAN
           [  146.288115] node 1 initialised, 132075466 pages in 143052ms
      
      So the page init time in a debug kernel was 20X of the non-debug kernel.
      The long init time can be problematic as the page initialization is done
      with interrupt disabled.  In this particular case, it caused the
      appearance of following warning messages as well as NMI backtraces of all
      the cores that were doing the initialization.
      
      [   68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
      [   68.241000] rcu: 	25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252
      [   68.241000] rcu: 	44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253
      [   68.241000] rcu: 	54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253
      [   68.241000] rcu: 	60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253
      [   68.241000] rcu: 	72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253
      [   68.241000] rcu: 	84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253
      [   68.241000] rcu: 	111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253
      [   68.241000] rcu: 	(detected by 13, t=65018 jiffies, g=249, q=2)
      
      The long init time was mainly caused by the call to kasan_free_pages() to
      poison the newly initialized pages.  On a 4TB system, we are talking about
      almost 500GB of memory probably on the same node.
      
      In reality, we may not need to poison the newly initialized pages before
      they are ever allocated.  So KASAN poisoning of freed pages before the
      completion of deferred memory initialization is now disabled.  Those pages
      will be properly poisoned when they are allocated or freed after deferred
      pages initialization is done.
      
      With this change, the new page initialization time became:
      
      [   21.948010] node 1 initialised, 132075466 pages in 18702ms
      
      This was still about double the non-debug kernel time, but was much
      better than before.
      
      Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d453a476
    • M
      Revert "mm, memory_hotplug: initialize struct pages for the full memory section" · 79d3b9b9
      Michal Hocko 提交于
      commit 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a upstream.
      
      This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684.
      
      The underlying assumption that one sparse section belongs into a single
      numa node doesn't hold really. Robert Shteynfeld has reported a boot
      failure. The boot log was not captured but his memory layout is as
      follows:
      
        Early memory node ranges
          node   1: [mem 0x0000000000001000-0x0000000000090fff]
          node   1: [mem 0x0000000000100000-0x00000000dbdf8fff]
          node   1: [mem 0x0000000100000000-0x0000001423ffffff]
          node   0: [mem 0x0000001424000000-0x0000002023ffffff]
      
      This means that node0 starts in the middle of a memory section which is
      also in node1.  memmap_init_zone tries to initialize padding of a
      section even when it is outside of the given pfn range because there are
      code paths (e.g.  memory hotplug) which assume that the full worth of
      memory section is always initialized.
      
      In this particular case, though, such a range is already intialized and
      most likely already managed by the page allocator.  Scribbling over
      those pages corrupts the internal state and likely blows up when any of
      those pages gets used.
      Reported-by: NRobert Shteynfeld <robert.shteynfeld@gmail.com>
      Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section")
      Cc: stable@kernel.org
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      79d3b9b9
  2. 29 12月, 2018 2 次提交
    • O
      mm, page_alloc: fix has_unmovable_pages for HugePages · e27666dd
      Oscar Salvador 提交于
      commit 17e2e7d7e1b83fa324b3f099bfe426659aa3c2a4 upstream.
      
      While playing with gigantic hugepages and memory_hotplug, I triggered
      the following #PF when "cat memoryX/removable":
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        #PF error: [normal kernel read fault]
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP PTI
        CPU: 1 PID: 1481 Comm: cat Tainted: G            E     4.20.0-rc6-mm1-1-default+ #18
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
        RIP: 0010:has_unmovable_pages+0x154/0x210
        Call Trace:
         is_mem_section_removable+0x7d/0x100
         removable_show+0x90/0xb0
         dev_attr_show+0x1c/0x50
         sysfs_kf_seq_show+0xca/0x1b0
         seq_read+0x133/0x380
         __vfs_read+0x26/0x180
         vfs_read+0x89/0x140
         ksys_read+0x42/0x90
         do_syscall_64+0x5b/0x180
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The reason is we do not pass the Head to page_hstate(), and so, the call
      to compound_order() in page_hstate() returns 0, so we end up checking
      all hstates's size to match PAGE_SIZE.
      
      Obviously, we do not find any hstate matching that size, and we return
      NULL.  Then, we dereference that NULL pointer in
      hugepage_migration_supported() and we got the #PF from above.
      
      Fix that by getting the head page before calling page_hstate().
      
      Also, since gigantic pages span several pageblocks, re-adjust the logic
      for skipping pages.  While are it, we can also get rid of the
      round_up().
      
      [osalvador@suse.de: remove round_up(), adjust skip pages logic per Michal]
        Link: http://lkml.kernel.org/r/20181221062809.31771-1-osalvador@suse.de
      Link: http://lkml.kernel.org/r/20181217225113.17864-1-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e27666dd
    • M
      mm, memory_hotplug: initialize struct pages for the full memory section · 7592dbfa
      Mikhail Zaslonko 提交于
      commit 2830bf6f05fb3e05bc4743274b806c821807a684 upstream.
      
      If memory end is not aligned with the sparse memory section boundary,
      the mapping of such a section is only partly initialized.  This may lead
      to VM_BUG_ON due to uninitialized struct page access from
      is_mem_section_removable() or test_pages_in_a_zone() function triggered
      by memory_hotplug sysfs handlers:
      
      Here are the the panic examples:
       CONFIG_DEBUG_VM=y
       CONFIG_DEBUG_VM_PGFLAGS=y
      
       kernel parameter mem=2050M
       --------------------------
       page:000003d082008000 is uninitialized and poisoned
       page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
       Call Trace:
       ( test_pages_in_a_zone+0xde/0x160)
         show_valid_zones+0x5c/0x190
         dev_attr_show+0x34/0x70
         sysfs_kf_seq_show+0xc8/0x148
         seq_read+0x204/0x480
         __vfs_read+0x32/0x178
         vfs_read+0x82/0x138
         ksys_read+0x5a/0xb0
         system_call+0xdc/0x2d8
       Last Breaking-Event-Address:
         test_pages_in_a_zone+0xde/0x160
       Kernel panic - not syncing: Fatal exception: panic_on_oops
      
       kernel parameter mem=3075M
       --------------------------
       page:000003d08300c000 is uninitialized and poisoned
       page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
       Call Trace:
       ( is_mem_section_removable+0xb4/0x190)
         show_mem_removable+0x9a/0xd8
         dev_attr_show+0x34/0x70
         sysfs_kf_seq_show+0xc8/0x148
         seq_read+0x204/0x480
         __vfs_read+0x32/0x178
         vfs_read+0x82/0x138
         ksys_read+0x5a/0xb0
         system_call+0xdc/0x2d8
       Last Breaking-Event-Address:
         is_mem_section_removable+0xb4/0x190
       Kernel panic - not syncing: Fatal exception: panic_on_oops
      
      Fix the problem by initializing the last memory section of each zone in
      memmap_init_zone() till the very end, even if it goes beyond the zone end.
      
      Michal said:
      
      : This has alwways been problem AFAIU.  It just went unnoticed because we
      : have zeroed memmaps during allocation before f7f99100 ("mm: stop
      : zeroing memory during allocation in vmemmap") and so the above test
      : would simply skip these ranges as belonging to zone 0 or provided a
      : garbage.
      :
      : So I guess we do care for post f7f99100 kernels mostly and
      : therefore Fixes: f7f99100 ("mm: stop zeroing memory during
      : allocation in vmemmap")
      
      Link: http://lkml.kernel.org/r/20181212172712.34019-2-zaslonko@linux.ibm.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: NMikhail Zaslonko <zaslonko@linux.ibm.com>
      Reviewed-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Suggested-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Tested-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7592dbfa
  3. 17 12月, 2018 1 次提交
    • W
      mm/page_alloc.c: fix calculation of pgdat->nr_zones · 505bc9f3
      Wei Yang 提交于
      [ Upstream commit 8f416836c0d50b198cad1225132e5abebf8980dc ]
      
      init_currently_empty_zone() will adjust pgdat->nr_zones and set it to
      'zone_idx(zone) + 1' unconditionally.  This is correct in the normal
      case, while not exact in hot-plug situation.
      
      This function is used in two places:
      
        * free_area_init_core()
        * move_pfn_range_to_zone()
      
      In the first case, we are sure zone index increase monotonically.  While
      in the second one, this is under users control.
      
      One way to reproduce this is:
      ----------------------------
      
      1. create a virtual machine with empty node1
      
         -m 4G,slots=32,maxmem=32G \
         -smp 4,maxcpus=8          \
         -numa node,nodeid=0,mem=4G,cpus=0-3 \
         -numa node,nodeid=1,mem=0G,cpus=4-7
      
      2. hot-add cpu 3-7
      
         cpu-add [3-7]
      
      2. hot-add memory to nod1
      
         object_add memory-backend-ram,id=ram0,size=1G
         device_add pc-dimm,id=dimm0,memdev=ram0,node=1
      
      3. online memory with following order
      
         echo online_movable > memory47/state
         echo online > memory40/state
      
      After this, node1 will have its nr_zones equals to (ZONE_NORMAL + 1)
      instead of (ZONE_MOVABLE + 1).
      
      Michal said:
       "Having an incorrect nr_zones might result in all sorts of problems
        which would be quite hard to debug (e.g. reclaim not considering the
        movable zone). I do not expect many users would suffer from this it
        but still this is trivial and obviously right thing to do so
        backporting to the stable tree shouldn't be harmful (last famous
        words)"
      
      Link: http://lkml.kernel.org/r/20181117022022.9956-1-richard.weiyang@gmail.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      505bc9f3
  4. 01 12月, 2018 2 次提交
    • M
      mm, page_alloc: check for max order in hot path · 9dec3855
      Michal Hocko 提交于
      [ Upstream commit c63ae43ba53bc432b414fd73dd5f4b01fcb1ab43 ]
      
      Konstantin has noticed that kvmalloc might trigger the following
      warning:
      
        WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
        [...]
        Call Trace:
         fragmentation_index+0x76/0x90
         compaction_suitable+0x4f/0xf0
         shrink_node+0x295/0x310
         node_reclaim+0x205/0x250
         get_page_from_freelist+0x649/0xad0
         __alloc_pages_nodemask+0x12a/0x2a0
         kmalloc_large_node+0x47/0x90
         __kmalloc_node+0x22b/0x2e0
         kvmalloc_node+0x3e/0x70
         xt_alloc_table_info+0x3a/0x80 [x_tables]
         do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
         nf_setsockopt+0x44/0x60
         SyS_setsockopt+0x6f/0xc0
         do_syscall_64+0x67/0x120
         entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      the problem is that we only check for an out of bound order in the slow
      path and the node reclaim might happen from the fast path already.  This
      is fixable by making sure that kvmalloc doesn't ever use kmalloc for
      requests that are larger than KMALLOC_MAX_SIZE but this also shows that
      the code is rather fragile.  A recent UBSAN report just underlines that
      by the following report
      
        UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
        shift exponent 51 is too large for 32-bit type 'int'
        CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0xd2/0x148 lib/dump_stack.c:113
         ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
         __ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
         __zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
         zone_watermark_fast mm/page_alloc.c:3216 [inline]
         get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
         __alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
         alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
         alloc_pages include/linux/gfp.h:509 [inline]
         __get_free_pages+0x12/0x60 mm/page_alloc.c:4414
         dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
         raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
         raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
         fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
         fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
         __blkdev_driver_ioctl block/ioctl.c:303 [inline]
         blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
         block_ioctl+0x105/0x150 fs/block_dev.c:1883
         vfs_ioctl fs/ioctl.c:46 [inline]
         do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
         ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
         __do_sys_ioctl fs/ioctl.c:709 [inline]
         __se_sys_ioctl fs/ioctl.c:707 [inline]
         __x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
         do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Note that this is not a kvmalloc path.  It is just that the fast path
      really depends on having sanitzed order as well.  Therefore move the
      order check to the fast path.
      
      Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reported-by: NKyungtae Kim <kt0755@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Byoungyoung Lee <lifeasageek@gmail.com>
      Cc: "Dae R. Jeong" <threeearcat@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9dec3855
    • M
      mm, memory_hotplug: check zone_movable in has_unmovable_pages · b44fd126
      Michal Hocko 提交于
      [ Upstream commit 9d7899999c62c1a81129b76d2a6ecbc4655e1597 ]
      
      Page state checks are racy.  Under a heavy memory workload (e.g.  stress
      -m 200 -t 2h) it is quite easy to hit a race window when the page is
      allocated but its state is not fully populated yet.  A debugging patch to
      dump the struct page state shows
      
        has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0
        page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1
        flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked)
      
      Note that the state has been checked for both PageLRU and PageSwapBacked
      already.  Closing this race completely would require some sort of retry
      logic.  This can be tricky and error prone (think of potential endless
      or long taking loops).
      
      Workaround this problem for movable zones at least.  Such a zone should
      only contain movable pages.  Commit 15c30bc0 ("mm, memory_hotplug:
      make has_unmovable_pages more robust") has told us that this is not
      strictly true though.  Bootmem pages should be marked reserved though so
      we can move the original check after the PageReserved check.  Pages from
      other zones are still prone to races but we even do not pretend that
      memory hotremove works for those so pre-mature failure doesn't hurt that
      much.
      
      Link: http://lkml.kernel.org/r/20181106095524.14629-1-mhocko@kernel.org
      Fixes: 15c30bc0 ("mm, memory_hotplug: make has_unmovable_pages more robust")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NBaoquan He <bhe@redhat.com>
      Tested-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b44fd126
  5. 09 10月, 2018 1 次提交
  6. 02 10月, 2018 1 次提交
    • M
      mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration · efaffc5e
      Mel Gorman 提交于
      Rate limiting of page migrations due to automatic NUMA balancing was
      introduced to mitigate the worst-case scenario of migrating at high
      frequency due to false sharing or slowly ping-ponging between nodes.
      Since then, a lot of effort was spent on correctly identifying these
      pages and avoiding unnecessary migrations and the safety net may no longer
      be required.
      
      Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
      avoids spreading STREAM tasks wide prematurely. However, once the task
      was properly placed, it delayed migrating the memory due to rate limiting.
      Increasing the limit fixed the problem for him.
      
      Currently, the limit is hard-coded and does not account for the real
      capabilities of the hardware. Even if an estimate was attempted, it would
      not properly account for the number of memory controllers and it could
      not account for the amount of bandwidth used for normal accesses. Rather
      than fudging, this patch simply eliminates the rate limiting.
      
      However, Jirka reports that a STREAM configuration using multiple
      processes achieved similar performance to 4.16. In local tests, this patch
      improved performance of STREAM relative to the baseline but it is somewhat
      machine-dependent. Most workloads show little or not performance difference
      implying that there is not a heavily reliance on the throttling mechanism
      and it is safe to remove.
      
      STREAM on 2-socket machine
                               4.19.0-rc5             4.19.0-rc5
                               numab-v1r1       noratelimit-v1r1
      MB/sec copy     43298.52 (   0.00%)    44673.38 (   3.18%)
      MB/sec scale    30115.06 (   0.00%)    31293.06 (   3.91%)
      MB/sec add      32825.12 (   0.00%)    34883.62 (   6.27%)
      MB/sec triad    32549.52 (   0.00%)    34906.60 (   7.24%
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linux-MM <linux-mm@kvack.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      efaffc5e
  7. 05 9月, 2018 1 次提交
  8. 30 8月, 2018 1 次提交
  9. 24 8月, 2018 1 次提交
  10. 23 8月, 2018 5 次提交
  11. 18 8月, 2018 4 次提交
    • A
      mm, page_alloc: double zone's batchsize · d8a759b5
      Aaron Lu 提交于
      To improve page allocator's performance for order-0 pages, each CPU has
      a Per-CPU-Pageset(PCP) per zone.  Whenever an order-0 page is needed,
      PCP will be checked first before asking pages from Buddy.  When PCP is
      used up, a batch of pages will be fetched from Buddy to improve
      performance and the size of batch can affect performance.
      
      zone's batch size gets doubled last time by commit ba56e91c("mm:
      page_alloc: increase size of per-cpu-pages") over ten years ago.  Since
      then, CPU has envolved a lot and CPU's cache sizes also increased.
      
      Dave Hansen is concerned the current batch size doesn't fit well with
      modern hardware and suggested me to do two things: first, use a page
      allocator intensive benchmark, e.g.  will-it-scale/page_fault1 to find
      out how performance changes with different batch sizes on various
      machines and then choose a new default batch size; second, see how this
      new batch size work with other workloads.
      
      In the first test, we saw performance gains on high-core-count systems
      and little to no effect on older systems with more modest core counts.
      In this phase's test data, two candidates: 63 and 127 are chosen.
      
      In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
      and more will-it-scale sub-tests are tested to see how these two
      candidates work with these workloads and decides a new default according
      to their results.
      
      Most test results are flat.  will-it-scale/page_fault2 process mode has
      10%-18% performance increase on 4-sockets Skylake and Broadwell.
      vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
      4-sockets servers while for 2-sockets servers, it caused 3%-8% performance
      drop.  Further analysis showed that, with a larger pcp->batch and thus
      larger pcp->high(the relationship of pcp->high=6 * pcp->batch is
      maintained in this patch), zone lock contention shifted to LRU add side
      lock contention and that caused performance drop.  This performance drop
      might be mitigated by others' work on optimizing LRU lock.
      
      Another downside of increasing pcp->batch is, when PCP is used up and need
      to fetch a batch of pages from Buddy, since batch is increased, that time
      can be longer than before.  My understanding is, this doesn't affect
      slowpath where direct reclaim and compaction dominates.  For fastpath,
      throughput is a win(according to will-it-scale/page_fault1) but worst
      latency can be larger now.
      
      Overall, I think double the batch size from 31 to 63 is relatively safe
      and provide good performance boost for high-core-count systems.
      
      The two phase's test results are listed below(all tests are done with THP
      disabled).
      
      Phase one(will-it-scale/page_fault1) test results:
      
      Skylake-EX: increased batch size has a good effect on zone->lock
      contention, though LRU contention will rise at the same time and
      limited the final performance increase.
      
      batch   score     change   zone_contention   lru_contention   total_contention
       31   15345900    +0.00%       64%                 8%           72%
       53   17903847   +16.67%       32%                38%           70%
       63   17992886   +17.25%       24%                45%           69%
       73   18022825   +17.44%       10%                61%           71%
      119   18023401   +17.45%        4%                66%           70%
      127   18029012   +17.48%        3%                66%           69%
      137   18036075   +17.53%        4%                66%           70%
      165   18035964   +17.53%        2%                67%           69%
      188   18101105   +17.95%        2%                67%           69%
      223   18130951   +18.15%        2%                67%           69%
      255   18118898   +18.07%        2%                67%           69%
      267   18101559   +17.96%        2%                67%           69%
      299   18160468   +18.34%        2%                68%           70%
      320   18139845   +18.21%        2%                67%           69%
      393   18160869   +18.34%        2%                68%           70%
      424   18170999   +18.41%        2%                68%           70%
      458   18144868   +18.24%        2%                68%           70%
      467   18142366   +18.22%        2%                68%           70%
      498   18154549   +18.30%        1%                68%           69%
      511   18134525   +18.17%        1%                69%           70%
      
      Broadwell-EX: similar pattern as Skylake-EX.
      
      batch   score     change   zone_contention   lru_contention   total_contention
       31   16703983    +0.00%       67%                 7%           74%
       53   18195393    +8.93%       43%                28%           71%
       63   18288885    +9.49%       38%                33%           71%
       73   18344329    +9.82%       35%                37%           72%
      119   18535529   +10.96%       24%                46%           70%
      127   18513596   +10.83%       23%                48%           71%
      137   18514327   +10.84%       23%                48%           71%
      165   18511840   +10.82%       22%                49%           71%
      188   18593478   +11.31%       17%                53%           70%
      223   18601667   +11.36%       17%                52%           69%
      255   18774825   +12.40%       12%                58%           70%
      267   18754781   +12.28%        9%                60%           69%
      299   18892265   +13.10%        7%                63%           70%
      320   18873812   +12.99%        8%                62%           70%
      393   18891174   +13.09%        6%                64%           70%
      424   18975108   +13.60%        6%                64%           70%
      458   18932364   +13.34%        8%                62%           70%
      467   18960891   +13.51%        5%                65%           70%
      498   18944526   +13.41%        5%                64%           69%
      511   18960839   +13.51%        5%                64%           69%
      
      Skylake-EP: although increased batch reduced zone->lock contention, but
      the effect is not as good as EX: zone->lock contention is still as high as
      20% with a very high batch value instead of 1% on Skylake-EX or 5% on
      Broadwell-EX.  Also, total_contention actually decreased with a higher
      batch but that doesn't translate to performance increase.
      
      batch   score    change   zone_contention   lru_contention   total_contention
       31   9554867    +0.00%       66%                 3%           69%
       53   9855486    +3.15%       63%                 3%           66%
       63   9980145    +4.45%       62%                 4%           66%
       73   10092774   +5.63%       62%                 5%           67%
      119   10310061   +7.90%       45%                19%           64%
      127   10342019   +8.24%       42%                19%           61%
      137   10358182   +8.41%       42%                21%           63%
      165   10397060   +8.81%       37%                24%           61%
      188   10341808   +8.24%       34%                26%           60%
      223   10349135   +8.31%       31%                27%           58%
      255   10327189   +8.08%       28%                29%           57%
      267   10344204   +8.26%       27%                29%           56%
      299   10325043   +8.06%       25%                30%           55%
      320   10310325   +7.91%       25%                31%           56%
      393   10293274   +7.73%       21%                31%           52%
      424   10311099   +7.91%       21%                32%           53%
      458   10321375   +8.02%       21%                32%           53%
      467   10303881   +7.84%       21%                32%           53%
      498   10332462   +8.14%       20%                33%           53%
      511   10325016   +8.06%       20%                32%           52%
      
      Broadwell-EP: zone->lock and lru lock had an agreement to make sure
      performance doesn't increase and they successfully managed to keep total
      contention at 70%.
      
      batch   score    change   zone_contention   lru_contention   total_contention
       31   10121178   +0.00%       19%                50%           69%
       53   10142366   +0.21%        6%                63%           69%
       63   10117984   -0.03%       11%                58%           69%
       73   10123330   +0.02%        7%                63%           70%
      119   10108791   -0.12%        2%                67%           69%
      127   10166074   +0.44%        3%                66%           69%
      137   10141574   +0.20%        3%                66%           69%
      165   10154499   +0.33%        2%                68%           70%
      188   10124921   +0.04%        2%                67%           69%
      223   10137399   +0.16%        2%                67%           69%
      255   10143289   +0.22%        0%                68%           68%
      267   10123535   +0.02%        1%                68%           69%
      299   10140952   +0.20%        0%                68%           68%
      320   10163170   +0.41%        0%                68%           68%
      393   10000633   -1.19%        0%                69%           69%
      424   10087998   -0.33%        0%                69%           69%
      458   10187116   +0.65%        0%                69%           69%
      467   10146790   +0.25%        0%                69%           69%
      498   10197958   +0.76%        0%                69%           69%
      511   10152326   +0.31%        0%                69%           69%
      
      Haswell-EP: similar to Broadwell-EP.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   10442205   +0.00%       14%                48%           62%
       53   10442255   +0.00%        5%                57%           62%
       63   10452059   +0.09%        6%                57%           63%
       73   10482349   +0.38%        5%                59%           64%
      119   10454644   +0.12%        3%                60%           63%
      127   10431514   -0.10%        3%                59%           62%
      137   10423785   -0.18%        3%                60%           63%
      165   10481216   +0.37%        2%                61%           63%
      188   10448755   +0.06%        2%                61%           63%
      223   10467144   +0.24%        2%                61%           63%
      255   10480215   +0.36%        2%                61%           63%
      267   10484279   +0.40%        2%                61%           63%
      299   10466450   +0.23%        2%                61%           63%
      320   10452578   +0.10%        2%                61%           63%
      393   10499678   +0.55%        1%                62%           63%
      424   10481454   +0.38%        1%                62%           63%
      458   10473562   +0.30%        1%                62%           63%
      467   10484269   +0.40%        0%                62%           62%
      498   10505599   +0.61%        0%                62%           62%
      511   10483395   +0.39%        0%                62%           62%
      
      Westmere-EP: contention is pretty small so not interesting.  Note too high
      a batch value could hurt performance.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   4831523   +0.00%        2%                 3%            5%
       53   4834086   +0.05%        2%                 4%            6%
       63   4834262   +0.06%        2%                 3%            5%
       73   48328518   +0.03%        2%                 4%            6%
      119   4830534   -0.02%        1%                 3%            4%
      127   4827461   -0.08%        1%                 4%            5%
      137   4827459   -0.08%        1%                 3%            4%
      165   4820534   -0.23%        0%                 4%            4%
      188   4817947   -0.28%        0%                 3%            3%
      223   4809671   -0.45%        0%                 3%            3%
      255   4802463   -0.60%        0%                 4%            4%
      267   4801634   -0.62%        0%                 3%            3%
      299   4798047   -0.69%        0%                 3%            3%
      320   4793084   -0.80%        0%                 3%            3%
      393   4785877   -0.94%        0%                 3%            3%
      424   4782911   -1.01%        0%                 3%            3%
      458   4779346   -1.08%        0%                 3%            3%
      467   4780306   -1.06%        0%                 3%            3%
      498   4780589   -1.05%        0%                 3%            3%
      511   4773724   -1.20%        0%                 3%            3%
      
      Skylake-Desktop: similar to Westmere-EP, nothing interesting.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   3906608   +0.00%        2%                 3%            5%
       53   3940164   +0.86%        2%                 3%            5%
       63   3937289   +0.79%        2%                 3%            5%
       73   3940201   +0.86%        2%                 3%            5%
      119   3933240   +0.68%        2%                 3%            5%
      127   3930514   +0.61%        2%                 4%            6%
      137   3938639   +0.82%        0%                 3%            3%
      165   3908755   +0.05%        0%                 3%            3%
      188   3905621   -0.03%        0%                 3%            3%
      223   3903015   -0.09%        0%                 4%            4%
      255   3889480   -0.44%        0%                 3%            3%
      267   3891669   -0.38%        0%                 4%            4%
      299   3898728   -0.20%        0%                 4%            4%
      320   3894547   -0.31%        0%                 4%            4%
      393   3875137   -0.81%        0%                 4%            4%
      424   3874521   -0.82%        0%                 3%            3%
      458   3880432   -0.67%        0%                 4%            4%
      467   3888715   -0.46%        0%                 3%            3%
      498   3888633   -0.46%        0%                 4%            4%
      511   3875305   -0.80%        0%                 5%            5%
      
      Haswell-Desktop: zone->lock is pretty low as other desktops, though lru
      contention is higher than other desktops.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   3511158   +0.00%        2%                 5%            7%
       53   3555445   +1.26%        2%                 6%            8%
       63   3561082   +1.42%        2%                 6%            8%
       73   3547218   +1.03%        2%                 6%            8%
      119   3571319   +1.71%        1%                 7%            8%
      127   3549375   +1.09%        0%                 6%            6%
      137   3560233   +1.40%        0%                 6%            6%
      165   3555176   +1.25%        2%                 6%            8%
      188   3551501   +1.15%        0%                 8%            8%
      223   3531462   +0.58%        0%                 7%            7%
      255   3570400   +1.69%        0%                 7%            7%
      267   3532235   +0.60%        1%                 8%            9%
      299   3562326   +1.46%        0%                 6%            6%
      320   3553569   +1.21%        0%                 8%            8%
      393   3539519   +0.81%        0%                 7%            7%
      424   3549271   +1.09%        0%                 8%            8%
      458   3528885   +0.50%        0%                 8%            8%
      467   3526554   +0.44%        0%                 7%            7%
      498   3525302   +0.40%        0%                 9%            9%
      511   3527556   +0.47%        0%                 8%            8%
      
      Sandybridge-Desktop: the 0% contention isn't accurate but caused by
      dropped fractional part. Since multiple contention path's contentions
      are all under 1% here, with some arithmetic operations like add, the
      final deviation could be as large as 3%.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   1744495   +0.00%        0%                 0%            0%
       53   1755341   +0.62%        0%                 0%            0%
       63   1758469   +0.80%        0%                 0%            0%
       73   1759626   +0.87%        0%                 0%            0%
      119   1770417   +1.49%        0%                 0%            0%
      127   1768252   +1.36%        0%                 0%            0%
      137   1767848   +1.34%        0%                 0%            0%
      165   1765088   +1.18%        0%                 0%            0%
      188   1766918   +1.29%        0%                 0%            0%
      223   1767866   +1.34%        0%                 0%            0%
      255   1768074   +1.35%        0%                 0%            0%
      267   1763187   +1.07%        0%                 0%            0%
      299   1765620   +1.21%        0%                 0%            0%
      320   1767603   +1.32%        0%                 0%            0%
      393   1764612   +1.15%        0%                 0%            0%
      424   1758476   +0.80%        0%                 0%            0%
      458   1758593   +0.81%        0%                 0%            0%
      467   1757915   +0.77%        0%                 0%            0%
      498   1753363   +0.51%        0%                 0%            0%
      511   1755548   +0.63%        0%                 0%            0%
      
      Phase two test results:
      Note: all percent change is against base(batch=31).
      
      ebizzy.throughput (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    2410037±7%     2600451±2% +7.9%     2602878 +8.0%
      lkp-bdw-ex1     1493328        1489243    -0.3%     1492145 -0.1%
      lkp-skl-2sp2    1329674        1345891    +1.2%     1351056 +1.6%
      lkp-bdw-ep2      711511         711511     0.0%      710708 -0.1%
      lkp-wsm-ep2       75750          75528    -0.3%       75441 -0.4%
      lkp-skl-d01      264126         262791    -0.5%      264113 +0.0%
      lkp-hsw-d01      176601         176328    -0.2%      176368 -0.1%
      lkp-sb02          98937          98937    +0.0%       99030 +0.1%
      
      kbuild.buildtime (less is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     107.00        107.67  +0.6%        107.11  +0.1%
      lkp-bdw-ex1       97.33         97.33  +0.0%         97.42  +0.1%
      lkp-skl-2sp2     180.00        179.83  -0.1%        179.83  -0.1%
      lkp-bdw-ep2      178.17        179.17  +0.6%        177.50  -0.4%
      lkp-wsm-ep2      737.00        738.00  +0.1%        738.00  +0.1%
      lkp-skl-d01      642.00        653.00  +1.7%        653.00  +1.7%
      lkp-hsw-d01     1310.00       1316.00  +0.5%       1311.00  +0.1%
      
      netperf/TCP_STREAM.Throughput_total_Mbps (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     948790        947144  -0.2%        948333 -0.0%
      lkp-bdw-ex1      904224        904366  +0.0%        904926 +0.1%
      lkp-skl-2sp2     239731        239607  -0.1%        239565 -0.1%
      lk-bdw-ep2       365764        365933  +0.0%        365951 +0.1%
      lkp-wsm-ep2       93736         93803  +0.1%         93808 +0.1%
      lkp-skl-d01       77314         77303  -0.0%         77375 +0.1%
      lkp-hsw-d01       58617         60387  +3.0%         60208 +2.7%
      lkp-sb02          29990         30137  +0.5%         30103 +0.4%
      
      oltp.transactions (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-bdw-ex1      9073276       9100377     +0.3%    9036344     -0.4%
      lkp-skl-2sp2     8898717       8852054     -0.5%    8894459     -0.0%
      lkp-bdw-ep2     13426155      13384654     -0.3%   13333637     -0.7%
      lkp-hsw-ep2     13146314      13232784     +0.7%   13193163     +0.4%
      lkp-wsm-ep2      5035355       5019348     -0.3%    5033418     -0.0%
      lkp-skl-d01       418485       4413339     -0.1%    4419039     +0.0%
      lkp-hsw-d01      3517817±5%    3396120±3%  -3.5%    3455138±3%  -1.8%
      
      pigz.throughput (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.513e+08     1.507e+08 -0.4%      1.511e+08 -0.2%
      lkp-bdw-ex1     2.060e+08     2.052e+08 -0.4%      2.044e+08 -0.8%
      lkp-skl-2sp2    8.836e+08     8.845e+08 +0.1%      8.836e+08 -0.0%
      lkp-bdw-ep2     8.275e+08     8.464e+08 +2.3%      8.330e+08 +0.7%
      lkp-wsm-ep2     2.224e+08     2.221e+08 -0.2%      2.218e+08 -0.3%
      lkp-skl-d01     1.177e+08     1.177e+08 -0.0%      1.176e+08 -0.1%
      lkp-hsw-d01     1.154e+08     1.154e+08 +0.1%      1.154e+08 -0.0%
      lkp-sb02        0.633e+08     0.633e+08 +0.1%      0.633e+08 +0.0%
      
      will-it-scale.malloc1.processes (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1      620181       620484 +0.0%         620240 +0.0%
      lkp-bdw-ex1      1403610      1401201 -0.2%        1417900 +1.0%
      lkp-skl-2sp2     1288097      1284145 -0.3%        1283907 -0.3%
      lkp-bdw-ep2      1427879      1427675 -0.0%        1428266 +0.0%
      lkp-hsw-ep2      1362546      1353965 -0.6%        1354759 -0.6%
      lkp-wsm-ep2      2099657      2107576 +0.4%        2100226 +0.0%
      lkp-skl-d01      1476835      1476358 -0.0%        1474487 -0.2%
      lkp-hsw-d01      1308810      1303429 -0.4%        1301299 -0.6%
      lkp-sb02          589286       589284 -0.0%         588101 -0.2%
      
      will-it-scale.malloc1.threads (higher is better)
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     21289         21125     -0.8%      21241     -0.2%
      lkp-bdw-ex1      28114         28089     -0.1%      28007     -0.4%
      lkp-skl-2sp2     91866         91946     +0.1%      92723     +0.9%
      lkp-bdw-ep2      37637         37501     -0.4%      37317     -0.9%
      lkp-hsw-ep2      43673         43590     -0.2%      43754     +0.2%
      lkp-wsm-ep2      28577         28298     -1.0%      28545     -0.1%
      lkp-skl-d01     175277        173343     -1.1%     173082     -1.3%
      lkp-hsw-d01     130303        129566     -0.6%     129250     -0.8%
      lkp-sb02        113742±3%     116911     +2.8%     116417±3%  +2.4%
      
      will-it-scale.malloc2.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.206e+09     1.206e+09 -0.0%      1.206e+09 +0.0%
      lkp-bdw-ex1     1.319e+09     1.319e+09 -0.0%      1.319e+09 +0.0%
      lkp-skl-2sp2    8.000e+08     8.021e+08 +0.3%      7.995e+08 -0.1%
      lkp-bdw-ep2     6.582e+08     6.634e+08 +0.8%      6.513e+08 -1.1%
      lkp-hsw-ep2     6.671e+08     6.669e+08 -0.0%      6.665e+08 -0.1%
      lkp-wsm-ep2     1.805e+08     1.806e+08 +0.0%      1.804e+08 -0.1%
      lkp-skl-d01     1.611e+08     1.611e+08 -0.0%      1.610e+08 -0.0%
      lkp-hsw-d01     1.333e+08     1.332e+08 -0.0%      1.332e+08 -0.0%
      lkp-sb02         82485104      82478206 -0.0%       82473546 -0.0%
      
      will-it-scale.malloc2.threads (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.574e+09     1.574e+09 -0.0%      1.574e+09 -0.0%
      lkp-bdw-ex1     1.737e+09     1.737e+09 +0.0%      1.737e+09 -0.0%
      lkp-skl-2sp2    9.161e+08     9.162e+08 +0.0%      9.181e+08 +0.2%
      lkp-bdw-ep2     7.856e+08     8.015e+08 +2.0%      8.113e+08 +3.3%
      lkp-hsw-ep2     6.908e+08     6.904e+08 -0.1%      6.907e+08 -0.0%
      lkp-wsm-ep2     2.409e+08     2.409e+08 +0.0%      2.409e+08 -0.0%
      lkp-skl-d01     1.199e+08     1.199e+08 -0.0%      1.199e+08 -0.0%
      lkp-hsw-d01     1.029e+08     1.029e+08 -0.0%      1.029e+08 +0.0%
      lkp-sb02         68081213      68061423 -0.0%       68076037 -0.0%
      
      will-it-scale.page_fault2.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    14509125±4%   16472364 +13.5%       17123117 +18.0%
      lkp-bdw-ex1     14736381      16196588  +9.9%       16364011 +11.0%
      lkp-skl-2sp2     6354925       6435444  +1.3%        6436644  +1.3%
      lkp-bdw-ep2      8749584       8834422  +1.0%        8827179  +0.9%
      lkp-hsw-ep2      8762591       8845920  +1.0%        8825697  +0.7%
      lkp-wsm-ep2      3036083       3030428  -0.2%        3021741  -0.5%
      lkp-skl-d01      2307834       2304731  -0.1%        2286142  -0.9%
      lkp-hsw-d01      1806237       1800786  -0.3%        1795943  -0.6%
      lkp-sb02          842616        837844  -0.6%         833921  -1.0%
      
      will-it-scale.page_fault2.threads
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1623294       1615132±2% -0.5%     1656777    +2.1%
      lkp-bdw-ex1      1995714       2025948    +1.5%     2113753±3% +5.9%
      lkp-skl-2sp2     2346708       2415591    +2.9%     2416919    +3.0%
      lkp-bdw-ep2      2342564       2344882    +0.1%     2300206    -1.8%
      lkp-hsw-ep2      1820658       1831681    +0.6%     1844057    +1.3%
      lkp-wsm-ep2      1725482       1733774    +0.5%     1740517    +0.9%
      lkp-skl-d01      1832833       1823628    -0.5%     1806489    -1.4%
      lkp-hsw-d01      1427913       1427287    -0.0%     1420226    -0.5%
      lkp-sb02          750626        748615    -0.3%      746621    -0.5%
      
      will-it-scale.page_fault3.processes (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    24382726      24400317 +0.1%       24668774 +1.2%
      lkp-bdw-ex1     35399750      35683124 +0.8%       35829492 +1.2%
      lkp-skl-2sp2    28136820      28068248 -0.2%       28147989 +0.0%
      lkp-bdw-ep2     37269077      37459490 +0.5%       37373073 +0.3%
      lkp-hsw-ep2     36224967      36114085 -0.3%       36104908 -0.3%
      lkp-wsm-ep2     16820457      16911005 +0.5%       16968596 +0.9%
      lkp-skl-d01      7721138       7725904 +0.1%        7756740 +0.5%
      lkp-hsw-d01      7611979       7650928 +0.5%        7651323 +0.5%
      lkp-sb02         3781546       3796502 +0.4%        3796827 +0.4%
      
      will-it-scale.page_fault3.threads (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1865820±3%   1900917±2%  +1.9%     1826245±4%  -2.1%
      lkp-bdw-ex1      3094060      3148326     +1.8%     3150036     +1.8%
      lkp-skl-2sp2     3952940      3953898     +0.0%     3989360     +0.9%
      lkp-bdw-ep2      3420373±3%   3643964     +6.5%     3644910±5%  +6.6%
      lkp-hsw-ep2      2609635±2%   2582310±3%  -1.0%     2780459     +6.5%
      lkp-wsm-ep2      4395001      4417196     +0.5%     4432499     +0.9%
      lkp-skl-d01      5363977      5400003     +0.7%     5411370     +0.9%
      lkp-hsw-d01      5274131      5311294     +0.7%     5319359     +0.9%
      lkp-sb02         2917314      2913004     -0.1%     2935286     +0.6%
      
      will-it-scale.read1.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    73762279±14%  69322519±10% -6.0%    69349855±13%  -6.0% (result unstable)
      lkp-bdw-ex1     1.701e+08     1.704e+08    +0.1%    1.705e+08     +0.2%
      lkp-skl-2sp2    63111570      63113953     +0.0%    63836573      +1.1%
      lkp-bdw-ep2     79247409      79424610     +0.2%    78012656      -1.6%
      lkp-hsw-ep2     67677026      68308800     +0.9%    67539106      -0.2%
      lkp-wsm-ep2     13339630      13939817     +4.5%    13766865      +3.2%
      lkp-skl-d01     10969487      10972650     +0.0%    no data
      lkp-hsw-d01     9857342±2%    10080592±2%  +2.3%    10131560      +2.8%
      lkp-sb02        5189076        5197473     +0.2%    5163253       -0.5%
      
      will-it-scale.read1.threads (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    62468045±12%  73666726±7% +17.9%    79553123±12% +27.4% (result unstable)
      lkp-bdw-ex1     1.62e+08      1.624e+08    +0.3%    1.614e+08     -0.3%
      lkp-skl-2sp2    58319780      59181032     +1.5%    59821353      +2.6%
      lkp-bdw-ep2     74057992      75698171     +2.2%    74990869      +1.3%
      lkp-hsw-ep2     63672959      63639652     -0.1%    64387051      +1.1%
      lkp-wsm-ep2     13489943      13526058     +0.3%    13259032      -1.7%
      lkp-skl-d01     10297906      10338796     +0.4%    10407328      +1.1%
      lkp-hsw-d01      9636721       9667376     +0.3%     9341147      -3.1%
      lkp-sb02         4801938       4804496     +0.1%     4802290      +0.0%
      
      will-it-scale.write1.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.111e+08     1.104e+08±2%  -0.7%   1.122e+08±2%  +1.0%
      lkp-bdw-ex1     1.392e+08     1.399e+08     +0.5%   1.397e+08     +0.4%
      lkp-skl-2sp2     59369233      58994841     -0.6%    58715168     -1.1%
      lkp-bdw-ep2      61820979      CPU throttle          63593123     +2.9%
      lkp-hsw-ep2      57897587      57435605     -0.8%    56347450     -2.7%
      lkp-wsm-ep2       7814203       7918017±2%  +1.3%     7669068     -1.9%
      lkp-skl-d01       8886557       8971422     +1.0%     8818366     -0.8%
      lkp-hsw-d01       9171001±5%    9189915     +0.2%     9483909     +3.4%
      lkp-sb02          4475406       4475294     -0.0%     4501756     +0.6%
      
      will-it-scale.write1.threads (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.058e+08     1.055e+08±2%  -0.2%   1.065e+08  +0.7%
      lkp-bdw-ex1     1.316e+08     1.300e+08     -1.2%   1.308e+08  -0.6%
      lkp-skl-2sp2     54492421      56086678     +2.9%    55975657  +2.7%
      lkp-bdw-ep2      59360449      59003957     -0.6%    58101262  -2.1%
      lkp-hsw-ep2      53346346±2%   52530876     -1.5%    52902487  -0.8%
      lkp-wsm-ep2       7774006       7800092±2%  +0.3%     7558833  -2.8%
      lkp-skl-d01       8346174       8235695     -1.3%     no data
      lkp-hsw-d01       8636244       8655731     +0.2%     8658868  +0.3%
      lkp-sb02          4181820       4204107     +0.5%     4182992  +0.0%
      
      vm-scalability.anon-r-rand.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    11933873±3%   12356544±2%  +3.5%   12188624     +2.1%
      lkp-bdw-ex1      7114424±2%    7330949±2%  +3.0%    7392419     +3.9%
      lkp-skl-2sp2     6773277±5%    6492332±8%  -4.1%    6543962     -3.4%
      lkp-bdw-ep2      7133846±4%    7233508     +1.4%    7013518±3%  -1.7%
      lkp-hsw-ep2      4576626       4527098     -1.1%    4551679     -0.5%
      lkp-wsm-ep2      2583599       2592492     +0.3%    2588039     +0.2%
      lkp-hsw-d01       998199±2%    1028311     +3.0%    1006460±2%  +0.8%
      lkp-sb02          570572        567854     -0.5%     568449     -0.4%
      
      vm-scalability.anon-r-rand-mt.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1789419       1787830     -0.1%    1788208     -0.1%
      lkp-bdw-ex1      3492595±2%    3554966±2%  +1.8%    3558835±3%  +1.9%
      lkp-skl-2sp2     3856238±2%    3975403±4%  +3.1%    3994600     +3.6%
      lkp-bdw-ep2      3726963±11%   3809292±6%  +2.2%    3871924±4%  +3.9%
      lkp-hsw-ep2      2131760±3%    2033578±4%  -4.6%    2130727±6%  -0.0%
      lkp-wsm-ep2      2369731       2368384     -0.1%    2370252     +0.0%
      lkp-skl-d01      1207128       1206220     -0.1%    1205801     -0.1%
      lkp-hsw-d01       964317        992329±2%  +2.9%     992099±2%  +2.9%
      lkp-sb02          567137        567346     +0.0%     566144     -0.2%
      
      vm-scalability.lru-file-mmap-read.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    19560469±6%   23018999     +17.7%   23418800     +19.7%
      lkp-bdw-ex1     17769135±14%  26141676±3%  +47.1%   26284723±5%  +47.9%
      lkp-skl-2sp2    14056512      13578884      -3.4%   13146214      -6.5%
      lkp-bdw-ep2     15336542      14737654      -3.9%   14088159      -8.1%
      lkp-hsw-ep2     16275498      15756296      -3.2%   15018090      -7.7%
      lkp-wsm-ep2     11272160      11237231      -0.3%   11310047      +0.3%
      lkp-skl-d01      7322119       7324569      +0.0%    7184148      -1.9%
      lkp-hsw-d01      6449234       6404542      -0.7%    6356141      -1.4%
      lkp-sb02         3517943       3520668      +0.1%    3527309      +0.3%
      
      vm-scalability.lru-file-mmap-read-rand.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1689052       1697553  +0.5%       1698726  +0.6%
      lkp-bdw-ex1      1675246       1699764  +1.5%       1712226  +2.2%
      lkp-skl-2sp2     1800533       1799749  -0.0%       1800581  +0.0%
      lkp-bdw-ep2      1807422       1807758  +0.0%       1804932  -0.1%
      lkp-hsw-ep2      1809807       1808781  -0.1%       1807811  -0.1%
      lkp-wsm-ep2      1800198       1802434  +0.1%       1801236  +0.1%
      lkp-skl-d01       696689        695537  -0.2%        694106  -0.4%
      lkp-hsw-d01       698364        698666  +0.0%        696686  -0.2%
      lkp-sb02          258939        258787  -0.1%        258199  -0.3%
      
      Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Suggested-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8a759b5
    • M
      mm: drop VM_BUG_ON from __get_free_pages · 9ea9a680
      Michal Hocko 提交于
      There is no real reason to blow up just because the caller doesn't know
      that __get_free_pages cannot return highmem pages.  Simply fix that up
      silently.  Even if we have some confused users such a fixup will not be
      harmful.
      
      [akpm@linux-foundation.org: mask off __GFP_HIGHMEM]
      Link: http://lkml.kernel.org/r/20180622162841.25114-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Jiankang Chen <chenjiankang1@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ea9a680
    • V
      mm, page_alloc: actually ignore mempolicies for high priority allocations · d6a24df0
      Vlastimil Babka 提交于
      __alloc_pages_slowpath() has for a long time contained code to ignore
      node restrictions from memory policies for high priority allocations.
      The current code that resets the zonelist iterator however does
      effectively nothing after commit 7810e678 ("mm, page_alloc: do not
      break __GFP_THISNODE by zonelist reset") removed a buggy zonelist reset.
      Even before that commit, mempolicy restrictions were still not ignored,
      as they are passed in ac->nodemask which is untouched by the code.
      
      We can either remove the code, or make it work as intended.  Since
      ac->nodemask can be set from task's mempolicy via alloc_pages_current()
      and thus also alloc_pages(), it may indeed affect kernel allocations,
      and it makes sense to ignore it to allow progress for high priority
      allocations.
      
      Thus, this patch resets ac->nodemask to NULL in such cases.  This
      assumes all callers can handle it (i.e.  there are no guarantees as in
      the case of __GFP_THISNODE) which seems to be the case.  The same
      assumption is already present in check_retry_cpuset() for some time.
      
      The expected effect is that high priority kernel allocations in the
      context of userspace tasks (e.g.  OOM victims) restricted by mempolicies
      will have higher chance to succeed if they are restricted to nodes with
      depleted memory, while there are other nodes with free memory left.
      
      It's not a new intention, but for the first time the code will match the
      intention, AFAICS.  It was intended by commit 183f6371 ("mm: ignore
      mempolicies when using ALLOC_NO_WATERMARK") in v3.6 but I think it never
      really worked, as mempolicy restriction was already encoded in nodemask,
      not zonelist, at that time.
      
      So originally that was for ALLOC_NO_WATERMARK only.  Then it was
      adjusted by e46e7b77 ("mm, page_alloc: recalculate the preferred
      zoneref if the context can ignore memory policies") and cd04ae1e
      ("mm, oom: do not rely on TIF_MEMDIE for memory reserves access") to the
      current state.  So even GFP_ATOMIC would now ignore mempolicies after
      the initial attempts fail - if the code worked as people thought it
      does.
      
      Link: http://lkml.kernel.org/r/20180612122624.8045-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6a24df0
    • P
      mm: skip invalid pages block at a time in zero_resv_unresv() · 720e14eb
      Pavel Tatashin 提交于
      The role of zero_resv_unavail() is to make sure that every struct page
      that is allocated but is not backed by memory that is accessible by
      kernel is zeroed and not in some uninitialized state.
      
      Since struct pages are allocated in blocks (2M pages in x86 case), we
      can skip pageblock_nr_pages at a time, when the first one is found to be
      invalid.
      
      This optimization may help since now on x86 every hole in e820 maps is
      marked as reserved in memblock, and thus will go through this function.
      
      This function is called before sched_clock() is initialized, so I used
      my x86 early boot clock patches to measure the performance improvement.
      
      With 1T hole on i7-8700 currently we would take 0.606918s of boot time,
      but with this optimization 0.001103s.
      
      Link: http://lkml.kernel.org/r/20180615155733.1175-1-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      720e14eb
  12. 06 8月, 2018 2 次提交
    • P
      PM / reboot: Eliminate race between reboot and suspend · 55f2503c
      Pingfan Liu 提交于
      At present, "systemctl suspend" and "shutdown" can run in parrallel. A
      system can suspend after devices_shutdown(), and resume. Then the shutdown
      task goes on to power off. This causes many devices are not really shut
      off. Hence replacing reboot_mutex with system_transition_mutex (renamed
      from pm_mutex) to achieve the exclusion. The renaming of pm_mutex as
      system_transition_mutex can be better to reflect the purpose of the mutex.
      Signed-off-by: NPingfan Liu <kernelfans@gmail.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      55f2503c
    • D
      mm: Allow non-direct-map arguments to free_reserved_area() · 0d834328
      Dave Hansen 提交于
      free_reserved_area() takes pointers as arguments to show which addresses
      should be freed.  However, it does this in a somewhat ambiguous way.  If it
      gets a kernel direct map address, it always works.  However, if it gets an
      address that is part of the kernel image alias mapping, it can fail.
      
      It fails if all of the following happen:
       * The specified address is part of the kernel image alias
       * Poisoning is requested (forcing a memset())
       * The address is in a read-only portion of the kernel image
      
      The memset() fails on the read-only mapping, of course.
      free_reserved_area() *is* called both on the direct map and on kernel image
      alias addresses.  We've just lucked out thus far that the kernel image
      alias areas it gets used on are read-write.  I'm fairly sure this has been
      just a happy accident.
      
      It is quite easy to make free_reserved_area() work for all cases: just
      convert the address to a direct map address before doing the memset(), and
      do this unconditionally.  There is little chance of a regression here
      because we previously did a virt_to_page() on the address for the memset,
      so we know these are not highmem pages for which virt_to_page() would fail.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: keescook@google.com
      Cc: aarcange@redhat.com
      Cc: jgross@suse.com
      Cc: jpoimboe@redhat.com
      Cc: gregkh@linuxfoundation.org
      Cc: peterz@infradead.org
      Cc: hughd@google.com
      Cc: torvalds@linux-foundation.org
      Cc: bp@alien8.de
      Cc: luto@kernel.org
      Cc: ak@linux.intel.com
      Cc: Kees Cook <keescook@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180802225826.1287AE3E@viggo.jf.intel.com
      0d834328
  13. 17 7月, 2018 1 次提交
  14. 15 7月, 2018 1 次提交
    • P
      mm: zero unavailable pages before memmap init · e181ae0c
      Pavel Tatashin 提交于
      We must zero struct pages for memory that is not backed by physical
      memory, or kernel does not have access to.
      
      Recently, there was a change which zeroed all memmap for all holes in
      e820.  Unfortunately, it introduced a bug that is discussed here:
      
        https://www.spinics.net/lists/linux-mm/msg156764.html
      
      Linus, also saw this bug on his machine, and confirmed that reverting
      commit 124049de ("x86/e820: put !E820_TYPE_RAM regions into
      memblock.reserved") fixes the issue.
      
      The problem is that we incorrectly zero some struct pages after they
      were setup.
      
      The fix is to zero unavailable struct pages prior to initializing of
      struct pages.
      
      A more detailed fix should come later that would avoid double zeroing
      cases: one in __init_single_page(), the other one in
      zero_resv_unavail().
      
      Fixes: 124049de ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved")
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e181ae0c
  15. 15 6月, 2018 1 次提交
  16. 08 6月, 2018 8 次提交
  17. 26 5月, 2018 1 次提交
    • M
      mm, memory_hotplug: make has_unmovable_pages more robust · 15c30bc0
      Michal Hocko 提交于
      Oscar has reported:
      : Due to an unfortunate setting with movablecore, memblocks containing bootmem
      : memory (pages marked by get_page_bootmem()) ended up marked in zone_movable.
      : So while trying to remove that memory, the system failed in do_migrate_range
      : and __offline_pages never returned.
      :
      : This can be reproduced by running
      : qemu-system-x86_64 -m 6G,slots=8,maxmem=8G -numa node,mem=4096M -numa node,mem=2048M
      : and movablecore=4G kernel command line
      :
      : linux kernel: BIOS-provided physical RAM map:
      : linux kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
      : linux kernel: BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
      : linux kernel: BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
      : linux kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
      : linux kernel: BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
      : linux kernel: BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
      : linux kernel: BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
      : linux kernel: BIOS-e820: [mem 0x0000000100000000-0x00000001bfffffff] usable
      : linux kernel: NX (Execute Disable) protection: active
      : linux kernel: SMBIOS 2.8 present.
      : linux kernel: DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org
      : linux kernel: Hypervisor detected: KVM
      : linux kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
      : linux kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
      : linux kernel: last_pfn = 0x1c0000 max_arch_pfn = 0x400000000
      :
      : linux kernel: SRAT: PXM 0 -> APIC 0x00 -> Node 0
      : linux kernel: SRAT: PXM 1 -> APIC 0x01 -> Node 1
      : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
      : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
      : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x13fffffff]
      : linux kernel: ACPI: SRAT: Node 1 PXM 1 [mem 0x140000000-0x1bfffffff]
      : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x1c0000000-0x43fffffff] hotplug
      : linux kernel: NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x0
      : linux kernel: NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0
      : linux kernel: NODE_DATA(0) allocated [mem 0x13ffd6000-0x13fffffff]
      : linux kernel: NODE_DATA(1) allocated [mem 0x1bffd3000-0x1bfffcfff]
      :
      : zoneinfo shows that the zone movable is placed into both numa nodes:
      : Node 0, zone  Movable
      :   pages free     160140
      :         min      1823
      :         low      2278
      :         high     2733
      :         spanned  262144
      :         present  262144
      :         managed  245670
      : Node 1, zone  Movable
      :   pages free     448427
      :         min      3827
      :         low      4783
      :         high     5739
      :         spanned  524288
      :         present  524288
      :         managed  515766
      
      Note how only Node 0 has a hutplugable memory region which would rule it
      out from the early memblock allocations (most likely memmap).  Node1
      will surely contain memmaps on the same node and those would prevent
      offlining to succeed.  So this is arguably a configuration issue.
      Although one could argue that we should be more clever and rule early
      allocations from the zone movable.  This would be correct but probably
      not worth the effort considering what a hack movablecore is.
      
      Anyway, We could do better for those cases though.  We rely on
      start_isolate_page_range resp.  has_unmovable_pages to do their job.
      The first one isolates the whole range to be offlined so that we do not
      allocate from it anymore and the later makes sure we are not stumbling
      over non-migrateable pages.
      
      has_unmovable_pages is overly optimistic, however.  It doesn't check all
      the pages if we are withing zone_movable because we rely that those
      pages will be always migrateable.  As it turns out we are still not
      perfect there.  While bootmem pages in zonemovable sound like a clear
      bug which should be fixed let's remove the optimization for now and warn
      if we encounter unmovable pages in zone_movable in the meantime.  That
      should help for now at least.
      
      Btw.  this wasn't a real problem until commit 72b39cfc ("mm,
      memory_hotplug: do not fail offlining too early") because we used to
      have a small number of retries and then failed.  This turned out to be
      too fragile though.
      
      Link: http://lkml.kernel.org/r/20180523125555.30039-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NOscar Salvador <osalvador@techadventures.net>
      Tested-by: NOscar Salvador <osalvador@techadventures.net>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Reza Arbab <arbab@linux.vnet.ibm.com>
      Cc: Igor Mammedov <imammedo@redhat.com>
      Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15c30bc0
  18. 25 5月, 2018 1 次提交
    • J
      Revert "mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE" · d883c6cf
      Joonsoo Kim 提交于
      This reverts the following commits that change CMA design in MM.
      
       3d2054ad ("ARM: CMA: avoid double mapping to the CMA area if CONFIG_HIGHMEM=y")
      
       1d47a3ec ("mm/cma: remove ALLOC_CMA")
      
       bad8c6c0 ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE")
      
      Ville reported a following error on i386.
      
        Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
        microcode: microcode updated early to revision 0x4, date = 2013-06-28
        Initializing CPU#0
        Initializing HighMem for node 0 (000377fe:00118000)
        Initializing Movable for node 0 (00000001:00118000)
        BUG: Bad page state in process swapper  pfn:377fe
        page:f53effc0 count:0 mapcount:-127 mapping:00000000 index:0x0
        flags: 0x80000000()
        raw: 80000000 00000000 00000000 ffffff80 00000000 00000100 00000200 00000001
        page dumped because: nonzero mapcount
        Modules linked in:
        CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0-rc5-elk+ #145
        Hardware name: Dell Inc. Latitude E5410/03VXMC, BIOS A15 07/11/2013
        Call Trace:
         dump_stack+0x60/0x96
         bad_page+0x9a/0x100
         free_pages_check_bad+0x3f/0x60
         free_pcppages_bulk+0x29d/0x5b0
         free_unref_page_commit+0x84/0xb0
         free_unref_page+0x3e/0x70
         __free_pages+0x1d/0x20
         free_highmem_page+0x19/0x40
         add_highpages_with_active_regions+0xab/0xeb
         set_highmem_pages_init+0x66/0x73
         mem_init+0x1b/0x1d7
         start_kernel+0x17a/0x363
         i386_start_kernel+0x95/0x99
         startup_32_smp+0x164/0x168
      
      The reason for this error is that the span of MOVABLE_ZONE is extended
      to whole node span for future CMA initialization, and, normal memory is
      wrongly freed here.  I submitted the fix and it seems to work, but,
      another problem happened.
      
      It's so late time to fix the later problem so I decide to reverting the
      series.
      Reported-by: NVille Syrjälä <ville.syrjala@linux.intel.com>
      Acked-by: NLaura Abbott <labbott@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d883c6cf
  19. 12 4月, 2018 2 次提交