1. 22 12月, 2018 2 次提交
    • O
      mm, page_alloc: fix has_unmovable_pages for HugePages · 17e2e7d7
      Oscar Salvador 提交于
      While playing with gigantic hugepages and memory_hotplug, I triggered
      the following #PF when "cat memoryX/removable":
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        #PF error: [normal kernel read fault]
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP PTI
        CPU: 1 PID: 1481 Comm: cat Tainted: G            E     4.20.0-rc6-mm1-1-default+ #18
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
        RIP: 0010:has_unmovable_pages+0x154/0x210
        Call Trace:
         is_mem_section_removable+0x7d/0x100
         removable_show+0x90/0xb0
         dev_attr_show+0x1c/0x50
         sysfs_kf_seq_show+0xca/0x1b0
         seq_read+0x133/0x380
         __vfs_read+0x26/0x180
         vfs_read+0x89/0x140
         ksys_read+0x42/0x90
         do_syscall_64+0x5b/0x180
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The reason is we do not pass the Head to page_hstate(), and so, the call
      to compound_order() in page_hstate() returns 0, so we end up checking
      all hstates's size to match PAGE_SIZE.
      
      Obviously, we do not find any hstate matching that size, and we return
      NULL.  Then, we dereference that NULL pointer in
      hugepage_migration_supported() and we got the #PF from above.
      
      Fix that by getting the head page before calling page_hstate().
      
      Also, since gigantic pages span several pageblocks, re-adjust the logic
      for skipping pages.  While are it, we can also get rid of the
      round_up().
      
      [osalvador@suse.de: remove round_up(), adjust skip pages logic per Michal]
        Link: http://lkml.kernel.org/r/20181221062809.31771-1-osalvador@suse.de
      Link: http://lkml.kernel.org/r/20181217225113.17864-1-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      17e2e7d7
    • M
      mm, memory_hotplug: initialize struct pages for the full memory section · 2830bf6f
      Mikhail Zaslonko 提交于
      If memory end is not aligned with the sparse memory section boundary,
      the mapping of such a section is only partly initialized.  This may lead
      to VM_BUG_ON due to uninitialized struct page access from
      is_mem_section_removable() or test_pages_in_a_zone() function triggered
      by memory_hotplug sysfs handlers:
      
      Here are the the panic examples:
       CONFIG_DEBUG_VM=y
       CONFIG_DEBUG_VM_PGFLAGS=y
      
       kernel parameter mem=2050M
       --------------------------
       page:000003d082008000 is uninitialized and poisoned
       page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
       Call Trace:
       ( test_pages_in_a_zone+0xde/0x160)
         show_valid_zones+0x5c/0x190
         dev_attr_show+0x34/0x70
         sysfs_kf_seq_show+0xc8/0x148
         seq_read+0x204/0x480
         __vfs_read+0x32/0x178
         vfs_read+0x82/0x138
         ksys_read+0x5a/0xb0
         system_call+0xdc/0x2d8
       Last Breaking-Event-Address:
         test_pages_in_a_zone+0xde/0x160
       Kernel panic - not syncing: Fatal exception: panic_on_oops
      
       kernel parameter mem=3075M
       --------------------------
       page:000003d08300c000 is uninitialized and poisoned
       page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
       Call Trace:
       ( is_mem_section_removable+0xb4/0x190)
         show_mem_removable+0x9a/0xd8
         dev_attr_show+0x34/0x70
         sysfs_kf_seq_show+0xc8/0x148
         seq_read+0x204/0x480
         __vfs_read+0x32/0x178
         vfs_read+0x82/0x138
         ksys_read+0x5a/0xb0
         system_call+0xdc/0x2d8
       Last Breaking-Event-Address:
         is_mem_section_removable+0xb4/0x190
       Kernel panic - not syncing: Fatal exception: panic_on_oops
      
      Fix the problem by initializing the last memory section of each zone in
      memmap_init_zone() till the very end, even if it goes beyond the zone end.
      
      Michal said:
      
      : This has alwways been problem AFAIU.  It just went unnoticed because we
      : have zeroed memmaps during allocation before f7f99100 ("mm: stop
      : zeroing memory during allocation in vmemmap") and so the above test
      : would simply skip these ranges as belonging to zone 0 or provided a
      : garbage.
      :
      : So I guess we do care for post f7f99100 kernels mostly and
      : therefore Fixes: f7f99100 ("mm: stop zeroing memory during
      : allocation in vmemmap")
      
      Link: http://lkml.kernel.org/r/20181212172712.34019-2-zaslonko@linux.ibm.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: NMikhail Zaslonko <zaslonko@linux.ibm.com>
      Reviewed-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Suggested-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Tested-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2830bf6f
  2. 01 12月, 2018 1 次提交
    • W
      mm/page_alloc.c: fix calculation of pgdat->nr_zones · 8f416836
      Wei Yang 提交于
      init_currently_empty_zone() will adjust pgdat->nr_zones and set it to
      'zone_idx(zone) + 1' unconditionally.  This is correct in the normal
      case, while not exact in hot-plug situation.
      
      This function is used in two places:
      
        * free_area_init_core()
        * move_pfn_range_to_zone()
      
      In the first case, we are sure zone index increase monotonically.  While
      in the second one, this is under users control.
      
      One way to reproduce this is:
      ----------------------------
      
      1. create a virtual machine with empty node1
      
         -m 4G,slots=32,maxmem=32G \
         -smp 4,maxcpus=8          \
         -numa node,nodeid=0,mem=4G,cpus=0-3 \
         -numa node,nodeid=1,mem=0G,cpus=4-7
      
      2. hot-add cpu 3-7
      
         cpu-add [3-7]
      
      2. hot-add memory to nod1
      
         object_add memory-backend-ram,id=ram0,size=1G
         device_add pc-dimm,id=dimm0,memdev=ram0,node=1
      
      3. online memory with following order
      
         echo online_movable > memory47/state
         echo online > memory40/state
      
      After this, node1 will have its nr_zones equals to (ZONE_NORMAL + 1)
      instead of (ZONE_MOVABLE + 1).
      
      Michal said:
       "Having an incorrect nr_zones might result in all sorts of problems
        which would be quite hard to debug (e.g. reclaim not considering the
        movable zone). I do not expect many users would suffer from this it
        but still this is trivial and obviously right thing to do so
        backporting to the stable tree shouldn't be harmful (last famous
        words)"
      
      Link: http://lkml.kernel.org/r/20181117022022.9956-1-richard.weiyang@gmail.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8f416836
  3. 19 11月, 2018 2 次提交
    • M
      mm, page_alloc: check for max order in hot path · c63ae43b
      Michal Hocko 提交于
      Konstantin has noticed that kvmalloc might trigger the following
      warning:
      
        WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
        [...]
        Call Trace:
         fragmentation_index+0x76/0x90
         compaction_suitable+0x4f/0xf0
         shrink_node+0x295/0x310
         node_reclaim+0x205/0x250
         get_page_from_freelist+0x649/0xad0
         __alloc_pages_nodemask+0x12a/0x2a0
         kmalloc_large_node+0x47/0x90
         __kmalloc_node+0x22b/0x2e0
         kvmalloc_node+0x3e/0x70
         xt_alloc_table_info+0x3a/0x80 [x_tables]
         do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
         nf_setsockopt+0x44/0x60
         SyS_setsockopt+0x6f/0xc0
         do_syscall_64+0x67/0x120
         entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      the problem is that we only check for an out of bound order in the slow
      path and the node reclaim might happen from the fast path already.  This
      is fixable by making sure that kvmalloc doesn't ever use kmalloc for
      requests that are larger than KMALLOC_MAX_SIZE but this also shows that
      the code is rather fragile.  A recent UBSAN report just underlines that
      by the following report
      
        UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
        shift exponent 51 is too large for 32-bit type 'int'
        CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0xd2/0x148 lib/dump_stack.c:113
         ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
         __ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
         __zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
         zone_watermark_fast mm/page_alloc.c:3216 [inline]
         get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
         __alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
         alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
         alloc_pages include/linux/gfp.h:509 [inline]
         __get_free_pages+0x12/0x60 mm/page_alloc.c:4414
         dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
         raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
         raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
         fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
         fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
         __blkdev_driver_ioctl block/ioctl.c:303 [inline]
         blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
         block_ioctl+0x105/0x150 fs/block_dev.c:1883
         vfs_ioctl fs/ioctl.c:46 [inline]
         do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
         ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
         __do_sys_ioctl fs/ioctl.c:709 [inline]
         __se_sys_ioctl fs/ioctl.c:707 [inline]
         __x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
         do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Note that this is not a kvmalloc path.  It is just that the fast path
      really depends on having sanitzed order as well.  Therefore move the
      order check to the fast path.
      
      Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reported-by: NKyungtae Kim <kt0755@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Byoungyoung Lee <lifeasageek@gmail.com>
      Cc: "Dae R. Jeong" <threeearcat@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c63ae43b
    • M
      mm, memory_hotplug: check zone_movable in has_unmovable_pages · 9d789999
      Michal Hocko 提交于
      Page state checks are racy.  Under a heavy memory workload (e.g.  stress
      -m 200 -t 2h) it is quite easy to hit a race window when the page is
      allocated but its state is not fully populated yet.  A debugging patch to
      dump the struct page state shows
      
        has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0
        page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1
        flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked)
      
      Note that the state has been checked for both PageLRU and PageSwapBacked
      already.  Closing this race completely would require some sort of retry
      logic.  This can be tricky and error prone (think of potential endless
      or long taking loops).
      
      Workaround this problem for movable zones at least.  Such a zone should
      only contain movable pages.  Commit 15c30bc0 ("mm, memory_hotplug:
      make has_unmovable_pages more robust") has told us that this is not
      strictly true though.  Bootmem pages should be marked reserved though so
      we can move the original check after the PageReserved check.  Pages from
      other zones are still prone to races but we even do not pretend that
      memory hotremove works for those so pre-mature failure doesn't hurt that
      much.
      
      Link: http://lkml.kernel.org/r/20181106095524.14629-1-mhocko@kernel.org
      Fixes: 15c30bc0 ("mm, memory_hotplug: make has_unmovable_pages more robust")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NBaoquan He <bhe@redhat.com>
      Tested-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9d789999
  4. 31 10月, 2018 6 次提交
    • M
      memblock: stop using implicit alignment to SMP_CACHE_BYTES · 7e1c4e27
      Mike Rapoport 提交于
      When a memblock allocation APIs are called with align = 0, the alignment
      is implicitly set to SMP_CACHE_BYTES.
      
      Implicit alignment is done deep in the memblock allocator and it can
      come as a surprise.  Not that such an alignment would be wrong even
      when used incorrectly but it is better to be explicit for the sake of
      clarity and the prinicple of the least surprise.
      
      Replace all such uses of memblock APIs with the 'align' parameter
      explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment
      in the memblock internal allocation functions.
      
      For the case when memblock APIs are used via helper functions, e.g.  like
      iommu_arena_new_node() in Alpha, the helper functions were detected with
      Coccinelle's help and then manually examined and updated where
      appropriate.
      
      The direct memblock APIs users were updated using the semantic patch below:
      
      @@
      expression size, min_addr, max_addr, nid;
      @@
      (
      |
      - memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid)
      + memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr,
      nid)
      |
      - memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid)
      + memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr,
      nid)
      |
      - memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid)
      + memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid)
      |
      - memblock_alloc(size, 0)
      + memblock_alloc(size, SMP_CACHE_BYTES)
      |
      - memblock_alloc_raw(size, 0)
      + memblock_alloc_raw(size, SMP_CACHE_BYTES)
      |
      - memblock_alloc_from(size, 0, min_addr)
      + memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr)
      |
      - memblock_alloc_nopanic(size, 0)
      + memblock_alloc_nopanic(size, SMP_CACHE_BYTES)
      |
      - memblock_alloc_low(size, 0)
      + memblock_alloc_low(size, SMP_CACHE_BYTES)
      |
      - memblock_alloc_low_nopanic(size, 0)
      + memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES)
      |
      - memblock_alloc_from_nopanic(size, 0, min_addr)
      + memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr)
      |
      - memblock_alloc_node(size, 0, nid)
      + memblock_alloc_node(size, SMP_CACHE_BYTES, nid)
      )
      
      [mhocko@suse.com: changelog update]
      [akpm@linux-foundation.org: coding-style fixes]
      [rppt@linux.ibm.com: fix missed uses of implicit alignment]
        Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx
      Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: Paul Burton <paul.burton@mips.com>	[MIPS]
      Acked-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e1c4e27
    • M
      mm: remove include/linux/bootmem.h · 57c8a661
      Mike Rapoport 提交于
      Move remaining definitions and declarations from include/linux/bootmem.h
      into include/linux/memblock.h and remove the redundant header.
      
      The includes were replaced with the semantic patch below and then
      semi-automated removal of duplicated '#include <linux/memblock.h>
      
      @@
      @@
      - #include <linux/bootmem.h>
      + #include <linux/memblock.h>
      
      [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
        Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
      [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
        Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
      [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
        Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
      Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Serge Semin <fancer.lancer@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      57c8a661
    • M
      memblock: rename __free_pages_bootmem to memblock_free_pages · 7c2ee349
      Mike Rapoport 提交于
      The conversion is done using
      
      sed -i 's@__free_pages_bootmem@memblock_free_pages@' \
          $(git grep -l __free_pages_bootmem)
      
      Link: http://lkml.kernel.org/r/1536927045-23536-27-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Serge Semin <fancer.lancer@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c2ee349
    • M
      memblock: rename free_all_bootmem to memblock_free_all · c6ffc5ca
      Mike Rapoport 提交于
      The conversion is done using
      
      sed -i 's@free_all_bootmem@memblock_free_all@' \
          $(git grep -l free_all_bootmem)
      
      Link: http://lkml.kernel.org/r/1536927045-23536-26-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Serge Semin <fancer.lancer@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c6ffc5ca
    • M
      memblock: remove _virt from APIs returning virtual address · eb31d559
      Mike Rapoport 提交于
      The conversion is done using
      
      sed -i 's@memblock_virt_alloc@memblock_alloc@g' \
      	$(git grep -l memblock_virt_alloc)
      
      Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Serge Semin <fancer.lancer@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb31d559
    • M
      mm: remove CONFIG_HAVE_MEMBLOCK · aca52c39
      Mike Rapoport 提交于
      All architecures use memblock for early memory management. There is no need
      for the CONFIG_HAVE_MEMBLOCK configuration option.
      
      [rppt@linux.vnet.ibm.com: of/fdt: fixup #ifdefs]
        Link: http://lkml.kernel.org/r/20180919103457.GA20545@rapoport-lnx
      [rppt@linux.vnet.ibm.com: csky: fixups after bootmem removal]
        Link: http://lkml.kernel.org/r/20180926112744.GC4628@rapoport-lnx
      [rppt@linux.vnet.ibm.com: remove stale #else and the code it protects]
        Link: http://lkml.kernel.org/r/1538067825-24835-1-git-send-email-rppt@linux.vnet.ibm.com
      Link: http://lkml.kernel.org/r/1536927045-23536-4-git-send-email-rppt@linux.vnet.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Tested-by: NJonathan Cameron <jonathan.cameron@huawei.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Greentime Hu <green.hu@gmail.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Guan Xuetao <gxt@pku.edu.cn>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Jonas Bonn <jonas@southpole.se>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Mark Salter <msalter@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Simek <monstr@monstr.eu>
      Cc: Palmer Dabbelt <palmer@sifive.com>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Serge Semin <fancer.lancer@gmail.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aca52c39
  5. 27 10月, 2018 13 次提交
    • P
      mm: return zero_resv_unavail optimization · ec393a0f
      Pavel Tatashin 提交于
      When checking for valid pfns in zero_resv_unavail(), it is not necessary
      to verify that pfns within pageblock_nr_pages ranges are valid, only the
      first one needs to be checked.  This is because memory for pages are
      allocated in contiguous chunks that contain pageblock_nr_pages struct
      pages.
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.comSigned-off-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Signed-off-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec393a0f
    • N
      mm: zero remaining unavailable struct pages · 907ec5fc
      Naoya Horiguchi 提交于
      Patch series "mm: Fix for movable_node boot option", v3.
      
      This patch series contains a fix for the movable_node boot option issue
      which was introduced by commit 124049de ("x86/e820: put !E820_TYPE_RAM
      regions into memblock.reserved").
      
      The commit breaks the option because it changed the memory gap range to
      reserved memblock.  So, the node is marked as Normal zone even if the SRAT
      has Hot pluggable affinity.
      
      First and second patch fix the original issue which the commit tried to
      fix, then revert the commit.
      
      This patch (of 3):
      
      There is a kernel panic that is triggered when reading /proc/kpageflags on
      the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
      
        BUG: unable to handle kernel paging request at fffffffffffffffe
        PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
        Oops: 0000 [#1] SMP PTI
        CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
        RIP: 0010:stable_page_flags+0x27/0x3c0
        Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
        RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
        RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
        RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
        R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
        R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
        FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
        Call Trace:
         kpageflags_read+0xc7/0x120
         proc_reg_read+0x3c/0x60
         __vfs_read+0x36/0x170
         vfs_read+0x89/0x130
         ksys_pread64+0x71/0x90
         do_syscall_64+0x5b/0x160
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7efc42e75e23
        Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
      
      According to kernel bisection, this problem became visible due to commit
      f7f99100 which changes how struct pages are initialized.
      
      Memblock layout affects the pfn ranges covered by node/zone.  Consider
      that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the
      default (no memmap= given) memblock layout is like below:
      
        MEMBLOCK configuration:
         memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x4
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
         memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      If you give memmap=1G!4G (so it just covers memory[0x2]),
      the range [0x100000000-0x13fffffff] is gone:
      
        MEMBLOCK configuration:
         memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x3
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      This causes shrinking node 0's pfn range because it is calculated by the
      address range of memblock.memory.  So some of struct pages in the gap
      range are left uninitialized.
      
      We have a function zero_resv_unavail() which does zeroing the struct pages
      outside memblock.memory, but currently it covers only the reserved
      unavailable range (i.e.  memblock.memory && !memblock.reserved).  This
      patch extends it to cover all unavailable range, which fixes the reported
      issue.
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Tested-by: NOscar Salvador <osalvador@suse.de>
      Tested-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      907ec5fc
    • P
      mm: move mirrored memory specific code outside of memmap_init_zone · a9a9e77f
      Pavel Tatashin 提交于
      memmap_init_zone, is getting complex, because it is called from different
      contexts: hotplug, and during boot, and also because it must handle some
      architecture quirks.  One of them is mirrored memory.
      
      Move the code that decides whether to skip mirrored memory outside of
      memmap_init_zone, into a separate function.
      
      [pasha.tatashin@oracle.com: uninline overlap_memmap_init()]
        Link: http://lkml.kernel.org/r/20180726193509.3326-4-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20180724235520.10200-4-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9a9e77f
    • P
      mm: calculate deferred pages after skipping mirrored memory · d3035be4
      Pavel Tatashin 提交于
      update_defer_init() should be called only when struct page is about to be
      initialized. Because it counts number of initialized struct pages, but
      there we may skip struct pages if there is some mirrored memory.
      
      So move, update_defer_init() after checking for mirrored memory.
      
      Also, rename update_defer_init() to defer_init() and reverse the return
      boolean to emphasize that this is a boolean function, that tells that the
      reset of memmap initialization should be deferred.
      
      Make this function self-contained: do not pass number of already
      initialized pages in this zone by using static counters.
      
      I found this bug by reading the code.  The effect is that fewer than
      expected struct pages are initialized early in boot, and it is possible
      that in some corner cases we may fail to boot when mirrored pages are
      used.  The deferred on demand code should somewhat mitigate this.  But
      this still brings some inconsistencies compared to when booting without
      mirrored pages, so it is better to fix.
      
      [pasha.tatashin@oracle.com: add comment about defer_init's lack of locking]
        Link: http://lkml.kernel.org/r/20180726193509.3326-3-pasha.tatashin@oracle.com
      [akpm@linux-foundation.org: make defer_init non-inline, __meminit]
      Link: http://lkml.kernel.org/r/20180724235520.10200-3-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d3035be4
    • P
      mm: make memmap_init a proper function · dfb3ccd0
      Pavel Tatashin 提交于
      memmap_init is sometimes a macro sometimes a function based on
      __HAVE_ARCH_MEMMAP_INIT.  It is only a function on ia64.  Make memmap_init
      a weak function instead, and let ia64 redefine it.
      
      Link: http://lkml.kernel.org/r/20180724235520.10200-2-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dfb3ccd0
    • D
      mm/page_alloc.c: initialize num_movable in move_freepages() · 4a222127
      David Rientjes 提交于
      If move_freepages_block() returns 0 because !zone_spans_pfn(),
      *num_movable can hold the value from the stack because it does not get
      initialized in move_freepages().
      
      Move the initialization to move_freepages_block() to guarantee the value
      actually makes sense.
      
      This currently doesn't affect its only caller where num_movable != NULL,
      so no bug fix, but just more robust.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1810051355490.212229@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a222127
    • A
      mm: defer ZONE_DEVICE page initialization to the point where we init pgmap · 966cf44f
      Alexander Duyck 提交于
      The ZONE_DEVICE pages were being initialized in two locations.  One was
      with the memory_hotplug lock held and another was outside of that lock.
      The problem with this is that it was nearly doubling the memory
      initialization time.  Instead of doing this twice, once while holding a
      global lock and once without, I am opting to defer the initialization to
      the one outside of the lock.  This allows us to avoid serializing the
      overhead for memory init and we can instead focus on per-node init times.
      
      One issue I encountered is that devm_memremap_pages and
      hmm_devmmem_pages_create were initializing only the pgmap field the same
      way.  One wasn't initializing hmm_data, and the other was initializing it
      to a poison value.  Since this is something that is exposed to the driver
      in the case of hmm I am opting for a third option and just initializing
      hmm_data to 0 since this is going to be exposed to unknown third party
      drivers.
      
      [alexander.h.duyck@linux.intel.com: fix reference count for pgmap in devm_memremap_pages]
        Link: http://lkml.kernel.org/r/20181008233404.1909.37302.stgit@localhost.localdomain
      Link: http://lkml.kernel.org/r/20180925202053.3576.66039.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Tested-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      966cf44f
    • A
      mm: create non-atomic version of SetPageReserved for init use · d483da5b
      Alexander Duyck 提交于
      It doesn't make much sense to use the atomic SetPageReserved at init time
      when we are using memset to clear the memory and manipulating the page
      flags via simple "&=" and "|=" operations in __init_single_page.
      
      This patch adds a non-atomic version __SetPageReserved that can be used
      during page init and shows about a 10% improvement in initialization times
      on the systems I have available for testing.  On those systems I saw
      initialization times drop from around 35 seconds to around 32 seconds to
      initialize a 3TB block of persistent memory.  I believe the main advantage
      of this is that it allows for more compiler optimization as the __set_bit
      operation can be reordered whereas the atomic version cannot.
      
      I tried adding a bit of documentation based on f1dd2cd1 ("mm,
      memory_hotplug: do not associate hotadded memory to zones until online").
      
      Ideally the reserved flag should be set earlier since there is a brief
      window where the page is initialization via __init_single_page and we have
      not set the PG_Reserved flag.  I'm leaving that for a future patch set as
      that will require a more significant refactor.
      
      Link: http://lkml.kernel.org/r/20180925202018.3576.11607.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d483da5b
    • M
      mm, page_alloc: drop should_suppress_show_mem · 2c029a1e
      Michal Hocko 提交于
      should_suppress_show_mem() was introduced to reduce the overhead of
      show_mem on large NUMA systems.  Things have changed since then though.
      Namely c78e9363 ("mm: do not walk all of system memory during
      show_mem") has reduced the overhead considerably.
      
      Moreover warn_alloc_show_mem clears SHOW_MEM_FILTER_NODES when called from
      the IRQ context already so we are not printing per node stats.
      
      Remove should_suppress_show_mem because we are losing potentially
      interesting information about allocation failures.  We have seen a bug
      report where system gets unresponsive under memory pressure and there is
      only
      
      kernel: [2032243.696888] qlge 0000:8b:00.1 ql1: Could not get a page chunk, i=8, clean_idx =200 .
      kernel: [2032243.710725] swapper/7: page allocation failure: order:1, mode:0x1084120(GFP_ATOMIC|__GFP_COLD|__GFP_COMP)
      
      without an additional information for debugging.  It would be great to see
      the state of the page allocator at the moment.
      
      Link: http://lkml.kernel.org/r/20180907114334.7088-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c029a1e
    • J
      psi: pressure stall information for CPU, memory, and IO · eb414681
      Johannes Weiner 提交于
      When systems are overcommitted and resources become contended, it's hard
      to tell exactly the impact this has on workload productivity, or how close
      the system is to lockups and OOM kills.  In particular, when machines work
      multiple jobs concurrently, the impact of overcommit in terms of latency
      and throughput on the individual job can be enormous.
      
      In order to maximize hardware utilization without sacrificing individual
      job health or risk complete machine lockups, this patch implements a way
      to quantify resource pressure in the system.
      
      A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
      expose the percentage of time the system is stalled on CPU, memory, or IO,
      respectively.  Stall states are aggregate versions of the per-task delay
      accounting delays:
      
             cpu: some tasks are runnable but not executing on a CPU
             memory: tasks are reclaiming, or waiting for swapin or thrashing cache
             io: tasks are waiting for io completions
      
      These percentages of walltime can be thought of as pressure percentages,
      and they give a general sense of system health and productivity loss
      incurred by resource overcommit.  They can also indicate when the system
      is approaching lockup scenarios and OOMs.
      
      To do this, psi keeps track of the task states associated with each CPU
      and samples the time they spend in stall states.  Every 2 seconds, the
      samples are averaged across CPUs - weighted by the CPUs' non-idle time to
      eliminate artifacts from unused CPUs - and translated into percentages of
      walltime.  A running average of those percentages is maintained over 10s,
      1m, and 5m periods (similar to the loadaverage).
      
      [hannes@cmpxchg.org: doc fixlet, per Randy]
        Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
      [hannes@cmpxchg.org: code optimization]
        Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
      [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
        Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
      [hannes@cmpxchg.org: fix build]
        Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      eb414681
    • V
      mm: rename and change semantics of nr_indirectly_reclaimable_bytes · b29940c1
      Vlastimil Babka 提交于
      The vmstat counter NR_INDIRECTLY_RECLAIMABLE_BYTES was introduced by
      commit eb592546 ("mm: introduce NR_INDIRECTLY_RECLAIMABLE_BYTES") with
      the goal of accounting objects that can be reclaimed, but cannot be
      allocated via a SLAB_RECLAIM_ACCOUNT cache.  This is now possible via
      kmalloc() with __GFP_RECLAIMABLE flag, and the dcache external names user
      is converted.
      
      The counter is however still useful for accounting direct page allocations
      (i.e.  not slab) with a shrinker, such as the ION page pool.  So keep it,
      and:
      
      - change granularity to pages to be more like other counters; sub-page
        allocations should be able to use kmalloc
      - rename the counter to NR_KERNEL_MISC_RECLAIMABLE
      - expose the counter again in vmstat as "nr_kernel_misc_reclaimable"; we can
        again remove the check for not printing "hidden" counters
      
      Link: http://lkml.kernel.org/r/20180731090649.16028-5-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Vijayanand Jitta <vjitta@codeaurora.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Sumit Semwal <sumit.semwal@linaro.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b29940c1
    • O
      mm/page_alloc.c: clean up check_for_memory() · 7b0e0c0e
      Oscar Salvador 提交于
      check_for_memory() looks a bit confusing.  First of all, we have this:
      
      if (N_MEMORY == N_NORMAL_MEMORY)
      	return;
      
      Checking the ENUM declaration, looks like N_MEMORY canot be equal to
      N_NORMAL_MEMORY.
      
      I could not find where N_MEMORY is set to N_NORMAL_MEMORY, or the other
      way around either, so unless I am missing something, this condition will
      never evaluate to true.  It makes sense to get rid of it.
      
      Moving forward, the operations within the loop look a bit confusing as
      well.
      
      We set N_HIGH_MEMORY unconditionally, and then we set N_NORMAL_MEMORY in
      case we have CONFIG_HIGHMEM (N_NORMAL_MEMORY != N_HIGH_MEMORY) and zone <=
      ZONE_NORMAL.  (N_HIGH_MEMORY falls back to N_NORMAL_MEMORY on
      !CONFIG_HIGHMEM systems, and that is why we can just go ahead and set
      N_HIGH_MEMORY unconditionally)
      
      Although this works, it is a bit subtle.
      
      I think that this could be easier to follow:
      
      First, we should only set N_HIGH_MEMORY in case we have CONFIG_HIGHMEM.
      And then we should set N_NORMAL_MEMORY in case zone <= ZONE_NORMAL,
      without further checking whether we have CONFIG_HIGHMEM or not.
      
      Link: http://lkml.kernel.org/r/20180828210158.4617-1-osalvador@techadventures.netSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Michael Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b0e0c0e
    • M
      mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry() · 15f570bf
      Michal Hocko 提交于
      Tetsuo Handa has reported that it is possible to bypass the short sleep
      for PF_WQ_WORKER threads which was introduced by commit 373ccbe5
      ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
      any progress") and lock up the system if OOM.
      
      The primary reason is that WQ_MEM_RECLAIM WQs are not guaranteed to run
      even when they have a rescuer available.  Those workers might be essential
      for reclaim to make a forward progress, however.  If we are too unlucky
      all the allocations requests can get stuck waiting for a WQ_MEM_RECLAIM
      work item and the system is essentially stuck in an OOM condition without
      much hope to move on.  Tetsuo has seen the reclaim stuck on
      drain_local_pages_wq or xlog_cil_push_work (xfs).  There might be others.
      
      Since should_reclaim_retry() should be a natural reschedule point,
      let's do the short sleep for PF_WQ_WORKER threads unconditionally in
      order to guarantee that other pending work items are started.  This
      will workaround this problem and it is less fragile than hunting down
      when the sleep is missed.  Having a single sleeping point is more
      robust.
      
      [akpm@linux-foundation.org: reflow comment to 80 cols to save a couple of lines]
      Link: http://lkml.kernel.org/r/20180827135101.15700-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Debugged-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15f570bf
  6. 09 10月, 2018 1 次提交
  7. 02 10月, 2018 1 次提交
    • M
      mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration · efaffc5e
      Mel Gorman 提交于
      Rate limiting of page migrations due to automatic NUMA balancing was
      introduced to mitigate the worst-case scenario of migrating at high
      frequency due to false sharing or slowly ping-ponging between nodes.
      Since then, a lot of effort was spent on correctly identifying these
      pages and avoiding unnecessary migrations and the safety net may no longer
      be required.
      
      Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
      avoids spreading STREAM tasks wide prematurely. However, once the task
      was properly placed, it delayed migrating the memory due to rate limiting.
      Increasing the limit fixed the problem for him.
      
      Currently, the limit is hard-coded and does not account for the real
      capabilities of the hardware. Even if an estimate was attempted, it would
      not properly account for the number of memory controllers and it could
      not account for the amount of bandwidth used for normal accesses. Rather
      than fudging, this patch simply eliminates the rate limiting.
      
      However, Jirka reports that a STREAM configuration using multiple
      processes achieved similar performance to 4.16. In local tests, this patch
      improved performance of STREAM relative to the baseline but it is somewhat
      machine-dependent. Most workloads show little or not performance difference
      implying that there is not a heavily reliance on the throttling mechanism
      and it is safe to remove.
      
      STREAM on 2-socket machine
                               4.19.0-rc5             4.19.0-rc5
                               numab-v1r1       noratelimit-v1r1
      MB/sec copy     43298.52 (   0.00%)    44673.38 (   3.18%)
      MB/sec scale    30115.06 (   0.00%)    31293.06 (   3.91%)
      MB/sec add      32825.12 (   0.00%)    34883.62 (   6.27%)
      MB/sec triad    32549.52 (   0.00%)    34906.60 (   7.24%
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linux-MM <linux-mm@kvack.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      efaffc5e
  8. 05 9月, 2018 1 次提交
  9. 30 8月, 2018 1 次提交
  10. 24 8月, 2018 1 次提交
  11. 23 8月, 2018 5 次提交
  12. 18 8月, 2018 4 次提交
    • A
      mm, page_alloc: double zone's batchsize · d8a759b5
      Aaron Lu 提交于
      To improve page allocator's performance for order-0 pages, each CPU has
      a Per-CPU-Pageset(PCP) per zone.  Whenever an order-0 page is needed,
      PCP will be checked first before asking pages from Buddy.  When PCP is
      used up, a batch of pages will be fetched from Buddy to improve
      performance and the size of batch can affect performance.
      
      zone's batch size gets doubled last time by commit ba56e91c("mm:
      page_alloc: increase size of per-cpu-pages") over ten years ago.  Since
      then, CPU has envolved a lot and CPU's cache sizes also increased.
      
      Dave Hansen is concerned the current batch size doesn't fit well with
      modern hardware and suggested me to do two things: first, use a page
      allocator intensive benchmark, e.g.  will-it-scale/page_fault1 to find
      out how performance changes with different batch sizes on various
      machines and then choose a new default batch size; second, see how this
      new batch size work with other workloads.
      
      In the first test, we saw performance gains on high-core-count systems
      and little to no effect on older systems with more modest core counts.
      In this phase's test data, two candidates: 63 and 127 are chosen.
      
      In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
      and more will-it-scale sub-tests are tested to see how these two
      candidates work with these workloads and decides a new default according
      to their results.
      
      Most test results are flat.  will-it-scale/page_fault2 process mode has
      10%-18% performance increase on 4-sockets Skylake and Broadwell.
      vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
      4-sockets servers while for 2-sockets servers, it caused 3%-8% performance
      drop.  Further analysis showed that, with a larger pcp->batch and thus
      larger pcp->high(the relationship of pcp->high=6 * pcp->batch is
      maintained in this patch), zone lock contention shifted to LRU add side
      lock contention and that caused performance drop.  This performance drop
      might be mitigated by others' work on optimizing LRU lock.
      
      Another downside of increasing pcp->batch is, when PCP is used up and need
      to fetch a batch of pages from Buddy, since batch is increased, that time
      can be longer than before.  My understanding is, this doesn't affect
      slowpath where direct reclaim and compaction dominates.  For fastpath,
      throughput is a win(according to will-it-scale/page_fault1) but worst
      latency can be larger now.
      
      Overall, I think double the batch size from 31 to 63 is relatively safe
      and provide good performance boost for high-core-count systems.
      
      The two phase's test results are listed below(all tests are done with THP
      disabled).
      
      Phase one(will-it-scale/page_fault1) test results:
      
      Skylake-EX: increased batch size has a good effect on zone->lock
      contention, though LRU contention will rise at the same time and
      limited the final performance increase.
      
      batch   score     change   zone_contention   lru_contention   total_contention
       31   15345900    +0.00%       64%                 8%           72%
       53   17903847   +16.67%       32%                38%           70%
       63   17992886   +17.25%       24%                45%           69%
       73   18022825   +17.44%       10%                61%           71%
      119   18023401   +17.45%        4%                66%           70%
      127   18029012   +17.48%        3%                66%           69%
      137   18036075   +17.53%        4%                66%           70%
      165   18035964   +17.53%        2%                67%           69%
      188   18101105   +17.95%        2%                67%           69%
      223   18130951   +18.15%        2%                67%           69%
      255   18118898   +18.07%        2%                67%           69%
      267   18101559   +17.96%        2%                67%           69%
      299   18160468   +18.34%        2%                68%           70%
      320   18139845   +18.21%        2%                67%           69%
      393   18160869   +18.34%        2%                68%           70%
      424   18170999   +18.41%        2%                68%           70%
      458   18144868   +18.24%        2%                68%           70%
      467   18142366   +18.22%        2%                68%           70%
      498   18154549   +18.30%        1%                68%           69%
      511   18134525   +18.17%        1%                69%           70%
      
      Broadwell-EX: similar pattern as Skylake-EX.
      
      batch   score     change   zone_contention   lru_contention   total_contention
       31   16703983    +0.00%       67%                 7%           74%
       53   18195393    +8.93%       43%                28%           71%
       63   18288885    +9.49%       38%                33%           71%
       73   18344329    +9.82%       35%                37%           72%
      119   18535529   +10.96%       24%                46%           70%
      127   18513596   +10.83%       23%                48%           71%
      137   18514327   +10.84%       23%                48%           71%
      165   18511840   +10.82%       22%                49%           71%
      188   18593478   +11.31%       17%                53%           70%
      223   18601667   +11.36%       17%                52%           69%
      255   18774825   +12.40%       12%                58%           70%
      267   18754781   +12.28%        9%                60%           69%
      299   18892265   +13.10%        7%                63%           70%
      320   18873812   +12.99%        8%                62%           70%
      393   18891174   +13.09%        6%                64%           70%
      424   18975108   +13.60%        6%                64%           70%
      458   18932364   +13.34%        8%                62%           70%
      467   18960891   +13.51%        5%                65%           70%
      498   18944526   +13.41%        5%                64%           69%
      511   18960839   +13.51%        5%                64%           69%
      
      Skylake-EP: although increased batch reduced zone->lock contention, but
      the effect is not as good as EX: zone->lock contention is still as high as
      20% with a very high batch value instead of 1% on Skylake-EX or 5% on
      Broadwell-EX.  Also, total_contention actually decreased with a higher
      batch but that doesn't translate to performance increase.
      
      batch   score    change   zone_contention   lru_contention   total_contention
       31   9554867    +0.00%       66%                 3%           69%
       53   9855486    +3.15%       63%                 3%           66%
       63   9980145    +4.45%       62%                 4%           66%
       73   10092774   +5.63%       62%                 5%           67%
      119   10310061   +7.90%       45%                19%           64%
      127   10342019   +8.24%       42%                19%           61%
      137   10358182   +8.41%       42%                21%           63%
      165   10397060   +8.81%       37%                24%           61%
      188   10341808   +8.24%       34%                26%           60%
      223   10349135   +8.31%       31%                27%           58%
      255   10327189   +8.08%       28%                29%           57%
      267   10344204   +8.26%       27%                29%           56%
      299   10325043   +8.06%       25%                30%           55%
      320   10310325   +7.91%       25%                31%           56%
      393   10293274   +7.73%       21%                31%           52%
      424   10311099   +7.91%       21%                32%           53%
      458   10321375   +8.02%       21%                32%           53%
      467   10303881   +7.84%       21%                32%           53%
      498   10332462   +8.14%       20%                33%           53%
      511   10325016   +8.06%       20%                32%           52%
      
      Broadwell-EP: zone->lock and lru lock had an agreement to make sure
      performance doesn't increase and they successfully managed to keep total
      contention at 70%.
      
      batch   score    change   zone_contention   lru_contention   total_contention
       31   10121178   +0.00%       19%                50%           69%
       53   10142366   +0.21%        6%                63%           69%
       63   10117984   -0.03%       11%                58%           69%
       73   10123330   +0.02%        7%                63%           70%
      119   10108791   -0.12%        2%                67%           69%
      127   10166074   +0.44%        3%                66%           69%
      137   10141574   +0.20%        3%                66%           69%
      165   10154499   +0.33%        2%                68%           70%
      188   10124921   +0.04%        2%                67%           69%
      223   10137399   +0.16%        2%                67%           69%
      255   10143289   +0.22%        0%                68%           68%
      267   10123535   +0.02%        1%                68%           69%
      299   10140952   +0.20%        0%                68%           68%
      320   10163170   +0.41%        0%                68%           68%
      393   10000633   -1.19%        0%                69%           69%
      424   10087998   -0.33%        0%                69%           69%
      458   10187116   +0.65%        0%                69%           69%
      467   10146790   +0.25%        0%                69%           69%
      498   10197958   +0.76%        0%                69%           69%
      511   10152326   +0.31%        0%                69%           69%
      
      Haswell-EP: similar to Broadwell-EP.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   10442205   +0.00%       14%                48%           62%
       53   10442255   +0.00%        5%                57%           62%
       63   10452059   +0.09%        6%                57%           63%
       73   10482349   +0.38%        5%                59%           64%
      119   10454644   +0.12%        3%                60%           63%
      127   10431514   -0.10%        3%                59%           62%
      137   10423785   -0.18%        3%                60%           63%
      165   10481216   +0.37%        2%                61%           63%
      188   10448755   +0.06%        2%                61%           63%
      223   10467144   +0.24%        2%                61%           63%
      255   10480215   +0.36%        2%                61%           63%
      267   10484279   +0.40%        2%                61%           63%
      299   10466450   +0.23%        2%                61%           63%
      320   10452578   +0.10%        2%                61%           63%
      393   10499678   +0.55%        1%                62%           63%
      424   10481454   +0.38%        1%                62%           63%
      458   10473562   +0.30%        1%                62%           63%
      467   10484269   +0.40%        0%                62%           62%
      498   10505599   +0.61%        0%                62%           62%
      511   10483395   +0.39%        0%                62%           62%
      
      Westmere-EP: contention is pretty small so not interesting.  Note too high
      a batch value could hurt performance.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   4831523   +0.00%        2%                 3%            5%
       53   4834086   +0.05%        2%                 4%            6%
       63   4834262   +0.06%        2%                 3%            5%
       73   48328518   +0.03%        2%                 4%            6%
      119   4830534   -0.02%        1%                 3%            4%
      127   4827461   -0.08%        1%                 4%            5%
      137   4827459   -0.08%        1%                 3%            4%
      165   4820534   -0.23%        0%                 4%            4%
      188   4817947   -0.28%        0%                 3%            3%
      223   48096710   -0.45%        0%                 3%            3%
      255   4802463   -0.60%        0%                 4%            4%
      267   4801634   -0.62%        0%                 3%            3%
      299   4798047   -0.69%        0%                 3%            3%
      320   4793084   -0.80%        0%                 3%            3%
      393   4785877   -0.94%        0%                 3%            3%
      424   4782911   -1.01%        0%                 3%            3%
      458   4779346   -1.08%        0%                 3%            3%
      467   4780306   -1.06%        0%                 3%            3%
      498   4780589   -1.05%        0%                 3%            3%
      511   4773724   -1.20%        0%                 3%            3%
      
      Skylake-Desktop: similar to Westmere-EP, nothing interesting.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   3906608   +0.00%        2%                 3%            5%
       53   3940164   +0.86%        2%                 3%            5%
       63   3937289   +0.79%        2%                 3%            5%
       73   3940201   +0.86%        2%                 3%            5%
      119   3933240   +0.68%        2%                 3%            5%
      127   3930514   +0.61%        2%                 4%            6%
      137   3938639   +0.82%        0%                 3%            3%
      165   3908755   +0.05%        0%                 3%            3%
      188   3905621   -0.03%        0%                 3%            3%
      223   3903015   -0.09%        0%                 4%            4%
      255   3889480   -0.44%        0%                 3%            3%
      267   3891669   -0.38%        0%                 4%            4%
      299   3898728   -0.20%        0%                 4%            4%
      320   3894547   -0.31%        0%                 4%            4%
      393   3875137   -0.81%        0%                 4%            4%
      424   3874521   -0.82%        0%                 3%            3%
      458   3880432   -0.67%        0%                 4%            4%
      467   3888715   -0.46%        0%                 3%            3%
      498   3888633   -0.46%        0%                 4%            4%
      511   3875305   -0.80%        0%                 5%            5%
      
      Haswell-Desktop: zone->lock is pretty low as other desktops, though lru
      contention is higher than other desktops.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   3511158   +0.00%        2%                 5%            7%
       53   3555445   +1.26%        2%                 6%            8%
       63   3561082   +1.42%        2%                 6%            8%
       73   3547218   +1.03%        2%                 6%            8%
      119   3571319   +1.71%        1%                 7%            8%
      127   3549375   +1.09%        0%                 6%            6%
      137   3560233   +1.40%        0%                 6%            6%
      165   3555176   +1.25%        2%                 6%            8%
      188   3551501   +1.15%        0%                 8%            8%
      223   3531462   +0.58%        0%                 7%            7%
      255   3570400   +1.69%        0%                 7%            7%
      267   3532235   +0.60%        1%                 8%            9%
      299   3562326   +1.46%        0%                 6%            6%
      320   3553569   +1.21%        0%                 8%            8%
      393   3539519   +0.81%        0%                 7%            7%
      424   3549271   +1.09%        0%                 8%            8%
      458   3528885   +0.50%        0%                 8%            8%
      467   3526554   +0.44%        0%                 7%            7%
      498   3525302   +0.40%        0%                 9%            9%
      511   3527556   +0.47%        0%                 8%            8%
      
      Sandybridge-Desktop: the 0% contention isn't accurate but caused by
      dropped fractional part. Since multiple contention path's contentions
      are all under 1% here, with some arithmetic operations like add, the
      final deviation could be as large as 3%.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   1744495   +0.00%        0%                 0%            0%
       53   1755341   +0.62%        0%                 0%            0%
       63   1758469   +0.80%        0%                 0%            0%
       73   1759626   +0.87%        0%                 0%            0%
      119   1770417   +1.49%        0%                 0%            0%
      127   1768252   +1.36%        0%                 0%            0%
      137   1767848   +1.34%        0%                 0%            0%
      165   1765088   +1.18%        0%                 0%            0%
      188   1766918   +1.29%        0%                 0%            0%
      223   1767866   +1.34%        0%                 0%            0%
      255   1768074   +1.35%        0%                 0%            0%
      267   1763187   +1.07%        0%                 0%            0%
      299   1765620   +1.21%        0%                 0%            0%
      320   1767603   +1.32%        0%                 0%            0%
      393   1764612   +1.15%        0%                 0%            0%
      424   1758476   +0.80%        0%                 0%            0%
      458   1758593   +0.81%        0%                 0%            0%
      467   1757915   +0.77%        0%                 0%            0%
      498   1753363   +0.51%        0%                 0%            0%
      511   1755548   +0.63%        0%                 0%            0%
      
      Phase two test results:
      Note: all percent change is against base(batch=31).
      
      ebizzy.throughput (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    2410037±7%     2600451±2% +7.9%     2602878 +8.0%
      lkp-bdw-ex1     1493328        1489243    -0.3%     1492145 -0.1%
      lkp-skl-2sp2    1329674        1345891    +1.2%     1351056 +1.6%
      lkp-bdw-ep2      711511         711511     0.0%      710708 -0.1%
      lkp-wsm-ep2       75750          75528    -0.3%       75441 -0.4%
      lkp-skl-d01      264126         262791    -0.5%      264113 +0.0%
      lkp-hsw-d01      176601         176328    -0.2%      176368 -0.1%
      lkp-sb02          98937          98937    +0.0%       99030 +0.1%
      
      kbuild.buildtime (less is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     107.00        107.67  +0.6%        107.11  +0.1%
      lkp-bdw-ex1       97.33         97.33  +0.0%         97.42  +0.1%
      lkp-skl-2sp2     180.00        179.83  -0.1%        179.83  -0.1%
      lkp-bdw-ep2      178.17        179.17  +0.6%        177.50  -0.4%
      lkp-wsm-ep2      737.00        738.00  +0.1%        738.00  +0.1%
      lkp-skl-d01      642.00        653.00  +1.7%        653.00  +1.7%
      lkp-hsw-d01     1310.00       1316.00  +0.5%       1311.00  +0.1%
      
      netperf/TCP_STREAM.Throughput_total_Mbps (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     948790        947144  -0.2%        948333 -0.0%
      lkp-bdw-ex1      904224        904366  +0.0%        904926 +0.1%
      lkp-skl-2sp2     239731        239607  -0.1%        239565 -0.1%
      lk-bdw-ep2       365764        365933  +0.0%        365951 +0.1%
      lkp-wsm-ep2       93736         93803  +0.1%         93808 +0.1%
      lkp-skl-d01       77314         77303  -0.0%         77375 +0.1%
      lkp-hsw-d01       58617         60387  +3.0%         60208 +2.7%
      lkp-sb02          29990         30137  +0.5%         30103 +0.4%
      
      oltp.transactions (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-bdw-ex1      9073276       9100377     +0.3%    9036344     -0.4%
      lkp-skl-2sp2     8898717       8852054     -0.5%    8894459     -0.0%
      lkp-bdw-ep2     13426155      13384654     -0.3%   13333637     -0.7%
      lkp-hsw-ep2     13146314      13232784     +0.7%   13193163     +0.4%
      lkp-wsm-ep2      5035355       5019348     -0.3%    5033418     -0.0%
      lkp-skl-d01       418485       4413339     -0.1%    4419039     +0.0%
      lkp-hsw-d01      3517817±5%    3396120±3%  -3.5%    3455138±3%  -1.8%
      
      pigz.throughput (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.513e+08     1.507e+08 -0.4%      1.511e+08 -0.2%
      lkp-bdw-ex1     2.060e+08     2.052e+08 -0.4%      2.044e+08 -0.8%
      lkp-skl-2sp2    8.836e+08     8.845e+08 +0.1%      8.836e+08 -0.0%
      lkp-bdw-ep2     8.275e+08     8.464e+08 +2.3%      8.330e+08 +0.7%
      lkp-wsm-ep2     2.224e+08     2.221e+08 -0.2%      2.218e+08 -0.3%
      lkp-skl-d01     1.177e+08     1.177e+08 -0.0%      1.176e+08 -0.1%
      lkp-hsw-d01     1.154e+08     1.154e+08 +0.1%      1.154e+08 -0.0%
      lkp-sb02        0.633e+08     0.633e+08 +0.1%      0.633e+08 +0.0%
      
      will-it-scale.malloc1.processes (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1      620181       620484 +0.0%         620240 +0.0%
      lkp-bdw-ex1      1403610      1401201 -0.2%        1417900 +1.0%
      lkp-skl-2sp2     1288097      1284145 -0.3%        1283907 -0.3%
      lkp-bdw-ep2      1427879      1427675 -0.0%        1428266 +0.0%
      lkp-hsw-ep2      1362546      1353965 -0.6%        1354759 -0.6%
      lkp-wsm-ep2      2099657      2107576 +0.4%        2100226 +0.0%
      lkp-skl-d01      1476835      1476358 -0.0%        1474487 -0.2%
      lkp-hsw-d01      1308810      1303429 -0.4%        1301299 -0.6%
      lkp-sb02          589286       589284 -0.0%         588101 -0.2%
      
      will-it-scale.malloc1.threads (higher is better)
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     21289         21125     -0.8%      21241     -0.2%
      lkp-bdw-ex1      28114         28089     -0.1%      28007     -0.4%
      lkp-skl-2sp2     91866         91946     +0.1%      92723     +0.9%
      lkp-bdw-ep2      37637         37501     -0.4%      37317     -0.9%
      lkp-hsw-ep2      43673         43590     -0.2%      43754     +0.2%
      lkp-wsm-ep2      28577         28298     -1.0%      28545     -0.1%
      lkp-skl-d01     175277        173343     -1.1%     173082     -1.3%
      lkp-hsw-d01     130303        129566     -0.6%     129250     -0.8%
      lkp-sb02        113742±3%     116911     +2.8%     116417±3%  +2.4%
      
      will-it-scale.malloc2.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.206e+09     1.206e+09 -0.0%      1.206e+09 +0.0%
      lkp-bdw-ex1     1.319e+09     1.319e+09 -0.0%      1.319e+09 +0.0%
      lkp-skl-2sp2    8.000e+08     8.021e+08 +0.3%      7.995e+08 -0.1%
      lkp-bdw-ep2     6.582e+08     6.634e+08 +0.8%      6.513e+08 -1.1%
      lkp-hsw-ep2     6.671e+08     6.669e+08 -0.0%      6.665e+08 -0.1%
      lkp-wsm-ep2     1.805e+08     1.806e+08 +0.0%      1.804e+08 -0.1%
      lkp-skl-d01     1.611e+08     1.611e+08 -0.0%      1.610e+08 -0.0%
      lkp-hsw-d01     1.333e+08     1.332e+08 -0.0%      1.332e+08 -0.0%
      lkp-sb02         82485104      82478206 -0.0%       82473546 -0.0%
      
      will-it-scale.malloc2.threads (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.574e+09     1.574e+09 -0.0%      1.574e+09 -0.0%
      lkp-bdw-ex1     1.737e+09     1.737e+09 +0.0%      1.737e+09 -0.0%
      lkp-skl-2sp2    9.161e+08     9.162e+08 +0.0%      9.181e+08 +0.2%
      lkp-bdw-ep2     7.856e+08     8.015e+08 +2.0%      8.113e+08 +3.3%
      lkp-hsw-ep2     6.908e+08     6.904e+08 -0.1%      6.907e+08 -0.0%
      lkp-wsm-ep2     2.409e+08     2.409e+08 +0.0%      2.409e+08 -0.0%
      lkp-skl-d01     1.199e+08     1.199e+08 -0.0%      1.199e+08 -0.0%
      lkp-hsw-d01     1.029e+08     1.029e+08 -0.0%      1.029e+08 +0.0%
      lkp-sb02         68081213      68061423 -0.0%       68076037 -0.0%
      
      will-it-scale.page_fault2.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    14509125±4%   16472364 +13.5%       17123117 +18.0%
      lkp-bdw-ex1     14736381      16196588  +9.9%       16364011 +11.0%
      lkp-skl-2sp2     6354925       6435444  +1.3%        6436644  +1.3%
      lkp-bdw-ep2      8749584       8834422  +1.0%        8827179  +0.9%
      lkp-hsw-ep2      8762591       8845920  +1.0%        8825697  +0.7%
      lkp-wsm-ep2      3036083       3030428  -0.2%        3021741  -0.5%
      lkp-skl-d01      2307834       2304731  -0.1%        2286142  -0.9%
      lkp-hsw-d01      1806237       1800786  -0.3%        1795943  -0.6%
      lkp-sb02          842616        837844  -0.6%         833921  -1.0%
      
      will-it-scale.page_fault2.threads
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1623294       1615132±2% -0.5%     1656777    +2.1%
      lkp-bdw-ex1      1995714       2025948    +1.5%     2113753±3% +5.9%
      lkp-skl-2sp2     2346708       2415591    +2.9%     2416919    +3.0%
      lkp-bdw-ep2      2342564       2344882    +0.1%     2300206    -1.8%
      lkp-hsw-ep2      1820658       1831681    +0.6%     1844057    +1.3%
      lkp-wsm-ep2      1725482       1733774    +0.5%     1740517    +0.9%
      lkp-skl-d01      1832833       1823628    -0.5%     1806489    -1.4%
      lkp-hsw-d01      1427913       1427287    -0.0%     1420226    -0.5%
      lkp-sb02          750626        748615    -0.3%      746621    -0.5%
      
      will-it-scale.page_fault3.processes (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    24382726      24400317 +0.1%       24668774 +1.2%
      lkp-bdw-ex1     35399750      35683124 +0.8%       35829492 +1.2%
      lkp-skl-2sp2    28136820      28068248 -0.2%       28147989 +0.0%
      lkp-bdw-ep2     37269077      37459490 +0.5%       37373073 +0.3%
      lkp-hsw-ep2     36224967      36114085 -0.3%       36104908 -0.3%
      lkp-wsm-ep2     16820457      16911005 +0.5%       16968596 +0.9%
      lkp-skl-d01      7721138       7725904 +0.1%        7756740 +0.5%
      lkp-hsw-d01      7611979       7650928 +0.5%        7651323 +0.5%
      lkp-sb02         3781546       3796502 +0.4%        3796827 +0.4%
      
      will-it-scale.page_fault3.threads (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1865820±3%   1900917±2%  +1.9%     1826245±4%  -2.1%
      lkp-bdw-ex1      3094060      3148326     +1.8%     3150036     +1.8%
      lkp-skl-2sp2     3952940      3953898     +0.0%     3989360     +0.9%
      lkp-bdw-ep2      3420373±3%   3643964     +6.5%     3644910±5%  +6.6%
      lkp-hsw-ep2      2609635±2%   2582310±3%  -1.0%     2780459     +6.5%
      lkp-wsm-ep2      4395001      4417196     +0.5%     4432499     +0.9%
      lkp-skl-d01      5363977      5400003     +0.7%     5411370     +0.9%
      lkp-hsw-d01      5274131      5311294     +0.7%     5319359     +0.9%
      lkp-sb02         2917314      2913004     -0.1%     2935286     +0.6%
      
      will-it-scale.read1.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    73762279±14%  69322519±10% -6.0%    69349855±13%  -6.0% (result unstable)
      lkp-bdw-ex1     1.701e+08     1.704e+08    +0.1%    1.705e+08     +0.2%
      lkp-skl-2sp2    63111570      63113953     +0.0%    63836573      +1.1%
      lkp-bdw-ep2     79247409      79424610     +0.2%    78012656      -1.6%
      lkp-hsw-ep2     67677026      68308800     +0.9%    67539106      -0.2%
      lkp-wsm-ep2     13339630      13939817     +4.5%    13766865      +3.2%
      lkp-skl-d01     10969487      10972650     +0.0%    no data
      lkp-hsw-d01     9857342±2%    10080592±2%  +2.3%    10131560      +2.8%
      lkp-sb02        5189076        5197473     +0.2%    5163253       -0.5%
      
      will-it-scale.read1.threads (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    62468045±12%  73666726±7% +17.9%    79553123±12% +27.4% (result unstable)
      lkp-bdw-ex1     1.62e+08      1.624e+08    +0.3%    1.614e+08     -0.3%
      lkp-skl-2sp2    58319780      59181032     +1.5%    59821353      +2.6%
      lkp-bdw-ep2     74057992      75698171     +2.2%    74990869      +1.3%
      lkp-hsw-ep2     63672959      63639652     -0.1%    64387051      +1.1%
      lkp-wsm-ep2     13489943      13526058     +0.3%    13259032      -1.7%
      lkp-skl-d01     10297906      10338796     +0.4%    10407328      +1.1%
      lkp-hsw-d01      9636721       9667376     +0.3%     9341147      -3.1%
      lkp-sb02         4801938       4804496     +0.1%     4802290      +0.0%
      
      will-it-scale.write1.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.111e+08     1.104e+08±2%  -0.7%   1.122e+08±2%  +1.0%
      lkp-bdw-ex1     1.392e+08     1.399e+08     +0.5%   1.397e+08     +0.4%
      lkp-skl-2sp2     59369233      58994841     -0.6%    58715168     -1.1%
      lkp-bdw-ep2      61820979      CPU throttle          63593123     +2.9%
      lkp-hsw-ep2      57897587      57435605     -0.8%    56347450     -2.7%
      lkp-wsm-ep2       7814203       7918017±2%  +1.3%     7669068     -1.9%
      lkp-skl-d01       8886557       8971422     +1.0%     8818366     -0.8%
      lkp-hsw-d01       9171001±5%    9189915     +0.2%     9483909     +3.4%
      lkp-sb02          4475406       4475294     -0.0%     4501756     +0.6%
      
      will-it-scale.write1.threads (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.058e+08     1.055e+08±2%  -0.2%   1.065e+08  +0.7%
      lkp-bdw-ex1     1.316e+08     1.300e+08     -1.2%   1.308e+08  -0.6%
      lkp-skl-2sp2     54492421      56086678     +2.9%    55975657  +2.7%
      lkp-bdw-ep2      59360449      59003957     -0.6%    58101262  -2.1%
      lkp-hsw-ep2      53346346±2%   52530876     -1.5%    52902487  -0.8%
      lkp-wsm-ep2       7774006       7800092±2%  +0.3%     7558833  -2.8%
      lkp-skl-d01       8346174       8235695     -1.3%     no data
      lkp-hsw-d01       8636244       8655731     +0.2%     8658868  +0.3%
      lkp-sb02          4181820       4204107     +0.5%     4182992  +0.0%
      
      vm-scalability.anon-r-rand.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    11933873±3%   12356544±2%  +3.5%   12188624     +2.1%
      lkp-bdw-ex1      7114424±2%    7330949±2%  +3.0%    7392419     +3.9%
      lkp-skl-2sp2     6773277±5%    6492332±8%  -4.1%    6543962     -3.4%
      lkp-bdw-ep2      7133846±4%    7233508     +1.4%    7013518±3%  -1.7%
      lkp-hsw-ep2      4576626       4527098     -1.1%    4551679     -0.5%
      lkp-wsm-ep2      2583599       2592492     +0.3%    2588039     +0.2%
      lkp-hsw-d01       998199±2%    1028311     +3.0%    1006460±2%  +0.8%
      lkp-sb02          570572        567854     -0.5%     568449     -0.4%
      
      vm-scalability.anon-r-rand-mt.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1789419       1787830     -0.1%    1788208     -0.1%
      lkp-bdw-ex1      3492595±2%    3554966±2%  +1.8%    3558835±3%  +1.9%
      lkp-skl-2sp2     3856238±2%    3975403±4%  +3.1%    3994600     +3.6%
      lkp-bdw-ep2      3726963±11%   3809292±6%  +2.2%    3871924±4%  +3.9%
      lkp-hsw-ep2      2131760±3%    2033578±4%  -4.6%    2130727±6%  -0.0%
      lkp-wsm-ep2      2369731       2368384     -0.1%    2370252     +0.0%
      lkp-skl-d01      1207128       1206220     -0.1%    1205801     -0.1%
      lkp-hsw-d01       964317        992329±2%  +2.9%     992099±2%  +2.9%
      lkp-sb02          567137        567346     +0.0%     566144     -0.2%
      
      vm-scalability.lru-file-mmap-read.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    19560469±6%   23018999     +17.7%   23418800     +19.7%
      lkp-bdw-ex1     17769135±14%  26141676±3%  +47.1%   26284723±5%  +47.9%
      lkp-skl-2sp2    14056512      13578884      -3.4%   13146214      -6.5%
      lkp-bdw-ep2     15336542      14737654      -3.9%   14088159      -8.1%
      lkp-hsw-ep2     16275498      15756296      -3.2%   15018090      -7.7%
      lkp-wsm-ep2     11272160      11237231      -0.3%   11310047      +0.3%
      lkp-skl-d01      7322119       7324569      +0.0%    7184148      -1.9%
      lkp-hsw-d01      6449234       6404542      -0.7%    6356141      -1.4%
      lkp-sb02         3517943       3520668      +0.1%    3527309      +0.3%
      
      vm-scalability.lru-file-mmap-read-rand.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1689052       1697553  +0.5%       1698726  +0.6%
      lkp-bdw-ex1      1675246       1699764  +1.5%       1712226  +2.2%
      lkp-skl-2sp2     1800533       1799749  -0.0%       1800581  +0.0%
      lkp-bdw-ep2      1807422       1807758  +0.0%       1804932  -0.1%
      lkp-hsw-ep2      1809807       1808781  -0.1%       1807811  -0.1%
      lkp-wsm-ep2      1800198       1802434  +0.1%       1801236  +0.1%
      lkp-skl-d01       696689        695537  -0.2%        694106  -0.4%
      lkp-hsw-d01       698364        698666  +0.0%        696686  -0.2%
      lkp-sb02          258939        258787  -0.1%        258199  -0.3%
      
      Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Suggested-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8a759b5
    • M
      mm: drop VM_BUG_ON from __get_free_pages · 9ea9a680
      Michal Hocko 提交于
      There is no real reason to blow up just because the caller doesn't know
      that __get_free_pages cannot return highmem pages.  Simply fix that up
      silently.  Even if we have some confused users such a fixup will not be
      harmful.
      
      [akpm@linux-foundation.org: mask off __GFP_HIGHMEM]
      Link: http://lkml.kernel.org/r/20180622162841.25114-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Jiankang Chen <chenjiankang1@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ea9a680
    • V
      mm, page_alloc: actually ignore mempolicies for high priority allocations · d6a24df0
      Vlastimil Babka 提交于
      __alloc_pages_slowpath() has for a long time contained code to ignore
      node restrictions from memory policies for high priority allocations.
      The current code that resets the zonelist iterator however does
      effectively nothing after commit 7810e678 ("mm, page_alloc: do not
      break __GFP_THISNODE by zonelist reset") removed a buggy zonelist reset.
      Even before that commit, mempolicy restrictions were still not ignored,
      as they are passed in ac->nodemask which is untouched by the code.
      
      We can either remove the code, or make it work as intended.  Since
      ac->nodemask can be set from task's mempolicy via alloc_pages_current()
      and thus also alloc_pages(), it may indeed affect kernel allocations,
      and it makes sense to ignore it to allow progress for high priority
      allocations.
      
      Thus, this patch resets ac->nodemask to NULL in such cases.  This
      assumes all callers can handle it (i.e.  there are no guarantees as in
      the case of __GFP_THISNODE) which seems to be the case.  The same
      assumption is already present in check_retry_cpuset() for some time.
      
      The expected effect is that high priority kernel allocations in the
      context of userspace tasks (e.g.  OOM victims) restricted by mempolicies
      will have higher chance to succeed if they are restricted to nodes with
      depleted memory, while there are other nodes with free memory left.
      
      It's not a new intention, but for the first time the code will match the
      intention, AFAICS.  It was intended by commit 183f6371 ("mm: ignore
      mempolicies when using ALLOC_NO_WATERMARK") in v3.6 but I think it never
      really worked, as mempolicy restriction was already encoded in nodemask,
      not zonelist, at that time.
      
      So originally that was for ALLOC_NO_WATERMARK only.  Then it was
      adjusted by e46e7b77 ("mm, page_alloc: recalculate the preferred
      zoneref if the context can ignore memory policies") and cd04ae1e
      ("mm, oom: do not rely on TIF_MEMDIE for memory reserves access") to the
      current state.  So even GFP_ATOMIC would now ignore mempolicies after
      the initial attempts fail - if the code worked as people thought it
      does.
      
      Link: http://lkml.kernel.org/r/20180612122624.8045-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6a24df0
    • P
      mm: skip invalid pages block at a time in zero_resv_unresv() · 720e14eb
      Pavel Tatashin 提交于
      The role of zero_resv_unavail() is to make sure that every struct page
      that is allocated but is not backed by memory that is accessible by
      kernel is zeroed and not in some uninitialized state.
      
      Since struct pages are allocated in blocks (2M pages in x86 case), we
      can skip pageblock_nr_pages at a time, when the first one is found to be
      invalid.
      
      This optimization may help since now on x86 every hole in e820 maps is
      marked as reserved in memblock, and thus will go through this function.
      
      This function is called before sched_clock() is initialized, so I used
      my x86 early boot clock patches to measure the performance improvement.
      
      With 1T hole on i7-8700 currently we would take 0.606918s of boot time,
      but with this optimization 0.001103s.
      
      Link: http://lkml.kernel.org/r/20180615155733.1175-1-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      720e14eb
  13. 06 8月, 2018 2 次提交
    • P
      PM / reboot: Eliminate race between reboot and suspend · 55f2503c
      Pingfan Liu 提交于
      At present, "systemctl suspend" and "shutdown" can run in parrallel. A
      system can suspend after devices_shutdown(), and resume. Then the shutdown
      task goes on to power off. This causes many devices are not really shut
      off. Hence replacing reboot_mutex with system_transition_mutex (renamed
      from pm_mutex) to achieve the exclusion. The renaming of pm_mutex as
      system_transition_mutex can be better to reflect the purpose of the mutex.
      Signed-off-by: NPingfan Liu <kernelfans@gmail.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      55f2503c
    • D
      mm: Allow non-direct-map arguments to free_reserved_area() · 0d834328
      Dave Hansen 提交于
      free_reserved_area() takes pointers as arguments to show which addresses
      should be freed.  However, it does this in a somewhat ambiguous way.  If it
      gets a kernel direct map address, it always works.  However, if it gets an
      address that is part of the kernel image alias mapping, it can fail.
      
      It fails if all of the following happen:
       * The specified address is part of the kernel image alias
       * Poisoning is requested (forcing a memset())
       * The address is in a read-only portion of the kernel image
      
      The memset() fails on the read-only mapping, of course.
      free_reserved_area() *is* called both on the direct map and on kernel image
      alias addresses.  We've just lucked out thus far that the kernel image
      alias areas it gets used on are read-write.  I'm fairly sure this has been
      just a happy accident.
      
      It is quite easy to make free_reserved_area() work for all cases: just
      convert the address to a direct map address before doing the memset(), and
      do this unconditionally.  There is little chance of a regression here
      because we previously did a virt_to_page() on the address for the memset,
      so we know these are not highmem pages for which virt_to_page() would fail.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: keescook@google.com
      Cc: aarcange@redhat.com
      Cc: jgross@suse.com
      Cc: jpoimboe@redhat.com
      Cc: gregkh@linuxfoundation.org
      Cc: peterz@infradead.org
      Cc: hughd@google.com
      Cc: torvalds@linux-foundation.org
      Cc: bp@alien8.de
      Cc: luto@kernel.org
      Cc: ak@linux.intel.com
      Cc: Kees Cook <keescook@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180802225826.1287AE3E@viggo.jf.intel.com
      0d834328