1. 08 12月, 2019 2 次提交
    • Y
      mm: thp: extract split_queue_* into a struct · c9acf2bd
      Yang Shi 提交于
      commit 364c1eebe453f06f0c1e837eb155a5725c9cd272 upstream
      
      Patch series "Make deferred split shrinker memcg aware", v6.
      
      Currently THP deferred split shrinker is not memcg aware, this may cause
      premature OOM with some configuration.  For example the below test would
      run into premature OOM easily:
      
      $ cgcreate -g memory:thp
      $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
      $ cgexec -g memory:thp transhuge-stress 4000
      
      transhuge-stress comes from kernel selftest.
      
      It is easy to hit OOM, but there are still a lot THP on the deferred
      split queue, memcg direct reclaim can't touch them since the deferred split
      shrinker is not memcg aware.
      
      Convert deferred split shrinker memcg aware by introducing per memcg
      deferred split queue.  The THP should be on either per node or per memcg
      deferred split queue if it belongs to a memcg.  When the page is
      immigrated to the other memcg, it will be immigrated to the target
      memcg's deferred split queue too.
      
      Reuse the second tail page's deferred_list for per memcg list since the
      same THP can't be on multiple deferred split queues.
      
      Make deferred split shrinker not depend on memcg kmem since it is not
      slab.  It doesn't make sense to not shrink THP even though memcg kmem is
      disabled.
      
      With the above change the test demonstrated above doesn't trigger OOM
      even though with cgroup.memory=nokmem.
      
      This patch (of 4):
      
      Put split_queue, split_queue_lock and split_queue_len into a struct in
      order to reduce code duplication when we convert deferred_split to memcg
      aware in the later patches.
      
      Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Suggested-by: N"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      c9acf2bd
    • G
      alios: mm: Support kidled · fd952d8c
      Gavin Shan 提交于
      This enables scanning pages in fixed interval to determine their access
      frequency (hot/cold). The result is exported to user land on basis of
      memory cgroup by "memory.idle_page_stats". The design is highlighted as
      below:
      
         * A kernel thread is spawn when this feature is enabled by writing
           non-zero value to "/sys/kernel/mm/kidled/scan_period_in_seconds".
           The thread sequentially scans the nodes and their pages that have
           been chained up in LRU list.
      
         * For each page, its corresponding age information is stored in the
           page flags or array in node. The age represents the scanning intervals
           in which the page isn't accessed. Also, the page flag (PG_idle) is
           leveraged. The page's age is increased by one if the idle flag isn't
           cleared in two consective scans. Otherwise, the page's age is cleared out.
           Also, the page's age information is cleared when it's free'd so that
           the stale age information won't be fetched when it's allocated.
      
         * Initially, the flag is set, while the access bit in its PTE is cleared
           out by the thread. In next scanning period, its PTE access bit is
           synchronized with the page flag: clear the flag if access bit is set.
           The flag is kept otherwise. For unmapped pages, the flag is cleared
           when it's accessed.
      
         * Eventually, the page's aging information is updated to the unstable
           bucket of its corresponding memory cgroup, taking as statistics. The
           unstable bucket (statistics) is copied to stable bucket when all pages
           in all nodes are scanned for once. The stable bucket (statistics) is
           exported to user land through "memory.idle_page_stats".
      
      TESTING
      =======
      
         * cgroup1, unmapped pagecache
      
           # dd if=/dev/zero of=/ext4/test.data oflag=direct bs=1M count=128
           #
           # echo 1 > /sys/kernel/mm/kidled/use_hierarchy
           # echo 15 > /sys/kernel/mm/kidled/scan_period_in_seconds
           # mkdir -p /cgroup/memory
           # mount -tcgroup -o memory /cgroup/memory
           # echo 1 > /cgroup/memory/memory.use_hierarchy
           # mkdir -p /cgroup/memory/test
           # echo 1 > /cgroup/memory/test/memory.use_hierarchy
           #
           # echo $$ > /cgroup/memory/test/cgroup.procs
           # dd if=/ext4/test.data of=/dev/null bs=1M count=128
           # < wait a few minutes >
           # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
           # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
             cfei   0   0   0   134217728   0   0   0   0
           # cat /cgroup/memory/memory.idle_page_stats | grep cfei
             cfei   0   0   0   134217728   0   0   0   0
      
         * cgroup1, mapped pagecache
      
           # < create same file and memory cgroups as above >
           #
           # echo $$ > /cgroup/memory/test/cgroup.procs
           # < run program to mmap the whole created file and access the area >
           # < wait a few minutes >
           # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
             cfei   0   134217728   0   0   0   0   0   0
           # cat /cgroup/memory/memory.idle_page_stats | grep cfei
             cfei   0   134217728   0   0   0   0   0   0
      
         * cgroup1, mapped and locked pagecache
      
           # < create same file and memory cgroups as above >
           #
           # echo $$ > /cgroup/memory/test/cgroup.procs
           # < run program to mmap the whole created file and mlock the area >
           # < wait a few minutes >
           # cat /cgroup/memory/test/memory.idle_page_stats | grep cfui
             cfui   0   134217728   0   0   0   0   0   0
           # cat /cgroup/memory/memory.idle_page_stats | grep cfui
             cfui   0   134217728   0   0   0   0   0   0
      
         * cgroup1, anonymous and locked area
      
           # < create memory cgroups as above >
           #
           # echo $$ > /cgroup/memory/test/cgroup.procs
           # < run program to mmap anonymous area and mlock it >
           # < wait a few minutes >
           # cat /cgroup/memory/test/memory.idle_page_stats | grep csui
             csui   0   0   134217728   0   0   0   0   0
           # cat /cgroup/memory/memory.idle_page_stats | grep csui
             csui   0   0   134217728   0   0   0   0   0
      
         * Rerun above test cases in cgroup2 and the results are no exceptional.
           However, the cgroups are populated in different way as below:
      
           # mkdir -p /cgroup
           # mount -tcgroup2 none /cgroup
           # echo "+memory" > /cgroup/cgroup.subtree_control
           # mkdir -p /cgroup/test
      Signed-off-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      fd952d8c
  2. 30 10月, 2019 1 次提交
  3. 15 6月, 2019 1 次提交
    • L
      mem-hotplug: fix node spanned pages when we have a node with only ZONE_MOVABLE · 5094a85d
      Linxu Fang 提交于
      [ Upstream commit 299c83dce9ea3a79bb4b5511d2cb996b6b8e5111 ]
      
      342332e6 ("mm/page_alloc.c: introduce kernelcore=mirror option") and
      later patches rewrote the calculation of node spanned pages.
      
      e506b996 ("mem-hotplug: fix node spanned pages when we have a movable
      node"), but the current code still has problems,
      
      When we have a node with only zone_movable and the node id is not zero,
      the size of node spanned pages is double added.
      
      That's because we have an empty normal zone, and zone_start_pfn or
      zone_end_pfn is not between arch_zone_lowest_possible_pfn and
      arch_zone_highest_possible_pfn, so we need to use clamp to constrain the
      range just like the commit <96e907d1> (bootmem: Reimplement
      __absent_pages_in_range() using for_each_mem_pfn_range()).
      
      e.g.
      Zone ranges:
        DMA      [mem 0x0000000000001000-0x0000000000ffffff]
        DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
        Normal   [mem 0x0000000100000000-0x000000023fffffff]
      Movable zone start for each node
        Node 0: 0x0000000100000000
        Node 1: 0x0000000140000000
      Early memory node ranges
        node   0: [mem 0x0000000000001000-0x000000000009efff]
        node   0: [mem 0x0000000000100000-0x00000000bffdffff]
        node   0: [mem 0x0000000100000000-0x000000013fffffff]
        node   1: [mem 0x0000000140000000-0x000000023fffffff]
      
      node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
      node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
      node 0 Normal	spanned:0	present:0	absent:0
      node 0 Movable	spanned:0x40000 present:0x40000 absent:0
      On node 0 totalpages(node_present_pages): 1048446
      node_spanned_pages:1310719
      node 1 DMA	spanned:0	    present:0		absent:0
      node 1 DMA32	spanned:0	    present:0		absent:0
      node 1 Normal	spanned:0x100000    present:0x100000	absent:0
      node 1 Movable	spanned:0x100000    present:0x100000	absent:0
      On node 1 totalpages(node_present_pages): 2097152
      node_spanned_pages:2097152
      Memory: 6967796K/12582392K available (16388K kernel code, 3686K rwdata,
      4468K rodata, 2160K init, 10444K bss, 5614596K reserved, 0K
      cma-reserved)
      
      It shows that the current memory of node 1 is double added.
      After this patch, the problem is fixed.
      
      node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
      node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
      node 0 Normal	spanned:0	present:0	absent:0
      node 0 Movable	spanned:0x40000 present:0x40000 absent:0
      On node 0 totalpages(node_present_pages): 1048446
      node_spanned_pages:1310719
      node 1 DMA	spanned:0	    present:0		absent:0
      node 1 DMA32	spanned:0	    present:0		absent:0
      node 1 Normal	spanned:0	    present:0		absent:0
      node 1 Movable	spanned:0x100000    present:0x100000	absent:0
      On node 1 totalpages(node_present_pages): 1048576
      node_spanned_pages:1048576
      memory: 6967796K/8388088K available (16388K kernel code, 3686K rwdata,
      4468K rodata, 2160K init, 10444K bss, 1420292K reserved, 0K
      cma-reserved)
      
      Link: http://lkml.kernel.org/r/1554178276-10372-1-git-send-email-fanglinxu@huawei.comSigned-off-by: NLinxu Fang <fanglinxu@huawei.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5094a85d
  4. 06 4月, 2019 1 次提交
    • Q
      page_poison: play nicely with KASAN · a6c56bf6
      Qian Cai 提交于
      [ Upstream commit 4117992df66a26fa33908b4969e04801534baab1 ]
      
      KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING).
      It triggers false positives in the allocation path:
      
        BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
        Read of size 8 at addr ffff88881f800000 by task swapper/0
        CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
        Call Trace:
         dump_stack+0xe0/0x19a
         print_address_description.cold.2+0x9/0x28b
         kasan_report.cold.3+0x7a/0xb5
         __asan_report_load8_noabort+0x19/0x20
         memchr_inv+0x2ea/0x330
         kernel_poison_pages+0x103/0x3d5
         get_page_from_freelist+0x15e7/0x4d90
      
      because KASAN has not yet unpoisoned the shadow page for allocation
      before it checks memchr_inv() but only found a stale poison pattern.
      
      Also, false positives in free path,
      
        BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
        Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1
        CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
        Call Trace:
         dump_stack+0xe0/0x19a
         print_address_description.cold.2+0x9/0x28b
         kasan_report.cold.3+0x7a/0xb5
         check_memory_region+0x22d/0x250
         memset+0x28/0x40
         kernel_poison_pages+0x29e/0x3d5
         __free_pages_ok+0x75f/0x13e0
      
      due to KASAN adds poisoned redzones around slab objects, but the page
      poisoning needs to poison the whole page.
      
      Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a6c56bf6
  5. 24 3月, 2019 1 次提交
    • J
      mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs · 33e83ea3
      Jann Horn 提交于
      [ Upstream commit 2c2ade81741c66082f8211f0b96cf509cc4c0218 ]
      
      The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
      number of references that we might need to create in the fastpath later,
      the bump-allocation fastpath only has to modify the non-atomic bias value
      that tracks the number of extra references we hold instead of the atomic
      refcount. The maximum number of allocations we can serve (under the
      assumption that no allocation is made with size 0) is nc->size, so that's
      the bias used.
      
      However, even when all memory in the allocation has been given away, a
      reference to the page is still held; and in the `offset < 0` slowpath, the
      page may be reused if everyone else has dropped their references.
      This means that the necessary number of references is actually
      `nc->size+1`.
      
      Luckily, from a quick grep, it looks like the only path that can call
      page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
      requires CAP_NET_ADMIN in the init namespace and is only intended to be
      used for kernel testing and fuzzing.
      
      To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
      `offset < 0` path, below the virt_to_page() call, and then repeatedly call
      writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
      with a vector consisting of 15 elements containing 1 byte each.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      33e83ea3
  6. 13 2月, 2019 1 次提交
    • W
      mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init · f73c7753
      Waiman Long 提交于
      [ Upstream commit 3c0c12cc8f00ca5f81acb010023b8eb13e9a7004 ]
      
      When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred
      pages initialization can take a long time.  Below were the reported init
      times on a 8-socket 96-core 4TB IvyBridge system.
      
        1) Non-debug kernel without CONFIG_KASAN
           [    8.764222] node 1 initialised, 132086516 pages in 7027ms
      
        2) Debug kernel with CONFIG_KASAN
           [  146.288115] node 1 initialised, 132075466 pages in 143052ms
      
      So the page init time in a debug kernel was 20X of the non-debug kernel.
      The long init time can be problematic as the page initialization is done
      with interrupt disabled.  In this particular case, it caused the
      appearance of following warning messages as well as NMI backtraces of all
      the cores that were doing the initialization.
      
      [   68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
      [   68.241000] rcu: 	25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252
      [   68.241000] rcu: 	44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253
      [   68.241000] rcu: 	54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253
      [   68.241000] rcu: 	60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253
      [   68.241000] rcu: 	72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253
      [   68.241000] rcu: 	84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253
      [   68.241000] rcu: 	111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253
      [   68.241000] rcu: 	(detected by 13, t=65018 jiffies, g=249, q=2)
      
      The long init time was mainly caused by the call to kasan_free_pages() to
      poison the newly initialized pages.  On a 4TB system, we are talking about
      almost 500GB of memory probably on the same node.
      
      In reality, we may not need to poison the newly initialized pages before
      they are ever allocated.  So KASAN poisoning of freed pages before the
      completion of deferred memory initialization is now disabled.  Those pages
      will be properly poisoned when they are allocated or freed after deferred
      pages initialization is done.
      
      With this change, the new page initialization time became:
      
      [   21.948010] node 1 initialised, 132075466 pages in 18702ms
      
      This was still about double the non-debug kernel time, but was much
      better than before.
      
      Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      f73c7753
  7. 31 1月, 2019 1 次提交
    • M
      Revert "mm, memory_hotplug: initialize struct pages for the full memory section" · 6bab9573
      Michal Hocko 提交于
      commit 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a upstream.
      
      This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684.
      
      The underlying assumption that one sparse section belongs into a single
      numa node doesn't hold really. Robert Shteynfeld has reported a boot
      failure. The boot log was not captured but his memory layout is as
      follows:
      
        Early memory node ranges
          node   1: [mem 0x0000000000001000-0x0000000000090fff]
          node   1: [mem 0x0000000000100000-0x00000000dbdf8fff]
          node   1: [mem 0x0000000100000000-0x0000001423ffffff]
          node   0: [mem 0x0000001424000000-0x0000002023ffffff]
      
      This means that node0 starts in the middle of a memory section which is
      also in node1.  memmap_init_zone tries to initialize padding of a
      section even when it is outside of the given pfn range because there are
      code paths (e.g.  memory hotplug) which assume that the full worth of
      memory section is always initialized.
      
      In this particular case, though, such a range is already intialized and
      most likely already managed by the page allocator.  Scribbling over
      those pages corrupts the internal state and likely blows up when any of
      those pages gets used.
      Reported-by: NRobert Shteynfeld <robert.shteynfeld@gmail.com>
      Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section")
      Cc: stable@kernel.org
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6bab9573
  8. 29 12月, 2018 2 次提交
    • O
      mm, page_alloc: fix has_unmovable_pages for HugePages · e27666dd
      Oscar Salvador 提交于
      commit 17e2e7d7e1b83fa324b3f099bfe426659aa3c2a4 upstream.
      
      While playing with gigantic hugepages and memory_hotplug, I triggered
      the following #PF when "cat memoryX/removable":
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        #PF error: [normal kernel read fault]
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP PTI
        CPU: 1 PID: 1481 Comm: cat Tainted: G            E     4.20.0-rc6-mm1-1-default+ #18
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
        RIP: 0010:has_unmovable_pages+0x154/0x210
        Call Trace:
         is_mem_section_removable+0x7d/0x100
         removable_show+0x90/0xb0
         dev_attr_show+0x1c/0x50
         sysfs_kf_seq_show+0xca/0x1b0
         seq_read+0x133/0x380
         __vfs_read+0x26/0x180
         vfs_read+0x89/0x140
         ksys_read+0x42/0x90
         do_syscall_64+0x5b/0x180
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The reason is we do not pass the Head to page_hstate(), and so, the call
      to compound_order() in page_hstate() returns 0, so we end up checking
      all hstates's size to match PAGE_SIZE.
      
      Obviously, we do not find any hstate matching that size, and we return
      NULL.  Then, we dereference that NULL pointer in
      hugepage_migration_supported() and we got the #PF from above.
      
      Fix that by getting the head page before calling page_hstate().
      
      Also, since gigantic pages span several pageblocks, re-adjust the logic
      for skipping pages.  While are it, we can also get rid of the
      round_up().
      
      [osalvador@suse.de: remove round_up(), adjust skip pages logic per Michal]
        Link: http://lkml.kernel.org/r/20181221062809.31771-1-osalvador@suse.de
      Link: http://lkml.kernel.org/r/20181217225113.17864-1-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e27666dd
    • M
      mm, memory_hotplug: initialize struct pages for the full memory section · 7592dbfa
      Mikhail Zaslonko 提交于
      commit 2830bf6f05fb3e05bc4743274b806c821807a684 upstream.
      
      If memory end is not aligned with the sparse memory section boundary,
      the mapping of such a section is only partly initialized.  This may lead
      to VM_BUG_ON due to uninitialized struct page access from
      is_mem_section_removable() or test_pages_in_a_zone() function triggered
      by memory_hotplug sysfs handlers:
      
      Here are the the panic examples:
       CONFIG_DEBUG_VM=y
       CONFIG_DEBUG_VM_PGFLAGS=y
      
       kernel parameter mem=2050M
       --------------------------
       page:000003d082008000 is uninitialized and poisoned
       page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
       Call Trace:
       ( test_pages_in_a_zone+0xde/0x160)
         show_valid_zones+0x5c/0x190
         dev_attr_show+0x34/0x70
         sysfs_kf_seq_show+0xc8/0x148
         seq_read+0x204/0x480
         __vfs_read+0x32/0x178
         vfs_read+0x82/0x138
         ksys_read+0x5a/0xb0
         system_call+0xdc/0x2d8
       Last Breaking-Event-Address:
         test_pages_in_a_zone+0xde/0x160
       Kernel panic - not syncing: Fatal exception: panic_on_oops
      
       kernel parameter mem=3075M
       --------------------------
       page:000003d08300c000 is uninitialized and poisoned
       page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
       Call Trace:
       ( is_mem_section_removable+0xb4/0x190)
         show_mem_removable+0x9a/0xd8
         dev_attr_show+0x34/0x70
         sysfs_kf_seq_show+0xc8/0x148
         seq_read+0x204/0x480
         __vfs_read+0x32/0x178
         vfs_read+0x82/0x138
         ksys_read+0x5a/0xb0
         system_call+0xdc/0x2d8
       Last Breaking-Event-Address:
         is_mem_section_removable+0xb4/0x190
       Kernel panic - not syncing: Fatal exception: panic_on_oops
      
      Fix the problem by initializing the last memory section of each zone in
      memmap_init_zone() till the very end, even if it goes beyond the zone end.
      
      Michal said:
      
      : This has alwways been problem AFAIU.  It just went unnoticed because we
      : have zeroed memmaps during allocation before f7f99100 ("mm: stop
      : zeroing memory during allocation in vmemmap") and so the above test
      : would simply skip these ranges as belonging to zone 0 or provided a
      : garbage.
      :
      : So I guess we do care for post f7f99100 kernels mostly and
      : therefore Fixes: f7f99100 ("mm: stop zeroing memory during
      : allocation in vmemmap")
      
      Link: http://lkml.kernel.org/r/20181212172712.34019-2-zaslonko@linux.ibm.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: NMikhail Zaslonko <zaslonko@linux.ibm.com>
      Reviewed-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Suggested-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Tested-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7592dbfa
  9. 17 12月, 2018 1 次提交
    • W
      mm/page_alloc.c: fix calculation of pgdat->nr_zones · 505bc9f3
      Wei Yang 提交于
      [ Upstream commit 8f416836c0d50b198cad1225132e5abebf8980dc ]
      
      init_currently_empty_zone() will adjust pgdat->nr_zones and set it to
      'zone_idx(zone) + 1' unconditionally.  This is correct in the normal
      case, while not exact in hot-plug situation.
      
      This function is used in two places:
      
        * free_area_init_core()
        * move_pfn_range_to_zone()
      
      In the first case, we are sure zone index increase monotonically.  While
      in the second one, this is under users control.
      
      One way to reproduce this is:
      ----------------------------
      
      1. create a virtual machine with empty node1
      
         -m 4G,slots=32,maxmem=32G \
         -smp 4,maxcpus=8          \
         -numa node,nodeid=0,mem=4G,cpus=0-3 \
         -numa node,nodeid=1,mem=0G,cpus=4-7
      
      2. hot-add cpu 3-7
      
         cpu-add [3-7]
      
      2. hot-add memory to nod1
      
         object_add memory-backend-ram,id=ram0,size=1G
         device_add pc-dimm,id=dimm0,memdev=ram0,node=1
      
      3. online memory with following order
      
         echo online_movable > memory47/state
         echo online > memory40/state
      
      After this, node1 will have its nr_zones equals to (ZONE_NORMAL + 1)
      instead of (ZONE_MOVABLE + 1).
      
      Michal said:
       "Having an incorrect nr_zones might result in all sorts of problems
        which would be quite hard to debug (e.g. reclaim not considering the
        movable zone). I do not expect many users would suffer from this it
        but still this is trivial and obviously right thing to do so
        backporting to the stable tree shouldn't be harmful (last famous
        words)"
      
      Link: http://lkml.kernel.org/r/20181117022022.9956-1-richard.weiyang@gmail.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      505bc9f3
  10. 01 12月, 2018 2 次提交
    • M
      mm, page_alloc: check for max order in hot path · 9dec3855
      Michal Hocko 提交于
      [ Upstream commit c63ae43ba53bc432b414fd73dd5f4b01fcb1ab43 ]
      
      Konstantin has noticed that kvmalloc might trigger the following
      warning:
      
        WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
        [...]
        Call Trace:
         fragmentation_index+0x76/0x90
         compaction_suitable+0x4f/0xf0
         shrink_node+0x295/0x310
         node_reclaim+0x205/0x250
         get_page_from_freelist+0x649/0xad0
         __alloc_pages_nodemask+0x12a/0x2a0
         kmalloc_large_node+0x47/0x90
         __kmalloc_node+0x22b/0x2e0
         kvmalloc_node+0x3e/0x70
         xt_alloc_table_info+0x3a/0x80 [x_tables]
         do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
         nf_setsockopt+0x44/0x60
         SyS_setsockopt+0x6f/0xc0
         do_syscall_64+0x67/0x120
         entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      the problem is that we only check for an out of bound order in the slow
      path and the node reclaim might happen from the fast path already.  This
      is fixable by making sure that kvmalloc doesn't ever use kmalloc for
      requests that are larger than KMALLOC_MAX_SIZE but this also shows that
      the code is rather fragile.  A recent UBSAN report just underlines that
      by the following report
      
        UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
        shift exponent 51 is too large for 32-bit type 'int'
        CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0xd2/0x148 lib/dump_stack.c:113
         ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
         __ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
         __zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
         zone_watermark_fast mm/page_alloc.c:3216 [inline]
         get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
         __alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
         alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
         alloc_pages include/linux/gfp.h:509 [inline]
         __get_free_pages+0x12/0x60 mm/page_alloc.c:4414
         dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
         raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
         raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
         fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
         fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
         __blkdev_driver_ioctl block/ioctl.c:303 [inline]
         blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
         block_ioctl+0x105/0x150 fs/block_dev.c:1883
         vfs_ioctl fs/ioctl.c:46 [inline]
         do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
         ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
         __do_sys_ioctl fs/ioctl.c:709 [inline]
         __se_sys_ioctl fs/ioctl.c:707 [inline]
         __x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
         do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Note that this is not a kvmalloc path.  It is just that the fast path
      really depends on having sanitzed order as well.  Therefore move the
      order check to the fast path.
      
      Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reported-by: NKyungtae Kim <kt0755@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Byoungyoung Lee <lifeasageek@gmail.com>
      Cc: "Dae R. Jeong" <threeearcat@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9dec3855
    • M
      mm, memory_hotplug: check zone_movable in has_unmovable_pages · b44fd126
      Michal Hocko 提交于
      [ Upstream commit 9d7899999c62c1a81129b76d2a6ecbc4655e1597 ]
      
      Page state checks are racy.  Under a heavy memory workload (e.g.  stress
      -m 200 -t 2h) it is quite easy to hit a race window when the page is
      allocated but its state is not fully populated yet.  A debugging patch to
      dump the struct page state shows
      
        has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0
        page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1
        flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked)
      
      Note that the state has been checked for both PageLRU and PageSwapBacked
      already.  Closing this race completely would require some sort of retry
      logic.  This can be tricky and error prone (think of potential endless
      or long taking loops).
      
      Workaround this problem for movable zones at least.  Such a zone should
      only contain movable pages.  Commit 15c30bc0 ("mm, memory_hotplug:
      make has_unmovable_pages more robust") has told us that this is not
      strictly true though.  Bootmem pages should be marked reserved though so
      we can move the original check after the PageReserved check.  Pages from
      other zones are still prone to races but we even do not pretend that
      memory hotremove works for those so pre-mature failure doesn't hurt that
      much.
      
      Link: http://lkml.kernel.org/r/20181106095524.14629-1-mhocko@kernel.org
      Fixes: 15c30bc0 ("mm, memory_hotplug: make has_unmovable_pages more robust")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NBaoquan He <bhe@redhat.com>
      Tested-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b44fd126
  11. 09 10月, 2018 1 次提交
  12. 02 10月, 2018 1 次提交
    • M
      mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration · efaffc5e
      Mel Gorman 提交于
      Rate limiting of page migrations due to automatic NUMA balancing was
      introduced to mitigate the worst-case scenario of migrating at high
      frequency due to false sharing or slowly ping-ponging between nodes.
      Since then, a lot of effort was spent on correctly identifying these
      pages and avoiding unnecessary migrations and the safety net may no longer
      be required.
      
      Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
      avoids spreading STREAM tasks wide prematurely. However, once the task
      was properly placed, it delayed migrating the memory due to rate limiting.
      Increasing the limit fixed the problem for him.
      
      Currently, the limit is hard-coded and does not account for the real
      capabilities of the hardware. Even if an estimate was attempted, it would
      not properly account for the number of memory controllers and it could
      not account for the amount of bandwidth used for normal accesses. Rather
      than fudging, this patch simply eliminates the rate limiting.
      
      However, Jirka reports that a STREAM configuration using multiple
      processes achieved similar performance to 4.16. In local tests, this patch
      improved performance of STREAM relative to the baseline but it is somewhat
      machine-dependent. Most workloads show little or not performance difference
      implying that there is not a heavily reliance on the throttling mechanism
      and it is safe to remove.
      
      STREAM on 2-socket machine
                               4.19.0-rc5             4.19.0-rc5
                               numab-v1r1       noratelimit-v1r1
      MB/sec copy     43298.52 (   0.00%)    44673.38 (   3.18%)
      MB/sec scale    30115.06 (   0.00%)    31293.06 (   3.91%)
      MB/sec add      32825.12 (   0.00%)    34883.62 (   6.27%)
      MB/sec triad    32549.52 (   0.00%)    34906.60 (   7.24%
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linux-MM <linux-mm@kvack.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      efaffc5e
  13. 05 9月, 2018 1 次提交
  14. 30 8月, 2018 1 次提交
  15. 24 8月, 2018 1 次提交
  16. 23 8月, 2018 5 次提交
  17. 18 8月, 2018 4 次提交
    • A
      mm, page_alloc: double zone's batchsize · d8a759b5
      Aaron Lu 提交于
      To improve page allocator's performance for order-0 pages, each CPU has
      a Per-CPU-Pageset(PCP) per zone.  Whenever an order-0 page is needed,
      PCP will be checked first before asking pages from Buddy.  When PCP is
      used up, a batch of pages will be fetched from Buddy to improve
      performance and the size of batch can affect performance.
      
      zone's batch size gets doubled last time by commit ba56e91c("mm:
      page_alloc: increase size of per-cpu-pages") over ten years ago.  Since
      then, CPU has envolved a lot and CPU's cache sizes also increased.
      
      Dave Hansen is concerned the current batch size doesn't fit well with
      modern hardware and suggested me to do two things: first, use a page
      allocator intensive benchmark, e.g.  will-it-scale/page_fault1 to find
      out how performance changes with different batch sizes on various
      machines and then choose a new default batch size; second, see how this
      new batch size work with other workloads.
      
      In the first test, we saw performance gains on high-core-count systems
      and little to no effect on older systems with more modest core counts.
      In this phase's test data, two candidates: 63 and 127 are chosen.
      
      In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
      and more will-it-scale sub-tests are tested to see how these two
      candidates work with these workloads and decides a new default according
      to their results.
      
      Most test results are flat.  will-it-scale/page_fault2 process mode has
      10%-18% performance increase on 4-sockets Skylake and Broadwell.
      vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
      4-sockets servers while for 2-sockets servers, it caused 3%-8% performance
      drop.  Further analysis showed that, with a larger pcp->batch and thus
      larger pcp->high(the relationship of pcp->high=6 * pcp->batch is
      maintained in this patch), zone lock contention shifted to LRU add side
      lock contention and that caused performance drop.  This performance drop
      might be mitigated by others' work on optimizing LRU lock.
      
      Another downside of increasing pcp->batch is, when PCP is used up and need
      to fetch a batch of pages from Buddy, since batch is increased, that time
      can be longer than before.  My understanding is, this doesn't affect
      slowpath where direct reclaim and compaction dominates.  For fastpath,
      throughput is a win(according to will-it-scale/page_fault1) but worst
      latency can be larger now.
      
      Overall, I think double the batch size from 31 to 63 is relatively safe
      and provide good performance boost for high-core-count systems.
      
      The two phase's test results are listed below(all tests are done with THP
      disabled).
      
      Phase one(will-it-scale/page_fault1) test results:
      
      Skylake-EX: increased batch size has a good effect on zone->lock
      contention, though LRU contention will rise at the same time and
      limited the final performance increase.
      
      batch   score     change   zone_contention   lru_contention   total_contention
       31   15345900    +0.00%       64%                 8%           72%
       53   17903847   +16.67%       32%                38%           70%
       63   17992886   +17.25%       24%                45%           69%
       73   18022825   +17.44%       10%                61%           71%
      119   18023401   +17.45%        4%                66%           70%
      127   18029012   +17.48%        3%                66%           69%
      137   18036075   +17.53%        4%                66%           70%
      165   18035964   +17.53%        2%                67%           69%
      188   18101105   +17.95%        2%                67%           69%
      223   18130951   +18.15%        2%                67%           69%
      255   18118898   +18.07%        2%                67%           69%
      267   18101559   +17.96%        2%                67%           69%
      299   18160468   +18.34%        2%                68%           70%
      320   18139845   +18.21%        2%                67%           69%
      393   18160869   +18.34%        2%                68%           70%
      424   18170999   +18.41%        2%                68%           70%
      458   18144868   +18.24%        2%                68%           70%
      467   18142366   +18.22%        2%                68%           70%
      498   18154549   +18.30%        1%                68%           69%
      511   18134525   +18.17%        1%                69%           70%
      
      Broadwell-EX: similar pattern as Skylake-EX.
      
      batch   score     change   zone_contention   lru_contention   total_contention
       31   16703983    +0.00%       67%                 7%           74%
       53   18195393    +8.93%       43%                28%           71%
       63   18288885    +9.49%       38%                33%           71%
       73   18344329    +9.82%       35%                37%           72%
      119   18535529   +10.96%       24%                46%           70%
      127   18513596   +10.83%       23%                48%           71%
      137   18514327   +10.84%       23%                48%           71%
      165   18511840   +10.82%       22%                49%           71%
      188   18593478   +11.31%       17%                53%           70%
      223   18601667   +11.36%       17%                52%           69%
      255   18774825   +12.40%       12%                58%           70%
      267   18754781   +12.28%        9%                60%           69%
      299   18892265   +13.10%        7%                63%           70%
      320   18873812   +12.99%        8%                62%           70%
      393   18891174   +13.09%        6%                64%           70%
      424   18975108   +13.60%        6%                64%           70%
      458   18932364   +13.34%        8%                62%           70%
      467   18960891   +13.51%        5%                65%           70%
      498   18944526   +13.41%        5%                64%           69%
      511   18960839   +13.51%        5%                64%           69%
      
      Skylake-EP: although increased batch reduced zone->lock contention, but
      the effect is not as good as EX: zone->lock contention is still as high as
      20% with a very high batch value instead of 1% on Skylake-EX or 5% on
      Broadwell-EX.  Also, total_contention actually decreased with a higher
      batch but that doesn't translate to performance increase.
      
      batch   score    change   zone_contention   lru_contention   total_contention
       31   9554867    +0.00%       66%                 3%           69%
       53   9855486    +3.15%       63%                 3%           66%
       63   9980145    +4.45%       62%                 4%           66%
       73   10092774   +5.63%       62%                 5%           67%
      119   10310061   +7.90%       45%                19%           64%
      127   10342019   +8.24%       42%                19%           61%
      137   10358182   +8.41%       42%                21%           63%
      165   10397060   +8.81%       37%                24%           61%
      188   10341808   +8.24%       34%                26%           60%
      223   10349135   +8.31%       31%                27%           58%
      255   10327189   +8.08%       28%                29%           57%
      267   10344204   +8.26%       27%                29%           56%
      299   10325043   +8.06%       25%                30%           55%
      320   10310325   +7.91%       25%                31%           56%
      393   10293274   +7.73%       21%                31%           52%
      424   10311099   +7.91%       21%                32%           53%
      458   10321375   +8.02%       21%                32%           53%
      467   10303881   +7.84%       21%                32%           53%
      498   10332462   +8.14%       20%                33%           53%
      511   10325016   +8.06%       20%                32%           52%
      
      Broadwell-EP: zone->lock and lru lock had an agreement to make sure
      performance doesn't increase and they successfully managed to keep total
      contention at 70%.
      
      batch   score    change   zone_contention   lru_contention   total_contention
       31   10121178   +0.00%       19%                50%           69%
       53   10142366   +0.21%        6%                63%           69%
       63   10117984   -0.03%       11%                58%           69%
       73   10123330   +0.02%        7%                63%           70%
      119   10108791   -0.12%        2%                67%           69%
      127   10166074   +0.44%        3%                66%           69%
      137   10141574   +0.20%        3%                66%           69%
      165   10154499   +0.33%        2%                68%           70%
      188   10124921   +0.04%        2%                67%           69%
      223   10137399   +0.16%        2%                67%           69%
      255   10143289   +0.22%        0%                68%           68%
      267   10123535   +0.02%        1%                68%           69%
      299   10140952   +0.20%        0%                68%           68%
      320   10163170   +0.41%        0%                68%           68%
      393   10000633   -1.19%        0%                69%           69%
      424   10087998   -0.33%        0%                69%           69%
      458   10187116   +0.65%        0%                69%           69%
      467   10146790   +0.25%        0%                69%           69%
      498   10197958   +0.76%        0%                69%           69%
      511   10152326   +0.31%        0%                69%           69%
      
      Haswell-EP: similar to Broadwell-EP.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   10442205   +0.00%       14%                48%           62%
       53   10442255   +0.00%        5%                57%           62%
       63   10452059   +0.09%        6%                57%           63%
       73   10482349   +0.38%        5%                59%           64%
      119   10454644   +0.12%        3%                60%           63%
      127   10431514   -0.10%        3%                59%           62%
      137   10423785   -0.18%        3%                60%           63%
      165   10481216   +0.37%        2%                61%           63%
      188   10448755   +0.06%        2%                61%           63%
      223   10467144   +0.24%        2%                61%           63%
      255   10480215   +0.36%        2%                61%           63%
      267   10484279   +0.40%        2%                61%           63%
      299   10466450   +0.23%        2%                61%           63%
      320   10452578   +0.10%        2%                61%           63%
      393   10499678   +0.55%        1%                62%           63%
      424   10481454   +0.38%        1%                62%           63%
      458   10473562   +0.30%        1%                62%           63%
      467   10484269   +0.40%        0%                62%           62%
      498   10505599   +0.61%        0%                62%           62%
      511   10483395   +0.39%        0%                62%           62%
      
      Westmere-EP: contention is pretty small so not interesting.  Note too high
      a batch value could hurt performance.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   4831523   +0.00%        2%                 3%            5%
       53   4834086   +0.05%        2%                 4%            6%
       63   4834262   +0.06%        2%                 3%            5%
       73   48328518   +0.03%        2%                 4%            6%
      119   4830534   -0.02%        1%                 3%            4%
      127   4827461   -0.08%        1%                 4%            5%
      137   4827459   -0.08%        1%                 3%            4%
      165   4820534   -0.23%        0%                 4%            4%
      188   4817947   -0.28%        0%                 3%            3%
      223   4809671   -0.45%        0%                 3%            3%
      255   4802463   -0.60%        0%                 4%            4%
      267   4801634   -0.62%        0%                 3%            3%
      299   4798047   -0.69%        0%                 3%            3%
      320   4793084   -0.80%        0%                 3%            3%
      393   4785877   -0.94%        0%                 3%            3%
      424   4782911   -1.01%        0%                 3%            3%
      458   4779346   -1.08%        0%                 3%            3%
      467   4780306   -1.06%        0%                 3%            3%
      498   4780589   -1.05%        0%                 3%            3%
      511   4773724   -1.20%        0%                 3%            3%
      
      Skylake-Desktop: similar to Westmere-EP, nothing interesting.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   3906608   +0.00%        2%                 3%            5%
       53   3940164   +0.86%        2%                 3%            5%
       63   3937289   +0.79%        2%                 3%            5%
       73   3940201   +0.86%        2%                 3%            5%
      119   3933240   +0.68%        2%                 3%            5%
      127   3930514   +0.61%        2%                 4%            6%
      137   3938639   +0.82%        0%                 3%            3%
      165   3908755   +0.05%        0%                 3%            3%
      188   3905621   -0.03%        0%                 3%            3%
      223   3903015   -0.09%        0%                 4%            4%
      255   3889480   -0.44%        0%                 3%            3%
      267   3891669   -0.38%        0%                 4%            4%
      299   3898728   -0.20%        0%                 4%            4%
      320   3894547   -0.31%        0%                 4%            4%
      393   3875137   -0.81%        0%                 4%            4%
      424   3874521   -0.82%        0%                 3%            3%
      458   3880432   -0.67%        0%                 4%            4%
      467   3888715   -0.46%        0%                 3%            3%
      498   3888633   -0.46%        0%                 4%            4%
      511   3875305   -0.80%        0%                 5%            5%
      
      Haswell-Desktop: zone->lock is pretty low as other desktops, though lru
      contention is higher than other desktops.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   3511158   +0.00%        2%                 5%            7%
       53   3555445   +1.26%        2%                 6%            8%
       63   3561082   +1.42%        2%                 6%            8%
       73   3547218   +1.03%        2%                 6%            8%
      119   3571319   +1.71%        1%                 7%            8%
      127   3549375   +1.09%        0%                 6%            6%
      137   3560233   +1.40%        0%                 6%            6%
      165   3555176   +1.25%        2%                 6%            8%
      188   3551501   +1.15%        0%                 8%            8%
      223   3531462   +0.58%        0%                 7%            7%
      255   3570400   +1.69%        0%                 7%            7%
      267   3532235   +0.60%        1%                 8%            9%
      299   3562326   +1.46%        0%                 6%            6%
      320   3553569   +1.21%        0%                 8%            8%
      393   3539519   +0.81%        0%                 7%            7%
      424   3549271   +1.09%        0%                 8%            8%
      458   3528885   +0.50%        0%                 8%            8%
      467   3526554   +0.44%        0%                 7%            7%
      498   3525302   +0.40%        0%                 9%            9%
      511   3527556   +0.47%        0%                 8%            8%
      
      Sandybridge-Desktop: the 0% contention isn't accurate but caused by
      dropped fractional part. Since multiple contention path's contentions
      are all under 1% here, with some arithmetic operations like add, the
      final deviation could be as large as 3%.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   1744495   +0.00%        0%                 0%            0%
       53   1755341   +0.62%        0%                 0%            0%
       63   1758469   +0.80%        0%                 0%            0%
       73   1759626   +0.87%        0%                 0%            0%
      119   1770417   +1.49%        0%                 0%            0%
      127   1768252   +1.36%        0%                 0%            0%
      137   1767848   +1.34%        0%                 0%            0%
      165   1765088   +1.18%        0%                 0%            0%
      188   1766918   +1.29%        0%                 0%            0%
      223   1767866   +1.34%        0%                 0%            0%
      255   1768074   +1.35%        0%                 0%            0%
      267   1763187   +1.07%        0%                 0%            0%
      299   1765620   +1.21%        0%                 0%            0%
      320   1767603   +1.32%        0%                 0%            0%
      393   1764612   +1.15%        0%                 0%            0%
      424   1758476   +0.80%        0%                 0%            0%
      458   1758593   +0.81%        0%                 0%            0%
      467   1757915   +0.77%        0%                 0%            0%
      498   1753363   +0.51%        0%                 0%            0%
      511   1755548   +0.63%        0%                 0%            0%
      
      Phase two test results:
      Note: all percent change is against base(batch=31).
      
      ebizzy.throughput (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    2410037±7%     2600451±2% +7.9%     2602878 +8.0%
      lkp-bdw-ex1     1493328        1489243    -0.3%     1492145 -0.1%
      lkp-skl-2sp2    1329674        1345891    +1.2%     1351056 +1.6%
      lkp-bdw-ep2      711511         711511     0.0%      710708 -0.1%
      lkp-wsm-ep2       75750          75528    -0.3%       75441 -0.4%
      lkp-skl-d01      264126         262791    -0.5%      264113 +0.0%
      lkp-hsw-d01      176601         176328    -0.2%      176368 -0.1%
      lkp-sb02          98937          98937    +0.0%       99030 +0.1%
      
      kbuild.buildtime (less is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     107.00        107.67  +0.6%        107.11  +0.1%
      lkp-bdw-ex1       97.33         97.33  +0.0%         97.42  +0.1%
      lkp-skl-2sp2     180.00        179.83  -0.1%        179.83  -0.1%
      lkp-bdw-ep2      178.17        179.17  +0.6%        177.50  -0.4%
      lkp-wsm-ep2      737.00        738.00  +0.1%        738.00  +0.1%
      lkp-skl-d01      642.00        653.00  +1.7%        653.00  +1.7%
      lkp-hsw-d01     1310.00       1316.00  +0.5%       1311.00  +0.1%
      
      netperf/TCP_STREAM.Throughput_total_Mbps (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     948790        947144  -0.2%        948333 -0.0%
      lkp-bdw-ex1      904224        904366  +0.0%        904926 +0.1%
      lkp-skl-2sp2     239731        239607  -0.1%        239565 -0.1%
      lk-bdw-ep2       365764        365933  +0.0%        365951 +0.1%
      lkp-wsm-ep2       93736         93803  +0.1%         93808 +0.1%
      lkp-skl-d01       77314         77303  -0.0%         77375 +0.1%
      lkp-hsw-d01       58617         60387  +3.0%         60208 +2.7%
      lkp-sb02          29990         30137  +0.5%         30103 +0.4%
      
      oltp.transactions (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-bdw-ex1      9073276       9100377     +0.3%    9036344     -0.4%
      lkp-skl-2sp2     8898717       8852054     -0.5%    8894459     -0.0%
      lkp-bdw-ep2     13426155      13384654     -0.3%   13333637     -0.7%
      lkp-hsw-ep2     13146314      13232784     +0.7%   13193163     +0.4%
      lkp-wsm-ep2      5035355       5019348     -0.3%    5033418     -0.0%
      lkp-skl-d01       418485       4413339     -0.1%    4419039     +0.0%
      lkp-hsw-d01      3517817±5%    3396120±3%  -3.5%    3455138±3%  -1.8%
      
      pigz.throughput (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.513e+08     1.507e+08 -0.4%      1.511e+08 -0.2%
      lkp-bdw-ex1     2.060e+08     2.052e+08 -0.4%      2.044e+08 -0.8%
      lkp-skl-2sp2    8.836e+08     8.845e+08 +0.1%      8.836e+08 -0.0%
      lkp-bdw-ep2     8.275e+08     8.464e+08 +2.3%      8.330e+08 +0.7%
      lkp-wsm-ep2     2.224e+08     2.221e+08 -0.2%      2.218e+08 -0.3%
      lkp-skl-d01     1.177e+08     1.177e+08 -0.0%      1.176e+08 -0.1%
      lkp-hsw-d01     1.154e+08     1.154e+08 +0.1%      1.154e+08 -0.0%
      lkp-sb02        0.633e+08     0.633e+08 +0.1%      0.633e+08 +0.0%
      
      will-it-scale.malloc1.processes (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1      620181       620484 +0.0%         620240 +0.0%
      lkp-bdw-ex1      1403610      1401201 -0.2%        1417900 +1.0%
      lkp-skl-2sp2     1288097      1284145 -0.3%        1283907 -0.3%
      lkp-bdw-ep2      1427879      1427675 -0.0%        1428266 +0.0%
      lkp-hsw-ep2      1362546      1353965 -0.6%        1354759 -0.6%
      lkp-wsm-ep2      2099657      2107576 +0.4%        2100226 +0.0%
      lkp-skl-d01      1476835      1476358 -0.0%        1474487 -0.2%
      lkp-hsw-d01      1308810      1303429 -0.4%        1301299 -0.6%
      lkp-sb02          589286       589284 -0.0%         588101 -0.2%
      
      will-it-scale.malloc1.threads (higher is better)
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     21289         21125     -0.8%      21241     -0.2%
      lkp-bdw-ex1      28114         28089     -0.1%      28007     -0.4%
      lkp-skl-2sp2     91866         91946     +0.1%      92723     +0.9%
      lkp-bdw-ep2      37637         37501     -0.4%      37317     -0.9%
      lkp-hsw-ep2      43673         43590     -0.2%      43754     +0.2%
      lkp-wsm-ep2      28577         28298     -1.0%      28545     -0.1%
      lkp-skl-d01     175277        173343     -1.1%     173082     -1.3%
      lkp-hsw-d01     130303        129566     -0.6%     129250     -0.8%
      lkp-sb02        113742±3%     116911     +2.8%     116417±3%  +2.4%
      
      will-it-scale.malloc2.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.206e+09     1.206e+09 -0.0%      1.206e+09 +0.0%
      lkp-bdw-ex1     1.319e+09     1.319e+09 -0.0%      1.319e+09 +0.0%
      lkp-skl-2sp2    8.000e+08     8.021e+08 +0.3%      7.995e+08 -0.1%
      lkp-bdw-ep2     6.582e+08     6.634e+08 +0.8%      6.513e+08 -1.1%
      lkp-hsw-ep2     6.671e+08     6.669e+08 -0.0%      6.665e+08 -0.1%
      lkp-wsm-ep2     1.805e+08     1.806e+08 +0.0%      1.804e+08 -0.1%
      lkp-skl-d01     1.611e+08     1.611e+08 -0.0%      1.610e+08 -0.0%
      lkp-hsw-d01     1.333e+08     1.332e+08 -0.0%      1.332e+08 -0.0%
      lkp-sb02         82485104      82478206 -0.0%       82473546 -0.0%
      
      will-it-scale.malloc2.threads (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.574e+09     1.574e+09 -0.0%      1.574e+09 -0.0%
      lkp-bdw-ex1     1.737e+09     1.737e+09 +0.0%      1.737e+09 -0.0%
      lkp-skl-2sp2    9.161e+08     9.162e+08 +0.0%      9.181e+08 +0.2%
      lkp-bdw-ep2     7.856e+08     8.015e+08 +2.0%      8.113e+08 +3.3%
      lkp-hsw-ep2     6.908e+08     6.904e+08 -0.1%      6.907e+08 -0.0%
      lkp-wsm-ep2     2.409e+08     2.409e+08 +0.0%      2.409e+08 -0.0%
      lkp-skl-d01     1.199e+08     1.199e+08 -0.0%      1.199e+08 -0.0%
      lkp-hsw-d01     1.029e+08     1.029e+08 -0.0%      1.029e+08 +0.0%
      lkp-sb02         68081213      68061423 -0.0%       68076037 -0.0%
      
      will-it-scale.page_fault2.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    14509125±4%   16472364 +13.5%       17123117 +18.0%
      lkp-bdw-ex1     14736381      16196588  +9.9%       16364011 +11.0%
      lkp-skl-2sp2     6354925       6435444  +1.3%        6436644  +1.3%
      lkp-bdw-ep2      8749584       8834422  +1.0%        8827179  +0.9%
      lkp-hsw-ep2      8762591       8845920  +1.0%        8825697  +0.7%
      lkp-wsm-ep2      3036083       3030428  -0.2%        3021741  -0.5%
      lkp-skl-d01      2307834       2304731  -0.1%        2286142  -0.9%
      lkp-hsw-d01      1806237       1800786  -0.3%        1795943  -0.6%
      lkp-sb02          842616        837844  -0.6%         833921  -1.0%
      
      will-it-scale.page_fault2.threads
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1623294       1615132±2% -0.5%     1656777    +2.1%
      lkp-bdw-ex1      1995714       2025948    +1.5%     2113753±3% +5.9%
      lkp-skl-2sp2     2346708       2415591    +2.9%     2416919    +3.0%
      lkp-bdw-ep2      2342564       2344882    +0.1%     2300206    -1.8%
      lkp-hsw-ep2      1820658       1831681    +0.6%     1844057    +1.3%
      lkp-wsm-ep2      1725482       1733774    +0.5%     1740517    +0.9%
      lkp-skl-d01      1832833       1823628    -0.5%     1806489    -1.4%
      lkp-hsw-d01      1427913       1427287    -0.0%     1420226    -0.5%
      lkp-sb02          750626        748615    -0.3%      746621    -0.5%
      
      will-it-scale.page_fault3.processes (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    24382726      24400317 +0.1%       24668774 +1.2%
      lkp-bdw-ex1     35399750      35683124 +0.8%       35829492 +1.2%
      lkp-skl-2sp2    28136820      28068248 -0.2%       28147989 +0.0%
      lkp-bdw-ep2     37269077      37459490 +0.5%       37373073 +0.3%
      lkp-hsw-ep2     36224967      36114085 -0.3%       36104908 -0.3%
      lkp-wsm-ep2     16820457      16911005 +0.5%       16968596 +0.9%
      lkp-skl-d01      7721138       7725904 +0.1%        7756740 +0.5%
      lkp-hsw-d01      7611979       7650928 +0.5%        7651323 +0.5%
      lkp-sb02         3781546       3796502 +0.4%        3796827 +0.4%
      
      will-it-scale.page_fault3.threads (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1865820±3%   1900917±2%  +1.9%     1826245±4%  -2.1%
      lkp-bdw-ex1      3094060      3148326     +1.8%     3150036     +1.8%
      lkp-skl-2sp2     3952940      3953898     +0.0%     3989360     +0.9%
      lkp-bdw-ep2      3420373±3%   3643964     +6.5%     3644910±5%  +6.6%
      lkp-hsw-ep2      2609635±2%   2582310±3%  -1.0%     2780459     +6.5%
      lkp-wsm-ep2      4395001      4417196     +0.5%     4432499     +0.9%
      lkp-skl-d01      5363977      5400003     +0.7%     5411370     +0.9%
      lkp-hsw-d01      5274131      5311294     +0.7%     5319359     +0.9%
      lkp-sb02         2917314      2913004     -0.1%     2935286     +0.6%
      
      will-it-scale.read1.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    73762279±14%  69322519±10% -6.0%    69349855±13%  -6.0% (result unstable)
      lkp-bdw-ex1     1.701e+08     1.704e+08    +0.1%    1.705e+08     +0.2%
      lkp-skl-2sp2    63111570      63113953     +0.0%    63836573      +1.1%
      lkp-bdw-ep2     79247409      79424610     +0.2%    78012656      -1.6%
      lkp-hsw-ep2     67677026      68308800     +0.9%    67539106      -0.2%
      lkp-wsm-ep2     13339630      13939817     +4.5%    13766865      +3.2%
      lkp-skl-d01     10969487      10972650     +0.0%    no data
      lkp-hsw-d01     9857342±2%    10080592±2%  +2.3%    10131560      +2.8%
      lkp-sb02        5189076        5197473     +0.2%    5163253       -0.5%
      
      will-it-scale.read1.threads (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    62468045±12%  73666726±7% +17.9%    79553123±12% +27.4% (result unstable)
      lkp-bdw-ex1     1.62e+08      1.624e+08    +0.3%    1.614e+08     -0.3%
      lkp-skl-2sp2    58319780      59181032     +1.5%    59821353      +2.6%
      lkp-bdw-ep2     74057992      75698171     +2.2%    74990869      +1.3%
      lkp-hsw-ep2     63672959      63639652     -0.1%    64387051      +1.1%
      lkp-wsm-ep2     13489943      13526058     +0.3%    13259032      -1.7%
      lkp-skl-d01     10297906      10338796     +0.4%    10407328      +1.1%
      lkp-hsw-d01      9636721       9667376     +0.3%     9341147      -3.1%
      lkp-sb02         4801938       4804496     +0.1%     4802290      +0.0%
      
      will-it-scale.write1.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.111e+08     1.104e+08±2%  -0.7%   1.122e+08±2%  +1.0%
      lkp-bdw-ex1     1.392e+08     1.399e+08     +0.5%   1.397e+08     +0.4%
      lkp-skl-2sp2     59369233      58994841     -0.6%    58715168     -1.1%
      lkp-bdw-ep2      61820979      CPU throttle          63593123     +2.9%
      lkp-hsw-ep2      57897587      57435605     -0.8%    56347450     -2.7%
      lkp-wsm-ep2       7814203       7918017±2%  +1.3%     7669068     -1.9%
      lkp-skl-d01       8886557       8971422     +1.0%     8818366     -0.8%
      lkp-hsw-d01       9171001±5%    9189915     +0.2%     9483909     +3.4%
      lkp-sb02          4475406       4475294     -0.0%     4501756     +0.6%
      
      will-it-scale.write1.threads (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.058e+08     1.055e+08±2%  -0.2%   1.065e+08  +0.7%
      lkp-bdw-ex1     1.316e+08     1.300e+08     -1.2%   1.308e+08  -0.6%
      lkp-skl-2sp2     54492421      56086678     +2.9%    55975657  +2.7%
      lkp-bdw-ep2      59360449      59003957     -0.6%    58101262  -2.1%
      lkp-hsw-ep2      53346346±2%   52530876     -1.5%    52902487  -0.8%
      lkp-wsm-ep2       7774006       7800092±2%  +0.3%     7558833  -2.8%
      lkp-skl-d01       8346174       8235695     -1.3%     no data
      lkp-hsw-d01       8636244       8655731     +0.2%     8658868  +0.3%
      lkp-sb02          4181820       4204107     +0.5%     4182992  +0.0%
      
      vm-scalability.anon-r-rand.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    11933873±3%   12356544±2%  +3.5%   12188624     +2.1%
      lkp-bdw-ex1      7114424±2%    7330949±2%  +3.0%    7392419     +3.9%
      lkp-skl-2sp2     6773277±5%    6492332±8%  -4.1%    6543962     -3.4%
      lkp-bdw-ep2      7133846±4%    7233508     +1.4%    7013518±3%  -1.7%
      lkp-hsw-ep2      4576626       4527098     -1.1%    4551679     -0.5%
      lkp-wsm-ep2      2583599       2592492     +0.3%    2588039     +0.2%
      lkp-hsw-d01       998199±2%    1028311     +3.0%    1006460±2%  +0.8%
      lkp-sb02          570572        567854     -0.5%     568449     -0.4%
      
      vm-scalability.anon-r-rand-mt.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1789419       1787830     -0.1%    1788208     -0.1%
      lkp-bdw-ex1      3492595±2%    3554966±2%  +1.8%    3558835±3%  +1.9%
      lkp-skl-2sp2     3856238±2%    3975403±4%  +3.1%    3994600     +3.6%
      lkp-bdw-ep2      3726963±11%   3809292±6%  +2.2%    3871924±4%  +3.9%
      lkp-hsw-ep2      2131760±3%    2033578±4%  -4.6%    2130727±6%  -0.0%
      lkp-wsm-ep2      2369731       2368384     -0.1%    2370252     +0.0%
      lkp-skl-d01      1207128       1206220     -0.1%    1205801     -0.1%
      lkp-hsw-d01       964317        992329±2%  +2.9%     992099±2%  +2.9%
      lkp-sb02          567137        567346     +0.0%     566144     -0.2%
      
      vm-scalability.lru-file-mmap-read.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    19560469±6%   23018999     +17.7%   23418800     +19.7%
      lkp-bdw-ex1     17769135±14%  26141676±3%  +47.1%   26284723±5%  +47.9%
      lkp-skl-2sp2    14056512      13578884      -3.4%   13146214      -6.5%
      lkp-bdw-ep2     15336542      14737654      -3.9%   14088159      -8.1%
      lkp-hsw-ep2     16275498      15756296      -3.2%   15018090      -7.7%
      lkp-wsm-ep2     11272160      11237231      -0.3%   11310047      +0.3%
      lkp-skl-d01      7322119       7324569      +0.0%    7184148      -1.9%
      lkp-hsw-d01      6449234       6404542      -0.7%    6356141      -1.4%
      lkp-sb02         3517943       3520668      +0.1%    3527309      +0.3%
      
      vm-scalability.lru-file-mmap-read-rand.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1689052       1697553  +0.5%       1698726  +0.6%
      lkp-bdw-ex1      1675246       1699764  +1.5%       1712226  +2.2%
      lkp-skl-2sp2     1800533       1799749  -0.0%       1800581  +0.0%
      lkp-bdw-ep2      1807422       1807758  +0.0%       1804932  -0.1%
      lkp-hsw-ep2      1809807       1808781  -0.1%       1807811  -0.1%
      lkp-wsm-ep2      1800198       1802434  +0.1%       1801236  +0.1%
      lkp-skl-d01       696689        695537  -0.2%        694106  -0.4%
      lkp-hsw-d01       698364        698666  +0.0%        696686  -0.2%
      lkp-sb02          258939        258787  -0.1%        258199  -0.3%
      
      Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Suggested-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8a759b5
    • M
      mm: drop VM_BUG_ON from __get_free_pages · 9ea9a680
      Michal Hocko 提交于
      There is no real reason to blow up just because the caller doesn't know
      that __get_free_pages cannot return highmem pages.  Simply fix that up
      silently.  Even if we have some confused users such a fixup will not be
      harmful.
      
      [akpm@linux-foundation.org: mask off __GFP_HIGHMEM]
      Link: http://lkml.kernel.org/r/20180622162841.25114-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Jiankang Chen <chenjiankang1@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ea9a680
    • V
      mm, page_alloc: actually ignore mempolicies for high priority allocations · d6a24df0
      Vlastimil Babka 提交于
      __alloc_pages_slowpath() has for a long time contained code to ignore
      node restrictions from memory policies for high priority allocations.
      The current code that resets the zonelist iterator however does
      effectively nothing after commit 7810e678 ("mm, page_alloc: do not
      break __GFP_THISNODE by zonelist reset") removed a buggy zonelist reset.
      Even before that commit, mempolicy restrictions were still not ignored,
      as they are passed in ac->nodemask which is untouched by the code.
      
      We can either remove the code, or make it work as intended.  Since
      ac->nodemask can be set from task's mempolicy via alloc_pages_current()
      and thus also alloc_pages(), it may indeed affect kernel allocations,
      and it makes sense to ignore it to allow progress for high priority
      allocations.
      
      Thus, this patch resets ac->nodemask to NULL in such cases.  This
      assumes all callers can handle it (i.e.  there are no guarantees as in
      the case of __GFP_THISNODE) which seems to be the case.  The same
      assumption is already present in check_retry_cpuset() for some time.
      
      The expected effect is that high priority kernel allocations in the
      context of userspace tasks (e.g.  OOM victims) restricted by mempolicies
      will have higher chance to succeed if they are restricted to nodes with
      depleted memory, while there are other nodes with free memory left.
      
      It's not a new intention, but for the first time the code will match the
      intention, AFAICS.  It was intended by commit 183f6371 ("mm: ignore
      mempolicies when using ALLOC_NO_WATERMARK") in v3.6 but I think it never
      really worked, as mempolicy restriction was already encoded in nodemask,
      not zonelist, at that time.
      
      So originally that was for ALLOC_NO_WATERMARK only.  Then it was
      adjusted by e46e7b77 ("mm, page_alloc: recalculate the preferred
      zoneref if the context can ignore memory policies") and cd04ae1e
      ("mm, oom: do not rely on TIF_MEMDIE for memory reserves access") to the
      current state.  So even GFP_ATOMIC would now ignore mempolicies after
      the initial attempts fail - if the code worked as people thought it
      does.
      
      Link: http://lkml.kernel.org/r/20180612122624.8045-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6a24df0
    • P
      mm: skip invalid pages block at a time in zero_resv_unresv() · 720e14eb
      Pavel Tatashin 提交于
      The role of zero_resv_unavail() is to make sure that every struct page
      that is allocated but is not backed by memory that is accessible by
      kernel is zeroed and not in some uninitialized state.
      
      Since struct pages are allocated in blocks (2M pages in x86 case), we
      can skip pageblock_nr_pages at a time, when the first one is found to be
      invalid.
      
      This optimization may help since now on x86 every hole in e820 maps is
      marked as reserved in memblock, and thus will go through this function.
      
      This function is called before sched_clock() is initialized, so I used
      my x86 early boot clock patches to measure the performance improvement.
      
      With 1T hole on i7-8700 currently we would take 0.606918s of boot time,
      but with this optimization 0.001103s.
      
      Link: http://lkml.kernel.org/r/20180615155733.1175-1-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      720e14eb
  18. 06 8月, 2018 2 次提交
    • P
      PM / reboot: Eliminate race between reboot and suspend · 55f2503c
      Pingfan Liu 提交于
      At present, "systemctl suspend" and "shutdown" can run in parrallel. A
      system can suspend after devices_shutdown(), and resume. Then the shutdown
      task goes on to power off. This causes many devices are not really shut
      off. Hence replacing reboot_mutex with system_transition_mutex (renamed
      from pm_mutex) to achieve the exclusion. The renaming of pm_mutex as
      system_transition_mutex can be better to reflect the purpose of the mutex.
      Signed-off-by: NPingfan Liu <kernelfans@gmail.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      55f2503c
    • D
      mm: Allow non-direct-map arguments to free_reserved_area() · 0d834328
      Dave Hansen 提交于
      free_reserved_area() takes pointers as arguments to show which addresses
      should be freed.  However, it does this in a somewhat ambiguous way.  If it
      gets a kernel direct map address, it always works.  However, if it gets an
      address that is part of the kernel image alias mapping, it can fail.
      
      It fails if all of the following happen:
       * The specified address is part of the kernel image alias
       * Poisoning is requested (forcing a memset())
       * The address is in a read-only portion of the kernel image
      
      The memset() fails on the read-only mapping, of course.
      free_reserved_area() *is* called both on the direct map and on kernel image
      alias addresses.  We've just lucked out thus far that the kernel image
      alias areas it gets used on are read-write.  I'm fairly sure this has been
      just a happy accident.
      
      It is quite easy to make free_reserved_area() work for all cases: just
      convert the address to a direct map address before doing the memset(), and
      do this unconditionally.  There is little chance of a regression here
      because we previously did a virt_to_page() on the address for the memset,
      so we know these are not highmem pages for which virt_to_page() would fail.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: keescook@google.com
      Cc: aarcange@redhat.com
      Cc: jgross@suse.com
      Cc: jpoimboe@redhat.com
      Cc: gregkh@linuxfoundation.org
      Cc: peterz@infradead.org
      Cc: hughd@google.com
      Cc: torvalds@linux-foundation.org
      Cc: bp@alien8.de
      Cc: luto@kernel.org
      Cc: ak@linux.intel.com
      Cc: Kees Cook <keescook@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180802225826.1287AE3E@viggo.jf.intel.com
      0d834328
  19. 17 7月, 2018 1 次提交
  20. 15 7月, 2018 1 次提交
    • P
      mm: zero unavailable pages before memmap init · e181ae0c
      Pavel Tatashin 提交于
      We must zero struct pages for memory that is not backed by physical
      memory, or kernel does not have access to.
      
      Recently, there was a change which zeroed all memmap for all holes in
      e820.  Unfortunately, it introduced a bug that is discussed here:
      
        https://www.spinics.net/lists/linux-mm/msg156764.html
      
      Linus, also saw this bug on his machine, and confirmed that reverting
      commit 124049de ("x86/e820: put !E820_TYPE_RAM regions into
      memblock.reserved") fixes the issue.
      
      The problem is that we incorrectly zero some struct pages after they
      were setup.
      
      The fix is to zero unavailable struct pages prior to initializing of
      struct pages.
      
      A more detailed fix should come later that would avoid double zeroing
      cases: one in __init_single_page(), the other one in
      zero_resv_unavail().
      
      Fixes: 124049de ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved")
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e181ae0c
  21. 15 6月, 2018 1 次提交
  22. 08 6月, 2018 8 次提交