1. 27 12月, 2019 14 次提交
    • A
      mm: Ignore cpuset enforcement when allocation flag has __GFP_THISNODE · 148b50b7
      Anshuman Khandual 提交于
      euler inclusion
      category: feature
      bugzilla: 11082
      CVE: NA
      -------------------
      
      __GFP_THISNODE specifically asks the memory to be allocated from the given
      node. Not all the requests that end up in __alloc_pages_nodemask() are
      originated from the process context where cpuset makes more sense. The
      current condition enforces cpuset limitation on every allocation whether
      originated from process context or not which prevents __GFP_THISNODE
      mandated allocations to come from the specified node. In context of the
      coherent device memory node which is isolated from all cpuset nodemask
      in the system, it prevents the only way of allocation into it which has
      been changed with this patch.
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      148b50b7
    • A
      mm: Enable Buddy allocation isolation for CDM nodes · 8877e9e4
      Anshuman Khandual 提交于
      euler inclusion
      category: feature
      bugzilla: 11082
      CVE: NA
      -------------------
      
      This implements allocation isolation for CDM nodes in buddy allocator by
      discarding CDM memory zones all the time except in the cases where the gfp
      flag has got __GFP_THISNODE or the nodemask contains CDM nodes in cases
      where it is non NULL (explicit allocation request in the kernel or user
      process MPOL_BIND policy based requests).
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8877e9e4
    • A
      mm: Change generic FALLBACK zonelist creation process · 023d1127
      Anshuman Khandual 提交于
      euler inclusion
      category: feature
      bugzilla: 11082
      CVE: NA
      -------------------
      
      Kernel allocation to CDM node has already been prevented by putting it's
      entire memory in ZONE_MOVABLE. But the CDM nodes must also be isolated
      from implicit allocations happening on the system.
      
      Any isolation seeking CDM node requires isolation from implicit memory
      allocations from user space but at the same time there should also have
      an explicit way to do the memory allocation.
      
      Platform node's both zonelists are fundamental to where the memory comes
      from when there is an allocation request. In order to achieve these two
      objectives as stated above, zonelists building process has to change as
      both zonelists (i.e FALLBACK and NOFALLBACK) gives access to the node's
      memory zones during any kind of memory allocation. The following changes
      are implemented in this regard.
      
      * CDM node's zones are not part of any other node's FALLBACK zonelist
      * CDM node's FALLBACK list contains it's own memory zones followed by
        all system RAM zones in regular order as before
      * CDM node's zones are part of it's own NOFALLBACK zonelist
      
      These above changes ensure the following which in turn isolates the CDM
      nodes as desired.
      
      * There wont be any implicit memory allocation ending up in the CDM node
      * Only __GFP_THISNODE marked allocations will come from the CDM node
      * CDM node memory can be allocated through mbind(MPOL_BIND) interface
      * System RAM memory will be used as fallback option in regular order in
        case the CDM memory is insufficient during targted allocation request
      
      Sample zonelist configuration:
      
      [NODE (0)]						RAM
              ZONELIST_FALLBACK (0xc00000000140da00)
                      (0) (node 0) (DMA     0xc00000000140c000)
                      (1) (node 1) (DMA     0xc000000100000000)
              ZONELIST_NOFALLBACK (0xc000000001411a10)
                      (0) (node 0) (DMA     0xc00000000140c000)
      [NODE (1)]						RAM
              ZONELIST_FALLBACK (0xc000000100001a00)
                      (0) (node 1) (DMA     0xc000000100000000)
                      (1) (node 0) (DMA     0xc00000000140c000)
              ZONELIST_NOFALLBACK (0xc000000100005a10)
                      (0) (node 1) (DMA     0xc000000100000000)
      [NODE (2)]						CDM
              ZONELIST_FALLBACK (0xc000000001427700)
                      (0) (node 2) (Movable 0xc000000001427080)
                      (1) (node 0) (DMA     0xc00000000140c000)
                      (2) (node 1) (DMA     0xc000000100000000)
              ZONELIST_NOFALLBACK (0xc00000000142b710)
                      (0) (node 2) (Movable 0xc000000001427080)
      [NODE (3)]						CDM
              ZONELIST_FALLBACK (0xc000000001431400)
                      (0) (node 3) (Movable 0xc000000001430d80)
                      (1) (node 0) (DMA     0xc00000000140c000)
                      (2) (node 1) (DMA     0xc000000100000000)
              ZONELIST_NOFALLBACK (0xc000000001435410)
                      (0) (node 3) (Movable 0xc000000001430d80)
      [NODE (4)]						CDM
              ZONELIST_FALLBACK (0xc00000000143b100)
                      (0) (node 4) (Movable 0xc00000000143aa80)
                      (1) (node 0) (DMA     0xc00000000140c000)
                      (2) (node 1) (DMA     0xc000000100000000)
              ZONELIST_NOFALLBACK (0xc00000000143f110)
                      (0) (node 4) (Movable 0xc00000000143aa80)
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      023d1127
    • A
      mm: Define coherent device memory (CDM) node · 4886e905
      Anshuman Khandual 提交于
      euler inclusion
      category: feature
      bugzilla: 11082
      CVE: NA
      -------------------
      
      There are certain devices like specialized accelerator, GPU cards, network
      cards, FPGA cards etc which might contain onboard memory which is coherent
      along with the existing system RAM while being accessed either from the CPU
      or from the device. They share some similar properties with that of normal
      system RAM but at the same time can also be different with respect to
      system RAM.
      
      User applications might be interested in using this kind of coherent device
      memory explicitly or implicitly along side the system RAM utilizing all
      possible core memory functions like anon mapping (LRU), file mapping (LRU),
      page cache (LRU), driver managed (non LRU), HW poisoning, NUMA migrations
      etc. To achieve this kind of tight integration with core memory subsystem,
      the device onboard coherent memory must be represented as a memory only
      NUMA node. At the same time arch must export some kind of a function to
      identify of this node as a coherent device memory not any other regular
      cpu less memory only NUMA node.
      
      After achieving the integration with core memory subsystem coherent device
      memory might still need some special consideration inside the kernel. There
      can be a variety of coherent memory nodes with different expectations from
      the core kernel memory. But right now only one kind of special treatment is
      considered which requires certain isolation.
      
      Now consider the case of a coherent device memory node type which requires
      isolation. This kind of coherent memory is onboard an external device
      attached to the system through a link where there is always a chance of a
      link failure taking down the entire memory node with it. More over the
      memory might also have higher chance of ECC failure as compared to the
      system RAM. Hence allocation into this kind of coherent memory node should
      be regulated. Kernel allocations must not come here. Normal user space
      allocations too should not come here implicitly (without user application
      knowing about it). This summarizes isolation requirement of certain kind of
      coherent device memory node as an example. There can be different kinds of
      isolation requirement also.
      
      Some coherent memory devices might not require isolation altogether after
      all. Then there might be other coherent memory devices which might require
      some other special treatment after being part of core memory representation
      . For now, will look into isolation seeking coherent device memory node not
      the other ones.
      
      To implement the integration as well as isolation, the coherent memory node
      must be present in N_MEMORY and a new N_COHERENT_DEVICE node mask inside
      the node_states[] array. During memory hotplug operations, the new nodemask
      N_COHERENT_DEVICE is updated along with N_MEMORY for these coherent device
      memory nodes. This also creates the following new sysfs based interface to
      list down all the coherent memory nodes of the system.
      
      	/sys/devices/system/node/is_cdm_node
      
      Architectures must export function arch_check_node_cdm() which identifies
      any coherent device memory node in case they enable CONFIG_COHERENT_DEVICE.
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      [Backported to 4.19
      -remove set or clear node state for memory_hotplug
      -separate CONFIG_COHERENT and CPUSET]
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4886e905
    • Q
      mm/hotplug: treat CMA pages as unmovable · 4700cf13
      Qian Cai 提交于
      mainline inclusion
      from mainline-5.1-rc6
      commit 1a9f2191
      category: bugfix
      bugzilla: 14055
      CVE: NA
      
      -------------------------------------------------
      
      has_unmovable_pages() is used by allocating CMA and gigantic pages as
      well as the memory hotplug.  The later doesn't know how to offline CMA
      pool properly now, but if an unused (free) CMA page is encountered, then
      has_unmovable_pages() happily considers it as a free memory and
      propagates this up the call chain.  Memory offlining code then frees the
      page without a proper CMA tear down which leads to an accounting issues.
      Moreover if the same memory range is onlined again then the memory never
      gets back to the CMA pool.
      
      State after memory offline:
      
       # grep cma /proc/vmstat
       nr_free_cma 205824
      
       # cat /sys/kernel/debug/cma/cma-kvm_cma/count
       209920
      
      Also, kmemleak still think those memory address are reserved below but
      have already been used by the buddy allocator after onlining.  This
      patch fixes the situation by treating CMA pageblocks as unmovable except
      when has_unmovable_pages() is called as part of CMA allocation.
      
        Offlined Pages 4096
        kmemleak: Cannot insert 0xc000201f7d040008 into the object search tree (overlaps existing)
        Call Trace:
          dump_stack+0xb0/0xf4 (unreliable)
          create_object+0x344/0x380
          __kmalloc_node+0x3ec/0x860
          kvmalloc_node+0x58/0x110
          seq_read+0x41c/0x620
          __vfs_read+0x3c/0x70
          vfs_read+0xbc/0x1a0
          ksys_read+0x7c/0x140
          system_call+0x5c/0x70
        kmemleak: Kernel memory leak detector disabled
        kmemleak: Object 0xc000201cc8000000 (size 13757317120):
        kmemleak:   comm "swapper/0", pid 0, jiffies 4294937297
        kmemleak:   min_count = -1
        kmemleak:   count = 0
        kmemleak:   flags = 0x5
        kmemleak:   checksum = 0
        kmemleak:   backtrace:
             cma_declare_contiguous+0x2a4/0x3b0
             kvm_cma_reserve+0x11c/0x134
             setup_arch+0x300/0x3f8
             start_kernel+0x9c/0x6e8
             start_here_common+0x1c/0x4b0
        kmemleak: Automatic memory scanning thread ended
      
      [cai@lca.pw: use is_migrate_cma_page() and update commit log]
        Link: http://lkml.kernel.org/r/20190416170510.20048-1-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190413002623.8967-1-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4700cf13
    • Q
      mm/hotplug: fix offline undo_isolate_page_range() · 47669159
      Qian Cai 提交于
      mainline inclusion
      from mainline-5.1-rc3
      commit 9b7ea46a82b31c74a37e6ff1c2a1df7d53e392ab
      category: bugfix
      bugzilla: 13472
      CVE: NA
      
      -------------------------------------------------
      
      Commit f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded
      memory to zones until online") introduced move_pfn_range_to_zone() which
      calls memmap_init_zone() during onlining a memory block.
      memmap_init_zone() will reset pagetype flags and makes migrate type to
      be MOVABLE.
      
      However, in __offline_pages(), it also call undo_isolate_page_range()
      after offline_isolated_pages() to do the same thing.  Due to commit
      2ce13640 ("mm: __first_valid_page skip over offline pages") changed
      __first_valid_page() to skip offline pages, undo_isolate_page_range()
      here just waste CPU cycles looping around the offlining PFN range while
      doing nothing, because __first_valid_page() will return NULL as
      offline_isolated_pages() has already marked all memory sections within
      the pfn range as offline via offline_mem_sections().
      
      Also, after calling the "useless" undo_isolate_page_range() here, it
      reaches the point of no returning by notifying MEM_OFFLINE.  Those pages
      will be marked as MIGRATE_MOVABLE again once onlining.  The only thing
      left to do is to decrease the number of isolated pageblocks zone counter
      which would make some paths of the page allocation slower that the above
      commit introduced.
      
      Even if alloc_contig_range() can be used to isolate 16GB-hugetlb pages
      on ppc64, an "int" should still be enough to represent the number of
      pageblocks there.  Fix an incorrect comment along the way.
      
      [cai@lca.pw: v4]
        Link: http://lkml.kernel.org/r/20190314150641.59358-1-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190313143133.46200-1-cai@lca.pw
      Fixes: 2ce13640 ("mm: __first_valid_page skip over offline pages")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      47669159
    • M
      mm: only report isolation failures when offlining memory · 2a5141d5
      Michal Hocko 提交于
      mainline inclusion
      from mainline-v5.0-rc1
      commit d381c54760dcfad23743da40516e7e003d73952a
      category: bugfix
      bugzilla: 13472
      CVE: NA
      
      ------------------------------------------------
      
      Heiko has complained that his log is swamped by warnings from
      has_unmovable_pages
      
      [   20.536664] page dumped because: has_unmovable_pages
      [   20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0
      [   20.536794] flags: 0x3fffe0000010200(slab|head)
      [   20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600
      [   20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000
      [   20.536797] page dumped because: has_unmovable_pages
      [   20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0
      [   20.536815] flags: 0x7fffe0000000000()
      [   20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000
      [   20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000
      
      which are not triggered by the memory hotplug but rather CMA allocator.
      The original idea behind dumping the page state for all call paths was
      that these messages will be helpful debugging failures.  From the above it
      seems that this is not the case for the CMA path because we are lacking
      much more context.  E.g the second reported page might be a CMA allocated
      page.  It is still interesting to see a slab page in the CMA area but it
      is hard to tell whether this is bug from the above output alone.
      
      Address this issue by dumping the page state only on request.  Both
      start_isolate_page_range and has_unmovable_pages already have an argument
      to ignore hwpoison pages so make this argument more generic and turn it
      into flags and allow callers to combine non-default modes into a mask.
      While we are at it, has_unmovable_pages call from
      is_pageblock_removable_nolock (sysfs removable file) is questionable to
      report the failure so drop it from there as well.
      
      Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2a5141d5
    • M
      mm, memory_hotplug: be more verbose for memory offline failures · 4a5f2575
      Michal Hocko 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit 2932c8b0
      category: bugfix
      bugzilla: 13472
      CVE: NA
      
      ------------------------------------------------
      
      There is only very limited information printed when the memory offlining
      fails:
      
      [ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff
      
      This tells us that the failure is triggered by the userspace intervention
      but it doesn't tell us much more about the underlying reason.  It might be
      that the page migration failes repeatedly and the userspace timeout
      expires and send a signal or it might be some of the earlier steps
      (isolation, memory notifier) takes too long.
      
      If the migration failes then it would be really helpful to see which page
      that and its state.  The same applies to the isolation phase.  If we fail
      to isolate a page from the allocator then knowing the state of the page
      would be helpful as well.
      
      Dump the page state that fails to get isolated or migrated.  This will
      tell us more about the failure and what to focus on during debugging.
      
      [akpm@linux-foundation.org: add missing printk arg]
      [mhocko@suse.com: tweak dump_page() `reason' text]
        Link: http://lkml.kernel.org/r/20181116083020.20260-6-mhocko@kernel.org
      Link: http://lkml.kernel.org/r/20181107101830.17405-6-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <OSalvador@suse.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4a5f2575
    • J
      mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs · 92834760
      Jann Horn 提交于
      [ Upstream commit 2c2ade81 ]
      
      The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
      number of references that we might need to create in the fastpath later,
      the bump-allocation fastpath only has to modify the non-atomic bias value
      that tracks the number of extra references we hold instead of the atomic
      refcount. The maximum number of allocations we can serve (under the
      assumption that no allocation is made with size 0) is nc->size, so that's
      the bias used.
      
      However, even when all memory in the allocation has been given away, a
      reference to the page is still held; and in the `offset < 0` slowpath, the
      page may be reused if everyone else has dropped their references.
      This means that the necessary number of references is actually
      `nc->size+1`.
      
      Luckily, from a quick grep, it looks like the only path that can call
      page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
      requires CAP_NET_ADMIN in the init namespace and is only intended to be
      used for kernel testing and fuzzing.
      
      To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
      `offset < 0` path, below the virt_to_page() call, and then repeatedly call
      writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
      with a vector consisting of 15 elements containing 1 byte each.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      92834760
    • Q
      page_poison: play nicely with KASAN · 15129a33
      Qian Cai 提交于
      mainline inclusion
      from mainline-5.0
      commit 4117992df66a
      category: bugfix
      bugzilla: 11620
      CVE: NA
      
      ------------------------------------------------
      
      KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING).
      It triggers false positives in the allocation path,
      
      BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
      Read of size 8 at addr ffff88881f800000 by task swapper/0
      CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
      Call Trace:
       dump_stack+0xe0/0x19a
       print_address_description.cold.2+0x9/0x28b
       kasan_report.cold.3+0x7a/0xb5
       __asan_report_load8_noabort+0x19/0x20
       memchr_inv+0x2ea/0x330
       kernel_poison_pages+0x103/0x3d5
       get_page_from_freelist+0x15e7/0x4d90
      
      because KASAN has not yet unpoisoned the shadow page for allocation before
      it checks memchr_inv() but only found a stale poison pattern.
      
      Also, false positives in free path,
      
      BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
      Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1
      CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
      Call Trace:
       dump_stack+0xe0/0x19a
       print_address_description.cold.2+0x9/0x28b
       kasan_report.cold.3+0x7a/0xb5
       check_memory_region+0x22d/0x250
       memset+0x28/0x40
       kernel_poison_pages+0x29e/0x3d5
       __free_pages_ok+0x75f/0x13e0
      
      due to KASAN adds poisoned redzones around slab objects, but the page
      poisoning needs to poison the whole page.
      
      Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      15129a33
    • Z
      pagecache: add Kconfig to enable/disable the feature · 862e2308
      zhongjiang 提交于
      euler inclusion
      category: bugfix
      CVE: NA
      Bugzilla: 9580
      
      ---------------------------
      
      Just add Kconfig to the feature.
      Signed-off-by: Nzhongjiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      862e2308
    • Z
      pagecache: add sysctl interface to limit pagecache · 6174ecb5
      zhong jiang 提交于
      euleros inclusion
      category: feature
      feature: pagecache limit
      
      add proc sysctl interface to set pagecache limit for reclaim memory
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NJing xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6174ecb5
    • W
      mm/page_alloc.c: don't call kasan_free_pages() at deferred mem init · d453a476
      Waiman Long 提交于
      [ Upstream commit 3c0c12cc8f00ca5f81acb010023b8eb13e9a7004 ]
      
      When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred
      pages initialization can take a long time.  Below were the reported init
      times on a 8-socket 96-core 4TB IvyBridge system.
      
        1) Non-debug kernel without CONFIG_KASAN
           [    8.764222] node 1 initialised, 132086516 pages in 7027ms
      
        2) Debug kernel with CONFIG_KASAN
           [  146.288115] node 1 initialised, 132075466 pages in 143052ms
      
      So the page init time in a debug kernel was 20X of the non-debug kernel.
      The long init time can be problematic as the page initialization is done
      with interrupt disabled.  In this particular case, it caused the
      appearance of following warning messages as well as NMI backtraces of all
      the cores that were doing the initialization.
      
      [   68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
      [   68.241000] rcu: 	25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252
      [   68.241000] rcu: 	44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253
      [   68.241000] rcu: 	54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253
      [   68.241000] rcu: 	60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253
      [   68.241000] rcu: 	72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253
      [   68.241000] rcu: 	84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253
      [   68.241000] rcu: 	111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253
      [   68.241000] rcu: 	(detected by 13, t=65018 jiffies, g=249, q=2)
      
      The long init time was mainly caused by the call to kasan_free_pages() to
      poison the newly initialized pages.  On a 4TB system, we are talking about
      almost 500GB of memory probably on the same node.
      
      In reality, we may not need to poison the newly initialized pages before
      they are ever allocated.  So KASAN poisoning of freed pages before the
      completion of deferred memory initialization is now disabled.  Those pages
      will be properly poisoned when they are allocated or freed after deferred
      pages initialization is done.
      
      With this change, the new page initialization time became:
      
      [   21.948010] node 1 initialised, 132075466 pages in 18702ms
      
      This was still about double the non-debug kernel time, but was much
      better than before.
      
      Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d453a476
    • M
      Revert "mm, memory_hotplug: initialize struct pages for the full memory section" · 79d3b9b9
      Michal Hocko 提交于
      commit 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a upstream.
      
      This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684.
      
      The underlying assumption that one sparse section belongs into a single
      numa node doesn't hold really. Robert Shteynfeld has reported a boot
      failure. The boot log was not captured but his memory layout is as
      follows:
      
        Early memory node ranges
          node   1: [mem 0x0000000000001000-0x0000000000090fff]
          node   1: [mem 0x0000000000100000-0x00000000dbdf8fff]
          node   1: [mem 0x0000000100000000-0x0000001423ffffff]
          node   0: [mem 0x0000001424000000-0x0000002023ffffff]
      
      This means that node0 starts in the middle of a memory section which is
      also in node1.  memmap_init_zone tries to initialize padding of a
      section even when it is outside of the given pfn range because there are
      code paths (e.g.  memory hotplug) which assume that the full worth of
      memory section is always initialized.
      
      In this particular case, though, such a range is already intialized and
      most likely already managed by the page allocator.  Scribbling over
      those pages corrupts the internal state and likely blows up when any of
      those pages gets used.
      Reported-by: NRobert Shteynfeld <robert.shteynfeld@gmail.com>
      Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section")
      Cc: stable@kernel.org
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      79d3b9b9
  2. 29 12月, 2018 2 次提交
    • O
      mm, page_alloc: fix has_unmovable_pages for HugePages · e27666dd
      Oscar Salvador 提交于
      commit 17e2e7d7e1b83fa324b3f099bfe426659aa3c2a4 upstream.
      
      While playing with gigantic hugepages and memory_hotplug, I triggered
      the following #PF when "cat memoryX/removable":
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
        #PF error: [normal kernel read fault]
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP PTI
        CPU: 1 PID: 1481 Comm: cat Tainted: G            E     4.20.0-rc6-mm1-1-default+ #18
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
        RIP: 0010:has_unmovable_pages+0x154/0x210
        Call Trace:
         is_mem_section_removable+0x7d/0x100
         removable_show+0x90/0xb0
         dev_attr_show+0x1c/0x50
         sysfs_kf_seq_show+0xca/0x1b0
         seq_read+0x133/0x380
         __vfs_read+0x26/0x180
         vfs_read+0x89/0x140
         ksys_read+0x42/0x90
         do_syscall_64+0x5b/0x180
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The reason is we do not pass the Head to page_hstate(), and so, the call
      to compound_order() in page_hstate() returns 0, so we end up checking
      all hstates's size to match PAGE_SIZE.
      
      Obviously, we do not find any hstate matching that size, and we return
      NULL.  Then, we dereference that NULL pointer in
      hugepage_migration_supported() and we got the #PF from above.
      
      Fix that by getting the head page before calling page_hstate().
      
      Also, since gigantic pages span several pageblocks, re-adjust the logic
      for skipping pages.  While are it, we can also get rid of the
      round_up().
      
      [osalvador@suse.de: remove round_up(), adjust skip pages logic per Michal]
        Link: http://lkml.kernel.org/r/20181221062809.31771-1-osalvador@suse.de
      Link: http://lkml.kernel.org/r/20181217225113.17864-1-osalvador@suse.deSigned-off-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e27666dd
    • M
      mm, memory_hotplug: initialize struct pages for the full memory section · 7592dbfa
      Mikhail Zaslonko 提交于
      commit 2830bf6f05fb3e05bc4743274b806c821807a684 upstream.
      
      If memory end is not aligned with the sparse memory section boundary,
      the mapping of such a section is only partly initialized.  This may lead
      to VM_BUG_ON due to uninitialized struct page access from
      is_mem_section_removable() or test_pages_in_a_zone() function triggered
      by memory_hotplug sysfs handlers:
      
      Here are the the panic examples:
       CONFIG_DEBUG_VM=y
       CONFIG_DEBUG_VM_PGFLAGS=y
      
       kernel parameter mem=2050M
       --------------------------
       page:000003d082008000 is uninitialized and poisoned
       page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
       Call Trace:
       ( test_pages_in_a_zone+0xde/0x160)
         show_valid_zones+0x5c/0x190
         dev_attr_show+0x34/0x70
         sysfs_kf_seq_show+0xc8/0x148
         seq_read+0x204/0x480
         __vfs_read+0x32/0x178
         vfs_read+0x82/0x138
         ksys_read+0x5a/0xb0
         system_call+0xdc/0x2d8
       Last Breaking-Event-Address:
         test_pages_in_a_zone+0xde/0x160
       Kernel panic - not syncing: Fatal exception: panic_on_oops
      
       kernel parameter mem=3075M
       --------------------------
       page:000003d08300c000 is uninitialized and poisoned
       page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
       Call Trace:
       ( is_mem_section_removable+0xb4/0x190)
         show_mem_removable+0x9a/0xd8
         dev_attr_show+0x34/0x70
         sysfs_kf_seq_show+0xc8/0x148
         seq_read+0x204/0x480
         __vfs_read+0x32/0x178
         vfs_read+0x82/0x138
         ksys_read+0x5a/0xb0
         system_call+0xdc/0x2d8
       Last Breaking-Event-Address:
         is_mem_section_removable+0xb4/0x190
       Kernel panic - not syncing: Fatal exception: panic_on_oops
      
      Fix the problem by initializing the last memory section of each zone in
      memmap_init_zone() till the very end, even if it goes beyond the zone end.
      
      Michal said:
      
      : This has alwways been problem AFAIU.  It just went unnoticed because we
      : have zeroed memmaps during allocation before f7f99100 ("mm: stop
      : zeroing memory during allocation in vmemmap") and so the above test
      : would simply skip these ranges as belonging to zone 0 or provided a
      : garbage.
      :
      : So I guess we do care for post f7f99100 kernels mostly and
      : therefore Fixes: f7f99100 ("mm: stop zeroing memory during
      : allocation in vmemmap")
      
      Link: http://lkml.kernel.org/r/20181212172712.34019-2-zaslonko@linux.ibm.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: NMikhail Zaslonko <zaslonko@linux.ibm.com>
      Reviewed-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Suggested-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Tested-by: NMikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7592dbfa
  3. 17 12月, 2018 1 次提交
    • W
      mm/page_alloc.c: fix calculation of pgdat->nr_zones · 505bc9f3
      Wei Yang 提交于
      [ Upstream commit 8f416836c0d50b198cad1225132e5abebf8980dc ]
      
      init_currently_empty_zone() will adjust pgdat->nr_zones and set it to
      'zone_idx(zone) + 1' unconditionally.  This is correct in the normal
      case, while not exact in hot-plug situation.
      
      This function is used in two places:
      
        * free_area_init_core()
        * move_pfn_range_to_zone()
      
      In the first case, we are sure zone index increase monotonically.  While
      in the second one, this is under users control.
      
      One way to reproduce this is:
      ----------------------------
      
      1. create a virtual machine with empty node1
      
         -m 4G,slots=32,maxmem=32G \
         -smp 4,maxcpus=8          \
         -numa node,nodeid=0,mem=4G,cpus=0-3 \
         -numa node,nodeid=1,mem=0G,cpus=4-7
      
      2. hot-add cpu 3-7
      
         cpu-add [3-7]
      
      2. hot-add memory to nod1
      
         object_add memory-backend-ram,id=ram0,size=1G
         device_add pc-dimm,id=dimm0,memdev=ram0,node=1
      
      3. online memory with following order
      
         echo online_movable > memory47/state
         echo online > memory40/state
      
      After this, node1 will have its nr_zones equals to (ZONE_NORMAL + 1)
      instead of (ZONE_MOVABLE + 1).
      
      Michal said:
       "Having an incorrect nr_zones might result in all sorts of problems
        which would be quite hard to debug (e.g. reclaim not considering the
        movable zone). I do not expect many users would suffer from this it
        but still this is trivial and obviously right thing to do so
        backporting to the stable tree shouldn't be harmful (last famous
        words)"
      
      Link: http://lkml.kernel.org/r/20181117022022.9956-1-richard.weiyang@gmail.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      505bc9f3
  4. 01 12月, 2018 2 次提交
    • M
      mm, page_alloc: check for max order in hot path · 9dec3855
      Michal Hocko 提交于
      [ Upstream commit c63ae43ba53bc432b414fd73dd5f4b01fcb1ab43 ]
      
      Konstantin has noticed that kvmalloc might trigger the following
      warning:
      
        WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
        [...]
        Call Trace:
         fragmentation_index+0x76/0x90
         compaction_suitable+0x4f/0xf0
         shrink_node+0x295/0x310
         node_reclaim+0x205/0x250
         get_page_from_freelist+0x649/0xad0
         __alloc_pages_nodemask+0x12a/0x2a0
         kmalloc_large_node+0x47/0x90
         __kmalloc_node+0x22b/0x2e0
         kvmalloc_node+0x3e/0x70
         xt_alloc_table_info+0x3a/0x80 [x_tables]
         do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
         nf_setsockopt+0x44/0x60
         SyS_setsockopt+0x6f/0xc0
         do_syscall_64+0x67/0x120
         entry_SYSCALL_64_after_hwframe+0x3d/0xa2
      
      the problem is that we only check for an out of bound order in the slow
      path and the node reclaim might happen from the fast path already.  This
      is fixable by making sure that kvmalloc doesn't ever use kmalloc for
      requests that are larger than KMALLOC_MAX_SIZE but this also shows that
      the code is rather fragile.  A recent UBSAN report just underlines that
      by the following report
      
        UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
        shift exponent 51 is too large for 32-bit type 'int'
        CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
        Call Trace:
         __dump_stack lib/dump_stack.c:77 [inline]
         dump_stack+0xd2/0x148 lib/dump_stack.c:113
         ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
         __ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
         __zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
         zone_watermark_fast mm/page_alloc.c:3216 [inline]
         get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
         __alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
         alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
         alloc_pages include/linux/gfp.h:509 [inline]
         __get_free_pages+0x12/0x60 mm/page_alloc.c:4414
         dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
         raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
         raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
         fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
         fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
         __blkdev_driver_ioctl block/ioctl.c:303 [inline]
         blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
         block_ioctl+0x105/0x150 fs/block_dev.c:1883
         vfs_ioctl fs/ioctl.c:46 [inline]
         do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
         ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
         __do_sys_ioctl fs/ioctl.c:709 [inline]
         __se_sys_ioctl fs/ioctl.c:707 [inline]
         __x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
         do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Note that this is not a kvmalloc path.  It is just that the fast path
      really depends on having sanitzed order as well.  Therefore move the
      order check to the fast path.
      
      Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reported-by: NKyungtae Kim <kt0755@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Byoungyoung Lee <lifeasageek@gmail.com>
      Cc: "Dae R. Jeong" <threeearcat@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      9dec3855
    • M
      mm, memory_hotplug: check zone_movable in has_unmovable_pages · b44fd126
      Michal Hocko 提交于
      [ Upstream commit 9d7899999c62c1a81129b76d2a6ecbc4655e1597 ]
      
      Page state checks are racy.  Under a heavy memory workload (e.g.  stress
      -m 200 -t 2h) it is quite easy to hit a race window when the page is
      allocated but its state is not fully populated yet.  A debugging patch to
      dump the struct page state shows
      
        has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0
        page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1
        flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked)
      
      Note that the state has been checked for both PageLRU and PageSwapBacked
      already.  Closing this race completely would require some sort of retry
      logic.  This can be tricky and error prone (think of potential endless
      or long taking loops).
      
      Workaround this problem for movable zones at least.  Such a zone should
      only contain movable pages.  Commit 15c30bc0 ("mm, memory_hotplug:
      make has_unmovable_pages more robust") has told us that this is not
      strictly true though.  Bootmem pages should be marked reserved though so
      we can move the original check after the PageReserved check.  Pages from
      other zones are still prone to races but we even do not pretend that
      memory hotremove works for those so pre-mature failure doesn't hurt that
      much.
      
      Link: http://lkml.kernel.org/r/20181106095524.14629-1-mhocko@kernel.org
      Fixes: 15c30bc0 ("mm, memory_hotplug: make has_unmovable_pages more robust")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NBaoquan He <bhe@redhat.com>
      Tested-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NBalbir Singh <bsingharora@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b44fd126
  5. 09 10月, 2018 1 次提交
  6. 02 10月, 2018 1 次提交
    • M
      mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration · efaffc5e
      Mel Gorman 提交于
      Rate limiting of page migrations due to automatic NUMA balancing was
      introduced to mitigate the worst-case scenario of migrating at high
      frequency due to false sharing or slowly ping-ponging between nodes.
      Since then, a lot of effort was spent on correctly identifying these
      pages and avoiding unnecessary migrations and the safety net may no longer
      be required.
      
      Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
      avoids spreading STREAM tasks wide prematurely. However, once the task
      was properly placed, it delayed migrating the memory due to rate limiting.
      Increasing the limit fixed the problem for him.
      
      Currently, the limit is hard-coded and does not account for the real
      capabilities of the hardware. Even if an estimate was attempted, it would
      not properly account for the number of memory controllers and it could
      not account for the amount of bandwidth used for normal accesses. Rather
      than fudging, this patch simply eliminates the rate limiting.
      
      However, Jirka reports that a STREAM configuration using multiple
      processes achieved similar performance to 4.16. In local tests, this patch
      improved performance of STREAM relative to the baseline but it is somewhat
      machine-dependent. Most workloads show little or not performance difference
      implying that there is not a heavily reliance on the throttling mechanism
      and it is safe to remove.
      
      STREAM on 2-socket machine
                               4.19.0-rc5             4.19.0-rc5
                               numab-v1r1       noratelimit-v1r1
      MB/sec copy     43298.52 (   0.00%)    44673.38 (   3.18%)
      MB/sec scale    30115.06 (   0.00%)    31293.06 (   3.91%)
      MB/sec add      32825.12 (   0.00%)    34883.62 (   6.27%)
      MB/sec triad    32549.52 (   0.00%)    34906.60 (   7.24%
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jirka Hladky <jhladky@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Linux-MM <linux-mm@kvack.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      efaffc5e
  7. 05 9月, 2018 1 次提交
  8. 30 8月, 2018 1 次提交
  9. 24 8月, 2018 1 次提交
  10. 23 8月, 2018 5 次提交
  11. 18 8月, 2018 4 次提交
    • A
      mm, page_alloc: double zone's batchsize · d8a759b5
      Aaron Lu 提交于
      To improve page allocator's performance for order-0 pages, each CPU has
      a Per-CPU-Pageset(PCP) per zone.  Whenever an order-0 page is needed,
      PCP will be checked first before asking pages from Buddy.  When PCP is
      used up, a batch of pages will be fetched from Buddy to improve
      performance and the size of batch can affect performance.
      
      zone's batch size gets doubled last time by commit ba56e91c("mm:
      page_alloc: increase size of per-cpu-pages") over ten years ago.  Since
      then, CPU has envolved a lot and CPU's cache sizes also increased.
      
      Dave Hansen is concerned the current batch size doesn't fit well with
      modern hardware and suggested me to do two things: first, use a page
      allocator intensive benchmark, e.g.  will-it-scale/page_fault1 to find
      out how performance changes with different batch sizes on various
      machines and then choose a new default batch size; second, see how this
      new batch size work with other workloads.
      
      In the first test, we saw performance gains on high-core-count systems
      and little to no effect on older systems with more modest core counts.
      In this phase's test data, two candidates: 63 and 127 are chosen.
      
      In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
      and more will-it-scale sub-tests are tested to see how these two
      candidates work with these workloads and decides a new default according
      to their results.
      
      Most test results are flat.  will-it-scale/page_fault2 process mode has
      10%-18% performance increase on 4-sockets Skylake and Broadwell.
      vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
      4-sockets servers while for 2-sockets servers, it caused 3%-8% performance
      drop.  Further analysis showed that, with a larger pcp->batch and thus
      larger pcp->high(the relationship of pcp->high=6 * pcp->batch is
      maintained in this patch), zone lock contention shifted to LRU add side
      lock contention and that caused performance drop.  This performance drop
      might be mitigated by others' work on optimizing LRU lock.
      
      Another downside of increasing pcp->batch is, when PCP is used up and need
      to fetch a batch of pages from Buddy, since batch is increased, that time
      can be longer than before.  My understanding is, this doesn't affect
      slowpath where direct reclaim and compaction dominates.  For fastpath,
      throughput is a win(according to will-it-scale/page_fault1) but worst
      latency can be larger now.
      
      Overall, I think double the batch size from 31 to 63 is relatively safe
      and provide good performance boost for high-core-count systems.
      
      The two phase's test results are listed below(all tests are done with THP
      disabled).
      
      Phase one(will-it-scale/page_fault1) test results:
      
      Skylake-EX: increased batch size has a good effect on zone->lock
      contention, though LRU contention will rise at the same time and
      limited the final performance increase.
      
      batch   score     change   zone_contention   lru_contention   total_contention
       31   15345900    +0.00%       64%                 8%           72%
       53   17903847   +16.67%       32%                38%           70%
       63   17992886   +17.25%       24%                45%           69%
       73   18022825   +17.44%       10%                61%           71%
      119   18023401   +17.45%        4%                66%           70%
      127   18029012   +17.48%        3%                66%           69%
      137   18036075   +17.53%        4%                66%           70%
      165   18035964   +17.53%        2%                67%           69%
      188   18101105   +17.95%        2%                67%           69%
      223   18130951   +18.15%        2%                67%           69%
      255   18118898   +18.07%        2%                67%           69%
      267   18101559   +17.96%        2%                67%           69%
      299   18160468   +18.34%        2%                68%           70%
      320   18139845   +18.21%        2%                67%           69%
      393   18160869   +18.34%        2%                68%           70%
      424   18170999   +18.41%        2%                68%           70%
      458   18144868   +18.24%        2%                68%           70%
      467   18142366   +18.22%        2%                68%           70%
      498   18154549   +18.30%        1%                68%           69%
      511   18134525   +18.17%        1%                69%           70%
      
      Broadwell-EX: similar pattern as Skylake-EX.
      
      batch   score     change   zone_contention   lru_contention   total_contention
       31   16703983    +0.00%       67%                 7%           74%
       53   18195393    +8.93%       43%                28%           71%
       63   18288885    +9.49%       38%                33%           71%
       73   18344329    +9.82%       35%                37%           72%
      119   18535529   +10.96%       24%                46%           70%
      127   18513596   +10.83%       23%                48%           71%
      137   18514327   +10.84%       23%                48%           71%
      165   18511840   +10.82%       22%                49%           71%
      188   18593478   +11.31%       17%                53%           70%
      223   18601667   +11.36%       17%                52%           69%
      255   18774825   +12.40%       12%                58%           70%
      267   18754781   +12.28%        9%                60%           69%
      299   18892265   +13.10%        7%                63%           70%
      320   18873812   +12.99%        8%                62%           70%
      393   18891174   +13.09%        6%                64%           70%
      424   18975108   +13.60%        6%                64%           70%
      458   18932364   +13.34%        8%                62%           70%
      467   18960891   +13.51%        5%                65%           70%
      498   18944526   +13.41%        5%                64%           69%
      511   18960839   +13.51%        5%                64%           69%
      
      Skylake-EP: although increased batch reduced zone->lock contention, but
      the effect is not as good as EX: zone->lock contention is still as high as
      20% with a very high batch value instead of 1% on Skylake-EX or 5% on
      Broadwell-EX.  Also, total_contention actually decreased with a higher
      batch but that doesn't translate to performance increase.
      
      batch   score    change   zone_contention   lru_contention   total_contention
       31   9554867    +0.00%       66%                 3%           69%
       53   9855486    +3.15%       63%                 3%           66%
       63   9980145    +4.45%       62%                 4%           66%
       73   10092774   +5.63%       62%                 5%           67%
      119   10310061   +7.90%       45%                19%           64%
      127   10342019   +8.24%       42%                19%           61%
      137   10358182   +8.41%       42%                21%           63%
      165   10397060   +8.81%       37%                24%           61%
      188   10341808   +8.24%       34%                26%           60%
      223   10349135   +8.31%       31%                27%           58%
      255   10327189   +8.08%       28%                29%           57%
      267   10344204   +8.26%       27%                29%           56%
      299   10325043   +8.06%       25%                30%           55%
      320   10310325   +7.91%       25%                31%           56%
      393   10293274   +7.73%       21%                31%           52%
      424   10311099   +7.91%       21%                32%           53%
      458   10321375   +8.02%       21%                32%           53%
      467   10303881   +7.84%       21%                32%           53%
      498   10332462   +8.14%       20%                33%           53%
      511   10325016   +8.06%       20%                32%           52%
      
      Broadwell-EP: zone->lock and lru lock had an agreement to make sure
      performance doesn't increase and they successfully managed to keep total
      contention at 70%.
      
      batch   score    change   zone_contention   lru_contention   total_contention
       31   10121178   +0.00%       19%                50%           69%
       53   10142366   +0.21%        6%                63%           69%
       63   10117984   -0.03%       11%                58%           69%
       73   10123330   +0.02%        7%                63%           70%
      119   10108791   -0.12%        2%                67%           69%
      127   10166074   +0.44%        3%                66%           69%
      137   10141574   +0.20%        3%                66%           69%
      165   10154499   +0.33%        2%                68%           70%
      188   10124921   +0.04%        2%                67%           69%
      223   10137399   +0.16%        2%                67%           69%
      255   10143289   +0.22%        0%                68%           68%
      267   10123535   +0.02%        1%                68%           69%
      299   10140952   +0.20%        0%                68%           68%
      320   10163170   +0.41%        0%                68%           68%
      393   10000633   -1.19%        0%                69%           69%
      424   10087998   -0.33%        0%                69%           69%
      458   10187116   +0.65%        0%                69%           69%
      467   10146790   +0.25%        0%                69%           69%
      498   10197958   +0.76%        0%                69%           69%
      511   10152326   +0.31%        0%                69%           69%
      
      Haswell-EP: similar to Broadwell-EP.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   10442205   +0.00%       14%                48%           62%
       53   10442255   +0.00%        5%                57%           62%
       63   10452059   +0.09%        6%                57%           63%
       73   10482349   +0.38%        5%                59%           64%
      119   10454644   +0.12%        3%                60%           63%
      127   10431514   -0.10%        3%                59%           62%
      137   10423785   -0.18%        3%                60%           63%
      165   10481216   +0.37%        2%                61%           63%
      188   10448755   +0.06%        2%                61%           63%
      223   10467144   +0.24%        2%                61%           63%
      255   10480215   +0.36%        2%                61%           63%
      267   10484279   +0.40%        2%                61%           63%
      299   10466450   +0.23%        2%                61%           63%
      320   10452578   +0.10%        2%                61%           63%
      393   10499678   +0.55%        1%                62%           63%
      424   10481454   +0.38%        1%                62%           63%
      458   10473562   +0.30%        1%                62%           63%
      467   10484269   +0.40%        0%                62%           62%
      498   10505599   +0.61%        0%                62%           62%
      511   10483395   +0.39%        0%                62%           62%
      
      Westmere-EP: contention is pretty small so not interesting.  Note too high
      a batch value could hurt performance.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   4831523   +0.00%        2%                 3%            5%
       53   4834086   +0.05%        2%                 4%            6%
       63   4834262   +0.06%        2%                 3%            5%
       73   48328518   +0.03%        2%                 4%            6%
      119   4830534   -0.02%        1%                 3%            4%
      127   4827461   -0.08%        1%                 4%            5%
      137   4827459   -0.08%        1%                 3%            4%
      165   4820534   -0.23%        0%                 4%            4%
      188   4817947   -0.28%        0%                 3%            3%
      223   4809671   -0.45%        0%                 3%            3%
      255   4802463   -0.60%        0%                 4%            4%
      267   4801634   -0.62%        0%                 3%            3%
      299   4798047   -0.69%        0%                 3%            3%
      320   4793084   -0.80%        0%                 3%            3%
      393   4785877   -0.94%        0%                 3%            3%
      424   4782911   -1.01%        0%                 3%            3%
      458   4779346   -1.08%        0%                 3%            3%
      467   4780306   -1.06%        0%                 3%            3%
      498   4780589   -1.05%        0%                 3%            3%
      511   4773724   -1.20%        0%                 3%            3%
      
      Skylake-Desktop: similar to Westmere-EP, nothing interesting.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   3906608   +0.00%        2%                 3%            5%
       53   3940164   +0.86%        2%                 3%            5%
       63   3937289   +0.79%        2%                 3%            5%
       73   3940201   +0.86%        2%                 3%            5%
      119   3933240   +0.68%        2%                 3%            5%
      127   3930514   +0.61%        2%                 4%            6%
      137   3938639   +0.82%        0%                 3%            3%
      165   3908755   +0.05%        0%                 3%            3%
      188   3905621   -0.03%        0%                 3%            3%
      223   3903015   -0.09%        0%                 4%            4%
      255   3889480   -0.44%        0%                 3%            3%
      267   3891669   -0.38%        0%                 4%            4%
      299   3898728   -0.20%        0%                 4%            4%
      320   3894547   -0.31%        0%                 4%            4%
      393   3875137   -0.81%        0%                 4%            4%
      424   3874521   -0.82%        0%                 3%            3%
      458   3880432   -0.67%        0%                 4%            4%
      467   3888715   -0.46%        0%                 3%            3%
      498   3888633   -0.46%        0%                 4%            4%
      511   3875305   -0.80%        0%                 5%            5%
      
      Haswell-Desktop: zone->lock is pretty low as other desktops, though lru
      contention is higher than other desktops.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   3511158   +0.00%        2%                 5%            7%
       53   3555445   +1.26%        2%                 6%            8%
       63   3561082   +1.42%        2%                 6%            8%
       73   3547218   +1.03%        2%                 6%            8%
      119   3571319   +1.71%        1%                 7%            8%
      127   3549375   +1.09%        0%                 6%            6%
      137   3560233   +1.40%        0%                 6%            6%
      165   3555176   +1.25%        2%                 6%            8%
      188   3551501   +1.15%        0%                 8%            8%
      223   3531462   +0.58%        0%                 7%            7%
      255   3570400   +1.69%        0%                 7%            7%
      267   3532235   +0.60%        1%                 8%            9%
      299   3562326   +1.46%        0%                 6%            6%
      320   3553569   +1.21%        0%                 8%            8%
      393   3539519   +0.81%        0%                 7%            7%
      424   3549271   +1.09%        0%                 8%            8%
      458   3528885   +0.50%        0%                 8%            8%
      467   3526554   +0.44%        0%                 7%            7%
      498   3525302   +0.40%        0%                 9%            9%
      511   3527556   +0.47%        0%                 8%            8%
      
      Sandybridge-Desktop: the 0% contention isn't accurate but caused by
      dropped fractional part. Since multiple contention path's contentions
      are all under 1% here, with some arithmetic operations like add, the
      final deviation could be as large as 3%.
      
      batch   score   change   zone_contention   lru_contention   total_contention
       31   1744495   +0.00%        0%                 0%            0%
       53   1755341   +0.62%        0%                 0%            0%
       63   1758469   +0.80%        0%                 0%            0%
       73   1759626   +0.87%        0%                 0%            0%
      119   1770417   +1.49%        0%                 0%            0%
      127   1768252   +1.36%        0%                 0%            0%
      137   1767848   +1.34%        0%                 0%            0%
      165   1765088   +1.18%        0%                 0%            0%
      188   1766918   +1.29%        0%                 0%            0%
      223   1767866   +1.34%        0%                 0%            0%
      255   1768074   +1.35%        0%                 0%            0%
      267   1763187   +1.07%        0%                 0%            0%
      299   1765620   +1.21%        0%                 0%            0%
      320   1767603   +1.32%        0%                 0%            0%
      393   1764612   +1.15%        0%                 0%            0%
      424   1758476   +0.80%        0%                 0%            0%
      458   1758593   +0.81%        0%                 0%            0%
      467   1757915   +0.77%        0%                 0%            0%
      498   1753363   +0.51%        0%                 0%            0%
      511   1755548   +0.63%        0%                 0%            0%
      
      Phase two test results:
      Note: all percent change is against base(batch=31).
      
      ebizzy.throughput (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    2410037±7%     2600451±2% +7.9%     2602878 +8.0%
      lkp-bdw-ex1     1493328        1489243    -0.3%     1492145 -0.1%
      lkp-skl-2sp2    1329674        1345891    +1.2%     1351056 +1.6%
      lkp-bdw-ep2      711511         711511     0.0%      710708 -0.1%
      lkp-wsm-ep2       75750          75528    -0.3%       75441 -0.4%
      lkp-skl-d01      264126         262791    -0.5%      264113 +0.0%
      lkp-hsw-d01      176601         176328    -0.2%      176368 -0.1%
      lkp-sb02          98937          98937    +0.0%       99030 +0.1%
      
      kbuild.buildtime (less is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     107.00        107.67  +0.6%        107.11  +0.1%
      lkp-bdw-ex1       97.33         97.33  +0.0%         97.42  +0.1%
      lkp-skl-2sp2     180.00        179.83  -0.1%        179.83  -0.1%
      lkp-bdw-ep2      178.17        179.17  +0.6%        177.50  -0.4%
      lkp-wsm-ep2      737.00        738.00  +0.1%        738.00  +0.1%
      lkp-skl-d01      642.00        653.00  +1.7%        653.00  +1.7%
      lkp-hsw-d01     1310.00       1316.00  +0.5%       1311.00  +0.1%
      
      netperf/TCP_STREAM.Throughput_total_Mbps (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     948790        947144  -0.2%        948333 -0.0%
      lkp-bdw-ex1      904224        904366  +0.0%        904926 +0.1%
      lkp-skl-2sp2     239731        239607  -0.1%        239565 -0.1%
      lk-bdw-ep2       365764        365933  +0.0%        365951 +0.1%
      lkp-wsm-ep2       93736         93803  +0.1%         93808 +0.1%
      lkp-skl-d01       77314         77303  -0.0%         77375 +0.1%
      lkp-hsw-d01       58617         60387  +3.0%         60208 +2.7%
      lkp-sb02          29990         30137  +0.5%         30103 +0.4%
      
      oltp.transactions (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-bdw-ex1      9073276       9100377     +0.3%    9036344     -0.4%
      lkp-skl-2sp2     8898717       8852054     -0.5%    8894459     -0.0%
      lkp-bdw-ep2     13426155      13384654     -0.3%   13333637     -0.7%
      lkp-hsw-ep2     13146314      13232784     +0.7%   13193163     +0.4%
      lkp-wsm-ep2      5035355       5019348     -0.3%    5033418     -0.0%
      lkp-skl-d01       418485       4413339     -0.1%    4419039     +0.0%
      lkp-hsw-d01      3517817±5%    3396120±3%  -3.5%    3455138±3%  -1.8%
      
      pigz.throughput (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.513e+08     1.507e+08 -0.4%      1.511e+08 -0.2%
      lkp-bdw-ex1     2.060e+08     2.052e+08 -0.4%      2.044e+08 -0.8%
      lkp-skl-2sp2    8.836e+08     8.845e+08 +0.1%      8.836e+08 -0.0%
      lkp-bdw-ep2     8.275e+08     8.464e+08 +2.3%      8.330e+08 +0.7%
      lkp-wsm-ep2     2.224e+08     2.221e+08 -0.2%      2.218e+08 -0.3%
      lkp-skl-d01     1.177e+08     1.177e+08 -0.0%      1.176e+08 -0.1%
      lkp-hsw-d01     1.154e+08     1.154e+08 +0.1%      1.154e+08 -0.0%
      lkp-sb02        0.633e+08     0.633e+08 +0.1%      0.633e+08 +0.0%
      
      will-it-scale.malloc1.processes (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1      620181       620484 +0.0%         620240 +0.0%
      lkp-bdw-ex1      1403610      1401201 -0.2%        1417900 +1.0%
      lkp-skl-2sp2     1288097      1284145 -0.3%        1283907 -0.3%
      lkp-bdw-ep2      1427879      1427675 -0.0%        1428266 +0.0%
      lkp-hsw-ep2      1362546      1353965 -0.6%        1354759 -0.6%
      lkp-wsm-ep2      2099657      2107576 +0.4%        2100226 +0.0%
      lkp-skl-d01      1476835      1476358 -0.0%        1474487 -0.2%
      lkp-hsw-d01      1308810      1303429 -0.4%        1301299 -0.6%
      lkp-sb02          589286       589284 -0.0%         588101 -0.2%
      
      will-it-scale.malloc1.threads (higher is better)
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     21289         21125     -0.8%      21241     -0.2%
      lkp-bdw-ex1      28114         28089     -0.1%      28007     -0.4%
      lkp-skl-2sp2     91866         91946     +0.1%      92723     +0.9%
      lkp-bdw-ep2      37637         37501     -0.4%      37317     -0.9%
      lkp-hsw-ep2      43673         43590     -0.2%      43754     +0.2%
      lkp-wsm-ep2      28577         28298     -1.0%      28545     -0.1%
      lkp-skl-d01     175277        173343     -1.1%     173082     -1.3%
      lkp-hsw-d01     130303        129566     -0.6%     129250     -0.8%
      lkp-sb02        113742±3%     116911     +2.8%     116417±3%  +2.4%
      
      will-it-scale.malloc2.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.206e+09     1.206e+09 -0.0%      1.206e+09 +0.0%
      lkp-bdw-ex1     1.319e+09     1.319e+09 -0.0%      1.319e+09 +0.0%
      lkp-skl-2sp2    8.000e+08     8.021e+08 +0.3%      7.995e+08 -0.1%
      lkp-bdw-ep2     6.582e+08     6.634e+08 +0.8%      6.513e+08 -1.1%
      lkp-hsw-ep2     6.671e+08     6.669e+08 -0.0%      6.665e+08 -0.1%
      lkp-wsm-ep2     1.805e+08     1.806e+08 +0.0%      1.804e+08 -0.1%
      lkp-skl-d01     1.611e+08     1.611e+08 -0.0%      1.610e+08 -0.0%
      lkp-hsw-d01     1.333e+08     1.332e+08 -0.0%      1.332e+08 -0.0%
      lkp-sb02         82485104      82478206 -0.0%       82473546 -0.0%
      
      will-it-scale.malloc2.threads (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.574e+09     1.574e+09 -0.0%      1.574e+09 -0.0%
      lkp-bdw-ex1     1.737e+09     1.737e+09 +0.0%      1.737e+09 -0.0%
      lkp-skl-2sp2    9.161e+08     9.162e+08 +0.0%      9.181e+08 +0.2%
      lkp-bdw-ep2     7.856e+08     8.015e+08 +2.0%      8.113e+08 +3.3%
      lkp-hsw-ep2     6.908e+08     6.904e+08 -0.1%      6.907e+08 -0.0%
      lkp-wsm-ep2     2.409e+08     2.409e+08 +0.0%      2.409e+08 -0.0%
      lkp-skl-d01     1.199e+08     1.199e+08 -0.0%      1.199e+08 -0.0%
      lkp-hsw-d01     1.029e+08     1.029e+08 -0.0%      1.029e+08 +0.0%
      lkp-sb02         68081213      68061423 -0.0%       68076037 -0.0%
      
      will-it-scale.page_fault2.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    14509125±4%   16472364 +13.5%       17123117 +18.0%
      lkp-bdw-ex1     14736381      16196588  +9.9%       16364011 +11.0%
      lkp-skl-2sp2     6354925       6435444  +1.3%        6436644  +1.3%
      lkp-bdw-ep2      8749584       8834422  +1.0%        8827179  +0.9%
      lkp-hsw-ep2      8762591       8845920  +1.0%        8825697  +0.7%
      lkp-wsm-ep2      3036083       3030428  -0.2%        3021741  -0.5%
      lkp-skl-d01      2307834       2304731  -0.1%        2286142  -0.9%
      lkp-hsw-d01      1806237       1800786  -0.3%        1795943  -0.6%
      lkp-sb02          842616        837844  -0.6%         833921  -1.0%
      
      will-it-scale.page_fault2.threads
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1623294       1615132±2% -0.5%     1656777    +2.1%
      lkp-bdw-ex1      1995714       2025948    +1.5%     2113753±3% +5.9%
      lkp-skl-2sp2     2346708       2415591    +2.9%     2416919    +3.0%
      lkp-bdw-ep2      2342564       2344882    +0.1%     2300206    -1.8%
      lkp-hsw-ep2      1820658       1831681    +0.6%     1844057    +1.3%
      lkp-wsm-ep2      1725482       1733774    +0.5%     1740517    +0.9%
      lkp-skl-d01      1832833       1823628    -0.5%     1806489    -1.4%
      lkp-hsw-d01      1427913       1427287    -0.0%     1420226    -0.5%
      lkp-sb02          750626        748615    -0.3%      746621    -0.5%
      
      will-it-scale.page_fault3.processes (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    24382726      24400317 +0.1%       24668774 +1.2%
      lkp-bdw-ex1     35399750      35683124 +0.8%       35829492 +1.2%
      lkp-skl-2sp2    28136820      28068248 -0.2%       28147989 +0.0%
      lkp-bdw-ep2     37269077      37459490 +0.5%       37373073 +0.3%
      lkp-hsw-ep2     36224967      36114085 -0.3%       36104908 -0.3%
      lkp-wsm-ep2     16820457      16911005 +0.5%       16968596 +0.9%
      lkp-skl-d01      7721138       7725904 +0.1%        7756740 +0.5%
      lkp-hsw-d01      7611979       7650928 +0.5%        7651323 +0.5%
      lkp-sb02         3781546       3796502 +0.4%        3796827 +0.4%
      
      will-it-scale.page_fault3.threads (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1865820±3%   1900917±2%  +1.9%     1826245±4%  -2.1%
      lkp-bdw-ex1      3094060      3148326     +1.8%     3150036     +1.8%
      lkp-skl-2sp2     3952940      3953898     +0.0%     3989360     +0.9%
      lkp-bdw-ep2      3420373±3%   3643964     +6.5%     3644910±5%  +6.6%
      lkp-hsw-ep2      2609635±2%   2582310±3%  -1.0%     2780459     +6.5%
      lkp-wsm-ep2      4395001      4417196     +0.5%     4432499     +0.9%
      lkp-skl-d01      5363977      5400003     +0.7%     5411370     +0.9%
      lkp-hsw-d01      5274131      5311294     +0.7%     5319359     +0.9%
      lkp-sb02         2917314      2913004     -0.1%     2935286     +0.6%
      
      will-it-scale.read1.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    73762279±14%  69322519±10% -6.0%    69349855±13%  -6.0% (result unstable)
      lkp-bdw-ex1     1.701e+08     1.704e+08    +0.1%    1.705e+08     +0.2%
      lkp-skl-2sp2    63111570      63113953     +0.0%    63836573      +1.1%
      lkp-bdw-ep2     79247409      79424610     +0.2%    78012656      -1.6%
      lkp-hsw-ep2     67677026      68308800     +0.9%    67539106      -0.2%
      lkp-wsm-ep2     13339630      13939817     +4.5%    13766865      +3.2%
      lkp-skl-d01     10969487      10972650     +0.0%    no data
      lkp-hsw-d01     9857342±2%    10080592±2%  +2.3%    10131560      +2.8%
      lkp-sb02        5189076        5197473     +0.2%    5163253       -0.5%
      
      will-it-scale.read1.threads (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    62468045±12%  73666726±7% +17.9%    79553123±12% +27.4% (result unstable)
      lkp-bdw-ex1     1.62e+08      1.624e+08    +0.3%    1.614e+08     -0.3%
      lkp-skl-2sp2    58319780      59181032     +1.5%    59821353      +2.6%
      lkp-bdw-ep2     74057992      75698171     +2.2%    74990869      +1.3%
      lkp-hsw-ep2     63672959      63639652     -0.1%    64387051      +1.1%
      lkp-wsm-ep2     13489943      13526058     +0.3%    13259032      -1.7%
      lkp-skl-d01     10297906      10338796     +0.4%    10407328      +1.1%
      lkp-hsw-d01      9636721       9667376     +0.3%     9341147      -3.1%
      lkp-sb02         4801938       4804496     +0.1%     4802290      +0.0%
      
      will-it-scale.write1.processes (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.111e+08     1.104e+08±2%  -0.7%   1.122e+08±2%  +1.0%
      lkp-bdw-ex1     1.392e+08     1.399e+08     +0.5%   1.397e+08     +0.4%
      lkp-skl-2sp2     59369233      58994841     -0.6%    58715168     -1.1%
      lkp-bdw-ep2      61820979      CPU throttle          63593123     +2.9%
      lkp-hsw-ep2      57897587      57435605     -0.8%    56347450     -2.7%
      lkp-wsm-ep2       7814203       7918017±2%  +1.3%     7669068     -1.9%
      lkp-skl-d01       8886557       8971422     +1.0%     8818366     -0.8%
      lkp-hsw-d01       9171001±5%    9189915     +0.2%     9483909     +3.4%
      lkp-sb02          4475406       4475294     -0.0%     4501756     +0.6%
      
      will-it-scale.write1.threads (higer is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    1.058e+08     1.055e+08±2%  -0.2%   1.065e+08  +0.7%
      lkp-bdw-ex1     1.316e+08     1.300e+08     -1.2%   1.308e+08  -0.6%
      lkp-skl-2sp2     54492421      56086678     +2.9%    55975657  +2.7%
      lkp-bdw-ep2      59360449      59003957     -0.6%    58101262  -2.1%
      lkp-hsw-ep2      53346346±2%   52530876     -1.5%    52902487  -0.8%
      lkp-wsm-ep2       7774006       7800092±2%  +0.3%     7558833  -2.8%
      lkp-skl-d01       8346174       8235695     -1.3%     no data
      lkp-hsw-d01       8636244       8655731     +0.2%     8658868  +0.3%
      lkp-sb02          4181820       4204107     +0.5%     4182992  +0.0%
      
      vm-scalability.anon-r-rand.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    11933873±3%   12356544±2%  +3.5%   12188624     +2.1%
      lkp-bdw-ex1      7114424±2%    7330949±2%  +3.0%    7392419     +3.9%
      lkp-skl-2sp2     6773277±5%    6492332±8%  -4.1%    6543962     -3.4%
      lkp-bdw-ep2      7133846±4%    7233508     +1.4%    7013518±3%  -1.7%
      lkp-hsw-ep2      4576626       4527098     -1.1%    4551679     -0.5%
      lkp-wsm-ep2      2583599       2592492     +0.3%    2588039     +0.2%
      lkp-hsw-d01       998199±2%    1028311     +3.0%    1006460±2%  +0.8%
      lkp-sb02          570572        567854     -0.5%     568449     -0.4%
      
      vm-scalability.anon-r-rand-mt.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1789419       1787830     -0.1%    1788208     -0.1%
      lkp-bdw-ex1      3492595±2%    3554966±2%  +1.8%    3558835±3%  +1.9%
      lkp-skl-2sp2     3856238±2%    3975403±4%  +3.1%    3994600     +3.6%
      lkp-bdw-ep2      3726963±11%   3809292±6%  +2.2%    3871924±4%  +3.9%
      lkp-hsw-ep2      2131760±3%    2033578±4%  -4.6%    2130727±6%  -0.0%
      lkp-wsm-ep2      2369731       2368384     -0.1%    2370252     +0.0%
      lkp-skl-d01      1207128       1206220     -0.1%    1205801     -0.1%
      lkp-hsw-d01       964317        992329±2%  +2.9%     992099±2%  +2.9%
      lkp-sb02          567137        567346     +0.0%     566144     -0.2%
      
      vm-scalability.lru-file-mmap-read.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1    19560469±6%   23018999     +17.7%   23418800     +19.7%
      lkp-bdw-ex1     17769135±14%  26141676±3%  +47.1%   26284723±5%  +47.9%
      lkp-skl-2sp2    14056512      13578884      -3.4%   13146214      -6.5%
      lkp-bdw-ep2     15336542      14737654      -3.9%   14088159      -8.1%
      lkp-hsw-ep2     16275498      15756296      -3.2%   15018090      -7.7%
      lkp-wsm-ep2     11272160      11237231      -0.3%   11310047      +0.3%
      lkp-skl-d01      7322119       7324569      +0.0%    7184148      -1.9%
      lkp-hsw-d01      6449234       6404542      -0.7%    6356141      -1.4%
      lkp-sb02         3517943       3520668      +0.1%    3527309      +0.3%
      
      vm-scalability.lru-file-mmap-read-rand.throughput (higher is better)
      
      machine         batch=31      batch=63             batch=127
      lkp-skl-4sp1     1689052       1697553  +0.5%       1698726  +0.6%
      lkp-bdw-ex1      1675246       1699764  +1.5%       1712226  +2.2%
      lkp-skl-2sp2     1800533       1799749  -0.0%       1800581  +0.0%
      lkp-bdw-ep2      1807422       1807758  +0.0%       1804932  -0.1%
      lkp-hsw-ep2      1809807       1808781  -0.1%       1807811  -0.1%
      lkp-wsm-ep2      1800198       1802434  +0.1%       1801236  +0.1%
      lkp-skl-d01       696689        695537  -0.2%        694106  -0.4%
      lkp-hsw-d01       698364        698666  +0.0%        696686  -0.2%
      lkp-sb02          258939        258787  -0.1%        258199  -0.3%
      
      Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Suggested-by: NDave Hansen <dave.hansen@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Kemi Wang <kemi.wang@intel.com>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d8a759b5
    • M
      mm: drop VM_BUG_ON from __get_free_pages · 9ea9a680
      Michal Hocko 提交于
      There is no real reason to blow up just because the caller doesn't know
      that __get_free_pages cannot return highmem pages.  Simply fix that up
      silently.  Even if we have some confused users such a fixup will not be
      harmful.
      
      [akpm@linux-foundation.org: mask off __GFP_HIGHMEM]
      Link: http://lkml.kernel.org/r/20180622162841.25114-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Jiankang Chen <chenjiankang1@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yisheng Xie <xieyisheng1@huawei.com>
      Cc: Hanjun Guo <guohanjun@huawei.com>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ea9a680
    • V
      mm, page_alloc: actually ignore mempolicies for high priority allocations · d6a24df0
      Vlastimil Babka 提交于
      __alloc_pages_slowpath() has for a long time contained code to ignore
      node restrictions from memory policies for high priority allocations.
      The current code that resets the zonelist iterator however does
      effectively nothing after commit 7810e678 ("mm, page_alloc: do not
      break __GFP_THISNODE by zonelist reset") removed a buggy zonelist reset.
      Even before that commit, mempolicy restrictions were still not ignored,
      as they are passed in ac->nodemask which is untouched by the code.
      
      We can either remove the code, or make it work as intended.  Since
      ac->nodemask can be set from task's mempolicy via alloc_pages_current()
      and thus also alloc_pages(), it may indeed affect kernel allocations,
      and it makes sense to ignore it to allow progress for high priority
      allocations.
      
      Thus, this patch resets ac->nodemask to NULL in such cases.  This
      assumes all callers can handle it (i.e.  there are no guarantees as in
      the case of __GFP_THISNODE) which seems to be the case.  The same
      assumption is already present in check_retry_cpuset() for some time.
      
      The expected effect is that high priority kernel allocations in the
      context of userspace tasks (e.g.  OOM victims) restricted by mempolicies
      will have higher chance to succeed if they are restricted to nodes with
      depleted memory, while there are other nodes with free memory left.
      
      It's not a new intention, but for the first time the code will match the
      intention, AFAICS.  It was intended by commit 183f6371 ("mm: ignore
      mempolicies when using ALLOC_NO_WATERMARK") in v3.6 but I think it never
      really worked, as mempolicy restriction was already encoded in nodemask,
      not zonelist, at that time.
      
      So originally that was for ALLOC_NO_WATERMARK only.  Then it was
      adjusted by e46e7b77 ("mm, page_alloc: recalculate the preferred
      zoneref if the context can ignore memory policies") and cd04ae1e
      ("mm, oom: do not rely on TIF_MEMDIE for memory reserves access") to the
      current state.  So even GFP_ATOMIC would now ignore mempolicies after
      the initial attempts fail - if the code worked as people thought it
      does.
      
      Link: http://lkml.kernel.org/r/20180612122624.8045-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d6a24df0
    • P
      mm: skip invalid pages block at a time in zero_resv_unresv() · 720e14eb
      Pavel Tatashin 提交于
      The role of zero_resv_unavail() is to make sure that every struct page
      that is allocated but is not backed by memory that is accessible by
      kernel is zeroed and not in some uninitialized state.
      
      Since struct pages are allocated in blocks (2M pages in x86 case), we
      can skip pageblock_nr_pages at a time, when the first one is found to be
      invalid.
      
      This optimization may help since now on x86 every hole in e820 maps is
      marked as reserved in memblock, and thus will go through this function.
      
      This function is called before sched_clock() is initialized, so I used
      my x86 early boot clock patches to measure the performance improvement.
      
      With 1T hole on i7-8700 currently we would take 0.606918s of boot time,
      but with this optimization 0.001103s.
      
      Link: http://lkml.kernel.org/r/20180615155733.1175-1-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Huang, Ying" <ying.huang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      720e14eb
  12. 06 8月, 2018 2 次提交
    • P
      PM / reboot: Eliminate race between reboot and suspend · 55f2503c
      Pingfan Liu 提交于
      At present, "systemctl suspend" and "shutdown" can run in parrallel. A
      system can suspend after devices_shutdown(), and resume. Then the shutdown
      task goes on to power off. This causes many devices are not really shut
      off. Hence replacing reboot_mutex with system_transition_mutex (renamed
      from pm_mutex) to achieve the exclusion. The renaming of pm_mutex as
      system_transition_mutex can be better to reflect the purpose of the mutex.
      Signed-off-by: NPingfan Liu <kernelfans@gmail.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      55f2503c
    • D
      mm: Allow non-direct-map arguments to free_reserved_area() · 0d834328
      Dave Hansen 提交于
      free_reserved_area() takes pointers as arguments to show which addresses
      should be freed.  However, it does this in a somewhat ambiguous way.  If it
      gets a kernel direct map address, it always works.  However, if it gets an
      address that is part of the kernel image alias mapping, it can fail.
      
      It fails if all of the following happen:
       * The specified address is part of the kernel image alias
       * Poisoning is requested (forcing a memset())
       * The address is in a read-only portion of the kernel image
      
      The memset() fails on the read-only mapping, of course.
      free_reserved_area() *is* called both on the direct map and on kernel image
      alias addresses.  We've just lucked out thus far that the kernel image
      alias areas it gets used on are read-write.  I'm fairly sure this has been
      just a happy accident.
      
      It is quite easy to make free_reserved_area() work for all cases: just
      convert the address to a direct map address before doing the memset(), and
      do this unconditionally.  There is little chance of a regression here
      because we previously did a virt_to_page() on the address for the memset,
      so we know these are not highmem pages for which virt_to_page() would fail.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: keescook@google.com
      Cc: aarcange@redhat.com
      Cc: jgross@suse.com
      Cc: jpoimboe@redhat.com
      Cc: gregkh@linuxfoundation.org
      Cc: peterz@infradead.org
      Cc: hughd@google.com
      Cc: torvalds@linux-foundation.org
      Cc: bp@alien8.de
      Cc: luto@kernel.org
      Cc: ak@linux.intel.com
      Cc: Kees Cook <keescook@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Josh Poimboeuf <jpoimboe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Andi Kleen <ak@linux.intel.com>
      Link: https://lkml.kernel.org/r/20180802225826.1287AE3E@viggo.jf.intel.com
      0d834328
  13. 17 7月, 2018 1 次提交
  14. 15 7月, 2018 1 次提交
    • P
      mm: zero unavailable pages before memmap init · e181ae0c
      Pavel Tatashin 提交于
      We must zero struct pages for memory that is not backed by physical
      memory, or kernel does not have access to.
      
      Recently, there was a change which zeroed all memmap for all holes in
      e820.  Unfortunately, it introduced a bug that is discussed here:
      
        https://www.spinics.net/lists/linux-mm/msg156764.html
      
      Linus, also saw this bug on his machine, and confirmed that reverting
      commit 124049de ("x86/e820: put !E820_TYPE_RAM regions into
      memblock.reserved") fixes the issue.
      
      The problem is that we incorrectly zero some struct pages after they
      were setup.
      
      The fix is to zero unavailable struct pages prior to initializing of
      struct pages.
      
      A more detailed fix should come later that would avoid double zeroing
      cases: one in __init_single_page(), the other one in
      zero_resv_unavail().
      
      Fixes: 124049de ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved")
      Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e181ae0c
  15. 15 6月, 2018 1 次提交
  16. 08 6月, 2018 2 次提交
    • V
      mm, page_alloc: do not break __GFP_THISNODE by zonelist reset · 7810e678
      Vlastimil Babka 提交于
      In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
      allocations that can ignore memory policies.  The zonelist is obtained
      from current CPU's node.  This is a problem for __GFP_THISNODE
      allocations that want to allocate on a different node, e.g.  because the
      allocating thread has been migrated to a different CPU.
      
      This has been observed to break SLAB in our 4.4-based kernel, because
      there it relies on __GFP_THISNODE working as intended.  If a slab page
      is put on wrong node's list, then further list manipulations may corrupt
      the list because page_to_nid() is used to determine which node's
      list_lock should be locked and thus we may take a wrong lock and race.
      
      Current SLAB implementation seems to be immune by luck thanks to commit
      511e3a05 ("mm/slab: make cache_grow() handle the page allocated on
      arbitrary node") but there may be others assuming that __GFP_THISNODE
      works as promised.
      
      We can fix it by simply removing the zonelist reset completely.  There
      is actually no reason to reset it, because memory policies and cpusets
      don't affect the zonelist choice in the first place.  This was different
      when commit 183f6371 ("mm: ignore mempolicies when using
      ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
      own restricted zonelists.
      
      We might consider this for 4.17 although I don't know if there's
      anything currently broken.
      
      SLAB is currently not affected, but in kernels older than 4.7 that don't
      yet have 511e3a05 ("mm/slab: make cache_grow() handle the page
      allocated on arbitrary node") it is.  That's at least 4.4 LTS.  Older
      ones I'll have to check.
      
      So stable backports should be more important, but will have to be
      reviewed carefully, as the code went through many changes.  BTW I think
      that also the ac->preferred_zoneref reset is currently useless if we
      don't also reset ac->nodemask from a mempolicy to NULL first (which we
      probably should for the OOM victims etc?), but I would leave that for a
      separate patch.
      
      Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Fixes: 183f6371 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7810e678
    • M
      mm: combine LRU and main union in struct page · 4da1984e
      Matthew Wilcox 提交于
      This gives us five words of space in a single union in struct page.  The
      compound_mapcount moves position (from offset 24 to offset 20) on 64-bit
      systems, but that does not seem likely to cause any trouble.
      
      Link: http://lkml.kernel.org/r/20180518194519.3820-11-willy@infradead.orgSigned-off-by: NMatthew Wilcox <mawilcox@microsoft.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4da1984e