1. 15 10月, 2019 1 次提交
    • D
      mm, hugetlb: allow hugepage allocations to reclaim as needed · 3f36d866
      David Rientjes 提交于
      Commit b39d0ee2 ("mm, page_alloc: avoid expensive reclaim when
      compaction may not succeed") has chnaged the allocator to bail out from
      the allocator early to prevent from a potentially excessive memory
      reclaim.  __GFP_RETRY_MAYFAIL is designed to retry the allocation,
      reclaim and compaction loop as long as there is a reasonable chance to
      make forward progress.  Neither COMPACT_SKIPPED nor COMPACT_DEFERRED at
      the INIT_COMPACT_PRIORITY compaction attempt gives this feedback.
      
      The most obvious affected subsystem is hugetlbfs which allocates huge
      pages based on an admin request (or via admin configured overcommit).  I
      have done a simple test which tries to allocate half of the memory for
      hugetlb pages while the memory is full of a clean page cache.  This is
      not an unusual situation because we try to cache as much of the memory
      as possible and sysctl/sysfs interface to allocate huge pages is there
      for flexibility to allocate hugetlb pages at any time.
      
      System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages
      after the memory is prefilled by a clean page cache:
      
        root@test1:~# cat hugetlb_test.sh
      
        set -x
        echo 0 > /proc/sys/vm/nr_hugepages
        echo 3 > /proc/sys/vm/drop_caches
        echo 1 > /proc/sys/vm/compact_memory
        dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
        TS=$(date +%s)
        echo 256 > /proc/sys/vm/nr_hugepages
        cat /proc/sys/vm/nr_hugepages
      
      The results for 2 consecutive runs on clean 5.3
      
        root@test1:~# sh hugetlb_test.sh
        + echo 0
        + echo 3
        + echo 1
        + dd if=/mnt/data/file-1G of=/dev/null bs=4096
        262144+0 records in
        262144+0 records out
        1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s
        + date +%s
        + TS=1569905284
        + echo 256
        + cat /proc/sys/vm/nr_hugepages
        256
        root@test1:~# sh hugetlb_test.sh
        + echo 0
        + echo 3
        + echo 1
        + dd if=/mnt/data/file-1G of=/dev/null bs=4096
        262144+0 records in
        262144+0 records out
        1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s
        + date +%s
        + TS=1569905311
        + echo 256
        + cat /proc/sys/vm/nr_hugepages
        256
      
      Now with b39d0ee2 applied
      
        root@test1:~# sh hugetlb_test.sh
        + echo 0
        + echo 3
        + echo 1
        + dd if=/mnt/data/file-1G of=/dev/null bs=4096
        262144+0 records in
        262144+0 records out
        1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
        + date +%s
        + TS=1569905516
        + echo 256
        + cat /proc/sys/vm/nr_hugepages
        11
        root@test1:~# sh hugetlb_test.sh
        + echo 0
        + echo 3
        + echo 1
        + dd if=/mnt/data/file-1G of=/dev/null bs=4096
        262144+0 records in
        262144+0 records out
        1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
        + date +%s
        + TS=1569905541
        + echo 256
        + cat /proc/sys/vm/nr_hugepages
        12
      
      The success rate went down by factor of 20!
      
      Although hugetlb allocation requests might fail and it is reasonable to
      expect them to under extremely fragmented memory or when the memory is
      under a heavy pressure but the above situation is not that case.
      
      Fix the regression by reverting back to the previous behavior for
      __GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for
      those requests.
      
      Mike said:
      
      : hugetlbfs allocations are commonly done via sysctl/sysfs shortly after
      : boot where this may not be as much of an issue.  However, I am aware of at
      : least three use cases where allocations are made after the system has been
      : up and running for quite some time:
      :
      : - DB reconfiguration.  If sysctl/sysfs fails to get required number of
      :   huge pages, system is rebooted to perform allocation after boot.
      :
      : - VM provisioning.  If unable get required number of huge pages, fall
      :   back to base pages.
      :
      : - An application that does not preallocate pool, but rather allocates
      :   pages at fault time for optimal NUMA locality.
      :
      : In all cases, I would expect b39d0ee2 to cause regressions and
      : noticable behavior changes.
      :
      : My quick/limited testing in
      : https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com
      : was insufficient.  It was also mentioned that if something like
      : b39d0ee2 went forward, I would like exemptions for __GFP_RETRY_MAYFAIL
      : requests as in this patch.
      
      [mhocko@suse.com: reworded changelog]
      Link: http://lkml.kernel.org/r/20191007075548.12456-1-mhocko@kernel.org
      Fixes: b39d0ee2 ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f36d866
  2. 08 10月, 2019 1 次提交
    • Q
      mm/page_alloc.c: fix a crash in free_pages_prepare() · 234fdce8
      Qian Cai 提交于
      On architectures like s390, arch_free_page() could mark the page unused
      (set_page_unused()) and any access later would trigger a kernel panic.
      Fix it by moving arch_free_page() after all possible accessing calls.
      
       Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
       Krnl PSW : 0404e00180000000 0000000026c2b96e (__free_pages_ok+0x34e/0x5d8)
                  R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
       Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c 000000000000000f
                  000003d080012100 000003d080013fc0 0000000000000000 0000000000100000
                  00000000275cca48 0000000000000100 0000000000000008 000003d080010000
                  00000000000001d0 000003d000000000 0000000026c2b78a 000000002717fdb0
       Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6
                  0000000026c2b962: e32014000036 pfd 2,1024(%r1)
                 #0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1)
                 >0000000026c2b96e: 41101100  la %r1,256(%r1)
                  0000000026c2b972: a737fff8  brctg %r3,26c2b962
                  0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1)
                  0000000026c2b97c: e31003400004 lg %r1,832
                  0000000026c2b982: ebff1430016a asi 5168(%r1),-1
       Call Trace:
       __free_pages_ok+0x16a/0x5d8)
       memblock_free_all+0x206/0x290
       mem_init+0x58/0x120
       start_kernel+0x2b0/0x570
       startup_continue+0x6a/0xc0
       INFO: lockdep is turned off.
       Last Breaking-Event-Address:
       __free_pages_ok+0x372/0x5d8
       Kernel panic - not syncing: Fatal exception: panic_on_oops
       00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 26A2379C
      
      In the past, only kernel_poison_pages() would trigger this but it needs
      "page_poison=on" kernel cmdline, and I suspect nobody tested that on
      s390.  Recently, kernel_init_free_pages() (commit 6471384a ("mm:
      security: introduce init_on_alloc=1 and init_on_free=1 boot options"))
      was added and could trigger this as well.
      
      [akpm@linux-foundation.org: add comment]
      Link: http://lkml.kernel.org/r/1569613623-16820-1-git-send-email-cai@lca.pw
      Fixes: 8823b1db ("mm/page_poison.c: enable PAGE_POISONING as a separate option")
      Fixes: 6471384a ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.3+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      234fdce8
  3. 29 9月, 2019 1 次提交
    • D
      mm, page_alloc: avoid expensive reclaim when compaction may not succeed · b39d0ee2
      David Rientjes 提交于
      Memory compaction has a couple significant drawbacks as the allocation
      order increases, specifically:
      
       - isolate_freepages() is responsible for finding free pages to use as
         migration targets and is implemented as a linear scan of memory
         starting at the end of a zone,
      
       - failing order-0 watermark checks in memory compaction does not account
         for how far below the watermarks the zone actually is: to enable
         migration, there must be *some* free memory available.  Per the above,
         watermarks are not always suffficient if isolate_freepages() cannot
         find the free memory but it could require hundreds of MBs of reclaim to
         even reach this threshold (read: potentially very expensive reclaim with
         no indication compaction can be successful), and
      
       - if compaction at this order has failed recently so that it does not even
         run as a result of deferred compaction, looping through reclaim can often
         be pointless.
      
      For hugepage allocations, these are quite substantial drawbacks because
      these are very high order allocations (order-9 on x86) and falling back to
      doing reclaim can potentially be *very* expensive without any indication
      that compaction would even be successful.
      
      Reclaim itself is unlikely to free entire pageblocks and certainly no
      reliance should be put on it to do so in isolation (recall lumpy reclaim).
      This means we should avoid reclaim and simply fail hugepage allocation if
      compaction is deferred.
      
      It is also not helpful to thrash a zone by doing excessive reclaim if
      compaction may not be able to access that memory.  If order-0 watermarks
      fail and the allocation order is sufficiently large, it is likely better
      to fail the allocation rather than thrashing the zone.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b39d0ee2
  4. 25 9月, 2019 4 次提交
  5. 03 9月, 2019 1 次提交
    • M
      sched/topology: Improve load balancing on AMD EPYC systems · a55c7454
      Matt Fleming 提交于
      SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
      for any sched domains with a NUMA distance greater than 2 hops
      (RECLAIM_DISTANCE). The idea being that it's expensive to balance
      across domains that far apart.
      
      However, as is rather unfortunately explained in:
      
        commit 32e45ff4 ("mm: increase RECLAIM_DISTANCE to 30")
      
      the value for RECLAIM_DISTANCE is based on node distance tables from
      2011-era hardware.
      
      Current AMD EPYC machines have the following NUMA node distances:
      
       node distances:
       node   0   1   2   3   4   5   6   7
         0:  10  16  16  16  32  32  32  32
         1:  16  10  16  16  32  32  32  32
         2:  16  16  10  16  32  32  32  32
         3:  16  16  16  10  32  32  32  32
         4:  32  32  32  32  10  16  16  16
         5:  32  32  32  32  16  10  16  16
         6:  32  32  32  32  16  16  10  16
         7:  32  32  32  32  16  16  16  10
      
      where 2 hops is 32.
      
      The result is that the scheduler fails to load balance properly across
      NUMA nodes on different sockets -- 2 hops apart.
      
      For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
      (CPUs 32-39) like so,
      
        $ numactl -C 0-7,32-39 ./spinner 16
      
      causes all threads to fork and remain on node 0 until the active
      balancer kicks in after a few seconds and forcibly moves some threads
      to node 4.
      
      Override node_reclaim_distance for AMD Zen.
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Suravee.Suthikulpanit@amd.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas.Lendacky@amd.com
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20190808195301.13222-3-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a55c7454
  6. 25 8月, 2019 1 次提交
    • D
      mm, page_alloc: move_freepages should not examine struct page of reserved memory · cd961038
      David Rientjes 提交于
      After commit 907ec5fc ("mm: zero remaining unavailable struct
      pages"), struct page of reserved memory is zeroed.  This causes
      page->flags to be 0 and fixes issues related to reading
      /proc/kpageflags, for example, of reserved memory.
      
      The VM_BUG_ON() in move_freepages_block(), however, assumes that
      page_zone() is meaningful even for reserved memory.  That assumption is
      no longer true after the aforementioned commit.
      
      There's no reason why move_freepages_block() should be testing the
      legitimacy of page_zone() for reserved memory; its scope is limited only
      to pages on the zone's freelist.
      
      Note that pfn_valid() can be true for reserved memory: there is a
      backing struct page.  The check for page_to_nid(page) is also buggy but
      reserved memory normally only appears on node 0 so the zeroing doesn't
      affect this.
      
      Move the debug checks to after verifying PageBuddy is true.  This
      isolates the scope of the checks to only be for buddy pages which are on
      the zone's freelist which move_freepages_block() is operating on.  In
      this case, an incorrect node or zone is a bug worthy of being warned
      about (and the examination of struct page is acceptable bcause this
      memory is not reserved).
      
      Why does move_freepages_block() gets called on reserved memory? It's
      simply math after finding a valid free page from the per-zone free area
      to use as fallback.  We find the beginning and end of the pageblock of
      the valid page and that can bring us into memory that was reserved per
      the e820.  pfn_valid() is still true (it's backed by a struct page), but
      since it's zero'd we shouldn't make any inferences here about comparing
      its node or zone.  The current node check just happens to succeed most
      of the time by luck because reserved memory typically appears on node 0.
      
      The fix here is to validate that we actually have buddy pages before
      testing if there's any type of zone or node strangeness going on.
      
      We noticed it almost immediately after bringing 907ec5fc in on
      CONFIG_DEBUG_VM builds.  It depends on finding specific free pages in
      the per-zone free area where the math in move_freepages() will bring the
      start or end pfn into reserved memory and wanting to claim that entire
      pageblock as a new migratetype.  So the path will be rare, require
      CONFIG_DEBUG_VM, and require fallback to a different migratetype.
      
      Some struct pages were already zeroed from reserve pages before
      907ec5fca3c so it theoretically could trigger before this commit.  I
      think it's rare enough under a config option that most people don't run
      that others may not have noticed.  I wouldn't argue against a stable tag
      and the backport should be easy enough, but probably wouldn't single out
      a commit that this is fixing.
      
      Mel said:
      
      : The overhead of the debugging check is higher with this patch although
      : it'll only affect debug builds and the path is not particularly hot.
      : If this was a concern, I think it would be reasonable to simply remove
      : the debugging check as the zone boundaries are checked in
      : move_freepages_block and we never expect a zone/node to be smaller than
      : a pageblock and stuck in the middle of another zone.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1908122036560.10779@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd961038
  7. 20 8月, 2019 1 次提交
  8. 19 7月, 2019 4 次提交
    • D
      mm/sparsemem: support sub-section hotplug · ba72b4c8
      Dan Williams 提交于
      The libnvdimm sub-system has suffered a series of hacks and broken
      workarounds for the memory-hotplug implementation's awkward
      section-aligned (128MB) granularity.
      
      For example the following backtrace is emitted when attempting
      arch_add_memory() with physical address ranges that intersect 'System
      RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:
      
          # cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
          100000000-1ffffffff : System RAM
          200000000-303ffffff : Persistent Memory (legacy)
          304000000-43fffffff : System RAM
          440000000-23ffffffff : Persistent Memory
          2400000000-43bfffffff : Persistent Memory
            2400000000-43bfffffff : namespace2.0
      
          WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60
          [..]
          RIP: 0010:add_pages+0x5c/0x60
          [..]
          Call Trace:
           devm_memremap_pages+0x460/0x6e0
           pmem_attach_disk+0x29e/0x680 [nd_pmem]
           ? nd_dax_probe+0xfc/0x120 [libnvdimm]
           nvdimm_bus_probe+0x66/0x160 [libnvdimm]
      
      It was discovered that the problem goes beyond RAM vs PMEM collisions as
      some platform produce PMEM vs PMEM collisions within a given section.
      The libnvdimm workaround for that case revealed that the libnvdimm
      section-alignment-padding implementation has been broken for a long
      while.
      
      A fix for that long-standing breakage introduces as many problems as it
      solves as it would require a backward-incompatible change to the
      namespace metadata interpretation.  Instead of that dubious route [1],
      address the root problem in the memory-hotplug implementation.
      
      Note that EEXIST is no longer treated as success as that is how
      sparse_add_section() reports subsection collisions, it was also obviated
      by recent changes to perform the request_region() for 'System RAM'
      before arch_add_memory() in the add_memory() sequence.
      
      [1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
      
      [osalvador@suse.de: fix deactivate_section for early sections]
        Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de
      Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NOscar Salvador <osalvador@suse.de>
      Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba72b4c8
    • D
      mm: kill is_dev_zone() helper · 46d945ae
      Dan Williams 提交于
      Given there are no more usages of is_dev_zone() outside of 'ifdef
      CONFIG_ZONE_DEVICE' protection, kill off the compilation helper.
      
      Link: http://lkml.kernel.org/r/156092353211.979959.1489004866360828964.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NWei Yang <richardw.yang@linux.intel.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46d945ae
    • D
      mm/sparsemem: add helpers track active portions of a section at boot · f46edbd1
      Dan Williams 提交于
      Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
      sub-section active bitmask, each bit representing a PMD_SIZE span of the
      architecture's memory hotplug section size.
      
      The implications of a partially populated section is that pfn_valid()
      needs to go beyond a valid_section() check and either determine that the
      section is an "early section", or read the sub-section active ranges
      from the bitmask.  The expectation is that the bitmask (subsection_map)
      fits in the same cacheline as the valid_section() / early_section()
      data, so the incremental performance overhead to pfn_valid() should be
      negligible.
      
      The rationale for using early_section() to short-ciruit the
      subsection_map check is that there are legacy code paths that use
      pfn_valid() at section granularity before validating the pfn against
      pgdat data.  So, the early_section() check allows those traditional
      assumptions to persist while also permitting subsection_map to tell the
      truth for purposes of populating the unused portions of early sections
      with PMEM and other ZONE_DEVICE mappings.
      
      Link: http://lkml.kernel.org/r/156092350874.979959.18185938451405518285.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: NQian Cai <cai@lca.pw>
      Tested-by: NJane Chu <jane.chu@oracle.com>
      Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f46edbd1
    • D
      mm/sparsemem: introduce struct mem_section_usage · f1eca35a
      Dan Williams 提交于
      Patch series "mm: Sub-section memory hotplug support", v10.
      
      The memory hotplug section is an arbitrary / convenient unit for memory
      hotplug.  'Section-size' units have bled into the user interface
      ('memblock' sysfs) and can not be changed without breaking existing
      userspace.  The section-size constraint, while mostly benign for typical
      memory hotplug, has and continues to wreak havoc with 'device-memory'
      use cases, persistent memory (pmem) in particular.  Recall that pmem
      uses devm_memremap_pages(), and subsequently arch_add_memory(), to
      allocate a 'struct page' memmap for pmem.  However, it does not use the
      'bottom half' of memory hotplug, i.e.  never marks pmem pages online and
      never exposes the userspace memblock interface for pmem.  This leaves an
      opening to redress the section-size constraint.
      
      To date, the libnvdimm subsystem has attempted to inject padding to
      satisfy the internal constraints of arch_add_memory().  Beyond
      complicating the code, leading to bugs [2], wasting memory, and limiting
      configuration flexibility, the padding hack is broken when the platform
      changes this physical memory alignment of pmem from one boot to the
      next.  Device failure (intermittent or permanent) and physical
      reconfiguration are events that can cause the platform firmware to
      change the physical placement of pmem on a subsequent boot, and device
      failure is an everyday event in a data-center.
      
      It turns out that sections are only a hard requirement of the
      user-facing interface for memory hotplug and with a bit more
      infrastructure sub-section arch_add_memory() support can be added for
      kernel internal usages like devm_memremap_pages().  Here is an analysis
      of the current design assumptions in the current code and how they are
      addressed in the new implementation:
      
      Current design assumptions:
      
       - Sections that describe boot memory (early sections) are never
         unplugged / removed.
      
       - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
         valid_section() check
      
       - __add_pages() and helper routines assume all operations occur in
         PAGES_PER_SECTION units.
      
       - The memblock sysfs interface only comprehends full sections
      
      New design assumptions:
      
       - Sections are instrumented with a sub-section bitmask to track (on
         x86) individual 2MB sub-divisions of a 128MB section.
      
       - Partially populated early sections can be extended with additional
         sub-sections, and those sub-sections can be removed with
         arch_remove_memory(). With this in place we no longer lose usable
         memory capacity to padding.
      
       - pfn_valid() is updated to look deeper than valid_section() to also
         check the active-sub-section mask. This indication is in the same
         cacheline as the valid_section() so the performance impact is
         expected to be negligible. So far the lkp robot has not reported any
         regressions.
      
       - Outside of the core vmemmap population routines which are replaced,
         other helper routines like shrink_{zone,pgdat}_span() are updated to
         handle the smaller granularity. Core memory hotplug routines that
         deal with online memory are not touched.
      
       - The existing memblock sysfs user api guarantees / assumptions are not
         touched since this capability is limited to !online
         !memblock-sysfs-accessible sections.
      
      Meanwhile the issue reports continue to roll in from users that do not
      understand when and how the 128MB constraint will bite them.  The current
      implementation relied on being able to support at least one misaligned
      namespace, but that immediately falls over on any moderately complex
      namespace creation attempt.  Beyond the initial problem of 'System RAM'
      colliding with pmem, and the unsolvable problem of physical alignment
      changes, Linux is now being exposed to platforms that collide pmem ranges
      with other pmem ranges by default [3].  In short, devm_memremap_pages()
      has pushed the venerable section-size constraint past the breaking point,
      and the simplicity of section-aligned arch_add_memory() is no longer
      tenable.
      
      These patches are exposed to the kbuild robot on a subsection-v10 branch
      [4], and a preview of the unit test for this functionality is available
      on the 'subsection-pending' branch of ndctl [5].
      
      [2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
      [3]: https://github.com/pmem/ndctl/issues/76
      [4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
      [5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c
      
      This patch (of 13):
      
      Towards enabling memory hotplug to track partial population of a section,
      introduce 'struct mem_section_usage'.
      
      A pointer to a 'struct mem_section_usage' instance replaces the existing
      pointer to a 'pageblock_flags' bitmap.  Effectively it adds one more
      'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
      a new 'subsection_map' bitmap.  The new bitmap enables the memory
      hot{plug,remove} implementation to act on incremental sub-divisions of a
      section.
      
      SUBSECTION_SHIFT is defined as global constant instead of per-architecture
      value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
      subsection users.  Specifically a common subsection size allows for the
      possibility that persistent memory namespace configurations be made
      compatible across architectures.
      
      The primary motivation for this functionality is to support platforms that
      mix "System RAM" and "Persistent Memory" within a single section, or
      multiple PMEM ranges with different mapping lifetimes within a single
      section.  The section restriction for hotplug has caused an ongoing saga
      of hacks and bugs for devm_memremap_pages() users.
      
      Beyond the fixups to teach existing paths how to retrieve the 'usemap'
      from a section, and updates to usemap allocation path, there are no
      expected behavior changes.
      
      Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richardw.yang@linux.intel.com>
      Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1eca35a
  9. 17 7月, 2019 1 次提交
  10. 13 7月, 2019 7 次提交
    • A
      mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options · 6471384a
      Alexander Potapenko 提交于
      Patch series "add init_on_alloc/init_on_free boot options", v10.
      
      Provide init_on_alloc and init_on_free boot options.
      
      These are aimed at preventing possible information leaks and making the
      control-flow bugs that depend on uninitialized values more deterministic.
      
      Enabling either of the options guarantees that the memory returned by the
      page allocator and SL[AU]B is initialized with zeroes.  SLOB allocator
      isn't supported at the moment, as its emulation of kmem caches complicates
      handling of SLAB_TYPESAFE_BY_RCU caches correctly.
      
      Enabling init_on_free also guarantees that pages and heap objects are
      initialized right after they're freed, so it won't be possible to access
      stale data by using a dangling pointer.
      
      As suggested by Michal Hocko, right now we don't let the heap users to
      disable initialization for certain allocations.  There's not enough
      evidence that doing so can speed up real-life cases, and introducing ways
      to opt-out may result in things going out of control.
      
      This patch (of 2):
      
      The new options are needed to prevent possible information leaks and make
      control-flow bugs that depend on uninitialized values more deterministic.
      
      This is expected to be on-by-default on Android and Chrome OS.  And it
      gives the opportunity for anyone else to use it under distros too via the
      boot args.  (The init_on_free feature is regularly requested by folks
      where memory forensics is included in their threat models.)
      
      init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
      objects with zeroes.  Initialization is done at allocation time at the
      places where checks for __GFP_ZERO are performed.
      
      init_on_free=1 makes the kernel initialize freed pages and heap objects
      with zeroes upon their deletion.  This helps to ensure sensitive data
      doesn't leak via use-after-free accesses.
      
      Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
      returns zeroed memory.  The two exceptions are slab caches with
      constructors and SLAB_TYPESAFE_BY_RCU flag.  Those are never
      zero-initialized to preserve their semantics.
      
      Both init_on_alloc and init_on_free default to zero, but those defaults
      can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
      CONFIG_INIT_ON_FREE_DEFAULT_ON.
      
      If either SLUB poisoning or page poisoning is enabled, those options take
      precedence over init_on_alloc and init_on_free: initialization is only
      applied to unpoisoned allocations.
      
      Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:
      
      hackbench, init_on_free=1:  +7.62% sys time (st.err 0.74%)
      hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
      
      Linux build with -j12, init_on_free=1:  +8.38% wall time (st.err 0.39%)
      Linux build with -j12, init_on_free=1:  +24.42% sys time (st.err 0.52%)
      Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
      Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
      
      The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
      is within the standard error.
      
      The new features are also going to pave the way for hardware memory
      tagging (e.g.  arm64's MTE), which will require both on_alloc and on_free
      hooks to set the tags for heap objects.  With MTE, tagging will have the
      same cost as memory initialization.
      
      Although init_on_free is rather costly, there are paranoid use-cases where
      in-memory data lifetime is desired to be minimized.  There are various
      arguments for/against the realism of the associated threat models, but
      given that we'll need the infrastructure for MTE anyway, and there are
      people who want wipe-on-free behavior no matter what the performance cost,
      it seems reasonable to include it in this series.
      
      [glider@google.com: v8]
        Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
      [glider@google.com: v9]
        Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
      [glider@google.com: v10]
        Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
      Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: Michal Hocko <mhocko@suse.cz>		[page and dmapool parts
      Acked-by: James Morris <jamorris@linux.microsoft.com>]
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Sandeep Patil <sspatil@android.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6471384a
    • N
      mm/large system hash: clear hashdist when only one node with memory is booted · e03a5125
      Nicholas Piggin 提交于
      CONFIG_NUMA on 64-bit CPUs currently enables hashdist unconditionally even
      when booting on single node machines.  This causes the large system hashes
      to be allocated with vmalloc, and mapped with small pages.
      
      This change clears hashdist if only one node has come up with memory.
      
      This results in the important large inode and dentry hashes using memblock
      allocations.  All others are within 4MB size up to about 128GB of RAM,
      which allows them to be allocated from the linear map on most non-NUMA
      images.
      
      Other big hashes like futex and TCP should eventually be moved over to the
      same style of allocation as those vfs caches that use HASH_EARLY if
      !hashdist, so they don't exceed MAX_ORDER on very large non-NUMA images.
      
      This brings dTLB misses for linux kernel tree `git diff` from ~45,000 to
      ~8,000 on a Kaby Lake KVM guest with 8MB dentry hash and mitigations=off
      (performance is in the noise, under 1% difference, page tables are likely
      to be well cached for this workload).
      
      Link: http://lkml.kernel.org/r/20190605144814.29319-2-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e03a5125
    • N
      mm/large system hash: use vmalloc for size > MAX_ORDER when !hashdist · ec11408a
      Nicholas Piggin 提交于
      The kernel currently clamps large system hashes to MAX_ORDER when hashdist
      is not set, which is rather arbitrary.
      
      vmalloc space is limited on 32-bit machines, but this shouldn't result in
      much more used because of small physical memory limiting system hash
      sizes.
      
      Include "vmalloc" or "linear" in the kernel log message.
      
      Link: http://lkml.kernel.org/r/20190605144814.29319-1-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec11408a
    • V
      mm, debug_pagealloc: use a page type instead of page_ext flag · 3972f6bb
      Vlastimil Babka 提交于
      When debug_pagealloc is enabled, we currently allocate the page_ext
      array to mark guard pages with the PAGE_EXT_DEBUG_GUARD flag.  Now that
      we have the page_type field in struct page, we can use that instead, as
      guard pages are neither PageSlab nor mapped to userspace.  This reduces
      memory overhead when debug_pagealloc is enabled and there are no other
      features requiring the page_ext array.
      
      Link: http://lkml.kernel.org/r/20190603143451.27353-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3972f6bb
    • V
      mm, page_alloc: more extensive free page checking with debug_pagealloc · 4462b32c
      Vlastimil Babka 提交于
      The page allocator checks struct pages for expected state (mapcount,
      flags etc) as pages are being allocated (check_new_page()) and freed
      (free_pages_check()) to provide some defense against errors in page
      allocator users.
      
      Prior commits 479f854a ("mm, page_alloc: defer debugging checks of
      pages allocated from the PCP") and 4db7548c ("mm, page_alloc: defer
      debugging checks of freed pages until a PCP drain") this has happened
      for order-0 pages as they were allocated from or freed to the per-cpu
      caches (pcplists).  Since those are fast paths, the checks are now
      performed only when pages are moved between pcplists and global free
      lists.  This however lowers the chances of catching errors soon enough.
      
      In order to increase the chances of the checks to catch errors, the
      kernel has to be rebuilt with CONFIG_DEBUG_VM, which also enables
      multiple other internal debug checks (VM_BUG_ON() etc), which is
      suboptimal when the goal is to catch errors in mm users, not in mm code
      itself.
      
      To catch some wrong users of the page allocator we have
      CONFIG_DEBUG_PAGEALLOC, which is designed to have virtually no overhead
      unless enabled at boot time.  Memory corruptions when writing to freed
      pages have often the same underlying errors (use-after-free, double free)
      as corrupting the corresponding struct pages, so this existing debugging
      functionality is a good fit to extend by also perform struct page checks
      at least as often as if CONFIG_DEBUG_VM was enabled.
      
      Specifically, after this patch, when debug_pagealloc is enabled on boot,
      and CONFIG_DEBUG_VM disabled, pages are checked when allocated from or
      freed to the pcplists *in addition* to being moved between pcplists and
      free lists.  When both debug_pagealloc and CONFIG_DEBUG_VM are enabled,
      pages are checked when being moved between pcplists and free lists *in
      addition* to when allocated from or freed to the pcplists.
      
      When debug_pagealloc is not enabled on boot, the overhead in fast paths
      should be virtually none thanks to the use of static key.
      
      Link: http://lkml.kernel.org/r/20190603143451.27353-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4462b32c
    • V
      mm, debug_pagelloc: use static keys to enable debugging · 96a2b03f
      Vlastimil Babka 提交于
      Patch series "debug_pagealloc improvements".
      
      I have been recently debugging some pcplist corruptions, where it would be
      useful to perform struct page checks immediately as pages are allocated
      from and freed to pcplists, which is now only possible by rebuilding the
      kernel with CONFIG_DEBUG_VM (details in Patch 2 changelog).
      
      To make this kind of debugging simpler in future on a distro kernel, I
      have improved CONFIG_DEBUG_PAGEALLOC so that it has even smaller overhead
      when not enabled at boot time (Patch 1) and also when enabled (Patch 3),
      and extended it to perform the struct page checks more often when enabled
      (Patch 2).  Now it can be configured in when building a distro kernel
      without extra overhead, and debugging page use after free or double free
      can be enabled simply by rebooting with debug_pagealloc=on.
      
      This patch (of 3):
      
      CONFIG_DEBUG_PAGEALLOC has been redesigned by 031bc574
      ("mm/debug-pagealloc: make debug-pagealloc boottime configurable") to
      allow being always enabled in a distro kernel, but only perform its
      expensive functionality when booted with debug_pagelloc=on.  We can
      further reduce the overhead when not boot-enabled (including page
      allocator fast paths) using static keys.  This patch introduces one for
      debug_pagealloc core functionality, and another for the optional guard
      page functionality (enabled by booting with debug_guardpage_minorder=X).
      
      Link: http://lkml.kernel.org/r/20190603143451.27353-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96a2b03f
    • D
      mm: remove the exporting of totalram_pages · 98ef2046
      Denis Efremov 提交于
      Previously totalram_pages was the global variable.  Currently,
      totalram_pages is the static inline function from the include/linux/mm.h
      However, the function is also marked as EXPORT_SYMBOL, which is at best an
      odd combination.  Because there is no point for the static inline function
      from a public header to be exported, this commit removes the
      EXPORT_SYMBOL() marking.  It will be still possible to use the function in
      modules because all the symbols it depends on are exported.
      
      Link: http://lkml.kernel.org/r/20190710141031.15642-1-efremov@linux.com
      Fixes: ca79b0c2 ("mm: convert totalram_pages and totalhigh_pages variables to atomic")
      Signed-off-by: NDenis Efremov <efremov@linux.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      98ef2046
  11. 05 7月, 2019 1 次提交
  12. 03 7月, 2019 2 次提交
  13. 21 5月, 2019 1 次提交
  14. 15 5月, 2019 14 次提交
    • D
      mm: maintain randomization of page free lists · 97500a4a
      Dan Williams 提交于
      When freeing a page with an order >= shuffle_page_order randomly select
      the front or back of the list for insertion.
      
      While the mm tries to defragment physical pages into huge pages this can
      tend to make the page allocator more predictable over time.  Inject the
      front-back randomness to preserve the initial randomness established by
      shuffle_free_memory() when the kernel was booted.
      
      The overhead of this manipulation is constrained by only being applied
      for MAX_ORDER sized pages by default.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Robert Elliott <elliott@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97500a4a
    • D
      mm: move buddy list manipulations into helpers · b03641af
      Dan Williams 提交于
      In preparation for runtime randomization of the zone lists, take all
      (well, most of) the list_*() functions in the buddy allocator and put
      them in helper functions.  Provide a common control point for injecting
      additional behavior when freeing pages.
      
      [dan.j.williams@intel.com: fix buddy list helpers]
        Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com
      [vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter]
        Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz
      Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Robert Elliott <elliott@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b03641af
    • D
      mm: shuffle initial free memory to improve memory-side-cache utilization · e900a918
      Dan Williams 提交于
      Patch series "mm: Randomize free memory", v10.
      
      This patch (of 3):
      
      Randomization of the page allocator improves the average utilization of
      a direct-mapped memory-side-cache.  Memory side caching is a platform
      capability that Linux has been previously exposed to in HPC
      (high-performance computing) environments on specialty platforms.  In
      that instance it was a smaller pool of high-bandwidth-memory relative to
      higher-capacity / lower-bandwidth DRAM.  Now, this capability is going
      to be found on general purpose server platforms where DRAM is a cache in
      front of higher latency persistent memory [1].
      
      Robert offered an explanation of the state of the art of Linux
      interactions with memory-side-caches [2], and I copy it here:
      
          It's been a problem in the HPC space:
          http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/
      
          A kernel module called zonesort is available to try to help:
          https://software.intel.com/en-us/articles/xeon-phi-software
      
          and this abandoned patch series proposed that for the kernel:
          https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com
      
          Dan's patch series doesn't attempt to ensure buffers won't conflict, but
          also reduces the chance that the buffers will. This will make performance
          more consistent, albeit slower than "optimal" (which is near impossible
          to attain in a general-purpose kernel).  That's better than forcing
          users to deploy remedies like:
              "To eliminate this gradual degradation, we have added a Stream
               measurement to the Node Health Check that follows each job;
               nodes are rebooted whenever their measured memory bandwidth
               falls below 300 GB/s."
      
      A replacement for zonesort was merged upstream in commit cc9aec03
      ("x86/numa_emulation: Introduce uniform split capability").  With this
      numa_emulation capability, memory can be split into cache sized
      ("near-memory" sized) numa nodes.  A bind operation to such a node, and
      disabling workloads on other nodes, enables full cache performance.
      However, once the workload exceeds the cache size then cache conflicts
      are unavoidable.  While HPC environments might be able to tolerate
      time-scheduling of cache sized workloads, for general purpose server
      platforms, the oversubscribed cache case will be the common case.
      
      The worst case scenario is that a server system owner benchmarks a
      workload at boot with an un-contended cache only to see that performance
      degrade over time, even below the average cache performance due to
      excessive conflicts.  Randomization clips the peaks and fills in the
      valleys of cache utilization to yield steady average performance.
      
      Here are some performance impact details of the patches:
      
      1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
         3X speedup in a contrived case that tries to force cache conflicts.
         The contrived cased used the numa_emulation capability to force an
         instance of the benchmark to be run in two of the near-memory sized
         numa nodes.  If both instances were placed on the same emulated they
         would fit and cause zero conflicts.  While on separate emulated nodes
         without randomization they underutilized the cache and conflicted
         unnecessarily due to the in-order allocation per node.
      
      2/ A well known Java server application benchmark was run with a heap
         size that exceeded cache size by 3X.  The cache conflict rate was 8%
         for the first run and degraded to 21% after page allocator aging.  With
         randomization enabled the rate levelled out at 11%.
      
      3/ A MongoDB workload did not observe measurable difference in
         cache-conflict rates, but the overall throughput dropped by 7% with
         randomization in one case.
      
      4/ Mel Gorman ran his suite of performance workloads with randomization
         enabled on platforms without a memory-side-cache and saw a mix of some
         improvements and some losses [3].
      
      While there is potentially significant improvement for applications that
      depend on low latency access across a wide working-set, the performance
      may be negligible to negative for other workloads.  For this reason the
      shuffle capability defaults to off unless a direct-mapped
      memory-side-cache is detected.  Even then, the page_alloc.shuffle=0
      parameter can be specified to disable the randomization on those systems.
      
      Outside of memory-side-cache utilization concerns there is potentially
      security benefit from randomization.  Some data exfiltration and
      return-oriented-programming attacks rely on the ability to infer the
      location of sensitive data objects.  The kernel page allocator, especially
      early in system boot, has predictable first-in-first out behavior for
      physical pages.  Pages are freed in physical address order when first
      onlined.
      
      Quoting Kees:
          "While we already have a base-address randomization
           (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
           memory layouts would certainly be using the predictability of
           allocation ordering (i.e. for attacks where the base address isn't
           important: only the relative positions between allocated memory).
           This is common in lots of heap-style attacks. They try to gain
           control over ordering by spraying allocations, etc.
      
           I'd really like to see this because it gives us something similar
           to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."
      
      While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
      caches it leaves vast bulk of memory to be predictably in order allocated.
      However, it should be noted, the concrete security benefits are hard to
      quantify, and no known CVE is mitigated by this randomization.
      
      Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
      a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
      are initially populated with free memory at boot and at hotplug time.  Do
      this based on either the presence of a page_alloc.shuffle=Y command line
      parameter, or autodetection of a memory-side-cache (to be added in a
      follow-on patch).
      
      The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
      pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e.  10,
      4MB this trades off randomization granularity for time spent shuffling.
      MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
      while still showing memory-side cache behavior improvements, and the
      expectation that the security implications of finer granularity
      randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM.  The
      performance impact of the shuffling appears to be in the noise compared to
      other memory initialization work.
      
      This initial randomization can be undone over time so a follow-on patch is
      introduced to inject entropy on page free decisions.  It is reasonable to
      ask if the page free entropy is sufficient, but it is not enough due to
      the in-order initial freeing of pages.  At the start of that process
      putting page1 in front or behind page0 still keeps them close together,
      page2 is still near page1 and has a high chance of being adjacent.  As
      more pages are added ordering diversity improves, but there is still high
      page locality for the low address pages and this leads to no significant
      impact to the cache conflict rate.
      
      [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
      [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
      [3]: https://lkml.org/lkml/2018/10/12/309
      
      [dan.j.williams@intel.com: fix shuffle enable]
        Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
      [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
        Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
      Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Robert Elliott <elliott@hpe.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e900a918
    • B
      mm: update references to page _refcount · 136ac591
      Baruch Siach 提交于
      Commit 0139aa7b ("mm: rename _count, field of the struct page, to
      _refcount") left out a couple of references to the old field name.  Fix
      that.
      
      Link: http://lkml.kernel.org/r/cedf87b02eb8a6b3eac57e8e91da53fb15c3c44c.1556537475.git.baruch@tkos.co.il
      Fixes: 0139aa7b ("mm: rename _count, field of the struct page, to _refcount")
      Signed-off-by: NBaruch Siach <baruch@tkos.co.il>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      136ac591
    • M
      mm: memblock: make keeping memblock memory opt-in rather than opt-out · 350e88ba
      Mike Rapoport 提交于
      Most architectures do not need the memblock memory after the page
      allocator is initialized, but only few enable ARCH_DISCARD_MEMBLOCK in the
      arch Kconfig.
      
      Replacing ARCH_DISCARD_MEMBLOCK with ARCH_KEEP_MEMBLOCK and inverting the
      logic makes it clear which architectures actually use memblock after
      system initialization and skips the necessity to add ARCH_DISCARD_MEMBLOCK
      to the architectures that are still missing that option.
      
      Link: http://lkml.kernel.org/r/1556102150-32517-1-git-send-email-rppt@linux.ibm.comSigned-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc)
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Richard Kuo <rkuo@codeaurora.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Paul Burton <paul.burton@mips.com>
      Cc: James Hogan <jhogan@kernel.org>
      Cc: Ley Foon Tan <lftan@altera.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      350e88ba
    • Y
      mm/page_alloc.c: remove unnecessary parameter in rmqueue_pcplist · 1c52e6d0
      Yafang Shao 提交于
      Because rmqueue_pcplist() is only called when order is 0, we don't need to
      use order as a parameter.
      
      Link: http://lkml.kernel.org/r/1555591709-11744-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NYafang Shao <laoar.shao@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NPankaj Gupta <pagupta@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1c52e6d0
    • M
      mm, memory_hotplug: cleanup memory offline path · 5557c766
      Michal Hocko 提交于
      check_pages_isolated_cb currently accounts the whole pfn range as being
      offlined if test_pages_isolated suceeds on the range.  This is based on
      the assumption that all pages in the range are freed which is currently
      the case in most cases but it won't be with later changes, as pages marked
      as vmemmap won't be isolated.
      
      Move the offlined pages counting to offline_isolated_pages_cb and rely on
      __offline_isolated_pages to return the correct value.
      check_pages_isolated_cb will still do it's primary job and check the pfn
      range.
      
      While we are at it remove check_pages_isolated and offline_isolated_pages
      and use directly walk_system_ram_range as do in online_pages.
      
      Link: http://lkml.kernel.org/r/20190408082633.2864-2-osalvador@suse.deReviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NOscar Salvador <osalvador@suse.de>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5557c766
    • A
      mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections · 0e56acae
      Alexander Duyck 提交于
      Add yet another iterator, for_each_free_mem_range_in_zone_from, and then
      use it to support initializing and freeing pages in groups no larger than
      MAX_ORDER_NR_PAGES.  By doing this we can greatly improve the cache
      locality of the pages while we do several loops over them in the init and
      freeing process.
      
      We are able to tighten the loops further as a result of the "from"
      iterator as we can perform the initial checks for first_init_pfn in our
      first call to the iterator, and continue without the need for those checks
      via the "from" iterator.  I have added this functionality in the function
      called deferred_init_mem_pfn_range_in_zone that primes the iterator and
      causes us to exit if we encounter any failure.
      
      On my x86_64 test system with 384GB of memory per node I saw a reduction
      in initialization time from 1.85s to 1.38s as a result of this patch.
      
      Link: http://lkml.kernel.org/r/20190405221231.12227.85836.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: <yi.z.zhang@linux.intel.com>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0e56acae
    • A
      mm: implement new zone specific memblock iterator · 837566e7
      Alexander Duyck 提交于
      Introduce a new iterator for_each_free_mem_pfn_range_in_zone.
      
      This iterator will take care of making sure a given memory range provided
      is in fact contained within a zone.  It takes are of all the bounds
      checking we were doing in deferred_grow_zone, and deferred_init_memmap.
      In addition it should help to speed up the search a bit by iterating until
      the end of a range is greater than the start of the zone pfn range, and
      will exit completely if the start is beyond the end of the zone.
      
      Link: http://lkml.kernel.org/r/20190405221225.12227.22573.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <yi.z.zhang@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      837566e7
    • A
      mm: drop meminit_pfn_in_nid as it is redundant · 56ec43d8
      Alexander Duyck 提交于
      As best as I can tell the meminit_pfn_in_nid call is completely redundant.
      The deferred memory initialization is already making use of
      for_each_free_mem_range which in turn will call into __next_mem_range
      which will only return a memory range if it matches the node ID provided
      assuming it is not NUMA_NO_NODE.
      
      I am operating on the assumption that there are no zones or pgdata_t
      structures that have a NUMA node of NUMA_NO_NODE associated with them.  If
      that is the case then __next_mem_range will never return a memory range
      that doesn't match the zone's node ID and as such the check is redundant.
      
      So one piece I would like to verify on this is if this works for ia64.
      Technically it was using a different approach to get the node ID, but it
      seems to have the node ID also encoded into the memblock.  So I am
      assuming this is okay, but would like to get confirmation on that.
      
      On my x86_64 test system with 384GB of memory per node I saw a reduction
      in initialization time from 2.80s to 1.85s as a result of this patch.
      
      Link: http://lkml.kernel.org/r/20190405221219.12227.93957.stgit@localhost.localdomainSigned-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <yi.z.zhang@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56ec43d8
    • L
      mem-hotplug: fix node spanned pages when we have a node with only ZONE_MOVABLE · 299c83dc
      Linxu Fang 提交于
      342332e6 ("mm/page_alloc.c: introduce kernelcore=mirror option") and
      later patches rewrote the calculation of node spanned pages.
      
      e506b996 ("mem-hotplug: fix node spanned pages when we have a movable
      node"), but the current code still has problems,
      
      When we have a node with only zone_movable and the node id is not zero,
      the size of node spanned pages is double added.
      
      That's because we have an empty normal zone, and zone_start_pfn or
      zone_end_pfn is not between arch_zone_lowest_possible_pfn and
      arch_zone_highest_possible_pfn, so we need to use clamp to constrain the
      range just like the commit <96e907d1> (bootmem: Reimplement
      __absent_pages_in_range() using for_each_mem_pfn_range()).
      
      e.g.
      Zone ranges:
        DMA      [mem 0x0000000000001000-0x0000000000ffffff]
        DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
        Normal   [mem 0x0000000100000000-0x000000023fffffff]
      Movable zone start for each node
        Node 0: 0x0000000100000000
        Node 1: 0x0000000140000000
      Early memory node ranges
        node   0: [mem 0x0000000000001000-0x000000000009efff]
        node   0: [mem 0x0000000000100000-0x00000000bffdffff]
        node   0: [mem 0x0000000100000000-0x000000013fffffff]
        node   1: [mem 0x0000000140000000-0x000000023fffffff]
      
      node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
      node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
      node 0 Normal	spanned:0	present:0	absent:0
      node 0 Movable	spanned:0x40000 present:0x40000 absent:0
      On node 0 totalpages(node_present_pages): 1048446
      node_spanned_pages:1310719
      node 1 DMA	spanned:0	    present:0		absent:0
      node 1 DMA32	spanned:0	    present:0		absent:0
      node 1 Normal	spanned:0x100000    present:0x100000	absent:0
      node 1 Movable	spanned:0x100000    present:0x100000	absent:0
      On node 1 totalpages(node_present_pages): 2097152
      node_spanned_pages:2097152
      Memory: 6967796K/12582392K available (16388K kernel code, 3686K rwdata,
      4468K rodata, 2160K init, 10444K bss, 5614596K reserved, 0K
      cma-reserved)
      
      It shows that the current memory of node 1 is double added.
      After this patch, the problem is fixed.
      
      node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
      node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
      node 0 Normal	spanned:0	present:0	absent:0
      node 0 Movable	spanned:0x40000 present:0x40000 absent:0
      On node 0 totalpages(node_present_pages): 1048446
      node_spanned_pages:1310719
      node 1 DMA	spanned:0	    present:0		absent:0
      node 1 DMA32	spanned:0	    present:0		absent:0
      node 1 Normal	spanned:0	    present:0		absent:0
      node 1 Movable	spanned:0x100000    present:0x100000	absent:0
      On node 1 totalpages(node_present_pages): 1048576
      node_spanned_pages:1048576
      memory: 6967796K/8388088K available (16388K kernel code, 3686K rwdata,
      4468K rodata, 2160K init, 10444K bss, 1420292K reserved, 0K
      cma-reserved)
      
      Link: http://lkml.kernel.org/r/1554178276-10372-1-git-send-email-fanglinxu@huawei.comSigned-off-by: NLinxu Fang <fanglinxu@huawei.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      299c83dc
    • A
      hugetlb: allow to free gigantic pages regardless of the configuration · 4eb0716e
      Alexandre Ghiti 提交于
      On systems without CONTIG_ALLOC activated but that support gigantic pages,
      boottime reserved gigantic pages can not be freed at all.  This patch
      simply enables the possibility to hand back those pages to memory
      allocator.
      
      Link: http://lkml.kernel.org/r/20190327063626.18421-5-alex@ghiti.frSigned-off-by: NAlexandre Ghiti <alex@ghiti.fr>
      Acked-by: David S. Miller <davem@davemloft.net> [sparc]
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4eb0716e
    • A
      mm: simplify MEMORY_ISOLATION && COMPACTION || CMA into CONTIG_ALLOC · 8df995f6
      Alexandre Ghiti 提交于
      This condition allows to define alloc_contig_range, so simplify it into a
      more accurate naming.
      
      Link: http://lkml.kernel.org/r/20190327063626.18421-4-alex@ghiti.frSigned-off-by: NAlexandre Ghiti <alex@ghiti.fr>
      Suggested-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andy Lutomirsky <luto@kernel.org>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H . Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8df995f6
    • V
      mm, page_alloc: disallow __GFP_COMP in alloc_pages_exact() · 63931eb9
      Vlastimil Babka 提交于
      alloc_pages_exact*() allocates a page of sufficient order and then splits
      it to return only the number of pages requested.  That makes it
      incompatible with __GFP_COMP, because compound pages cannot be split.
      
      As shown by [1] things may silently work until the requested size
      (possibly depending on user) stops being power of two.  Then for
      CONFIG_DEBUG_VM, BUG_ON() triggers in split_page().  Without
      CONFIG_DEBUG_VM, consequences are unclear.
      
      There are several options here, none of them great:
      
      1) Don't do the splitting when __GFP_COMP is passed, and return the
         whole compound page.  However if caller then returns it via
         free_pages_exact(), that will be unexpected and the freeing actions
         there will be wrong.
      
      2) Warn and remove __GFP_COMP from the flags.  But the caller may have
         really wanted it, so things may break later somewhere.
      
      3) Warn and return NULL.  However NULL may be unexpected, especially
         for small sizes.
      
      This patch picks option 2, because as Michal Hocko put it: "callers wanted
      it" is much less probable than "caller is simply confused and more gfp
      flags is surely better than fewer".
      
      [1] https://lore.kernel.org/lkml/20181126002805.GI18977@shao2-debian/T/#u
      
      Link: http://lkml.kernel.org/r/0c6393eb-b28d-4607-c386-862a71f09de6@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63931eb9