1. 03 4月, 2020 4 次提交
  2. 04 2月, 2020 5 次提交
    • A
      mm/memmap_init: update variable name in memmap_init_zone · 1f8d75c1
      Aneesh Kumar K.V 提交于
      Patch series "mm/memory_hotplug: Shrink zones before removing memory", v6.
      
      This series fixes the access of uninitialized memmaps when shrinking
      zones/nodes and when removing memory.  Also, it contains all fixes for
      crashes that can be triggered when removing certain namespace using
      memunmap_pages() - ZONE_DEVICE, reported by Aneesh.
      
      We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
      more involved (we don't have SECTION_IS_ONLINE as an indicator), and
      shrinking is only of limited use (set_zone_contiguous() cannot detect the
      ZONE_DEVICE as contiguous).
      
      We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount of
      code to a minimum.  Shrinking is especially necessary to keep
      zone->contiguous set where possible, especially, on memory unplug of DIMMs
      at zone boundaries.
      
      --------------------------------------------------------------------------
      
      Zones are now properly shrunk when offlining memory blocks or when
      onlining failed.  This allows to properly shrink zones on memory unplug
      even if the separate memory blocks of a DIMM were onlined to different
      zones or re-onlined to a different zone after offlining.
      
      Example:
      
      :/# cat /proc/zoneinfo
      Node 1, zone  Movable
              spanned  0
              present  0
              managed  0
      :/# echo "online_movable" > /sys/devices/system/memory/memory41/state
      :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
      :/# cat /proc/zoneinfo
      Node 1, zone  Movable
              spanned  98304
              present  65536
              managed  65536
      :/# echo 0 > /sys/devices/system/memory/memory43/online
      :/# cat /proc/zoneinfo
      Node 1, zone  Movable
              spanned  32768
              present  32768
              managed  32768
      :/# echo 0 > /sys/devices/system/memory/memory41/online
      :/# cat /proc/zoneinfo
      Node 1, zone  Movable
              spanned  0
              present  0
              managed  0
      
      This patch (of 6):
      
      The third argument is actually number of pages.  Change the variable name
      from size to nr_pages to indicate this better.
      
      No functional change in this patch.
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-3-david@redhat.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPankaj Gupta <pagupta@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1f8d75c1
    • D
      mm: factor out next_present_section_nr() · 4c605881
      David Hildenbrand 提交于
      Let's move it to the header and use the shorter variant from
      mm/page_alloc.c (the original one will also check
      "__highest_present_section_nr + 1", which is not necessary).  While at
      it, make the section_nr in next_pfn() const.
      
      In next_pfn(), we now return section_nr_to_pfn(-1) instead of -1 once we
      exceed __highest_present_section_nr, which doesn't make a difference in
      the caller as it is big enough (>= all sane end_pfn).
      
      Link: http://lkml.kernel.org/r/20200113144035.10848-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: "Jin, Zhi" <zhi.jin@intel.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4c605881
    • D
      mm/page_alloc: fix and rework pfn handling in memmap_init_zone() · 948c436e
      David Hildenbrand 提交于
      Let's update the pfn manually whenever we continue the loop.  This makes
      the code easier to read but also less error prone (and we can directly fix
      one issue).
      
      When overlap_memmap_init() returns true, pfn is updated to
      "memblock_region_memory_end_pfn(r)".  So it already points at the *next*
      pfn to process.  Incrementing the pfn another time is wrong, we might
      leave one uninitialized.  I spotted this by inspecting the code, so I have
      no idea if this is relevant in practise (with kernelcore=mirror).
      
      Link: http://lkml.kernel.org/r/20200113144035.10848-2-david@redhat.com
      Fixes: a9a9e77f ("mm: move mirrored memory specific code outside of memmap_init_zone")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Jin, Zhi" <zhi.jin@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      948c436e
    • D
      mm/page_alloc.c: initialize memmap of unavailable memory directly · 4b094b78
      David Hildenbrand 提交于
      Let's make sure that all memory holes are actually marked PageReserved(),
      that page_to_pfn() produces reliable results, and that these pages are not
      detected as "mmap" pages due to the mapcount.
      
      E.g., booting a x86-64 QEMU guest with 4160 MB:
      
      [    0.010585] Early memory node ranges
      [    0.010586]   node   0: [mem 0x0000000000001000-0x000000000009efff]
      [    0.010588]   node   0: [mem 0x0000000000100000-0x00000000bffdefff]
      [    0.010589]   node   0: [mem 0x0000000100000000-0x0000000143ffffff]
      
      max_pfn is 0x144000.
      
      Before this change:
      
      [root@localhost ~]# ./page-types -r -a 0x144000,
                   flags      page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000000000800           16384       64  ___________M_______________________________        mmap
                   total           16384       64
      
      After this change:
      
      [root@localhost ~]# ./page-types -r -a 0x144000,
                   flags      page-count       MB  symbolic-flags                     long-symbolic-flags
      0x0000000100000000           16384       64  ___________________________r_______________        reserved
                   total           16384       64
      
      IOW, especially the unavailable physical memory ("memory hole") in the
      last section would not get properly marked PageReserved() and is indicated
      to be "mmap" memory.
      
      Drop the trace of that function from include/linux/mm.h - nobody else
      needs it, and rename it accordingly.
      
      Note: The fake zone/node might not be covered by the zone/node span.  This
      is not an urgent issue (for now, we had the same node/zone due to the
      zeroing).  We'll need a clean way to mark memory holes (e.g., using a page
      type PageHole() if possible or a fake ZONE_INVALID) and eventually stop
      marking these memory holes PageReserved().
      
      Link: http://lkml.kernel.org/r/20191211163201.17179-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b094b78
    • D
      mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section · e822969c
      David Hildenbrand 提交于
      Patch series "mm: fix max_pfn not falling on section boundary", v2.
      
      Playing with different memory sizes for a x86-64 guest, I discovered that
      some memmaps (highest section if max_mem does not fall on the section
      boundary) are marked as being valid and online, but contain garbage.  We
      have to properly initialize these memmaps.
      
      Looking at /proc/kpageflags and friends, I found some more issues,
      partially related to this.
      
      This patch (of 3):
      
      If max_pfn is not aligned to a section boundary, we can easily run into
      BUGs.  This can e.g., be triggered on x86-64 under QEMU by specifying a
      memory size that is not a multiple of 128MB (e.g., 4097MB, but also
      4160MB).  I was told that on real HW, we can easily have this scenario
      (esp., one of the main reasons sub-section hotadd of devmem was added).
      
      The issue is, that we have a valid memmap (pfn_valid()) for the whole
      section, and the whole section will be marked "online".
      pfn_to_online_page() will succeed, but the memmap contains garbage.
      
      E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
      4160M" - (see tools/vm/page-types.c):
      
      [  200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
      [  200.477500] #PF: supervisor read access in kernel mode
      [  200.478334] #PF: error_code(0x0000) - not-present page
      [  200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
      [  200.479557] Oops: 0000 [#4] SMP NOPTI
      [  200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G      D W         5.5.0-rc1-next-20191209 #93
      [  200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
      [  200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
      [  200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
      [  200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
      [  200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
      [  200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
      [  200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
      [  200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
      [  200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
      [  200.487130] FS:  00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
      [  200.487804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
      [  200.488897] Call Trace:
      [  200.489115]  kpageflags_read+0xe9/0x140
      [  200.489447]  proc_reg_read+0x3c/0x60
      [  200.489755]  vfs_read+0xc2/0x170
      [  200.490037]  ksys_pread64+0x65/0xa0
      [  200.490352]  do_syscall_64+0x5c/0xa0
      [  200.490665]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
      after cold/hot plugging a DIMM to such a system:
      
      [root@localhost ~]# cat /proc/kpageflags > /dev/null
      [  111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
      [  111.517907] #PF: supervisor read access in kernel mode
      [  111.518333] #PF: error_code(0x0000) - not-present page
      [  111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0
      
      This patch fixes that by at least zero-ing out that memmap (so e.g.,
      page_to_pfn() will not crash).  Commit 907ec5fc ("mm: zero remaining
      unavailable struct pages") tried to fix a similar issue, but forgot to
      consider this special case.
      
      After this patch, there are still problems to solve.  E.g., not all of
      these pages falling into a memory hole will actually get initialized later
      and set PageReserved - they are only zeroed out - but at least the
      immediate crashes are gone.  A follow-up patch will take care of this.
      
      Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Tested-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: <stable@vger.kernel.org>	[4.15+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e822969c
  3. 01 2月, 2020 4 次提交
    • Q
      mm/page_isolation: fix potential warning from user · 3d680bdf
      Qian Cai 提交于
      It makes sense to call the WARN_ON_ONCE(zone_idx(zone) == ZONE_MOVABLE)
      from start_isolate_page_range(), but should avoid triggering it from
      userspace, i.e, from is_mem_section_removable() because it could crash
      the system by a non-root user if warn_on_panic is set.
      
      While at it, simplify the code a bit by removing an unnecessary jump
      label.
      
      Link: http://lkml.kernel.org/r/20200120163915.1469-1-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Suggested-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d680bdf
    • Q
      mm/hotplug: silence a lockdep splat with printk() · 4a55c047
      Qian Cai 提交于
      It is not that hard to trigger lockdep splats by calling printk from
      under zone->lock.  Most of them are false positives caused by lock
      chains introduced early in the boot process and they do not cause any
      real problems (although most of the early boot lock dependencies could
      happen after boot as well).  There are some console drivers which do
      allocate from the printk context as well and those should be fixed.  In
      any case, false positives are not that trivial to workaround and it is
      far from optimal to lose lockdep functionality for something that is a
      non-issue.
      
      So change has_unmovable_pages() so that it no longer calls dump_page()
      itself - instead it returns a "struct page *" of the unmovable page back
      to the caller so that in the case of a has_unmovable_pages() failure,
      the caller can call dump_page() after releasing zone->lock.  Also, make
      dump_page() is able to report a CMA page as well, so the reason string
      from has_unmovable_pages() can be removed.
      
      Even though has_unmovable_pages doesn't hold any reference to the
      returned page this should be reasonably safe for the purpose of
      reporting the page (dump_page) because it cannot be hotremoved in the
      context of memory unplug.  The state of the page might change but that
      is the case even with the existing code as zone->lock only plays role
      for free pages.
      
      While at it, remove a similar but unnecessary debug-only printk() as
      well.  A sample of one of those lockdep splats is,
      
        WARNING: possible circular locking dependency detected
        ------------------------------------------------------
        test.sh/8653 is trying to acquire lock:
        ffffffff865a4460 (console_owner){-.-.}, at:
        console_unlock+0x207/0x750
      
        but task is already holding lock:
        ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
        __offline_isolated_pages+0x179/0x3e0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #3 (&(&zone->lock)->rlock){-.-.}:
               __lock_acquire+0x5b3/0xb40
               lock_acquire+0x126/0x280
               _raw_spin_lock+0x2f/0x40
               rmqueue_bulk.constprop.21+0xb6/0x1160
               get_page_from_freelist+0x898/0x22c0
               __alloc_pages_nodemask+0x2f3/0x1cd0
               alloc_pages_current+0x9c/0x110
               allocate_slab+0x4c6/0x19c0
               new_slab+0x46/0x70
               ___slab_alloc+0x58b/0x960
               __slab_alloc+0x43/0x70
               __kmalloc+0x3ad/0x4b0
               __tty_buffer_request_room+0x100/0x250
               tty_insert_flip_string_fixed_flag+0x67/0x110
               pty_write+0xa2/0xf0
               n_tty_write+0x36b/0x7b0
               tty_write+0x284/0x4c0
               __vfs_write+0x50/0xa0
               vfs_write+0x105/0x290
               redirected_tty_write+0x6a/0xc0
               do_iter_write+0x248/0x2a0
               vfs_writev+0x106/0x1e0
               do_writev+0xd4/0x180
               __x64_sys_writev+0x45/0x50
               do_syscall_64+0xcc/0x76c
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        -> #2 (&(&port->lock)->rlock){-.-.}:
               __lock_acquire+0x5b3/0xb40
               lock_acquire+0x126/0x280
               _raw_spin_lock_irqsave+0x3a/0x50
               tty_port_tty_get+0x20/0x60
               tty_port_default_wakeup+0xf/0x30
               tty_port_tty_wakeup+0x39/0x40
               uart_write_wakeup+0x2a/0x40
               serial8250_tx_chars+0x22e/0x440
               serial8250_handle_irq.part.8+0x14a/0x170
               serial8250_default_handle_irq+0x5c/0x90
               serial8250_interrupt+0xa6/0x130
               __handle_irq_event_percpu+0x78/0x4f0
               handle_irq_event_percpu+0x70/0x100
               handle_irq_event+0x5a/0x8b
               handle_edge_irq+0x117/0x370
               do_IRQ+0x9e/0x1e0
               ret_from_intr+0x0/0x2a
               cpuidle_enter_state+0x156/0x8e0
               cpuidle_enter+0x41/0x70
               call_cpuidle+0x5e/0x90
               do_idle+0x333/0x370
               cpu_startup_entry+0x1d/0x1f
               start_secondary+0x290/0x330
               secondary_startup_64+0xb6/0xc0
      
        -> #1 (&port_lock_key){-.-.}:
               __lock_acquire+0x5b3/0xb40
               lock_acquire+0x126/0x280
               _raw_spin_lock_irqsave+0x3a/0x50
               serial8250_console_write+0x3e4/0x450
               univ8250_console_write+0x4b/0x60
               console_unlock+0x501/0x750
               vprintk_emit+0x10d/0x340
               vprintk_default+0x1f/0x30
               vprintk_func+0x44/0xd4
               printk+0x9f/0xc5
      
        -> #0 (console_owner){-.-.}:
               check_prev_add+0x107/0xea0
               validate_chain+0x8fc/0x1200
               __lock_acquire+0x5b3/0xb40
               lock_acquire+0x126/0x280
               console_unlock+0x269/0x750
               vprintk_emit+0x10d/0x340
               vprintk_default+0x1f/0x30
               vprintk_func+0x44/0xd4
               printk+0x9f/0xc5
               __offline_isolated_pages.cold.52+0x2f/0x30a
               offline_isolated_pages_cb+0x17/0x30
               walk_system_ram_range+0xda/0x160
               __offline_pages+0x79c/0xa10
               offline_pages+0x11/0x20
               memory_subsys_offline+0x7e/0xc0
               device_offline+0xd5/0x110
               state_store+0xc6/0xe0
               dev_attr_store+0x3f/0x60
               sysfs_kf_write+0x89/0xb0
               kernfs_fop_write+0x188/0x240
               __vfs_write+0x50/0xa0
               vfs_write+0x105/0x290
               ksys_write+0xc6/0x160
               __x64_sys_write+0x43/0x50
               do_syscall_64+0xcc/0x76c
               entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
        other info that might help us debug this:
      
        Chain exists of:
          console_owner --> &(&port->lock)->rlock --> &(&zone->lock)->rlock
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(&(&zone->lock)->rlock);
                                       lock(&(&port->lock)->rlock);
                                       lock(&(&zone->lock)->rlock);
          lock(console_owner);
      
         *** DEADLOCK ***
      
        9 locks held by test.sh/8653:
         #0: ffff88839ba7d408 (sb_writers#4){.+.+}, at:
        vfs_write+0x25f/0x290
         #1: ffff888277618880 (&of->mutex){+.+.}, at:
        kernfs_fop_write+0x128/0x240
         #2: ffff8898131fc218 (kn->count#115){.+.+}, at:
        kernfs_fop_write+0x138/0x240
         #3: ffffffff86962a80 (device_hotplug_lock){+.+.}, at:
        lock_device_hotplug_sysfs+0x16/0x50
         #4: ffff8884374f4990 (&dev->mutex){....}, at:
        device_offline+0x70/0x110
         #5: ffffffff86515250 (cpu_hotplug_lock.rw_sem){++++}, at:
        __offline_pages+0xbf/0xa10
         #6: ffffffff867405f0 (mem_hotplug_lock.rw_sem){++++}, at:
        percpu_down_write+0x87/0x2f0
         #7: ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
        __offline_isolated_pages+0x179/0x3e0
         #8: ffffffff865a4920 (console_lock){+.+.}, at:
        vprintk_emit+0x100/0x340
      
        stack backtrace:
        Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10,
        BIOS U34 05/21/2019
        Call Trace:
         dump_stack+0x86/0xca
         print_circular_bug.cold.31+0x243/0x26e
         check_noncircular+0x29e/0x2e0
         check_prev_add+0x107/0xea0
         validate_chain+0x8fc/0x1200
         __lock_acquire+0x5b3/0xb40
         lock_acquire+0x126/0x280
         console_unlock+0x269/0x750
         vprintk_emit+0x10d/0x340
         vprintk_default+0x1f/0x30
         vprintk_func+0x44/0xd4
         printk+0x9f/0xc5
         __offline_isolated_pages.cold.52+0x2f/0x30a
         offline_isolated_pages_cb+0x17/0x30
         walk_system_ram_range+0xda/0x160
         __offline_pages+0x79c/0xa10
         offline_pages+0x11/0x20
         memory_subsys_offline+0x7e/0xc0
         device_offline+0xd5/0x110
         state_store+0xc6/0xe0
         dev_attr_store+0x3f/0x60
         sysfs_kf_write+0x89/0xb0
         kernfs_fop_write+0x188/0x240
         __vfs_write+0x50/0xa0
         vfs_write+0x105/0x290
         ksys_write+0xc6/0x160
         __x64_sys_write+0x43/0x50
         do_syscall_64+0xcc/0x76c
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Link: http://lkml.kernel.org/r/20200117181200.20299-1-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: Steven Rostedt (VMware) <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a55c047
    • D
      mm: remove "count" parameter from has_unmovable_pages() · fe4c86c9
      David Hildenbrand 提交于
      Now that the memory isolate notifier is gone, the parameter is always 0.
      Drop it and cleanup has_unmovable_pages().
      
      Link: http://lkml.kernel.org/r/20191114131911.11783-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Pingfan Liu <kernelfans@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Arun KS <arunks@codeaurora.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe4c86c9
    • K
      mm/page_alloc: skip non present sections on zone initialization · 3f135355
      Kirill A. Shutemov 提交于
      memmap_init_zone() can be called on the ranges with holes during the
      boot.  It will skip any non-valid PFNs one-by-one.  It works fine as
      long as holes are not too big.
      
      But huge holes in the memory map causes a problem.  It takes over 20
      seconds to walk 32TiB hole.  x86-64 with 5-level paging allows for much
      larger holes in the memory map which would practically hang the system.
      
      Deferred struct page init doesn't help here.  It only works on the
      present ranges.
      
      Skipping non-present sections would fix the issue.
      
      Link: http://lkml.kernel.org/r/20191230093828.24613-1-kirill.shutemov@linux.intel.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Jin, Zhi" <zhi.jin@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f135355
  4. 14 1月, 2020 2 次提交
    • V
      mm, debug_pagealloc: don't rely on static keys too early · 8e57f8ac
      Vlastimil Babka 提交于
      Commit 96a2b03f ("mm, debug_pagelloc: use static keys to enable
      debugging") has introduced a static key to reduce overhead when
      debug_pagealloc is compiled in but not enabled.  It relied on the
      assumption that jump_label_init() is called before parse_early_param()
      as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
      it is safe to enable the static key.
      
      However, it turns out multiple architectures call parse_early_param()
      earlier from their setup_arch().  x86 also calls jump_label_init() even
      earlier, so no issue was found while testing the commit, but same is not
      true for e.g.  ppc64 and s390 where the kernel would not boot with
      debug_pagealloc=on as found by our QA.
      
      To fix this without tricky changes to init code of multiple
      architectures, this patch partially reverts the static key conversion
      from 96a2b03f.  Init-time and non-fastpath calls (such as in arch
      code) of debug_pagealloc_enabled() will again test a simple bool
      variable.  Fastpath mm code is converted to a new
      debug_pagealloc_enabled_static() variant that relies on the static key,
      which is enabled in a well-defined point in mm_init() where it's
      guaranteed that jump_label_init() has been called, regardless of
      architecture.
      
      [sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
        Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
      Fixes: 96a2b03f ("mm, debug_pagelloc: use static keys to enable debugging")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e57f8ac
    • V
      mm, thp: tweak reclaim/compaction effort of local-only and all-node allocations · cc638f32
      Vlastimil Babka 提交于
      THP page faults now attempt a __GFP_THISNODE allocation first, which
      should only compact existing free memory, followed by another attempt
      that can allocate from any node using reclaim/compaction effort
      specified by global defrag setting and madvise.
      
      This patch makes the following changes to the scheme:
      
       - Before the patch, the first allocation relies on a check for
         pageblock order and __GFP_IO to prevent excessive reclaim. This
         however affects also the second attempt, which is not limited to
         single node.
      
         Instead of that, reuse the existing check for costly order
         __GFP_NORETRY allocations, and make sure the first THP attempt uses
         __GFP_NORETRY. As a side-effect, all costly order __GFP_NORETRY
         allocations will bail out if compaction needs reclaim, while
         previously they only bailed out when compaction was deferred due to
         previous failures.
      
         This should be still acceptable within the __GFP_NORETRY semantics.
      
       - Before the patch, the second allocation attempt (on all nodes) was
         passing __GFP_NORETRY. This is redundant as the check for pageblock
         order (discussed above) was stronger. It's also contrary to
         madvise(MADV_HUGEPAGE) which means some effort to allocate THP is
         requested.
      
         After this patch, the second attempt doesn't pass __GFP_THISNODE nor
         __GFP_NORETRY.
      
      To sum up, THP page faults now try the following attempts:
      
      1. local node only THP allocation with no reclaim, just compaction.
      2. for madvised VMA's or when synchronous compaction is enabled always - THP
         allocation from any node with effort determined by global defrag setting
         and VMA madvise
      3. fallback to base pages on any node
      
      Link: http://lkml.kernel.org/r/08a3f4dd-c3ce-0009-86c5-9ee51aba8557@suse.cz
      Fixes: b39d0ee2 ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc638f32
  5. 02 12月, 2019 6 次提交
  6. 07 11月, 2019 2 次提交
    • J
      mm/page_alloc.c: ratelimit allocation failure warnings more aggressively · 1be334e5
      Johannes Weiner 提交于
      While investigating a bug related to higher atomic allocation failures,
      we noticed the failure warnings positively drowning the console, and in
      our case trigger lockup warnings because of a serial console too slow to
      handle all that output.
      
      But even if we had a faster console, it's unclear what additional
      information the current level of repetition provides.
      
      Allocation failures happen for three reasons: The machine is OOM, the VM
      is failing to handle reasonable requests, or somebody is making
      unreasonable requests (and didn't acknowledge their opportunism with
      __GFP_NOWARN).  Having the memory dump, a callstack, and the ratelimit
      stats on skipped failure warnings should provide enough information to
      let users/admins/developers know whether something is wrong and point
      them in the right direction for debugging, bpftracing etc.
      
      Limit allocation failure warnings to one spew every ten seconds.
      
      Link: http://lkml.kernel.org/r/20191028194906.26899-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1be334e5
    • M
      mm, meminit: recalculate pcpu batch and high limits after init completes · 3e8fc007
      Mel Gorman 提交于
      Deferred memory initialisation updates zone->managed_pages during the
      initialisation phase but before that finishes, the per-cpu page
      allocator (pcpu) calculates the number of pages allocated/freed in
      batches as well as the maximum number of pages allowed on a per-cpu
      list.  As zone->managed_pages is not up to date yet, the pcpu
      initialisation calculates inappropriately low batch and high values.
      
      This increases zone lock contention quite severely in some cases with
      the degree of severity depending on how many CPUs share a local zone and
      the size of the zone.  A private report indicated that kernel build
      times were excessive with extremely high system CPU usage.  A perf
      profile indicated that a large chunk of time was lost on zone->lock
      contention.
      
      This patch recalculates the pcpu batch and high values after deferred
      initialisation completes for every populated zone in the system.  It was
      tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
      workload -- allmodconfig and all available CPUs.
      
      mmtests configuration: config-workload-kernbench-max Configuration was
      modified to build on a fresh XFS partition.
      
      kernbench
                                      5.4.0-rc3              5.4.0-rc3
                                        vanilla           resetpcpu-v2
      Amean     user-256    13249.50 (   0.00%)    16401.31 * -23.79%*
      Amean     syst-256    14760.30 (   0.00%)     4448.39 *  69.86%*
      Amean     elsp-256      162.42 (   0.00%)      119.13 *  26.65%*
      Stddev    user-256       42.97 (   0.00%)       19.15 (  55.43%)
      Stddev    syst-256      336.87 (   0.00%)        6.71 (  98.01%)
      Stddev    elsp-256        2.46 (   0.00%)        0.39 (  84.03%)
      
                         5.4.0-rc3    5.4.0-rc3
                           vanilla resetpcpu-v2
      Duration User       39766.24     49221.79
      Duration System     44298.10     13361.67
      Duration Elapsed      519.11       388.87
      
      The patch reduces system CPU usage by 69.86% and total build time by
      26.65%.  The variance of system CPU usage is also much reduced.
      
      Before, this was the breakdown of batch and high values over all zones
      was:
      
          256               batch: 1
          256               batch: 63
          512               batch: 7
          256               high:  0
          256               high:  378
          512               high:  42
      
      512 pcpu pagesets had a batch limit of 7 and a high limit of 42.  After
      the patch:
      
          256               batch: 1
          768               batch: 63
          256               high:  0
          768               high:  378
      
      [mgorman@techsingularity.net: fix merge/linkage snafu]
        Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: <stable@vger.kernel.org>	[4.1+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e8fc007
  7. 15 10月, 2019 1 次提交
    • D
      mm, hugetlb: allow hugepage allocations to reclaim as needed · 3f36d866
      David Rientjes 提交于
      Commit b39d0ee2 ("mm, page_alloc: avoid expensive reclaim when
      compaction may not succeed") has chnaged the allocator to bail out from
      the allocator early to prevent from a potentially excessive memory
      reclaim.  __GFP_RETRY_MAYFAIL is designed to retry the allocation,
      reclaim and compaction loop as long as there is a reasonable chance to
      make forward progress.  Neither COMPACT_SKIPPED nor COMPACT_DEFERRED at
      the INIT_COMPACT_PRIORITY compaction attempt gives this feedback.
      
      The most obvious affected subsystem is hugetlbfs which allocates huge
      pages based on an admin request (or via admin configured overcommit).  I
      have done a simple test which tries to allocate half of the memory for
      hugetlb pages while the memory is full of a clean page cache.  This is
      not an unusual situation because we try to cache as much of the memory
      as possible and sysctl/sysfs interface to allocate huge pages is there
      for flexibility to allocate hugetlb pages at any time.
      
      System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages
      after the memory is prefilled by a clean page cache:
      
        root@test1:~# cat hugetlb_test.sh
      
        set -x
        echo 0 > /proc/sys/vm/nr_hugepages
        echo 3 > /proc/sys/vm/drop_caches
        echo 1 > /proc/sys/vm/compact_memory
        dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
        TS=$(date +%s)
        echo 256 > /proc/sys/vm/nr_hugepages
        cat /proc/sys/vm/nr_hugepages
      
      The results for 2 consecutive runs on clean 5.3
      
        root@test1:~# sh hugetlb_test.sh
        + echo 0
        + echo 3
        + echo 1
        + dd if=/mnt/data/file-1G of=/dev/null bs=4096
        262144+0 records in
        262144+0 records out
        1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s
        + date +%s
        + TS=1569905284
        + echo 256
        + cat /proc/sys/vm/nr_hugepages
        256
        root@test1:~# sh hugetlb_test.sh
        + echo 0
        + echo 3
        + echo 1
        + dd if=/mnt/data/file-1G of=/dev/null bs=4096
        262144+0 records in
        262144+0 records out
        1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s
        + date +%s
        + TS=1569905311
        + echo 256
        + cat /proc/sys/vm/nr_hugepages
        256
      
      Now with b39d0ee2 applied
      
        root@test1:~# sh hugetlb_test.sh
        + echo 0
        + echo 3
        + echo 1
        + dd if=/mnt/data/file-1G of=/dev/null bs=4096
        262144+0 records in
        262144+0 records out
        1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
        + date +%s
        + TS=1569905516
        + echo 256
        + cat /proc/sys/vm/nr_hugepages
        11
        root@test1:~# sh hugetlb_test.sh
        + echo 0
        + echo 3
        + echo 1
        + dd if=/mnt/data/file-1G of=/dev/null bs=4096
        262144+0 records in
        262144+0 records out
        1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
        + date +%s
        + TS=1569905541
        + echo 256
        + cat /proc/sys/vm/nr_hugepages
        12
      
      The success rate went down by factor of 20!
      
      Although hugetlb allocation requests might fail and it is reasonable to
      expect them to under extremely fragmented memory or when the memory is
      under a heavy pressure but the above situation is not that case.
      
      Fix the regression by reverting back to the previous behavior for
      __GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for
      those requests.
      
      Mike said:
      
      : hugetlbfs allocations are commonly done via sysctl/sysfs shortly after
      : boot where this may not be as much of an issue.  However, I am aware of at
      : least three use cases where allocations are made after the system has been
      : up and running for quite some time:
      :
      : - DB reconfiguration.  If sysctl/sysfs fails to get required number of
      :   huge pages, system is rebooted to perform allocation after boot.
      :
      : - VM provisioning.  If unable get required number of huge pages, fall
      :   back to base pages.
      :
      : - An application that does not preallocate pool, but rather allocates
      :   pages at fault time for optimal NUMA locality.
      :
      : In all cases, I would expect b39d0ee2 to cause regressions and
      : noticable behavior changes.
      :
      : My quick/limited testing in
      : https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com
      : was insufficient.  It was also mentioned that if something like
      : b39d0ee2 went forward, I would like exemptions for __GFP_RETRY_MAYFAIL
      : requests as in this patch.
      
      [mhocko@suse.com: reworded changelog]
      Link: http://lkml.kernel.org/r/20191007075548.12456-1-mhocko@kernel.org
      Fixes: b39d0ee2 ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f36d866
  8. 08 10月, 2019 1 次提交
    • Q
      mm/page_alloc.c: fix a crash in free_pages_prepare() · 234fdce8
      Qian Cai 提交于
      On architectures like s390, arch_free_page() could mark the page unused
      (set_page_unused()) and any access later would trigger a kernel panic.
      Fix it by moving arch_free_page() after all possible accessing calls.
      
       Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
       Krnl PSW : 0404e00180000000 0000000026c2b96e (__free_pages_ok+0x34e/0x5d8)
                  R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
       Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c 000000000000000f
                  000003d080012100 000003d080013fc0 0000000000000000 0000000000100000
                  00000000275cca48 0000000000000100 0000000000000008 000003d080010000
                  00000000000001d0 000003d000000000 0000000026c2b78a 000000002717fdb0
       Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6
                  0000000026c2b962: e32014000036 pfd 2,1024(%r1)
                 #0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1)
                 >0000000026c2b96e: 41101100  la %r1,256(%r1)
                  0000000026c2b972: a737fff8  brctg %r3,26c2b962
                  0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1)
                  0000000026c2b97c: e31003400004 lg %r1,832
                  0000000026c2b982: ebff1430016a asi 5168(%r1),-1
       Call Trace:
       __free_pages_ok+0x16a/0x5d8)
       memblock_free_all+0x206/0x290
       mem_init+0x58/0x120
       start_kernel+0x2b0/0x570
       startup_continue+0x6a/0xc0
       INFO: lockdep is turned off.
       Last Breaking-Event-Address:
       __free_pages_ok+0x372/0x5d8
       Kernel panic - not syncing: Fatal exception: panic_on_oops
       00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 26A2379C
      
      In the past, only kernel_poison_pages() would trigger this but it needs
      "page_poison=on" kernel cmdline, and I suspect nobody tested that on
      s390.  Recently, kernel_init_free_pages() (commit 6471384a ("mm:
      security: introduce init_on_alloc=1 and init_on_free=1 boot options"))
      was added and could trigger this as well.
      
      [akpm@linux-foundation.org: add comment]
      Link: http://lkml.kernel.org/r/1569613623-16820-1-git-send-email-cai@lca.pw
      Fixes: 8823b1db ("mm/page_poison.c: enable PAGE_POISONING as a separate option")
      Fixes: 6471384a ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.3+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      234fdce8
  9. 29 9月, 2019 1 次提交
    • D
      mm, page_alloc: avoid expensive reclaim when compaction may not succeed · b39d0ee2
      David Rientjes 提交于
      Memory compaction has a couple significant drawbacks as the allocation
      order increases, specifically:
      
       - isolate_freepages() is responsible for finding free pages to use as
         migration targets and is implemented as a linear scan of memory
         starting at the end of a zone,
      
       - failing order-0 watermark checks in memory compaction does not account
         for how far below the watermarks the zone actually is: to enable
         migration, there must be *some* free memory available.  Per the above,
         watermarks are not always suffficient if isolate_freepages() cannot
         find the free memory but it could require hundreds of MBs of reclaim to
         even reach this threshold (read: potentially very expensive reclaim with
         no indication compaction can be successful), and
      
       - if compaction at this order has failed recently so that it does not even
         run as a result of deferred compaction, looping through reclaim can often
         be pointless.
      
      For hugepage allocations, these are quite substantial drawbacks because
      these are very high order allocations (order-9 on x86) and falling back to
      doing reclaim can potentially be *very* expensive without any indication
      that compaction would even be successful.
      
      Reclaim itself is unlikely to free entire pageblocks and certainly no
      reliance should be put on it to do so in isolation (recall lumpy reclaim).
      This means we should avoid reclaim and simply fail hugepage allocation if
      compaction is deferred.
      
      It is also not helpful to thrash a zone by doing excessive reclaim if
      compaction may not be able to access that memory.  If order-0 watermarks
      fail and the allocation order is sufficiently large, it is likely better
      to fail the allocation rather than thrashing the zone.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b39d0ee2
  10. 25 9月, 2019 4 次提交
  11. 03 9月, 2019 1 次提交
    • M
      sched/topology: Improve load balancing on AMD EPYC systems · a55c7454
      Matt Fleming 提交于
      SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
      for any sched domains with a NUMA distance greater than 2 hops
      (RECLAIM_DISTANCE). The idea being that it's expensive to balance
      across domains that far apart.
      
      However, as is rather unfortunately explained in:
      
        commit 32e45ff4 ("mm: increase RECLAIM_DISTANCE to 30")
      
      the value for RECLAIM_DISTANCE is based on node distance tables from
      2011-era hardware.
      
      Current AMD EPYC machines have the following NUMA node distances:
      
       node distances:
       node   0   1   2   3   4   5   6   7
         0:  10  16  16  16  32  32  32  32
         1:  16  10  16  16  32  32  32  32
         2:  16  16  10  16  32  32  32  32
         3:  16  16  16  10  32  32  32  32
         4:  32  32  32  32  10  16  16  16
         5:  32  32  32  32  16  10  16  16
         6:  32  32  32  32  16  16  10  16
         7:  32  32  32  32  16  16  16  10
      
      where 2 hops is 32.
      
      The result is that the scheduler fails to load balance properly across
      NUMA nodes on different sockets -- 2 hops apart.
      
      For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
      (CPUs 32-39) like so,
      
        $ numactl -C 0-7,32-39 ./spinner 16
      
      causes all threads to fork and remain on node 0 until the active
      balancer kicks in after a few seconds and forcibly moves some threads
      to node 4.
      
      Override node_reclaim_distance for AMD Zen.
      Signed-off-by: NMatt Fleming <matt@codeblueprint.co.uk>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Suravee.Suthikulpanit@amd.com
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Thomas.Lendacky@amd.com
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20190808195301.13222-3-matt@codeblueprint.co.ukSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a55c7454
  12. 25 8月, 2019 1 次提交
    • D
      mm, page_alloc: move_freepages should not examine struct page of reserved memory · cd961038
      David Rientjes 提交于
      After commit 907ec5fc ("mm: zero remaining unavailable struct
      pages"), struct page of reserved memory is zeroed.  This causes
      page->flags to be 0 and fixes issues related to reading
      /proc/kpageflags, for example, of reserved memory.
      
      The VM_BUG_ON() in move_freepages_block(), however, assumes that
      page_zone() is meaningful even for reserved memory.  That assumption is
      no longer true after the aforementioned commit.
      
      There's no reason why move_freepages_block() should be testing the
      legitimacy of page_zone() for reserved memory; its scope is limited only
      to pages on the zone's freelist.
      
      Note that pfn_valid() can be true for reserved memory: there is a
      backing struct page.  The check for page_to_nid(page) is also buggy but
      reserved memory normally only appears on node 0 so the zeroing doesn't
      affect this.
      
      Move the debug checks to after verifying PageBuddy is true.  This
      isolates the scope of the checks to only be for buddy pages which are on
      the zone's freelist which move_freepages_block() is operating on.  In
      this case, an incorrect node or zone is a bug worthy of being warned
      about (and the examination of struct page is acceptable bcause this
      memory is not reserved).
      
      Why does move_freepages_block() gets called on reserved memory? It's
      simply math after finding a valid free page from the per-zone free area
      to use as fallback.  We find the beginning and end of the pageblock of
      the valid page and that can bring us into memory that was reserved per
      the e820.  pfn_valid() is still true (it's backed by a struct page), but
      since it's zero'd we shouldn't make any inferences here about comparing
      its node or zone.  The current node check just happens to succeed most
      of the time by luck because reserved memory typically appears on node 0.
      
      The fix here is to validate that we actually have buddy pages before
      testing if there's any type of zone or node strangeness going on.
      
      We noticed it almost immediately after bringing 907ec5fc in on
      CONFIG_DEBUG_VM builds.  It depends on finding specific free pages in
      the per-zone free area where the math in move_freepages() will bring the
      start or end pfn into reserved memory and wanting to claim that entire
      pageblock as a new migratetype.  So the path will be rare, require
      CONFIG_DEBUG_VM, and require fallback to a different migratetype.
      
      Some struct pages were already zeroed from reserve pages before
      907ec5fca3c so it theoretically could trigger before this commit.  I
      think it's rare enough under a config option that most people don't run
      that others may not have noticed.  I wouldn't argue against a stable tag
      and the backport should be easy enough, but probably wouldn't single out
      a commit that this is fixing.
      
      Mel said:
      
      : The overhead of the debugging check is higher with this patch although
      : it'll only affect debug builds and the path is not particularly hot.
      : If this was a concern, I think it would be reasonable to simply remove
      : the debugging check as the zone boundaries are checked in
      : move_freepages_block and we never expect a zone/node to be smaller than
      : a pageblock and stuck in the middle of another zone.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1908122036560.10779@chino.kir.corp.google.comSigned-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Masayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd961038
  13. 20 8月, 2019 1 次提交
  14. 19 7月, 2019 4 次提交
    • D
      mm/sparsemem: support sub-section hotplug · ba72b4c8
      Dan Williams 提交于
      The libnvdimm sub-system has suffered a series of hacks and broken
      workarounds for the memory-hotplug implementation's awkward
      section-aligned (128MB) granularity.
      
      For example the following backtrace is emitted when attempting
      arch_add_memory() with physical address ranges that intersect 'System
      RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:
      
          # cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
          100000000-1ffffffff : System RAM
          200000000-303ffffff : Persistent Memory (legacy)
          304000000-43fffffff : System RAM
          440000000-23ffffffff : Persistent Memory
          2400000000-43bfffffff : Persistent Memory
            2400000000-43bfffffff : namespace2.0
      
          WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60
          [..]
          RIP: 0010:add_pages+0x5c/0x60
          [..]
          Call Trace:
           devm_memremap_pages+0x460/0x6e0
           pmem_attach_disk+0x29e/0x680 [nd_pmem]
           ? nd_dax_probe+0xfc/0x120 [libnvdimm]
           nvdimm_bus_probe+0x66/0x160 [libnvdimm]
      
      It was discovered that the problem goes beyond RAM vs PMEM collisions as
      some platform produce PMEM vs PMEM collisions within a given section.
      The libnvdimm workaround for that case revealed that the libnvdimm
      section-alignment-padding implementation has been broken for a long
      while.
      
      A fix for that long-standing breakage introduces as many problems as it
      solves as it would require a backward-incompatible change to the
      namespace metadata interpretation.  Instead of that dubious route [1],
      address the root problem in the memory-hotplug implementation.
      
      Note that EEXIST is no longer treated as success as that is how
      sparse_add_section() reports subsection collisions, it was also obviated
      by recent changes to perform the request_region() for 'System RAM'
      before arch_add_memory() in the add_memory() sequence.
      
      [1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
      
      [osalvador@suse.de: fix deactivate_section for early sections]
        Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de
      Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NOscar Salvador <osalvador@suse.de>
      Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ba72b4c8
    • D
      mm: kill is_dev_zone() helper · 46d945ae
      Dan Williams 提交于
      Given there are no more usages of is_dev_zone() outside of 'ifdef
      CONFIG_ZONE_DEVICE' protection, kill off the compilation helper.
      
      Link: http://lkml.kernel.org/r/156092353211.979959.1489004866360828964.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NWei Yang <richardw.yang@linux.intel.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46d945ae
    • D
      mm/sparsemem: add helpers track active portions of a section at boot · f46edbd1
      Dan Williams 提交于
      Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
      sub-section active bitmask, each bit representing a PMD_SIZE span of the
      architecture's memory hotplug section size.
      
      The implications of a partially populated section is that pfn_valid()
      needs to go beyond a valid_section() check and either determine that the
      section is an "early section", or read the sub-section active ranges
      from the bitmask.  The expectation is that the bitmask (subsection_map)
      fits in the same cacheline as the valid_section() / early_section()
      data, so the incremental performance overhead to pfn_valid() should be
      negligible.
      
      The rationale for using early_section() to short-ciruit the
      subsection_map check is that there are legacy code paths that use
      pfn_valid() at section granularity before validating the pfn against
      pgdat data.  So, the early_section() check allows those traditional
      assumptions to persist while also permitting subsection_map to tell the
      truth for purposes of populating the unused portions of early sections
      with PMEM and other ZONE_DEVICE mappings.
      
      Link: http://lkml.kernel.org/r/156092350874.979959.18185938451405518285.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: NQian Cai <cai@lca.pw>
      Tested-by: NJane Chu <jane.chu@oracle.com>
      Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f46edbd1
    • D
      mm/sparsemem: introduce struct mem_section_usage · f1eca35a
      Dan Williams 提交于
      Patch series "mm: Sub-section memory hotplug support", v10.
      
      The memory hotplug section is an arbitrary / convenient unit for memory
      hotplug.  'Section-size' units have bled into the user interface
      ('memblock' sysfs) and can not be changed without breaking existing
      userspace.  The section-size constraint, while mostly benign for typical
      memory hotplug, has and continues to wreak havoc with 'device-memory'
      use cases, persistent memory (pmem) in particular.  Recall that pmem
      uses devm_memremap_pages(), and subsequently arch_add_memory(), to
      allocate a 'struct page' memmap for pmem.  However, it does not use the
      'bottom half' of memory hotplug, i.e.  never marks pmem pages online and
      never exposes the userspace memblock interface for pmem.  This leaves an
      opening to redress the section-size constraint.
      
      To date, the libnvdimm subsystem has attempted to inject padding to
      satisfy the internal constraints of arch_add_memory().  Beyond
      complicating the code, leading to bugs [2], wasting memory, and limiting
      configuration flexibility, the padding hack is broken when the platform
      changes this physical memory alignment of pmem from one boot to the
      next.  Device failure (intermittent or permanent) and physical
      reconfiguration are events that can cause the platform firmware to
      change the physical placement of pmem on a subsequent boot, and device
      failure is an everyday event in a data-center.
      
      It turns out that sections are only a hard requirement of the
      user-facing interface for memory hotplug and with a bit more
      infrastructure sub-section arch_add_memory() support can be added for
      kernel internal usages like devm_memremap_pages().  Here is an analysis
      of the current design assumptions in the current code and how they are
      addressed in the new implementation:
      
      Current design assumptions:
      
       - Sections that describe boot memory (early sections) are never
         unplugged / removed.
      
       - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
         valid_section() check
      
       - __add_pages() and helper routines assume all operations occur in
         PAGES_PER_SECTION units.
      
       - The memblock sysfs interface only comprehends full sections
      
      New design assumptions:
      
       - Sections are instrumented with a sub-section bitmask to track (on
         x86) individual 2MB sub-divisions of a 128MB section.
      
       - Partially populated early sections can be extended with additional
         sub-sections, and those sub-sections can be removed with
         arch_remove_memory(). With this in place we no longer lose usable
         memory capacity to padding.
      
       - pfn_valid() is updated to look deeper than valid_section() to also
         check the active-sub-section mask. This indication is in the same
         cacheline as the valid_section() so the performance impact is
         expected to be negligible. So far the lkp robot has not reported any
         regressions.
      
       - Outside of the core vmemmap population routines which are replaced,
         other helper routines like shrink_{zone,pgdat}_span() are updated to
         handle the smaller granularity. Core memory hotplug routines that
         deal with online memory are not touched.
      
       - The existing memblock sysfs user api guarantees / assumptions are not
         touched since this capability is limited to !online
         !memblock-sysfs-accessible sections.
      
      Meanwhile the issue reports continue to roll in from users that do not
      understand when and how the 128MB constraint will bite them.  The current
      implementation relied on being able to support at least one misaligned
      namespace, but that immediately falls over on any moderately complex
      namespace creation attempt.  Beyond the initial problem of 'System RAM'
      colliding with pmem, and the unsolvable problem of physical alignment
      changes, Linux is now being exposed to platforms that collide pmem ranges
      with other pmem ranges by default [3].  In short, devm_memremap_pages()
      has pushed the venerable section-size constraint past the breaking point,
      and the simplicity of section-aligned arch_add_memory() is no longer
      tenable.
      
      These patches are exposed to the kbuild robot on a subsection-v10 branch
      [4], and a preview of the unit test for this functionality is available
      on the 'subsection-pending' branch of ndctl [5].
      
      [2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
      [3]: https://github.com/pmem/ndctl/issues/76
      [4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
      [5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c
      
      This patch (of 13):
      
      Towards enabling memory hotplug to track partial population of a section,
      introduce 'struct mem_section_usage'.
      
      A pointer to a 'struct mem_section_usage' instance replaces the existing
      pointer to a 'pageblock_flags' bitmap.  Effectively it adds one more
      'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
      a new 'subsection_map' bitmap.  The new bitmap enables the memory
      hot{plug,remove} implementation to act on incremental sub-divisions of a
      section.
      
      SUBSECTION_SHIFT is defined as global constant instead of per-architecture
      value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
      subsection users.  Specifically a common subsection size allows for the
      possibility that persistent memory namespace configurations be made
      compatible across architectures.
      
      The primary motivation for this functionality is to support platforms that
      mix "System RAM" and "Persistent Memory" within a single section, or
      multiple PMEM ranges with different mapping lifetimes within a single
      section.  The section restriction for hotplug has caused an ongoing saga
      of hacks and bugs for devm_memremap_pages() users.
      
      Beyond the fixups to teach existing paths how to retrieve the 'usemap'
      from a section, and updates to usemap allocation path, there are no
      expected behavior changes.
      
      Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.comSigned-off-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richardw.yang@linux.intel.com>
      Tested-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>	[ppc64]
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Jane Chu <jane.chu@oracle.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Toshi Kani <toshi.kani@hpe.com>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1eca35a
  15. 17 7月, 2019 1 次提交
  16. 13 7月, 2019 2 次提交
    • A
      mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options · 6471384a
      Alexander Potapenko 提交于
      Patch series "add init_on_alloc/init_on_free boot options", v10.
      
      Provide init_on_alloc and init_on_free boot options.
      
      These are aimed at preventing possible information leaks and making the
      control-flow bugs that depend on uninitialized values more deterministic.
      
      Enabling either of the options guarantees that the memory returned by the
      page allocator and SL[AU]B is initialized with zeroes.  SLOB allocator
      isn't supported at the moment, as its emulation of kmem caches complicates
      handling of SLAB_TYPESAFE_BY_RCU caches correctly.
      
      Enabling init_on_free also guarantees that pages and heap objects are
      initialized right after they're freed, so it won't be possible to access
      stale data by using a dangling pointer.
      
      As suggested by Michal Hocko, right now we don't let the heap users to
      disable initialization for certain allocations.  There's not enough
      evidence that doing so can speed up real-life cases, and introducing ways
      to opt-out may result in things going out of control.
      
      This patch (of 2):
      
      The new options are needed to prevent possible information leaks and make
      control-flow bugs that depend on uninitialized values more deterministic.
      
      This is expected to be on-by-default on Android and Chrome OS.  And it
      gives the opportunity for anyone else to use it under distros too via the
      boot args.  (The init_on_free feature is regularly requested by folks
      where memory forensics is included in their threat models.)
      
      init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
      objects with zeroes.  Initialization is done at allocation time at the
      places where checks for __GFP_ZERO are performed.
      
      init_on_free=1 makes the kernel initialize freed pages and heap objects
      with zeroes upon their deletion.  This helps to ensure sensitive data
      doesn't leak via use-after-free accesses.
      
      Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
      returns zeroed memory.  The two exceptions are slab caches with
      constructors and SLAB_TYPESAFE_BY_RCU flag.  Those are never
      zero-initialized to preserve their semantics.
      
      Both init_on_alloc and init_on_free default to zero, but those defaults
      can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
      CONFIG_INIT_ON_FREE_DEFAULT_ON.
      
      If either SLUB poisoning or page poisoning is enabled, those options take
      precedence over init_on_alloc and init_on_free: initialization is only
      applied to unpoisoned allocations.
      
      Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:
      
      hackbench, init_on_free=1:  +7.62% sys time (st.err 0.74%)
      hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)
      
      Linux build with -j12, init_on_free=1:  +8.38% wall time (st.err 0.39%)
      Linux build with -j12, init_on_free=1:  +24.42% sys time (st.err 0.52%)
      Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
      Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)
      
      The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
      is within the standard error.
      
      The new features are also going to pave the way for hardware memory
      tagging (e.g.  arm64's MTE), which will require both on_alloc and on_free
      hooks to set the tags for heap objects.  With MTE, tagging will have the
      same cost as memory initialization.
      
      Although init_on_free is rather costly, there are paranoid use-cases where
      in-memory data lifetime is desired to be minimized.  There are various
      arguments for/against the realism of the associated threat models, but
      given that we'll need the infrastructure for MTE anyway, and there are
      people who want wipe-on-free behavior no matter what the performance cost,
      it seems reasonable to include it in this series.
      
      [glider@google.com: v8]
        Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
      [glider@google.com: v9]
        Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
      [glider@google.com: v10]
        Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
      Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Acked-by: Michal Hocko <mhocko@suse.cz>		[page and dmapool parts
      Acked-by: James Morris <jamorris@linux.microsoft.com>]
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Serge E. Hallyn" <serge@hallyn.com>
      Cc: Nick Desaulniers <ndesaulniers@google.com>
      Cc: Kostya Serebryany <kcc@google.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Sandeep Patil <sspatil@android.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Marco Elver <elver@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6471384a
    • N
      mm/large system hash: clear hashdist when only one node with memory is booted · e03a5125
      Nicholas Piggin 提交于
      CONFIG_NUMA on 64-bit CPUs currently enables hashdist unconditionally even
      when booting on single node machines.  This causes the large system hashes
      to be allocated with vmalloc, and mapped with small pages.
      
      This change clears hashdist if only one node has come up with memory.
      
      This results in the important large inode and dentry hashes using memblock
      allocations.  All others are within 4MB size up to about 128GB of RAM,
      which allows them to be allocated from the linear map on most non-NUMA
      images.
      
      Other big hashes like futex and TCP should eventually be moved over to the
      same style of allocation as those vfs caches that use HASH_EARLY if
      !hashdist, so they don't exceed MAX_ORDER on very large non-NUMA images.
      
      This brings dTLB misses for linux kernel tree `git diff` from ~45,000 to
      ~8,000 on a Kaby Lake KVM guest with 8MB dentry hash and mitigations=off
      (performance is in the noise, under 1% difference, page tables are likely
      to be well cached for this workload).
      
      Link: http://lkml.kernel.org/r/20190605144814.29319-2-npiggin@gmail.comSigned-off-by: NNicholas Piggin <npiggin@gmail.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e03a5125