1. 07 11月, 2019 6 次提交
    • V
      mm/khugepaged: fix might_sleep() warn with CONFIG_HIGHPTE=y · ec649c9d
      Ville Syrjälä 提交于
      I got some khugepaged spew on a 32bit x86:
      
        BUG: sleeping function called from invalid context at include/linux/mmu_notifier.h:346
        in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 25, name: khugepaged
        INFO: lockdep is turned off.
        CPU: 1 PID: 25 Comm: khugepaged Not tainted 5.4.0-rc5-elk+ #206
        Hardware name: System manufacturer P5Q-EM/P5Q-EM, BIOS 2203    07/08/2009
        Call Trace:
         dump_stack+0x66/0x8e
         ___might_sleep.cold.96+0x95/0xa6
         __might_sleep+0x2e/0x80
         collapse_huge_page.isra.51+0x5ac/0x1360
         khugepaged+0x9a9/0x20f0
         kthread+0xf5/0x110
         ret_from_fork+0x2e/0x38
      
      Looks like it's due to CONFIG_HIGHPTE=y pte_offset_map()->kmap_atomic()
      vs.  mmu_notifier_invalidate_range_start().  Let's do the naive approach
      and just reorder the two operations.
      
      Link: http://lkml.kernel.org/r/20191029201513.GG1208@intel.com
      Fixes: 810e24e0 ("mm/mmu_notifiers: annotate with might_sleep()")
      Signed-off-by: NVille Syrjl <ville.syrjala@linux.intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec649c9d
    • M
      mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo · 93b3a674
      Michal Hocko 提交于
      pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
      This is not really nice because it blocks both any interrupts on that
      cpu and the page allocator.  On large machines this might even trigger
      the hard lockup detector.
      
      Considering the pagetypeinfo is a debugging tool we do not really need
      exact numbers here.  The primary reason to look at the outuput is to see
      how pageblocks are spread among different migratetypes and low number of
      pages is much more interesting therefore putting a bound on the number
      of pages on the free_list sounds like a reasonable tradeoff.
      
      The new output will simply tell
        [...]
        Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
      
      instead of
        Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
      
      The limit has been chosen arbitrary and it is a subject of a future
      change should there be a need for that.
      
      While we are at it, also drop the zone lock after each free_list
      iteration which will help with the IRQ and page allocator responsiveness
      even further as the IRQ lock held time is always bound to those 100k
      pages.
      
      [akpm@linux-foundation.org: tweak comment text, per David Hildenbrand]
      Link: http://lkml.kernel.org/r/20191025072610.18526-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWaiman Long <longman@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93b3a674
    • M
      mm, vmstat: hide /proc/pagetypeinfo from normal users · abaed011
      Michal Hocko 提交于
      /proc/pagetypeinfo is a debugging tool to examine internal page
      allocator state wrt to fragmentation.  It is not very useful for any
      other use so normal users really do not need to read this file.
      
      Waiman Long has noticed that reading this file can have negative side
      effects because zone->lock is necessary for gathering data and that a)
      interferes with the page allocator and its users and b) can lead to hard
      lockups on large machines which have very long free_list.
      
      Reduce both issues by simply not exporting the file to regular users.
      
      Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org
      Fixes: 467c996c ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NWaiman Long <longman@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NWaiman Long <longman@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Jann Horn <jannh@google.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abaed011
    • J
      mm/mmu_notifiers: use the right return code for WARN_ON · df2ec764
      Jason Gunthorpe 提交于
      The return code from the op callback is actually in _ret, while the
      WARN_ON was checking ret which causes it to misfire.
      
      Link: http://lkml.kernel.org/r/20191025175502.GA31127@ziepe.ca
      Fixes: 8402ce61 ("mm/mmu_notifiers: check if mmu notifier callbacks are allowed to fail")
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df2ec764
    • M
      mm, meminit: recalculate pcpu batch and high limits after init completes · 3e8fc007
      Mel Gorman 提交于
      Deferred memory initialisation updates zone->managed_pages during the
      initialisation phase but before that finishes, the per-cpu page
      allocator (pcpu) calculates the number of pages allocated/freed in
      batches as well as the maximum number of pages allowed on a per-cpu
      list.  As zone->managed_pages is not up to date yet, the pcpu
      initialisation calculates inappropriately low batch and high values.
      
      This increases zone lock contention quite severely in some cases with
      the degree of severity depending on how many CPUs share a local zone and
      the size of the zone.  A private report indicated that kernel build
      times were excessive with extremely high system CPU usage.  A perf
      profile indicated that a large chunk of time was lost on zone->lock
      contention.
      
      This patch recalculates the pcpu batch and high values after deferred
      initialisation completes for every populated zone in the system.  It was
      tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
      workload -- allmodconfig and all available CPUs.
      
      mmtests configuration: config-workload-kernbench-max Configuration was
      modified to build on a fresh XFS partition.
      
      kernbench
                                      5.4.0-rc3              5.4.0-rc3
                                        vanilla           resetpcpu-v2
      Amean     user-256    13249.50 (   0.00%)    16401.31 * -23.79%*
      Amean     syst-256    14760.30 (   0.00%)     4448.39 *  69.86%*
      Amean     elsp-256      162.42 (   0.00%)      119.13 *  26.65%*
      Stddev    user-256       42.97 (   0.00%)       19.15 (  55.43%)
      Stddev    syst-256      336.87 (   0.00%)        6.71 (  98.01%)
      Stddev    elsp-256        2.46 (   0.00%)        0.39 (  84.03%)
      
                         5.4.0-rc3    5.4.0-rc3
                           vanilla resetpcpu-v2
      Duration User       39766.24     49221.79
      Duration System     44298.10     13361.67
      Duration Elapsed      519.11       388.87
      
      The patch reduces system CPU usage by 69.86% and total build time by
      26.65%.  The variance of system CPU usage is also much reduced.
      
      Before, this was the breakdown of batch and high values over all zones
      was:
      
          256               batch: 1
          256               batch: 63
          512               batch: 7
          256               high:  0
          256               high:  378
          512               high:  42
      
      512 pcpu pagesets had a batch limit of 7 and a high limit of 42.  After
      the patch:
      
          256               batch: 1
          768               batch: 63
          256               high:  0
          768               high:  378
      
      [mgorman@techsingularity.net: fix merge/linkage snafu]
        Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: <stable@vger.kernel.org>	[4.1+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e8fc007
    • S
      mm: memcontrol: fix NULL-ptr deref in percpu stats flush · 7961eee3
      Shakeel Butt 提交于
      __mem_cgroup_free() can be called on the failure path in
      mem_cgroup_alloc().  However memcg_flush_percpu_vmstats() and
      memcg_flush_percpu_vmevents() which are called from __mem_cgroup_free()
      access the fields of memcg which can potentially be null if called from
      failure path from mem_cgroup_alloc().  Indeed syzbot has reported the
      following crash:
      
      	kasan: CONFIG_KASAN_INLINE enabled
      	kasan: GPF could be caused by NULL-ptr deref or user memory access
      	general protection fault: 0000 [#1] PREEMPT SMP KASAN
      	CPU: 0 PID: 30393 Comm: syz-executor.1 Not tainted 5.4.0-rc2+ #0
      	Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      	RIP: 0010:memcg_flush_percpu_vmstats+0x4ae/0x930 mm/memcontrol.c:3436
      	Code: 05 41 89 c0 41 0f b6 04 24 41 38 c7 7c 08 84 c0 0f 85 5d 03 00 00 44 3b 05 33 d5 12 08 0f 83 e2 00 00 00 4c 89 f0 48 c1 e8 03 <42> 80 3c 28 00 0f 85 91 03 00 00 48 8b 85 10 fe ff ff 48 8b b0 90
      	RSP: 0018:ffff888095c27980 EFLAGS: 00010206
      	RAX: 0000000000000012 RBX: ffff888095c27b28 RCX: ffffc90008192000
      	RDX: 0000000000040000 RSI: ffffffff8340fae7 RDI: 0000000000000007
      	RBP: ffff888095c27be0 R08: 0000000000000000 R09: ffffed1013f0da33
      	R10: ffffed1013f0da32 R11: ffff88809f86d197 R12: fffffbfff138b760
      	R13: dffffc0000000000 R14: 0000000000000090 R15: 0000000000000007
      	FS:  00007f5027170700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	CR2: 0000000000710158 CR3: 00000000a7b18000 CR4: 00000000001406f0
      	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      	Call Trace:
      	__mem_cgroup_free+0x1a/0x190 mm/memcontrol.c:5021
      	mem_cgroup_free mm/memcontrol.c:5033 [inline]
      	mem_cgroup_css_alloc+0x3a1/0x1ae0 mm/memcontrol.c:5160
      	css_create kernel/cgroup/cgroup.c:5156 [inline]
      	cgroup_apply_control_enable+0x44d/0xc40 kernel/cgroup/cgroup.c:3119
      	cgroup_mkdir+0x899/0x11b0 kernel/cgroup/cgroup.c:5401
      	kernfs_iop_mkdir+0x14d/0x1d0 fs/kernfs/dir.c:1124
      	vfs_mkdir+0x42e/0x670 fs/namei.c:3807
      	do_mkdirat+0x234/0x2a0 fs/namei.c:3830
      	__do_sys_mkdir fs/namei.c:3846 [inline]
      	__se_sys_mkdir fs/namei.c:3844 [inline]
      	__x64_sys_mkdir+0x5c/0x80 fs/namei.c:3844
      	do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
      	entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixing this by moving the flush to mem_cgroup_free as there is no need
      to flush anything if we see failure in mem_cgroup_alloc().
      
      Link: http://lkml.kernel.org/r/20191018165231.249872-1-shakeelb@google.com
      Fixes: bb65f89b ("mm: memcontrol: flush percpu vmevents before releasing memcg")
      Fixes: c350a99e ("mm: memcontrol: flush percpu vmstats before releasing memcg")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: syzbot+515d5bcfe179cdf049b2@syzkaller.appspotmail.com
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7961eee3
  2. 19 10月, 2019 16 次提交
    • K
      mm/thp: allow dropping THP from page cache · ef18a1ca
      Kirill A. Shutemov 提交于
      Once a THP is added to the page cache, it cannot be dropped via
      /proc/sys/vm/drop_caches.  Fix this issue with proper handling in
      invalidate_mapping_pages().
      
      Link: http://lkml.kernel.org/r/20191017164223.2762148-5-songliubraving@fb.com
      Fixes: 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Tested-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef18a1ca
    • W
      mm/vmscan.c: support removing arbitrary sized pages from mapping · 906d278d
      William Kucharski 提交于
      __remove_mapping() assumes that pages can only be either base pages or
      HPAGE_PMD_SIZE.  Ask the page what size it is.
      
      Link: http://lkml.kernel.org/r/20191017164223.2762148-4-songliubraving@fb.com
      Fixes: 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      906d278d
    • K
      mm/thp: fix node page state in split_huge_page_to_list() · 06d3eff6
      Kirill A. Shutemov 提交于
      Make sure split_huge_page_to_list() handles the state of shmem THP and
      file THP properly.
      
      Link: http://lkml.kernel.org/r/20191017164223.2762148-3-songliubraving@fb.com
      Fixes: 60fbf0ab ("mm,thp: stats for file backed THP")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Tested-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06d3eff6
    • B
      mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch · a2ae8c05
      Ben Dooks (Codethink) 提交于
      mm_init.c needs to include <linux/mman.h> for the definition of
      vm_committed_as_batch.  Fixes the following sparse warning:
      
        mm/mm_init.c:141:5: warning: symbol 'vm_committed_as_batch' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20191016091509.26708-1-ben.dooks@codethink.co.ukSigned-off-by: NBen Dooks <ben.dooks@codethink.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2ae8c05
    • B
      mm/filemap.c: include <linux/ramfs.h> for generic_file_vm_ops definition · d0e6a582
      Ben Dooks 提交于
      The generic_file_vm_ops is defined in <linux/ramfs.h> so include it to
      fix the following warning:
      
        mm/filemap.c:2717:35: warning: symbol 'generic_file_vm_ops' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20191008102311.25432-1-ben.dooks@codethink.co.ukSigned-off-by: NBen Dooks <ben.dooks@codethink.co.uk>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0e6a582
    • B
      mm: include <linux/huge_mm.h> for is_vma_temporary_stack · 444f84fd
      Ben Dooks 提交于
      Include <linux/huge_mm.h> for the definition of is_vma_temporary_stack
      to fix the following sparse warning:
      
        mm/rmap.c:1673:6: warning: symbol 'is_vma_temporary_stack' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20191009151155.27763-1-ben.dooks@codethink.co.ukSigned-off-by: NBen Dooks <ben.dooks@codethink.co.uk>
      Reviewed-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      444f84fd
    • K
      mm/memcontrol: update lruvec counters in mem_cgroup_move_account · ae8af438
      Konstantin Khlebnikov 提交于
      Mapped, dirty and writeback pages are also counted in per-lruvec stats.
      These counters needs update when page is moved between cgroups.
      
      Currently is nobody *consuming* the lruvec versions of these counters and
      that there is no user-visible effect.
      
      Link: http://lkml.kernel.org/r/157112699975.7360.1062614888388489788.stgit@buzz
      Fixes: 00f3ca2c ("mm: memcontrol: per-lruvec stats infrastructure")
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae8af438
    • D
      hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic() · f231fe42
      David Hildenbrand 提交于
      Uninitialized memmaps contain garbage and in the worst case trigger
      kernel BUGs, especially with CONFIG_PAGE_POISONING.  They should not get
      touched.
      
      Let's make sure that we only consider online memory (managed by the
      buddy) that has initialized memmaps.  ZONE_DEVICE is not applicable.
      
      page_zone() will call page_to_nid(), which will trigger
      VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING
      and CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps.  This
      can be the case when an offline memory block (e.g., never onlined) is
      spanned by a zone.
      
      Note: As explained by Michal in [1], alloc_contig_range() will verify
      the range.  So it boils down to the wrong access in this function.
      
      [1] http://lkml.kernel.org/r/20180423000943.GO17484@dhcp22.suse.cz
      
      Link: http://lkml.kernel.org/r/20191015120717.4858-1-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f231fe42
    • M
      mm: memblock: do not enforce current limit for memblock_phys* family · f3057ad7
      Mike Rapoport 提交于
      Until commit 92d12f95 ("memblock: refactor internal allocation
      functions") the maximal address for memblock allocations was forced to
      memblock.current_limit only for the allocation functions returning
      virtual address.  The changes introduced by that commit moved the limit
      enforcement into the allocation core and as a result the allocation
      functions returning physical address also started to limit allocations
      to memblock.current_limit.
      
      This caused breakage of etnaviv GPU driver:
      
        etnaviv etnaviv: bound 130000.gpu (ops gpu_ops)
        etnaviv etnaviv: bound 134000.gpu (ops gpu_ops)
        etnaviv etnaviv: bound 2204000.gpu (ops gpu_ops)
        etnaviv-gpu 130000.gpu: model: GC2000, revision: 5108
        etnaviv-gpu 130000.gpu: command buffer outside valid memory window
        etnaviv-gpu 134000.gpu: model: GC320, revision: 5007
        etnaviv-gpu 134000.gpu: command buffer outside valid memory window
        etnaviv-gpu 2204000.gpu: model: GC355, revision: 1215
        etnaviv-gpu 2204000.gpu: Ignoring GPU with VG and FE2.0
      
      Restore the behaviour of memblock_phys* family so that these functions
      will not enforce memblock.current_limit.
      
      Link: http://lkml.kernel.org/r/1570915861-17633-1-git-send-email-rppt@kernel.org
      Fixes: 92d12f95 ("memblock: refactor internal allocation functions")
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reported-by: NAdam Ford <aford173@gmail.com>
      Tested-by: Adam Ford <aford173@gmail.com>	[imx6q-logicpd]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Fabio Estevam <festevam@gmail.com>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3057ad7
    • H
      mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size · b11edebb
      Honglei Wang 提交于
      Commit 1a61ab80 ("mm: memcontrol: replace zone summing with
      lruvec_page_state()") has made lruvec_page_state to use per-cpu counters
      instead of calculating it directly from lru_zone_size with an idea that
      this would be more effective.
      
      Tim has reported that this is not really the case for their database
      benchmark which is showing an opposite results where lruvec_page_state
      is taking up a huge chunk of CPU cycles (about 25% of the system time
      which is roughly 7% of total cpu cycles) on 5.3 kernels.  The workload
      is running on a larger machine (96cpus), it has many cgroups (500) and
      it is heavily direct reclaim bound.
      
      Tim Chen said:
      
      : The problem can also be reproduced by running simple multi-threaded
      : pmbench benchmark with a fast Optane SSD swap (see profile below).
      :
      :
      : 6.15%     3.08%  pmbench          [kernel.vmlinux]            [k] lruvec_lru_size
      :             |
      :             |--3.07%--lruvec_lru_size
      :             |          |
      :             |          |--2.11%--cpumask_next
      :             |          |          |
      :             |          |           --1.66%--find_next_bit
      :             |          |
      :             |           --0.57%--call_function_interrupt
      :             |                     |
      :             |                      --0.55%--smp_call_function_interrupt
      :             |
      :             |--1.59%--0x441f0fc3d009
      :             |          _ops_rdtsc_init_base_freq
      :             |          access_histogram
      :             |          page_fault
      :             |          __do_page_fault
      :             |          handle_mm_fault
      :             |          __handle_mm_fault
      :             |          |
      :             |           --1.54%--do_swap_page
      :             |                     swapin_readahead
      :             |                     swap_cluster_readahead
      :             |                     |
      :             |                      --1.53%--read_swap_cache_async
      :             |                                __read_swap_cache_async
      :             |                                alloc_pages_vma
      :             |                                __alloc_pages_nodemask
      :             |                                __alloc_pages_slowpath
      :             |                                try_to_free_pages
      :             |                                do_try_to_free_pages
      :             |                                shrink_node
      :             |                                shrink_node_memcg
      :             |                                |
      :             |                                |--0.77%--lruvec_lru_size
      :             |                                |
      :             |                                 --0.76%--inactive_list_is_low
      :             |                                           |
      :             |                                            --0.76%--lruvec_lru_size
      :             |
      :              --1.50%--measure_read
      :                        page_fault
      :                        __do_page_fault
      :                        handle_mm_fault
      :                        __handle_mm_fault
      :                        do_swap_page
      :                        swapin_readahead
      :                        swap_cluster_readahead
      :                        |
      :                         --1.48%--read_swap_cache_async
      :                                   __read_swap_cache_async
      :                                   alloc_pages_vma
      :                                   __alloc_pages_nodemask
      :                                   __alloc_pages_slowpath
      :                                   try_to_free_pages
      :                                   do_try_to_free_pages
      :                                   shrink_node
      :                                   shrink_node_memcg
      :                                   |
      :                                   |--0.75%--inactive_list_is_low
      :                                   |          |
      :                                   |           --0.75%--lruvec_lru_size
      :                                   |
      :                                    --0.73%--lruvec_lru_size
      
      The likely culprit is the cache traffic the lruvec_page_state_local
      generates.  Dave Hansen says:
      
      : I was thinking purely of the cache footprint.  If it's reading
      : pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
      : bytes of cache *96 CPUs = 18k of data, mostly read-only.  1 cgroup would
      : be 18k of data for the whole system and the caching would be pretty
      : efficient and all 18k would probably survive a tight page fault loop in
      : the L1.  500 cgroups would be ~90k of data per CPU thread which doesn't
      : fit in the L1 and probably wouldn't survive a tight page fault loop if
      : both logical threads were banging on different cgroups.
      :
      : It's just a theory, but it's why I noted the number of cgroups when I
      : initially saw this show up in profiles
      
      Fix the regression by partially reverting the said commit and calculate
      the lru size explicitly.
      
      Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com
      Fixes: 1a61ab80 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
      Signed-off-by: NHonglei Wang <honglei.wang@oracle.com>
      Reported-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NTim Chen <tim.c.chen@linux.intel.com>
      Tested-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: <stable@vger.kernel.org>	[5.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b11edebb
    • J
      mm/gup: fix a misnamed "write" argument, and a related bug · 0cd22afd
      John Hubbard 提交于
      In several routines, the "flags" argument is incorrectly named "write".
      Change it to "flags".
      
      Also, in one place, the misnaming led to an actual bug:
      "flags & FOLL_WRITE" is required, rather than just "flags".
      (That problem was flagged by krobot, in v1 of this patch.)
      
      Also, change the flags argument from int, to unsigned int.
      
      You can see that this was a simple oversight, because the
      calling code passes "flags" to the fifth argument:
      
      gup_pgd_range():
          ...
          if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
      		    PGDIR_SHIFT, next, flags, pages, nr))
      
      ...which, until this patch, the callees referred to as "write".
      
      Also, change two lines to avoid checkpatch line length
      complaints, and another line to fix another oversight
      that checkpatch called out: missing "int" on pdshift.
      
      Link: http://lkml.kernel.org/r/20191014184639.1512873-3-jhubbard@nvidia.com
      Fixes: b798bec4 ("mm/gup: change write parameter to flags in fast walk")
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Suggested-by: NKirill A. Shutemov <kirill@shutemov.name>
      Suggested-by: NIra Weiny <ira.weiny@intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cd22afd
    • R
      mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release · b749ecfa
      Roman Gushchin 提交于
      Karsten reported the following panic in __free_slab() happening on a s390x
      machine:
      
        Unable to handle kernel pointer dereference in virtual kernel address space
        Failing address: 0000000000000000 TEID: 0000000000000483
        Fault in home space mode while using kernel ASCE.
        AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
        Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
        Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
        Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
        Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
                   R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
        Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
                   0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
                   0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
                   0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
                   000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
        Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
                   00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
                  #00000000003cadb2: a7f4fe36            brc     15,3caa1e
                  >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
                   00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
                   00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
                   00000000003cadc6: a7f4fe2c            brc     15,3caa
                   00000000003cadc6: a7f4fe2c            brc     15,3caa1e
                   00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
        Call Trace:
        (<00000000003cabcc> __free_slab+0x49c/0x6b0)
         <00000000001f5886> rcu_core+0x5a6/0x7e0
         <0000000000ca2dea> __do_softirq+0xf2/0x5c0
         <0000000000152644> irq_exit+0x104/0x130
         <000000000010d222> do_IRQ+0x9a/0xf0
         <0000000000ca2344> ext_int_handler+0x130/0x134
         <0000000000103648> enabled_wait+0x58/0x128
        (<0000000000103634> enabled_wait+0x44/0x128)
         <0000000000103b00> arch_cpu_idle+0x40/0x58
         <0000000000ca0544> default_idle_call+0x3c/0x68
         <000000000018eaa4> do_idle+0xec/0x1c0
         <000000000018ee0e> cpu_startup_entry+0x36/0x40
         <000000000122df34> arch_call_rest_init+0x5c/0x88
         <0000000000000000> 0x0
        INFO: lockdep is turned off.
        Last Breaking-Event-Address:
         <00000000003ca8f4> __free_slab+0x1c4/0x6b0
        Kernel panic - not syncing: Fatal exception in interrupt
      
      The kernel panics on an attempt to dereference the NULL memcg pointer.
      When shutdown_cache() is called from the kmem_cache_destroy() context, a
      memcg kmem_cache might have empty slab pages in a partial list, which are
      still charged to the memory cgroup.
      
      These pages are released by free_partial() at the beginning of
      shutdown_cache(): either directly or by scheduling a RCU-delayed work
      (if the kmem_cache has the SLAB_TYPESAFE_BY_RCU flag).  The latter case
      is when the reported panic can happen: memcg_unlink_cache() is called
      immediately after shrinking partial lists, without waiting for scheduled
      RCU works.  It sets the kmem_cache->memcg_params.memcg pointer to NULL,
      and the following attempt to dereference it by __free_slab() from the
      RCU work context causes the panic.
      
      To fix the issue, let's postpone the release of the memcg pointer to
      destroy_memcg_params().  It's called from a separate work context by
      slab_caches_to_rcu_destroy_workfn(), which contains a full RCU barrier.
      This guarantees that all scheduled page release RCU works will complete
      before the memcg pointer will be zeroed.
      
      Big thanks for Karsten for the perfect report containing all necessary
      information, his help with the analysis of the problem and testing of the
      fix.
      
      Link: http://lkml.kernel.org/r/20191010160549.1584316-1-guro@fb.com
      Fixes: fb2f2b0a ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NKarsten Graul <kgraul@linux.ibm.com>
      Tested-by: NKarsten Graul <kgraul@linux.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Karsten Graul <kgraul@linux.ibm.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b749ecfa
    • A
      mm/memunmap: don't access uninitialized memmap in memunmap_pages() · 77e080e7
      Aneesh Kumar K.V 提交于
      Patch series "mm/memory_hotplug: Shrink zones before removing memory",
      v6.
      
      This series fixes the access of uninitialized memmaps when shrinking
      zones/nodes and when removing memory.  Also, it contains all fixes for
      crashes that can be triggered when removing certain namespace using
      memunmap_pages() - ZONE_DEVICE, reported by Aneesh.
      
      We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
      more involved (we don't have SECTION_IS_ONLINE as an indicator), and
      shrinking is only of limited use (set_zone_contiguous() cannot detect
      the ZONE_DEVICE as contiguous).
      
      We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount
      of code to a minimum.  Shrinking is especially necessary to keep
      zone->contiguous set where possible, especially, on memory unplug of
      DIMMs at zone boundaries.
      
      --------------------------------------------------------------------------
      
      Zones are now properly shrunk when offlining memory blocks or when
      onlining failed.  This allows to properly shrink zones on memory unplug
      even if the separate memory blocks of a DIMM were onlined to different
      zones or re-onlined to a different zone after offlining.
      
      Example:
      
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  0
                present  0
                managed  0
        :/# echo "online_movable" > /sys/devices/system/memory/memory41/state
        :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  98304
                present  65536
                managed  65536
        :/# echo 0 > /sys/devices/system/memory/memory43/online
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  32768
                present  32768
                managed  32768
        :/# echo 0 > /sys/devices/system/memory/memory41/online
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  0
                present  0
                managed  0
      
      This patch (of 10):
      
      With an altmap, the memmap falling into the reserved altmap space are not
      initialized and, therefore, contain a garbage NID and a garbage zone.
      Make sure to read the NID/zone from a memmap that was initialized.
      
      This fixes a kernel crash that is observed when destroying a namespace:
      
        kernel BUG at include/linux/mm.h:1107!
        cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
            pc: c0000000004b9728: memunmap_pages+0x238/0x340
            lr: c0000000004b9724: memunmap_pages+0x234/0x340
        ...
            pid   = 3669, comm = ndctl
        kernel BUG at include/linux/mm.h:1107!
          devm_action_release+0x30/0x50
          release_nodes+0x268/0x2d0
          device_release_driver_internal+0x174/0x240
          unbind_store+0x13c/0x190
          drv_attr_store+0x44/0x60
          sysfs_kf_write+0x70/0xa0
          kernfs_fop_write+0x1ac/0x290
          __vfs_write+0x3c/0x70
          vfs_write+0xe4/0x200
          ksys_write+0x7c/0x140
          system_call+0x5c/0x68
      
      The "page_zone(pfn_to_page(pfn)" was introduced by 69324b8f ("mm,
      devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support"), however, I
      think we will never have driver reserved memory with
      MEMORY_DEVICE_PRIVATE (no altmap AFAIKS).
      
      [david@redhat.com: minimze code changes, rephrase description]
      Link: http://lkml.kernel.org/r/20191006085646.5768-2-david@redhat.com
      Fixes: 2c2a5af6 ("mm, memory_hotplug: add nid parameter to arch_remove_memory")
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Damian Tometzki <damian.tometzki@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77e080e7
    • D
      mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span() · 00d6c019
      David Hildenbrand 提交于
      We might use the nid of memmaps that were never initialized.  For
      example, if the memmap was poisoned, we will crash the kernel in
      pfn_to_nid() right now.  Let's use the calculated boundaries of the
      separate zones instead.  This now also avoids having to iterate over a
      whole bunch of subsections again, after shrinking one zone.
      
      Before commit d0dc12e8 ("mm/memory_hotplug: optimize memory
      hotplug"), the memmap was initialized to 0 and the node was set to the
      right value.  After that commit, the node might be garbage.
      
      We'll have to fix shrink_zone_span() next.
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-4-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Damian Tometzki <damian.tometzki@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00d6c019
    • Q
      mm/page_owner: don't access uninitialized memmaps when reading /proc/pagetypeinfo · a26ee565
      Qian Cai 提交于
      Uninitialized memmaps contain garbage and in the worst case trigger
      kernel BUGs, especially with CONFIG_PAGE_POISONING.  They should not get
      touched.
      
      For example, when not onlining a memory block that is spanned by a zone
      and reading /proc/pagetypeinfo with CONFIG_DEBUG_VM_PGFLAGS and
      CONFIG_PAGE_POISONING, we can trigger a kernel BUG:
      
        :/# echo 1 > /sys/devices/system/memory/memory40/online
        :/# echo 1 > /sys/devices/system/memory/memory42/online
        :/# cat /proc/pagetypeinfo > test.file
         page:fffff2c585200000 is uninitialized and poisoned
         raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
         raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
         page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
         There is not page extension available.
         ------------[ cut here ]------------
         kernel BUG at include/linux/mm.h:1107!
         invalid opcode: 0000 [#1] SMP NOPTI
      
      Please note that this change does not affect ZONE_DEVICE, because
      pagetypeinfo_showmixedcount_print() is called from
      mm/vmstat.c:pagetypeinfo_showmixedcount() only for populated zones, and
      ZONE_DEVICE is never populated (zone->present_pages always 0).
      
      [david@redhat.com: move check to outer loop, add comment, rephrase description]
      Link: http://lkml.kernel.org/r/20191011140638.8160-1-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e8Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a26ee565
    • D
      mm/memory-failure.c: don't access uninitialized memmaps in memory_failure() · 96c804a6
      David Hildenbrand 提交于
      We should check for pfn_to_online_page() to not access uninitialized
      memmaps.  Reshuffle the code so we don't have to duplicate the error
      message.
      
      Link: http://lkml.kernel.org/r/20191009142435.3975-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96c804a6
  3. 15 10月, 2019 9 次提交
    • J
      mm/memory-failure: poison read receives SIGKILL instead of SIGBUS if mmaped more than once · 3d7fed4a
      Jane Chu 提交于
      Mmap /dev/dax more than once, then read the poison location using
      address from one of the mappings.  The other mappings due to not having
      the page mapped in will cause SIGKILLs delivered to the process.
      SIGKILL succeeds over SIGBUS, so user process loses the opportunity to
      handle the UE.
      
      Although one may add MAP_POPULATE to mmap(2) to work around the issue,
      MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
      isn't always an option.
      
      Details -
      
        ndctl inject-error --block=10 --count=1 namespace6.0
      
        ./read_poison -x dax6.0 -o 5120 -m 2
        mmaped address 0x7f5bb6600000
        mmaped address 0x7f3cf3600000
        doing local read at address 0x7f3cf3601400
        Killed
      
      Console messages in instrumented kernel -
      
        mce: Uncorrected hardware memory error in user-access at edbe201400
        Memory failure: tk->addr = 7f5bb6601000
        Memory failure: address edbe201: call dev_pagemap_mapping_shift
        dev_pagemap_mapping_shift: page edbe201: no PUD
        Memory failure: tk->size_shift == 0
        Memory failure: Unable to find user space address edbe201 in read_poison
        Memory failure: tk->addr = 7f3cf3601000
        Memory failure: address edbe201: call dev_pagemap_mapping_shift
        Memory failure: tk->size_shift = 21
        Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of failure to unmap corrupted page
          => to deliver SIGKILL
        Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory corruption
          => to deliver SIGBUS
      
      Link: http://lkml.kernel.org/r/1565112345-28754-3-git-send-email-jane.chu@oracle.comSigned-off-by: NJane Chu <jane.chu@oracle.com>
      Suggested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d7fed4a
    • R
      mm/slab.c: fix kernel-doc warning for __ksize() · 87bf4f71
      Randy Dunlap 提交于
      Fix kernel-doc warning in mm/slab.c:
      
        mm/slab.c:4215: warning: Function parameter or member 'objp' not described in '__ksize'
      
      Also add Return: documentation section for this function.
      
      Link: http://lkml.kernel.org/r/68c9fd7d-f09e-d376-e292-c7b2bdf1774d@infradead.org
      Fixes: 10d1f8cb ("mm/slab: refactor common ksize KASAN logic into slab_common.c")
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Acked-by: NMarco Elver <elver@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      87bf4f71
    • V
      mm, compaction: fix wrong pfn handling in __reset_isolation_pfn() · a2e9a5af
      Vlastimil Babka 提交于
      Florian and Dave reported [1] a NULL pointer dereference in
      __reset_isolation_pfn().  While the exact cause is unclear, staring at
      the code revealed two bugs, which might be related.
      
      One bug is that if zone starts in the middle of pageblock, block_page
      might correspond to different pfn than block_pfn, and then the
      pfn_valid_within() checks will check different pfn's than those accessed
      via struct page.  This might result in acessing an unitialized page in
      CONFIG_HOLES_IN_ZONE configs.
      
      The other bug is that end_page refers to the first page of next
      pageblock and not last page of current pageblock.  The online and valid
      check is then wrong and with sections, the while (page < end_page) loop
      might wander off actual struct page arrays.
      
      [1] https://lore.kernel.org/linux-xfs/87o8z1fvqu.fsf@mid.deneb.enyo.de/
      
      Link: http://lkml.kernel.org/r/20191008152915.24704-1-vbabka@suse.cz
      Fixes: 6b0868c8 ("mm/compaction.c: correct zone boundary handling when resetting pageblock skip hints")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NFlorian Weimer <fw@deneb.enyo.de>
      Reported-by: NDave Chinner <david@fromorbit.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2e9a5af
    • D
      mm, hugetlb: allow hugepage allocations to reclaim as needed · 3f36d866
      David Rientjes 提交于
      Commit b39d0ee2 ("mm, page_alloc: avoid expensive reclaim when
      compaction may not succeed") has chnaged the allocator to bail out from
      the allocator early to prevent from a potentially excessive memory
      reclaim.  __GFP_RETRY_MAYFAIL is designed to retry the allocation,
      reclaim and compaction loop as long as there is a reasonable chance to
      make forward progress.  Neither COMPACT_SKIPPED nor COMPACT_DEFERRED at
      the INIT_COMPACT_PRIORITY compaction attempt gives this feedback.
      
      The most obvious affected subsystem is hugetlbfs which allocates huge
      pages based on an admin request (or via admin configured overcommit).  I
      have done a simple test which tries to allocate half of the memory for
      hugetlb pages while the memory is full of a clean page cache.  This is
      not an unusual situation because we try to cache as much of the memory
      as possible and sysctl/sysfs interface to allocate huge pages is there
      for flexibility to allocate hugetlb pages at any time.
      
      System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages
      after the memory is prefilled by a clean page cache:
      
        root@test1:~# cat hugetlb_test.sh
      
        set -x
        echo 0 > /proc/sys/vm/nr_hugepages
        echo 3 > /proc/sys/vm/drop_caches
        echo 1 > /proc/sys/vm/compact_memory
        dd if=/mnt/data/file-1G of=/dev/null bs=$((4<<10))
        TS=$(date +%s)
        echo 256 > /proc/sys/vm/nr_hugepages
        cat /proc/sys/vm/nr_hugepages
      
      The results for 2 consecutive runs on clean 5.3
      
        root@test1:~# sh hugetlb_test.sh
        + echo 0
        + echo 3
        + echo 1
        + dd if=/mnt/data/file-1G of=/dev/null bs=4096
        262144+0 records in
        262144+0 records out
        1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s
        + date +%s
        + TS=1569905284
        + echo 256
        + cat /proc/sys/vm/nr_hugepages
        256
        root@test1:~# sh hugetlb_test.sh
        + echo 0
        + echo 3
        + echo 1
        + dd if=/mnt/data/file-1G of=/dev/null bs=4096
        262144+0 records in
        262144+0 records out
        1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s
        + date +%s
        + TS=1569905311
        + echo 256
        + cat /proc/sys/vm/nr_hugepages
        256
      
      Now with b39d0ee2 applied
      
        root@test1:~# sh hugetlb_test.sh
        + echo 0
        + echo 3
        + echo 1
        + dd if=/mnt/data/file-1G of=/dev/null bs=4096
        262144+0 records in
        262144+0 records out
        1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
        + date +%s
        + TS=1569905516
        + echo 256
        + cat /proc/sys/vm/nr_hugepages
        11
        root@test1:~# sh hugetlb_test.sh
        + echo 0
        + echo 3
        + echo 1
        + dd if=/mnt/data/file-1G of=/dev/null bs=4096
        262144+0 records in
        262144+0 records out
        1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
        + date +%s
        + TS=1569905541
        + echo 256
        + cat /proc/sys/vm/nr_hugepages
        12
      
      The success rate went down by factor of 20!
      
      Although hugetlb allocation requests might fail and it is reasonable to
      expect them to under extremely fragmented memory or when the memory is
      under a heavy pressure but the above situation is not that case.
      
      Fix the regression by reverting back to the previous behavior for
      __GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for
      those requests.
      
      Mike said:
      
      : hugetlbfs allocations are commonly done via sysctl/sysfs shortly after
      : boot where this may not be as much of an issue.  However, I am aware of at
      : least three use cases where allocations are made after the system has been
      : up and running for quite some time:
      :
      : - DB reconfiguration.  If sysctl/sysfs fails to get required number of
      :   huge pages, system is rebooted to perform allocation after boot.
      :
      : - VM provisioning.  If unable get required number of huge pages, fall
      :   back to base pages.
      :
      : - An application that does not preallocate pool, but rather allocates
      :   pages at fault time for optimal NUMA locality.
      :
      : In all cases, I would expect b39d0ee2 to cause regressions and
      : noticable behavior changes.
      :
      : My quick/limited testing in
      : https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com
      : was insufficient.  It was also mentioned that if something like
      : b39d0ee2 went forward, I would like exemptions for __GFP_RETRY_MAYFAIL
      : requests as in this patch.
      
      [mhocko@suse.com: reworded changelog]
      Link: http://lkml.kernel.org/r/20191007075548.12456-1-mhocko@kernel.org
      Fixes: b39d0ee2 ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f36d866
    • A
      mm/slub.c: init_on_free=1 should wipe freelist ptr for bulk allocations · 0f181f9f
      Alexander Potapenko 提交于
      slab_alloc_node() already zeroed out the freelist pointer if
      init_on_free was on.  Thibaut Sautereau noticed that the same needs to
      be done for kmem_cache_alloc_bulk(), which performs the allocations
      separately.
      
      kmem_cache_alloc_bulk() is currently used in two places in the kernel,
      so this change is unlikely to have a major performance impact.
      
      SLAB doesn't require a similar change, as auto-initialization makes the
      allocator store the freelist pointers off-slab.
      
      Link: http://lkml.kernel.org/r/20191007091605.30530-1-glider@google.com
      Fixes: 6471384a ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
      Signed-off-by: NAlexander Potapenko <glider@google.com>
      Reported-by: NThibaut Sautereau <thibaut@sautereau.fr>
      Reported-by: NKees Cook <keescook@chromium.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Laura Abbott <labbott@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f181f9f
    • Q
      mm/slub: fix a deadlock in show_slab_objects() · e4f8e513
      Qian Cai 提交于
      A long time ago we fixed a similar deadlock in show_slab_objects() [1].
      However, it is apparently due to the commits like 01fb58bc ("slab:
      remove synchronous synchronize_sched() from memcg cache deactivation
      path") and 03afc0e2 ("slab: get_online_mems for
      kmem_cache_{create,destroy,shrink}"), this kind of deadlock is back by
      just reading files in /sys/kernel/slab which will generate a lockdep
      splat below.
      
      Since the "mem_hotplug_lock" here is only to obtain a stable online node
      mask while racing with NUMA node hotplug, in the worst case, the results
      may me miscalculated while doing NUMA node hotplug, but they shall be
      corrected by later reads of the same files.
      
        WARNING: possible circular locking dependency detected
        ------------------------------------------------------
        cat/5224 is trying to acquire lock:
        ffff900012ac3120 (mem_hotplug_lock.rw_sem){++++}, at:
        show_slab_objects+0x94/0x3a8
      
        but task is already holding lock:
        b8ff009693eee398 (kn->count#45){++++}, at: kernfs_seq_start+0x44/0xf0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (kn->count#45){++++}:
               lock_acquire+0x31c/0x360
               __kernfs_remove+0x290/0x490
               kernfs_remove+0x30/0x44
               sysfs_remove_dir+0x70/0x88
               kobject_del+0x50/0xb0
               sysfs_slab_unlink+0x2c/0x38
               shutdown_cache+0xa0/0xf0
               kmemcg_cache_shutdown_fn+0x1c/0x34
               kmemcg_workfn+0x44/0x64
               process_one_work+0x4f4/0x950
               worker_thread+0x390/0x4bc
               kthread+0x1cc/0x1e8
               ret_from_fork+0x10/0x18
      
        -> #1 (slab_mutex){+.+.}:
               lock_acquire+0x31c/0x360
               __mutex_lock_common+0x16c/0xf78
               mutex_lock_nested+0x40/0x50
               memcg_create_kmem_cache+0x38/0x16c
               memcg_kmem_cache_create_func+0x3c/0x70
               process_one_work+0x4f4/0x950
               worker_thread+0x390/0x4bc
               kthread+0x1cc/0x1e8
               ret_from_fork+0x10/0x18
      
        -> #0 (mem_hotplug_lock.rw_sem){++++}:
               validate_chain+0xd10/0x2bcc
               __lock_acquire+0x7f4/0xb8c
               lock_acquire+0x31c/0x360
               get_online_mems+0x54/0x150
               show_slab_objects+0x94/0x3a8
               total_objects_show+0x28/0x34
               slab_attr_show+0x38/0x54
               sysfs_kf_seq_show+0x198/0x2d4
               kernfs_seq_show+0xa4/0xcc
               seq_read+0x30c/0x8a8
               kernfs_fop_read+0xa8/0x314
               __vfs_read+0x88/0x20c
               vfs_read+0xd8/0x10c
               ksys_read+0xb0/0x120
               __arm64_sys_read+0x54/0x88
               el0_svc_handler+0x170/0x240
               el0_svc+0x8/0xc
      
        other info that might help us debug this:
      
        Chain exists of:
          mem_hotplug_lock.rw_sem --> slab_mutex --> kn->count#45
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(kn->count#45);
                                       lock(slab_mutex);
                                       lock(kn->count#45);
          lock(mem_hotplug_lock.rw_sem);
      
         *** DEADLOCK ***
      
        3 locks held by cat/5224:
         #0: 9eff00095b14b2a0 (&p->lock){+.+.}, at: seq_read+0x4c/0x8a8
         #1: 0eff008997041480 (&of->mutex){+.+.}, at: kernfs_seq_start+0x34/0xf0
         #2: b8ff009693eee398 (kn->count#45){++++}, at:
        kernfs_seq_start+0x44/0xf0
      
        stack backtrace:
        Call trace:
         dump_backtrace+0x0/0x248
         show_stack+0x20/0x2c
         dump_stack+0xd0/0x140
         print_circular_bug+0x368/0x380
         check_noncircular+0x248/0x250
         validate_chain+0xd10/0x2bcc
         __lock_acquire+0x7f4/0xb8c
         lock_acquire+0x31c/0x360
         get_online_mems+0x54/0x150
         show_slab_objects+0x94/0x3a8
         total_objects_show+0x28/0x34
         slab_attr_show+0x38/0x54
         sysfs_kf_seq_show+0x198/0x2d4
         kernfs_seq_show+0xa4/0xcc
         seq_read+0x30c/0x8a8
         kernfs_fop_read+0xa8/0x314
         __vfs_read+0x88/0x20c
         vfs_read+0xd8/0x10c
         ksys_read+0xb0/0x120
         __arm64_sys_read+0x54/0x88
         el0_svc_handler+0x170/0x240
         el0_svc+0x8/0xc
      
      I think it is important to mention that this doesn't expose the
      show_slab_objects to use-after-free.  There is only a single path that
      might really race here and that is the slab hotplug notifier callback
      __kmem_cache_shrink (via slab_mem_going_offline_callback) but that path
      doesn't really destroy kmem_cache_node data structures.
      
      [1] http://lkml.iu.edu/hypermail/linux/kernel/1101.0/02850.html
      
      [akpm@linux-foundation.org: add comment explaining why we don't need mem_hotplug_lock]
      Link: http://lkml.kernel.org/r/1570192309-10132-1-git-send-email-cai@lca.pw
      Fixes: 01fb58bc ("slab: remove synchronous synchronize_sched() from memcg cache deactivation path")
      Fixes: 03afc0e2 ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e4f8e513
    • V
      mm, page_owner: rename flag indicating that page is allocated · fdf3bf80
      Vlastimil Babka 提交于
      Commit 37389167 ("mm, page_owner: keep owner info when freeing the
      page") has introduced a flag PAGE_EXT_OWNER_ACTIVE to indicate that page
      is tracked as being allocated.  Kirril suggested naming it
      PAGE_EXT_OWNER_ALLOCATED to make it more clear, as "active is somewhat
      loaded term for a page".
      
      Link: http://lkml.kernel.org/r/20190930122916.14969-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Suggested-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Walter Wu <walter-zh.wu@mediatek.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fdf3bf80
    • V
      mm, page_owner: decouple freeing stack trace from debug_pagealloc · 0fe9a448
      Vlastimil Babka 提交于
      Commit 8974558f ("mm, page_owner, debug_pagealloc: save and dump
      freeing stack trace") enhanced page_owner to also store freeing stack
      trace, when debug_pagealloc is also enabled.  KASAN would also like to
      do this [1] to improve error reports to debug e.g. UAF issues.
      
      Kirill has suggested that the freeing stack trace saving should be also
      possible to be enabled separately from KASAN or debug_pagealloc, i.e.
      with an extra boot option.  Qian argued that we have enough options
      already, and avoiding the extra overhead is not worth the complications
      in the case of a debugging option.  Kirill noted that the extra stack
      handle in struct page_owner requires 0.1% of memory.
      
      This patch therefore enables free stack saving whenever page_owner is
      enabled, regardless of whether debug_pagealloc or KASAN is also enabled.
      KASAN kernels booted with page_owner=on will thus benefit from the
      improved error reports.
      
      [1] https://bugzilla.kernel.org/show_bug.cgi?id=203967
      
      [vbabka@suse.cz: v3]
        Link: http://lkml.kernel.org/r/20191007091808.7096-3-vbabka@suse.cz
      Link: http://lkml.kernel.org/r/20190930122916.14969-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NQian Cai <cai@lca.pw>
      Suggested-by: NDmitry Vyukov <dvyukov@google.com>
      Suggested-by: NWalter Wu <walter-zh.wu@mediatek.com>
      Suggested-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Suggested-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0fe9a448
    • V
      mm, page_owner: fix off-by-one error in __set_page_owner_handle() · 5556cfe8
      Vlastimil Babka 提交于
      Patch series "followups to debug_pagealloc improvements through
      page_owner", v3.
      
      These are followups to [1] which made it to Linus meanwhile.  Patches 1
      and 3 are based on Kirill's review, patch 2 on KASAN request [2].  It
      would be nice if all of this made it to 5.4 with [1] already there (or
      at least Patch 1).
      
      This patch (of 3):
      
      As noted by Kirill, commit 7e2f2a0c ("mm, page_owner: record page
      owner for each subpage") has introduced an off-by-one error in
      __set_page_owner_handle() when looking up page_ext for subpages.  As a
      result, the head page page_owner info is set twice, while for the last
      tail page, it's not set at all.
      
      Fix this and also make the code more efficient by advancing the page_ext
      pointer we already have, instead of calling lookup_page_ext() for each
      subpage.  Since the full size of struct page_ext is not known at compile
      time, we can't use a simple page_ext++ statement, so introduce a
      page_ext_next() inline function for that.
      
      Link: http://lkml.kernel.org/r/20190930122916.14969-2-vbabka@suse.cz
      Fixes: 7e2f2a0c ("mm, page_owner: record page owner for each subpage")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NKirill A. Shutemov <kirill@shutemov.name>
      Reported-by: NMiles Chen <miles.chen@mediatek.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Walter Wu <walter-zh.wu@mediatek.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5556cfe8
  4. 14 10月, 2019 1 次提交
  5. 10 10月, 2019 1 次提交
  6. 08 10月, 2019 7 次提交
    • V
      mm, sl[aou]b: guarantee natural alignment for kmalloc(power-of-two) · 59bb4798
      Vlastimil Babka 提交于
      In most configurations, kmalloc() happens to return naturally aligned
      (i.e.  aligned to the block size itself) blocks for power of two sizes.
      
      That means some kmalloc() users might unknowingly rely on that
      alignment, until stuff breaks when the kernel is built with e.g.
      CONFIG_SLUB_DEBUG or CONFIG_SLOB, and blocks stop being aligned.  Then
      developers have to devise workaround such as own kmem caches with
      specified alignment [1], which is not always practical, as recently
      evidenced in [2].
      
      The topic has been discussed at LSF/MM 2019 [3].  Adding a
      'kmalloc_aligned()' variant would not help with code unknowingly relying
      on the implicit alignment.  For slab implementations it would either
      require creating more kmalloc caches, or allocate a larger size and only
      give back part of it.  That would be wasteful, especially with a generic
      alignment parameter (in contrast with a fixed alignment to size).
      
      Ideally we should provide to mm users what they need without difficult
      workarounds or own reimplementations, so let's make the kmalloc()
      alignment to size explicitly guaranteed for power-of-two sizes under all
      configurations.  What this means for the three available allocators?
      
      * SLAB object layout happens to be mostly unchanged by the patch.  The
        implicitly provided alignment could be compromised with
        CONFIG_DEBUG_SLAB due to redzoning, however SLAB disables redzoning for
        caches with alignment larger than unsigned long long.  Practically on at
        least x86 this includes kmalloc caches as they use cache line alignment,
        which is larger than that.  Still, this patch ensures alignment on all
        arches and cache sizes.
      
      * SLUB layout is also unchanged unless redzoning is enabled through
        CONFIG_SLUB_DEBUG and boot parameter for the particular kmalloc cache.
        With this patch, explicit alignment is guaranteed with redzoning as
        well.  This will result in more memory being wasted, but that should be
        acceptable in a debugging scenario.
      
      * SLOB has no implicit alignment so this patch adds it explicitly for
        kmalloc().  The potential downside is increased fragmentation.  While
        pathological allocation scenarios are certainly possible, in my testing,
        after booting a x86_64 kernel+userspace with virtme, around 16MB memory
        was consumed by slab pages both before and after the patch, with
        difference in the noise.
      
      [1] https://lore.kernel.org/linux-btrfs/c3157c8e8e0e7588312b40c853f65c02fe6c957a.1566399731.git.christophe.leroy@c-s.fr/
      [2] https://lore.kernel.org/linux-fsdevel/20190225040904.5557-1-ming.lei@redhat.com/
      [3] https://lwn.net/Articles/787740/
      
      [akpm@linux-foundation.org: documentation fixlet, per Matthew]
      Link: http://lkml.kernel.org/r/20190826111627.7505-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Cc: David Sterba <dsterba@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      59bb4798
    • V
      mm, sl[ou]b: improve memory accounting · 6a486c0a
      Vlastimil Babka 提交于
      Patch series "guarantee natural alignment for kmalloc()", v2.
      
      This patch (of 2):
      
      SLOB currently doesn't account its pages at all, so in /proc/meminfo the
      Slab field shows zero.  Modifying a counter on page allocation and
      freeing should be acceptable even for the small system scenarios SLOB is
      intended for.  Since reclaimable caches are not separated in SLOB,
      account everything as unreclaimable.
      
      SLUB currently doesn't account kmalloc() and kmalloc_node() allocations
      larger than order-1 page, that are passed directly to the page
      allocator.  As they also don't appear in /proc/slabinfo, it might look
      like a memory leak.  For consistency, account them as well.  (SLAB
      doesn't actually use page allocator directly, so no change there).
      
      Ideally SLOB and SLUB would be handled in separate patches, but due to
      the shared kmalloc_order() function and different kfree()
      implementations, it's easier to patch both at once to prevent
      inconsistencies.
      
      Link: http://lkml.kernel.org/r/20190826111627.7505-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: "Darrick J . Wong" <darrick.wong@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6a486c0a
    • C
      mm, memcg: make scan aggression always exclude protection · 1bc63fb1
      Chris Down 提交于
      This patch is an incremental improvement on the existing
      memory.{low,min} relative reclaim work to base its scan pressure
      calculations on how much protection is available compared to the current
      usage, rather than how much the current usage is over some protection
      threshold.
      
      This change doesn't change the experience for the user in the normal
      case too much.  One benefit is that it replaces the (somewhat arbitrary)
      100% cutoff with an indefinite slope, which makes it easier to ballpark
      a memory.low value.
      
      As well as this, the old methodology doesn't quite apply generically to
      machines with varying amounts of physical memory.  Let's say we have a
      top level cgroup, workload.slice, and another top level cgroup,
      system-management.slice.  We want to roughly give 12G to
      system-management.slice, so on a 32GB machine we set memory.low to 20GB
      in workload.slice, and on a 64GB machine we set memory.low to 52GB.
      However, because these are relative amounts to the total machine size,
      while the amount of memory we want to generally be willing to yield to
      system.slice is absolute (12G), we end up putting more pressure on
      system.slice just because we have a larger machine and a larger workload
      to fill it, which seems fairly unintuitive.  With this new behaviour, we
      don't end up with this unintended side effect.
      
      Previously the way that memory.low protection works is that if you are
      50% over a certain baseline, you get 50% of your normal scan pressure.
      This is certainly better than the previous cliff-edge behaviour, but it
      can be improved even further by always considering memory under the
      currently enforced protection threshold to be out of bounds.  This means
      that we can set relatively low memory.low thresholds for variable or
      bursty workloads while still getting a reasonable level of protection,
      whereas with the previous version we may still trivially hit the 100%
      clamp.  The previous 100% clamp is also somewhat arbitrary, whereas this
      one is more concretely based on the currently enforced protection
      threshold, which is likely easier to reason about.
      
      There is also a subtle issue with the way that proportional reclaim
      worked previously -- it promotes having no memory.low, since it makes
      pressure higher during low reclaim.  This happens because we base our
      scan pressure modulation on how far memory.current is between memory.min
      and memory.low, but if memory.low is unset, we only use the overage
      method.  In most cromulent configurations, this then means that we end
      up with *more* pressure than with no memory.low at all when we're in low
      reclaim, which is not really very usable or expected.
      
      With this patch, memory.low and memory.min affect reclaim pressure in a
      more understandable and composable way.  For example, from a user
      standpoint, "protected" memory now remains untouchable from a reclaim
      aggression standpoint, and users can also have more confidence that
      bursty workloads will still receive some amount of guaranteed
      protection.
      
      Link: http://lkml.kernel.org/r/20190322160307.GA3316@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1bc63fb1
    • C
      mm, memcg: make memory.emin the baseline for utilisation determination · 9de7ca46
      Chris Down 提交于
      Roman points out that when when we do the low reclaim pass, we scale the
      reclaim pressure relative to position between 0 and the maximum
      protection threshold.
      
      However, if the maximum protection is based on memory.elow, and
      memory.emin is above zero, this means we still may get binary behaviour
      on second-pass low reclaim.  This is because we scale starting at 0, not
      starting at memory.emin, and since we don't scan at all below emin, we
      end up with cliff behaviour.
      
      This should be a fairly uncommon case since usually we don't go into the
      second pass, but it makes sense to scale our low reclaim pressure
      starting at emin.
      
      You can test this by catting two large sparse files, one in a cgroup
      with emin set to some moderate size compared to physical RAM, and
      another cgroup without any emin.  In both cgroups, set an elow larger
      than 50% of physical RAM.  The one with emin will have less page
      scanning, as reclaim pressure is lower.
      
      Rebase on top of and apply the same idea as what was applied to handle
      cgroup_memory=disable properly for the original proportional patch
      http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name ("mm,
      memcg: Handle cgroup_disable=memory when getting memcg protection").
      
      Link: http://lkml.kernel.org/r/20190201051810.GA18895@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Suggested-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9de7ca46
    • C
      mm, memcg: proportional memory.{low,min} reclaim · 9783aa99
      Chris Down 提交于
      cgroup v2 introduces two memory protection thresholds: memory.low
      (best-effort) and memory.min (hard protection).  While they generally do
      what they say on the tin, there is a limitation in their implementation
      that makes them difficult to use effectively: that cliff behaviour often
      manifests when they become eligible for reclaim.  This patch implements
      more intuitive and usable behaviour, where we gradually mount more
      reclaim pressure as cgroups further and further exceed their protection
      thresholds.
      
      This cliff edge behaviour happens because we only choose whether or not
      to reclaim based on whether the memcg is within its protection limits
      (see the use of mem_cgroup_protected in shrink_node), but we don't vary
      our reclaim behaviour based on this information.  Imagine the following
      timeline, with the numbers the lruvec size in this zone:
      
      1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
      2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
      3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
         scanned. (?!)
      
      * Of course, we won't usually scan all available pages in the zone even
        without this patch because of scan control priority, over-reclaim
        protection, etc.  However, as shown by the tests at the end, these
        techniques don't sufficiently throttle such an extreme change in input,
        so cliff-like behaviour isn't really averted by their existence alone.
      
      Here's an example of how this plays out in practice.  At Facebook, we are
      trying to protect various workloads from "system" software, like
      configuration management tools, metric collectors, etc (see this[0] case
      study).  In order to find a suitable memory.low value, we start by
      determining the expected memory range within which the workload will be
      comfortable operating.  This isn't an exact science -- memory usage deemed
      "comfortable" will vary over time due to user behaviour, differences in
      composition of work, etc, etc.  As such we need to ballpark memory.low,
      but doing this is currently problematic:
      
      1. If we end up setting it too low for the workload, it won't have
         *any* effect (see discussion above).  The group will receive the full
         weight of reclaim and won't have any priority while competing with the
         less important system software, as if we had no memory.low configured
         at all.
      
      2. Because of this behaviour, we end up erring on the side of setting
         it too high, such that the comfort range is reliably covered.  However,
         protected memory is completely unavailable to the rest of the system,
         so we might cause undue memory and IO pressure there when we *know* we
         have some elasticity in the workload.
      
      3. Even if we get the value totally right, smack in the middle of the
         comfort zone, we get extreme jumps between no pressure and full
         pressure that cause unpredictable pressure spikes in the workload due
         to the current binary reclaim behaviour.
      
      With this patch, we can set it to our ballpark estimation without too much
      worry.  Any undesirable behaviour, such as too much or too little reclaim
      pressure on the workload or system will be proportional to how far our
      estimation is off.  This means we can set memory.low much more
      conservatively and thus waste less resources *without* the risk of the
      workload falling off a cliff if we overshoot.
      
      As a more abstract technical description, this unintuitive behaviour
      results in having to give high-priority workloads a large protection
      buffer on top of their expected usage to function reliably, as otherwise
      we have abrupt periods of dramatically increased memory pressure which
      hamper performance.  Having to set these thresholds so high wastes
      resources and generally works against the principle of work conservation.
      In addition, having proportional memory reclaim behaviour has other
      benefits.  Most notably, before this patch it's basically mandatory to set
      memory.low to a higher than desirable value because otherwise as soon as
      you exceed memory.low, all protection is lost, and all pages are eligible
      to scan again.  By contrast, having a gradual ramp in reclaim pressure
      means that you now still get some protection when thresholds are exceeded,
      which means that one can now be more comfortable setting memory.low to
      lower values without worrying that all protection will be lost.  This is
      important because workingset size is really hard to know exactly,
      especially with variable workloads, so at least getting *some* protection
      if your workingset size grows larger than you expect increases user
      confidence in setting memory.low without a huge buffer on top being
      needed.
      
      Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
      assistance in thinking about how to make this work better.
      
      In testing these changes, I intended to verify that:
      
      1. Changes in page scanning become gradual and proportional instead of
         binary.
      
         To test this, I experimented stepping further and further down
         memory.low protection on a workload that floats around 19G workingset
         when under memory.low protection, watching page scan rates for the
         workload cgroup:
      
         +------------+-----------------+--------------------+--------------+
         | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
         +------------+-----------------+--------------------+--------------+
         |        21G |               0 |                  0 | N/A          |
         |        17G |             867 |               3799 | 23%          |
         |        12G |            1203 |               3543 | 34%          |
         |         8G |            2534 |               3979 | 64%          |
         |         4G |            3980 |               4147 | 96%          |
         |          0 |            3799 |               3980 | 95%          |
         +------------+-----------------+--------------------+--------------+
      
         As you can see, the test kernel (with a kernel containing this
         patch) ramps up page scanning significantly more gradually than the
         control kernel (without this patch).
      
      2. More gradual ramp up in reclaim aggression doesn't result in
         premature OOMs.
      
         To test this, I wrote a script that slowly increments the number of
         pages held by stress(1)'s --vm-keep mode until a production system
         entered severe overall memory contention.  This script runs in a highly
         protected slice taking up the majority of available system memory.
         Watching vmstat revealed that page scanning continued essentially
         nominally between test and control, without causing forward reclaim
         progress to become arrested.
      
      [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project
      
      [akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
      [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
        Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
      Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dennis Zhou <dennis@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9783aa99
    • D
      mm/vmpressure.c: fix a signedness bug in vmpressure_register_event() · 518a8671
      Dan Carpenter 提交于
      The "mode" and "level" variables are enums and in this context GCC will
      treat them as unsigned ints so the error handling is never triggered.
      
      I also removed the bogus initializer because it isn't required any more
      and it's sort of confusing.
      
      [akpm@linux-foundation.org: reduce implicit and explicit typecasting]
      [akpm@linux-foundation.org: fix return value, add comment, per Matthew]
      Link: http://lkml.kernel.org/r/20190925110449.GO3264@mwanda
      Fixes: 3cadfa2b ("mm/vmpressure.c: convert to use match_string() helper")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NMatthew Wilcox <willy@infradead.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Enrico Weigelt <info@metux.net>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      518a8671
    • Q
      mm/page_alloc.c: fix a crash in free_pages_prepare() · 234fdce8
      Qian Cai 提交于
      On architectures like s390, arch_free_page() could mark the page unused
      (set_page_unused()) and any access later would trigger a kernel panic.
      Fix it by moving arch_free_page() after all possible accessing calls.
      
       Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
       Krnl PSW : 0404e00180000000 0000000026c2b96e (__free_pages_ok+0x34e/0x5d8)
                  R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
       Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c 000000000000000f
                  000003d080012100 000003d080013fc0 0000000000000000 0000000000100000
                  00000000275cca48 0000000000000100 0000000000000008 000003d080010000
                  00000000000001d0 000003d000000000 0000000026c2b78a 000000002717fdb0
       Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6
                  0000000026c2b962: e32014000036 pfd 2,1024(%r1)
                 #0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1)
                 >0000000026c2b96e: 41101100  la %r1,256(%r1)
                  0000000026c2b972: a737fff8  brctg %r3,26c2b962
                  0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1)
                  0000000026c2b97c: e31003400004 lg %r1,832
                  0000000026c2b982: ebff1430016a asi 5168(%r1),-1
       Call Trace:
       __free_pages_ok+0x16a/0x5d8)
       memblock_free_all+0x206/0x290
       mem_init+0x58/0x120
       start_kernel+0x2b0/0x570
       startup_continue+0x6a/0xc0
       INFO: lockdep is turned off.
       Last Breaking-Event-Address:
       __free_pages_ok+0x372/0x5d8
       Kernel panic - not syncing: Fatal exception: panic_on_oops
       00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 26A2379C
      
      In the past, only kernel_poison_pages() would trigger this but it needs
      "page_poison=on" kernel cmdline, and I suspect nobody tested that on
      s390.  Recently, kernel_init_free_pages() (commit 6471384a ("mm:
      security: introduce init_on_alloc=1 and init_on_free=1 boot options"))
      was added and could trigger this as well.
      
      [akpm@linux-foundation.org: add comment]
      Link: http://lkml.kernel.org/r/1569613623-16820-1-git-send-email-cai@lca.pw
      Fixes: 8823b1db ("mm/page_poison.c: enable PAGE_POISONING as a separate option")
      Fixes: 6471384a ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.3+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      234fdce8