1. 23 11月, 2019 2 次提交
    • A
      mm/ksm.c: don't WARN if page is still mapped in remove_stable_node() · 9a63236f
      Andrey Ryabinin 提交于
      It's possible to hit the WARN_ON_ONCE(page_mapped(page)) in
      remove_stable_node() when it races with __mmput() and squeezes in
      between ksm_exit() and exit_mmap().
      
        WARNING: CPU: 0 PID: 3295 at mm/ksm.c:888 remove_stable_node+0x10c/0x150
      
        Call Trace:
         remove_all_stable_nodes+0x12b/0x330
         run_store+0x4ef/0x7b0
         kernfs_fop_write+0x200/0x420
         vfs_write+0x154/0x450
         ksys_write+0xf9/0x1d0
         do_syscall_64+0x99/0x510
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Remove the warning as there is nothing scary going on.
      
      Link: http://lkml.kernel.org/r/20191119131850.5675-1-aryabinin@virtuozzo.com
      Fixes: cbf86cfe ("ksm: remove old stable nodes more thoroughly")
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a63236f
    • D
      mm/memory_hotplug: don't access uninitialized memmaps in shrink_zone_span() · 7ce700bf
      David Hildenbrand 提交于
      Let's limit shrinking to !ZONE_DEVICE so we can fix the current code.
      We should never try to touch the memmap of offline sections where we
      could have uninitialized memmaps and could trigger BUGs when calling
      page_to_nid() on poisoned pages.
      
      There is no reliable way to distinguish an uninitialized memmap from an
      initialized memmap that belongs to ZONE_DEVICE, as we don't have
      anything like SECTION_IS_ONLINE we can use similar to
      pfn_to_online_section() for !ZONE_DEVICE memory.
      
      E.g., set_zone_contiguous() similarly relies on pfn_to_online_section()
      and will therefore never set a ZONE_DEVICE zone consecutive.  Stopping
      to shrink the ZONE_DEVICE therefore results in no observable changes,
      besides /proc/zoneinfo indicating different boundaries - something we
      can totally live with.
      
      Before commit d0dc12e8 ("mm/memory_hotplug: optimize memory
      hotplug"), the memmap was initialized with 0 and the node with the right
      value.  So the zone might be wrong but not garbage.  After that commit,
      both the zone and the node will be garbage when touching uninitialized
      memmaps.
      
      Toshiki reported a BUG (race between delayed initialization of
      ZONE_DEVICE memmaps without holding the memory hotplug lock and
      concurrent zone shrinking).
      
        https://lkml.org/lkml/2019/11/14/1040
      
      "Iteration of create and destroy namespace causes the panic as below:
      
            kernel BUG at mm/page_alloc.c:535!
            CPU: 7 PID: 2766 Comm: ndctl Not tainted 5.4.0-rc4 #6
            Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
            RIP: 0010:set_pfnblock_flags_mask+0x95/0xf0
            Call Trace:
             memmap_init_zone_device+0x165/0x17c
             memremap_pages+0x4c1/0x540
             devm_memremap_pages+0x1d/0x60
             pmem_attach_disk+0x16b/0x600 [nd_pmem]
             nvdimm_bus_probe+0x69/0x1c0
             really_probe+0x1c2/0x3e0
             driver_probe_device+0xb4/0x100
             device_driver_attach+0x4f/0x60
             bind_store+0xc9/0x110
             kernfs_fop_write+0x116/0x190
             vfs_write+0xa5/0x1a0
             ksys_write+0x59/0xd0
             do_syscall_64+0x5b/0x180
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        While creating a namespace and initializing memmap, if you destroy the
        namespace and shrink the zone, it will initialize the memmap outside
        the zone and trigger VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page),
        pfn), page) in set_pfnblock_flags_mask()."
      
      This BUG is also mitigated by this commit, where we for now stop to
      shrink the ZONE_DEVICE zone until we can do it in a safe and clean way.
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-5-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reported-by: NToshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Damian Tometzki <damian.tometzki@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ce700bf
  2. 16 11月, 2019 10 次提交
    • R
      mm/debug.c: PageAnon() is true for PageKsm() pages · 6855ac4a
      Ralph Campbell 提交于
      PageAnon() and PageKsm() use the low two bits of the page->mapping
      pointer to indicate the page type.  PageAnon() only checks the LSB while
      PageKsm() checks the least significant 2 bits are equal to 3.
      
      Therefore, PageAnon() is true for KSM pages.  __dump_page() incorrectly
      will never print "ksm" because it checks PageAnon() first.  Fix this by
      checking PageKsm() first.
      
      Link: http://lkml.kernel.org/r/20191113000651.20677-1-rcampbell@nvidia.com
      Fixes: 1c6fb1d8 ("mm: print more information about mapping in __dump_page")
      Signed-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6855ac4a
    • R
      mm/debug.c: __dump_page() prints an extra line · 76a1850e
      Ralph Campbell 提交于
      When dumping struct page information, __dump_page() prints the page type
      with a trailing blank followed by the page flags on a separate line:
      
        anon
        flags: 0x100000000090034(uptodate|lru|active|head|swapbacked)
      
      It looks like the intent was to use pr_cont() for printing "flags:" but
      pr_cont() usage is discouraged so fix this by extending the format to
      include the flags into a single line:
      
        anon flags: 0x100000000090034(uptodate|lru|active|head|swapbacked)
      
      If the page is file backed, the name might be long so use two lines:
      
        shmem_aops name:"dev/zero"
        flags: 0x10000000008000c(uptodate|dirty|swapbacked)
      
      Eliminate pr_conf() usage as well for appending compound_mapcount.
      
      Link: http://lkml.kernel.org/r/20191112012608.16926-1-rcampbell@nvidia.comSigned-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76a1850e
    • V
      mm/page_io.c: do not free shared swap slots · 5df373e9
      Vinayak Menon 提交于
      The following race is observed due to which a processes faulting on a
      swap entry, finds the page neither in swapcache nor swap.  This causes
      zram to give a zero filled page that gets mapped to the process,
      resulting in a user space crash later.
      
      Consider parent and child processes Pa and Pb sharing the same swap slot
      with swap_count 2.  Swap is on zram with SWP_SYNCHRONOUS_IO set.
      Virtual address 'VA' of Pa and Pb points to the shared swap entry.
      
      Pa                                       Pb
      
      fault on VA                              fault on VA
      do_swap_page                             do_swap_page
      lookup_swap_cache fails                  lookup_swap_cache fails
                                               Pb scheduled out
      swapin_readahead (deletes zram entry)
      swap_free (makes swap_count 1)
                                               Pb scheduled in
                                               swap_readpage (swap_count == 1)
                                               Takes SWP_SYNCHRONOUS_IO path
                                               zram enrty absent
                                               zram gives a zero filled page
      
      Fix this by making sure that swap slot is freed only when swap count
      drops down to one.
      
      Link: http://lkml.kernel.org/r/1571743294-14285-1-git-send-email-vinmenon@codeaurora.org
      Fixes: aa8d22a1 ("mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no other reference")
      Signed-off-by: NVinayak Menon <vinmenon@codeaurora.org>
      Suggested-by: NMinchan Kim <minchan@google.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5df373e9
    • D
      mm/memory_hotplug: fix try_offline_node() · 2c91f8fc
      David Hildenbrand 提交于
      try_offline_node() is pretty much broken right now:
      
       - The node span is updated when onlining memory, not when adding it. We
         ignore memory that was mever onlined. Bad.
      
       - We touch possible garbage memmaps. The pfn_to_nid(pfn) can easily
         trigger a kernel panic. Bad for memory that is offline but also bad
         for subsection hotadd with ZONE_DEVICE, whereby the memmap of the
         first PFN of a section might contain garbage.
      
       - Sections belonging to mixed nodes are not properly considered.
      
      As memory blocks might belong to multiple nodes, we would have to walk
      all pageblocks (or at least subsections) within present sections.
      However, we don't have a way to identify whether a memmap that is not
      online was initialized (relevant for ZONE_DEVICE).  This makes things
      more complicated.
      
      Luckily, we can piggy pack on the node span and the nid stored in memory
      blocks.  Currently, the node span is grown when calling
      move_pfn_range_to_zone() - e.g., when onlining memory, and shrunk when
      removing memory, before calling try_offline_node().  Sysfs links are
      created via link_mem_sections(), e.g., during boot or when adding
      memory.
      
      If the node still spans memory or if any memory block belongs to the
      nid, we don't set the node offline.  As memory blocks that span multiple
      nodes cannot get offlined, the nid stored in memory blocks is reliable
      enough (for such online memory blocks, the node still spans the memory).
      
      Introduce for_each_memory_block() to efficiently walk all memory blocks.
      
      Note: We will soon stop shrinking the ZONE_DEVICE zone and the node span
      when removing ZONE_DEVICE memory to fix similar issues (access of
      garbage memmaps) - until we have a reliable way to identify whether
      these memmaps were properly initialized.  This implies later, that once
      a node had ZONE_DEVICE memory, we won't be able to set a node offline -
      which should be acceptable.
      
      Since commit f1dd2cd1 ("mm, memory_hotplug: do not associate
      hotadded memory to zones until online") memory that is added is not
      assoziated with a zone/node (memmap not initialized).  The introducing
      commit 60a5a19e ("memory-hotplug: remove sysfs file of node")
      already missed that we could have multiple nodes for a section and that
      the zone/node span is updated when onlining pages, not when adding them.
      
      I tested this by hotplugging two DIMMs to a memory-less and cpu-less
      NUMA node.  The node is properly onlined when adding the DIMMs.  When
      removing the DIMMs, the node is properly offlined.
      
      Masayoshi Mizuma reported:
      
      : Without this patch, memory hotplug fails as panic:
      :
      :  BUG: kernel NULL pointer dereference, address: 0000000000000000
      :  ...
      :  Call Trace:
      :   remove_memory_block_devices+0x81/0xc0
      :   try_remove_memory+0xb4/0x130
      :   __remove_memory+0xa/0x20
      :   acpi_memory_device_remove+0x84/0x100
      :   acpi_bus_trim+0x57/0x90
      :   acpi_bus_trim+0x2e/0x90
      :   acpi_device_hotplug+0x2b2/0x4d0
      :   acpi_hotplug_work_fn+0x1a/0x30
      :   process_one_work+0x171/0x380
      :   worker_thread+0x49/0x3f0
      :   kthread+0xf8/0x130
      :   ret_from_fork+0x35/0x40
      
      [david@redhat.com: v3]
        Link: http://lkml.kernel.org/r/20191102120221.7553-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20191028105458.28320-1-david@redhat.com
      Fixes: 60a5a19e ("memory-hotplug: remove sysfs file of node")
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visiable after d0dc12e8Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Tested-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Jani Nikula <jani.nikula@intel.com>
      Cc: Nayna Jain <nayna@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c91f8fc
    • S
      mm,thp: recheck each page before collapsing file THP · 4655e5e5
      Song Liu 提交于
      In collapse_file(), for !is_shmem case, current check cannot guarantee
      the locked page is up-to-date.  Specifically, xas_unlock_irq() should
      not be called before lock_page() and get_page(); and it is necessary to
      recheck PageUptodate() after locking the page.
      
      With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text
      may contain corrupted data.  This is because khugepaged mistakenly
      collapses some not up-to-date sub pages into a huge page, and assumes
      the huge page is up-to-date.  This will NOT corrupt data in the disk,
      because the page is read-only and never written back.  Fix this by
      properly checking PageUptodate() after locking the page.  This check
      replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);".
      
      Also, move PageDirty() check after locking the page.  Current khugepaged
      should not try to collapse dirty file THP, because it is limited to
      read-only .text.  The only case we hit a dirty page here is when the
      page hasn't been written since write.  Bail out and retry when this
      happens.
      
      syzbot reported bug on previous version of this patch.
      
      Link: http://lkml.kernel.org/r/20191106060930.2571389-2-songliubraving@fb.com
      Fixes: 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Reported-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4655e5e5
    • L
      mm: slub: really fix slab walking for init_on_free · aea4df4c
      Laura Abbott 提交于
      Commit 1b7e816f ("mm: slub: Fix slab walking for init_on_free")
      fixed one problem with the slab walking but missed a key detail: When
      walking the list, the head and tail pointers need to be updated since we
      end up reversing the list as a result.  Without doing this, bulk free is
      broken.
      
      One way this is exposed is a NULL pointer with slub_debug=F:
      
        =============================================================================
        BUG skbuff_head_cache (Tainted: G                T): Object already free
        -----------------------------------------------------------------------------
      
        INFO: Slab 0x000000000d2d2f8f objects=16 used=3 fp=0x0000000064309071 flags=0x3fff00000000201
        BUG: kernel NULL pointer dereference, address: 0000000000000000
        Oops: 0000 [#1] PREEMPT SMP PTI
        RIP: 0010:print_trailer+0x70/0x1d5
        Call Trace:
         <IRQ>
         free_debug_processing.cold.37+0xc9/0x149
         __slab_free+0x22a/0x3d0
         kmem_cache_free_bulk+0x415/0x420
         __kfree_skb_flush+0x30/0x40
         net_rx_action+0x2dd/0x480
         __do_softirq+0xf0/0x246
         irq_exit+0x93/0xb0
         do_IRQ+0xa0/0x110
         common_interrupt+0xf/0xf
         </IRQ>
      
      Given we're now almost identical to the existing debugging code which
      correctly walks the list, combine with that.
      
      Link: https://lkml.kernel.org/r/20191104170303.GA50361@gandi.net
      Link: http://lkml.kernel.org/r/20191106222208.26815-1-labbott@redhat.com
      Fixes: 1b7e816f ("mm: slub: Fix slab walking for init_on_free")
      Signed-off-by: NLaura Abbott <labbott@redhat.com>
      Reported-by: NThibaut Sautereau <thibaut.sautereau@clip-os.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Tested-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NAlexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <clipos@ssi.gouv.fr>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aea4df4c
    • R
      mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup() · 0362f326
      Roman Gushchin 提交于
      An exiting task might belong to an offline cgroup.  In this case an
      attempt to grab a cgroup reference from the task can end up with an
      infinite loop in hugetlb_cgroup_charge_cgroup(), because neither the
      cgroup will become online, neither the task will be migrated to a live
      cgroup.
      
      Fix this by switching over to css_tryget().  As css_tryget_online()
      can't guarantee that the cgroup won't go offline, in most cases the
      check doesn't make sense.  In this particular case users of
      hugetlb_cgroup_charge_cgroup() are not affected by this change.
      
      A similar problem is described by commit 18fa84a2 ("cgroup: Use
      css_tryget() instead of css_tryget_online() in task_get_css()").
      
      Link: http://lkml.kernel.org/r/20191106225131.3543616-2-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0362f326
    • R
      mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() · 00d484f3
      Roman Gushchin 提交于
      We've encountered a rcu stall in get_mem_cgroup_from_mm():
      
        rcu: INFO: rcu_sched self-detected stall on CPU
        rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
        (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
        <...>
        RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
        <...>
         __memcg_kmem_charge+0x55/0x140
         __alloc_pages_nodemask+0x267/0x320
         pipe_write+0x1ad/0x400
         new_sync_write+0x127/0x1c0
         __kernel_write+0x4f/0xf0
         dump_emit+0x91/0xc0
         writenote+0xa0/0xc0
         elf_core_dump+0x11af/0x1430
         do_coredump+0xc65/0xee0
         get_signal+0x132/0x7c0
         do_signal+0x36/0x640
         exit_to_usermode_loop+0x61/0xd0
         do_syscall_64+0xd4/0x100
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The problem is caused by an exiting task which is associated with an
      offline memcg.  We're iterating over and over in the do {} while
      (!css_tryget_online()) loop, but obviously the memcg won't become online
      and the exiting task won't be migrated to a live memcg.
      
      Let's fix it by switching from css_tryget_online() to css_tryget().
      
      As css_tryget_online() cannot guarantee that the memcg won't go offline,
      the check is usually useless, except some rare cases when for example it
      determines if something should be presented to a user.
      
      A similar problem is described by commit 18fa84a2 ("cgroup: Use
      css_tryget() instead of css_tryget_online() in task_get_css()").
      
      Johannes:
      
      : The bug aside, it doesn't matter whether the cgroup is online for the
      : callers.  It used to matter when offlining needed to evacuate all charges
      : from the memcg, and so needed to prevent new ones from showing up, but we
      : don't care now.
      
      Link: http://lkml.kernel.org/r/20191106225131.3543616-1-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NShakeel Butt <shakeeb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutn <mkoutny@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00d484f3
    • Z
      mm: fix trying to reclaim unevictable lru page when calling madvise_pageout · 82072962
      zhong jiang 提交于
      Recently, I hit the following issue when running upstream.
      
        kernel BUG at mm/vmscan.c:1521!
        invalid opcode: 0000 [#1] SMP KASAN PTI
        CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
        RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
        Call Trace:
         reclaim_pages+0x499/0x800 mm/vmscan.c:2188
         madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
         walk_pmd_range mm/pagewalk.c:53 [inline]
         walk_pud_range mm/pagewalk.c:112 [inline]
         walk_p4d_range mm/pagewalk.c:139 [inline]
         walk_pgd_range mm/pagewalk.c:166 [inline]
         __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
         walk_page_range+0x179/0x310 mm/pagewalk.c:349
         madvise_pageout_page_range mm/madvise.c:506 [inline]
         madvise_pageout+0x1f0/0x330 mm/madvise.c:542
         madvise_vma mm/madvise.c:931 [inline]
         __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
         do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      madvise_pageout() accesses the specified range of the vma and isolates
      them, then runs shrink_page_list() to reclaim its memory.  But it also
      isolates the unevictable pages to reclaim.  Hence, we can catch the
      cases in shrink_page_list().
      
      The root cause is that we scan the page tables instead of specific LRU
      list.  and so we need to filter out the unevictable lru pages from our
      end.
      
      Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com
      Fixes: 1a4e58cc ("mm: introduce MADV_PAGEOUT")
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82072962
    • Y
      mm: mempolicy: fix the wrong return value and potential pages leak of mbind · a85dfc30
      Yang Shi 提交于
      Commit d8835445 ("mm: mempolicy: make the behavior consistent when
      MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") fixed the return value
      of mbind() for a couple of corner cases.  But, it altered the errno for
      some other cases, for example, mbind() should return -EFAULT when part
      or all of the memory range specified by nodemask and maxnode points
      outside your accessible address space, or there was an unmapped hole in
      the specified memory range specified by addr and len.
      
      Fix this by preserving the errno returned by queue_pages_range().  And,
      the pagelist may be not empty even though queue_pages_range() returns
      error, put the pages back to LRU since mbind_range() is not called to
      really apply the policy so those pages should not be migrated, this is
      also the old behavior before the problematic commit.
      
      Link: http://lkml.kernel.org/r/1572454731-3925-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: d8835445 ("mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified")
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reported-by: NLi Xinhai <lixinhai.lxh@gmail.com>
      Reviewed-by: NLi Xinhai <lixinhai.lxh@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>	[4.19 and 5.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a85dfc30
  3. 07 11月, 2019 10 次提交
    • J
      mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges · 869712fd
      Johannes Weiner 提交于
      While upgrading from 4.16 to 5.2, we noticed these allocation errors in
      the log of the new kernel:
      
        SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
          cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0
          node 0: slabs: 5, objs: 170, free: 0
      
              slab_out_of_memory+1
              ___slab_alloc+969
              __slab_alloc+14
              kmem_cache_alloc+346
              inet_twsk_alloc+60
              tcp_time_wait+46
              tcp_fin+206
              tcp_data_queue+2034
              tcp_rcv_state_process+784
              tcp_v6_do_rcv+405
              __release_sock+118
              tcp_close+385
              inet_release+46
              __sock_release+55
              sock_close+17
              __fput+170
              task_work_run+127
              exit_to_usermode_loop+191
              do_syscall_64+212
              entry_SYSCALL_64_after_hwframe+68
      
      accompanied by an increase in machines going completely radio silent
      under memory pressure.
      
      One thing that changed since 4.16 is e699e2c6 ("net, mm: account
      sock objects to kmemcg"), which made these slab caches subject to cgroup
      memory accounting and control.
      
      The problem with that is that cgroups, unlike the page allocator, do not
      maintain dedicated atomic reserves.  As a cgroup's usage hovers at its
      limit, atomic allocations - such as done during network rx - can fail
      consistently for extended periods of time.  The kernel is not able to
      operate under these conditions.
      
      We don't want to revert the culprit patch, because it indeed tracks a
      potentially substantial amount of memory used by a cgroup.
      
      We also don't want to implement dedicated atomic reserves for cgroups.
      There is no point in keeping a fixed margin of unused bytes in the
      cgroup's memory budget to accomodate a consumer that is impossible to
      predict - we'd be wasting memory and get into configuration headaches,
      not unlike what we have going with min_free_kbytes.  We do this for
      physical mem because we have to, but cgroups are an accounting game.
      
      Instead, account these privileged allocations to the cgroup, but let
      them bypass the configured limit if they have to.  This way, we get the
      benefits of accounting the consumed memory and have it exert pressure on
      the rest of the cgroup, but like with the page allocator, we shift the
      burden of reclaimining on behalf of atomic allocations onto the regular
      allocations that can block.
      
      Link: http://lkml.kernel.org/r/20191022233708.365764-1-hannes@cmpxchg.org
      Fixes: e699e2c6 ("net, mm: account sock objects to kmemcg")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.18+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      869712fd
    • D
      mm/memory_hotplug: fix updating the node span · 656d5711
      David Hildenbrand 提交于
      We recently started updating the node span based on the zone span to
      avoid touching uninitialized memmaps.
      
      Currently, we will always detect the node span to start at 0, meaning a
      node can easily span too many pages.  pgdat_is_empty() will still work
      correctly if all zones span no pages.  We should skip over all zones
      without spanned pages and properly handle the first detected zone that
      spans pages.
      
      Unfortunately, in contrast to the zone span (/proc/zoneinfo), the node
      span cannot easily be inspected and tested.  The node span gives no real
      guarantees when an architecture supports memory hotplug, meaning it can
      easily contain holes or span pages of different nodes.
      
      The node span is not really used after init on architectures that
      support memory hotplug.
      
      E.g., we use it in mm/memory_hotplug.c:try_offline_node() and in
      mm/kmemleak.c:kmemleak_scan().  These users seem to be fine.
      
      Link: http://lkml.kernel.org/r/20191027222714.5313-1-david@redhat.com
      Fixes: 00d6c019 ("mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span()")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      656d5711
    • R
      mm: slab: make page_cgroup_ino() to recognize non-compound slab pages properly · 221ec5c0
      Roman Gushchin 提交于
      page_cgroup_ino() doesn't return a valid memcg pointer for non-compound
      slab pages, because it depends on PgHead AND PgSlab flags to be set to
      determine the memory cgroup from the kmem_cache.  It's correct for
      compound pages, but not for generic small pages.  Those don't have PgHead
      set, so it ends up returning zero.
      
      Fix this by replacing the condition to PageSlab() && !PageTail().
      
      Before this patch:
        [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
        0x0000000000000080	        38        0  _______S___________________________________	slab
      
      After this patch:
        [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
        0x0000000000000080	       147        0  _______S___________________________________	slab
      
      Also, hwpoison_filter_task() uses output of page_cgroup_ino() in order
      to filter error injection events based on memcg.  So if
      page_cgroup_ino() fails to return memcg pointer, we just fail to inject
      memory error.  Considering that hwpoison filter is for testing, affected
      users are limited and the impact should be marginal.
      
      [n-horiguchi@ah.jp.nec.com: changelog additions]
      Link: http://lkml.kernel.org/r/20191031012151.2722280-1-guro@fb.com
      Fixes: 4d96ba35 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      221ec5c0
    • J
      mm/page_alloc.c: ratelimit allocation failure warnings more aggressively · 1be334e5
      Johannes Weiner 提交于
      While investigating a bug related to higher atomic allocation failures,
      we noticed the failure warnings positively drowning the console, and in
      our case trigger lockup warnings because of a serial console too slow to
      handle all that output.
      
      But even if we had a faster console, it's unclear what additional
      information the current level of repetition provides.
      
      Allocation failures happen for three reasons: The machine is OOM, the VM
      is failing to handle reasonable requests, or somebody is making
      unreasonable requests (and didn't acknowledge their opportunism with
      __GFP_NOWARN).  Having the memory dump, a callstack, and the ratelimit
      stats on skipped failure warnings should provide enough information to
      let users/admins/developers know whether something is wrong and point
      them in the right direction for debugging, bpftracing etc.
      
      Limit allocation failure warnings to one spew every ten seconds.
      
      Link: http://lkml.kernel.org/r/20191028194906.26899-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1be334e5
    • V
      mm/khugepaged: fix might_sleep() warn with CONFIG_HIGHPTE=y · ec649c9d
      Ville Syrjälä 提交于
      I got some khugepaged spew on a 32bit x86:
      
        BUG: sleeping function called from invalid context at include/linux/mmu_notifier.h:346
        in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 25, name: khugepaged
        INFO: lockdep is turned off.
        CPU: 1 PID: 25 Comm: khugepaged Not tainted 5.4.0-rc5-elk+ #206
        Hardware name: System manufacturer P5Q-EM/P5Q-EM, BIOS 2203    07/08/2009
        Call Trace:
         dump_stack+0x66/0x8e
         ___might_sleep.cold.96+0x95/0xa6
         __might_sleep+0x2e/0x80
         collapse_huge_page.isra.51+0x5ac/0x1360
         khugepaged+0x9a9/0x20f0
         kthread+0xf5/0x110
         ret_from_fork+0x2e/0x38
      
      Looks like it's due to CONFIG_HIGHPTE=y pte_offset_map()->kmap_atomic()
      vs.  mmu_notifier_invalidate_range_start().  Let's do the naive approach
      and just reorder the two operations.
      
      Link: http://lkml.kernel.org/r/20191029201513.GG1208@intel.com
      Fixes: 810e24e0 ("mm/mmu_notifiers: annotate with might_sleep()")
      Signed-off-by: NVille Syrjl <ville.syrjala@linux.intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec649c9d
    • M
      mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo · 93b3a674
      Michal Hocko 提交于
      pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
      This is not really nice because it blocks both any interrupts on that
      cpu and the page allocator.  On large machines this might even trigger
      the hard lockup detector.
      
      Considering the pagetypeinfo is a debugging tool we do not really need
      exact numbers here.  The primary reason to look at the outuput is to see
      how pageblocks are spread among different migratetypes and low number of
      pages is much more interesting therefore putting a bound on the number
      of pages on the free_list sounds like a reasonable tradeoff.
      
      The new output will simply tell
        [...]
        Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
      
      instead of
        Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
      
      The limit has been chosen arbitrary and it is a subject of a future
      change should there be a need for that.
      
      While we are at it, also drop the zone lock after each free_list
      iteration which will help with the IRQ and page allocator responsiveness
      even further as the IRQ lock held time is always bound to those 100k
      pages.
      
      [akpm@linux-foundation.org: tweak comment text, per David Hildenbrand]
      Link: http://lkml.kernel.org/r/20191025072610.18526-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWaiman Long <longman@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93b3a674
    • M
      mm, vmstat: hide /proc/pagetypeinfo from normal users · abaed011
      Michal Hocko 提交于
      /proc/pagetypeinfo is a debugging tool to examine internal page
      allocator state wrt to fragmentation.  It is not very useful for any
      other use so normal users really do not need to read this file.
      
      Waiman Long has noticed that reading this file can have negative side
      effects because zone->lock is necessary for gathering data and that a)
      interferes with the page allocator and its users and b) can lead to hard
      lockups on large machines which have very long free_list.
      
      Reduce both issues by simply not exporting the file to regular users.
      
      Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org
      Fixes: 467c996c ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NWaiman Long <longman@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NWaiman Long <longman@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Jann Horn <jannh@google.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abaed011
    • J
      mm/mmu_notifiers: use the right return code for WARN_ON · df2ec764
      Jason Gunthorpe 提交于
      The return code from the op callback is actually in _ret, while the
      WARN_ON was checking ret which causes it to misfire.
      
      Link: http://lkml.kernel.org/r/20191025175502.GA31127@ziepe.ca
      Fixes: 8402ce61 ("mm/mmu_notifiers: check if mmu notifier callbacks are allowed to fail")
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df2ec764
    • M
      mm, meminit: recalculate pcpu batch and high limits after init completes · 3e8fc007
      Mel Gorman 提交于
      Deferred memory initialisation updates zone->managed_pages during the
      initialisation phase but before that finishes, the per-cpu page
      allocator (pcpu) calculates the number of pages allocated/freed in
      batches as well as the maximum number of pages allowed on a per-cpu
      list.  As zone->managed_pages is not up to date yet, the pcpu
      initialisation calculates inappropriately low batch and high values.
      
      This increases zone lock contention quite severely in some cases with
      the degree of severity depending on how many CPUs share a local zone and
      the size of the zone.  A private report indicated that kernel build
      times were excessive with extremely high system CPU usage.  A perf
      profile indicated that a large chunk of time was lost on zone->lock
      contention.
      
      This patch recalculates the pcpu batch and high values after deferred
      initialisation completes for every populated zone in the system.  It was
      tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
      workload -- allmodconfig and all available CPUs.
      
      mmtests configuration: config-workload-kernbench-max Configuration was
      modified to build on a fresh XFS partition.
      
      kernbench
                                      5.4.0-rc3              5.4.0-rc3
                                        vanilla           resetpcpu-v2
      Amean     user-256    13249.50 (   0.00%)    16401.31 * -23.79%*
      Amean     syst-256    14760.30 (   0.00%)     4448.39 *  69.86%*
      Amean     elsp-256      162.42 (   0.00%)      119.13 *  26.65%*
      Stddev    user-256       42.97 (   0.00%)       19.15 (  55.43%)
      Stddev    syst-256      336.87 (   0.00%)        6.71 (  98.01%)
      Stddev    elsp-256        2.46 (   0.00%)        0.39 (  84.03%)
      
                         5.4.0-rc3    5.4.0-rc3
                           vanilla resetpcpu-v2
      Duration User       39766.24     49221.79
      Duration System     44298.10     13361.67
      Duration Elapsed      519.11       388.87
      
      The patch reduces system CPU usage by 69.86% and total build time by
      26.65%.  The variance of system CPU usage is also much reduced.
      
      Before, this was the breakdown of batch and high values over all zones
      was:
      
          256               batch: 1
          256               batch: 63
          512               batch: 7
          256               high:  0
          256               high:  378
          512               high:  42
      
      512 pcpu pagesets had a batch limit of 7 and a high limit of 42.  After
      the patch:
      
          256               batch: 1
          768               batch: 63
          256               high:  0
          768               high:  378
      
      [mgorman@techsingularity.net: fix merge/linkage snafu]
        Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: <stable@vger.kernel.org>	[4.1+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e8fc007
    • S
      mm: memcontrol: fix NULL-ptr deref in percpu stats flush · 7961eee3
      Shakeel Butt 提交于
      __mem_cgroup_free() can be called on the failure path in
      mem_cgroup_alloc().  However memcg_flush_percpu_vmstats() and
      memcg_flush_percpu_vmevents() which are called from __mem_cgroup_free()
      access the fields of memcg which can potentially be null if called from
      failure path from mem_cgroup_alloc().  Indeed syzbot has reported the
      following crash:
      
      	kasan: CONFIG_KASAN_INLINE enabled
      	kasan: GPF could be caused by NULL-ptr deref or user memory access
      	general protection fault: 0000 [#1] PREEMPT SMP KASAN
      	CPU: 0 PID: 30393 Comm: syz-executor.1 Not tainted 5.4.0-rc2+ #0
      	Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      	RIP: 0010:memcg_flush_percpu_vmstats+0x4ae/0x930 mm/memcontrol.c:3436
      	Code: 05 41 89 c0 41 0f b6 04 24 41 38 c7 7c 08 84 c0 0f 85 5d 03 00 00 44 3b 05 33 d5 12 08 0f 83 e2 00 00 00 4c 89 f0 48 c1 e8 03 <42> 80 3c 28 00 0f 85 91 03 00 00 48 8b 85 10 fe ff ff 48 8b b0 90
      	RSP: 0018:ffff888095c27980 EFLAGS: 00010206
      	RAX: 0000000000000012 RBX: ffff888095c27b28 RCX: ffffc90008192000
      	RDX: 0000000000040000 RSI: ffffffff8340fae7 RDI: 0000000000000007
      	RBP: ffff888095c27be0 R08: 0000000000000000 R09: ffffed1013f0da33
      	R10: ffffed1013f0da32 R11: ffff88809f86d197 R12: fffffbfff138b760
      	R13: dffffc0000000000 R14: 0000000000000090 R15: 0000000000000007
      	FS:  00007f5027170700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	CR2: 0000000000710158 CR3: 00000000a7b18000 CR4: 00000000001406f0
      	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      	Call Trace:
      	__mem_cgroup_free+0x1a/0x190 mm/memcontrol.c:5021
      	mem_cgroup_free mm/memcontrol.c:5033 [inline]
      	mem_cgroup_css_alloc+0x3a1/0x1ae0 mm/memcontrol.c:5160
      	css_create kernel/cgroup/cgroup.c:5156 [inline]
      	cgroup_apply_control_enable+0x44d/0xc40 kernel/cgroup/cgroup.c:3119
      	cgroup_mkdir+0x899/0x11b0 kernel/cgroup/cgroup.c:5401
      	kernfs_iop_mkdir+0x14d/0x1d0 fs/kernfs/dir.c:1124
      	vfs_mkdir+0x42e/0x670 fs/namei.c:3807
      	do_mkdirat+0x234/0x2a0 fs/namei.c:3830
      	__do_sys_mkdir fs/namei.c:3846 [inline]
      	__se_sys_mkdir fs/namei.c:3844 [inline]
      	__x64_sys_mkdir+0x5c/0x80 fs/namei.c:3844
      	do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
      	entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixing this by moving the flush to mem_cgroup_free as there is no need
      to flush anything if we see failure in mem_cgroup_alloc().
      
      Link: http://lkml.kernel.org/r/20191018165231.249872-1-shakeelb@google.com
      Fixes: bb65f89b ("mm: memcontrol: flush percpu vmevents before releasing memcg")
      Fixes: c350a99e ("mm: memcontrol: flush percpu vmstats before releasing memcg")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: syzbot+515d5bcfe179cdf049b2@syzkaller.appspotmail.com
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7961eee3
  4. 19 10月, 2019 16 次提交
    • K
      mm/thp: allow dropping THP from page cache · ef18a1ca
      Kirill A. Shutemov 提交于
      Once a THP is added to the page cache, it cannot be dropped via
      /proc/sys/vm/drop_caches.  Fix this issue with proper handling in
      invalidate_mapping_pages().
      
      Link: http://lkml.kernel.org/r/20191017164223.2762148-5-songliubraving@fb.com
      Fixes: 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Tested-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ef18a1ca
    • W
      mm/vmscan.c: support removing arbitrary sized pages from mapping · 906d278d
      William Kucharski 提交于
      __remove_mapping() assumes that pages can only be either base pages or
      HPAGE_PMD_SIZE.  Ask the page what size it is.
      
      Link: http://lkml.kernel.org/r/20191017164223.2762148-4-songliubraving@fb.com
      Fixes: 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      906d278d
    • K
      mm/thp: fix node page state in split_huge_page_to_list() · 06d3eff6
      Kirill A. Shutemov 提交于
      Make sure split_huge_page_to_list() handles the state of shmem THP and
      file THP properly.
      
      Link: http://lkml.kernel.org/r/20191017164223.2762148-3-songliubraving@fb.com
      Fixes: 60fbf0ab ("mm,thp: stats for file backed THP")
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Tested-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06d3eff6
    • B
      mm/init-mm.c: include <linux/mman.h> for vm_committed_as_batch · a2ae8c05
      Ben Dooks (Codethink) 提交于
      mm_init.c needs to include <linux/mman.h> for the definition of
      vm_committed_as_batch.  Fixes the following sparse warning:
      
        mm/mm_init.c:141:5: warning: symbol 'vm_committed_as_batch' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20191016091509.26708-1-ben.dooks@codethink.co.ukSigned-off-by: NBen Dooks <ben.dooks@codethink.co.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a2ae8c05
    • B
      mm/filemap.c: include <linux/ramfs.h> for generic_file_vm_ops definition · d0e6a582
      Ben Dooks 提交于
      The generic_file_vm_ops is defined in <linux/ramfs.h> so include it to
      fix the following warning:
      
        mm/filemap.c:2717:35: warning: symbol 'generic_file_vm_ops' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20191008102311.25432-1-ben.dooks@codethink.co.ukSigned-off-by: NBen Dooks <ben.dooks@codethink.co.uk>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d0e6a582
    • B
      mm: include <linux/huge_mm.h> for is_vma_temporary_stack · 444f84fd
      Ben Dooks 提交于
      Include <linux/huge_mm.h> for the definition of is_vma_temporary_stack
      to fix the following sparse warning:
      
        mm/rmap.c:1673:6: warning: symbol 'is_vma_temporary_stack' was not declared. Should it be static?
      
      Link: http://lkml.kernel.org/r/20191009151155.27763-1-ben.dooks@codethink.co.ukSigned-off-by: NBen Dooks <ben.dooks@codethink.co.uk>
      Reviewed-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      444f84fd
    • K
      mm/memcontrol: update lruvec counters in mem_cgroup_move_account · ae8af438
      Konstantin Khlebnikov 提交于
      Mapped, dirty and writeback pages are also counted in per-lruvec stats.
      These counters needs update when page is moved between cgroups.
      
      Currently is nobody *consuming* the lruvec versions of these counters and
      that there is no user-visible effect.
      
      Link: http://lkml.kernel.org/r/157112699975.7360.1062614888388489788.stgit@buzz
      Fixes: 00f3ca2c ("mm: memcontrol: per-lruvec stats infrastructure")
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ae8af438
    • D
      hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic() · f231fe42
      David Hildenbrand 提交于
      Uninitialized memmaps contain garbage and in the worst case trigger
      kernel BUGs, especially with CONFIG_PAGE_POISONING.  They should not get
      touched.
      
      Let's make sure that we only consider online memory (managed by the
      buddy) that has initialized memmaps.  ZONE_DEVICE is not applicable.
      
      page_zone() will call page_to_nid(), which will trigger
      VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING
      and CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps.  This
      can be the case when an offline memory block (e.g., never onlined) is
      spanned by a zone.
      
      Note: As explained by Michal in [1], alloc_contig_range() will verify
      the range.  So it boils down to the wrong access in this function.
      
      [1] http://lkml.kernel.org/r/20180423000943.GO17484@dhcp22.suse.cz
      
      Link: http://lkml.kernel.org/r/20191015120717.4858-1-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f231fe42
    • M
      mm: memblock: do not enforce current limit for memblock_phys* family · f3057ad7
      Mike Rapoport 提交于
      Until commit 92d12f95 ("memblock: refactor internal allocation
      functions") the maximal address for memblock allocations was forced to
      memblock.current_limit only for the allocation functions returning
      virtual address.  The changes introduced by that commit moved the limit
      enforcement into the allocation core and as a result the allocation
      functions returning physical address also started to limit allocations
      to memblock.current_limit.
      
      This caused breakage of etnaviv GPU driver:
      
        etnaviv etnaviv: bound 130000.gpu (ops gpu_ops)
        etnaviv etnaviv: bound 134000.gpu (ops gpu_ops)
        etnaviv etnaviv: bound 2204000.gpu (ops gpu_ops)
        etnaviv-gpu 130000.gpu: model: GC2000, revision: 5108
        etnaviv-gpu 130000.gpu: command buffer outside valid memory window
        etnaviv-gpu 134000.gpu: model: GC320, revision: 5007
        etnaviv-gpu 134000.gpu: command buffer outside valid memory window
        etnaviv-gpu 2204000.gpu: model: GC355, revision: 1215
        etnaviv-gpu 2204000.gpu: Ignoring GPU with VG and FE2.0
      
      Restore the behaviour of memblock_phys* family so that these functions
      will not enforce memblock.current_limit.
      
      Link: http://lkml.kernel.org/r/1570915861-17633-1-git-send-email-rppt@kernel.org
      Fixes: 92d12f95 ("memblock: refactor internal allocation functions")
      Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
      Reported-by: NAdam Ford <aford173@gmail.com>
      Tested-by: Adam Ford <aford173@gmail.com>	[imx6q-logicpd]
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Fabio Estevam <festevam@gmail.com>
      Cc: Lucas Stach <l.stach@pengutronix.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3057ad7
    • H
      mm: memcg: get number of pages on the LRU list in memcgroup base on lru_zone_size · b11edebb
      Honglei Wang 提交于
      Commit 1a61ab80 ("mm: memcontrol: replace zone summing with
      lruvec_page_state()") has made lruvec_page_state to use per-cpu counters
      instead of calculating it directly from lru_zone_size with an idea that
      this would be more effective.
      
      Tim has reported that this is not really the case for their database
      benchmark which is showing an opposite results where lruvec_page_state
      is taking up a huge chunk of CPU cycles (about 25% of the system time
      which is roughly 7% of total cpu cycles) on 5.3 kernels.  The workload
      is running on a larger machine (96cpus), it has many cgroups (500) and
      it is heavily direct reclaim bound.
      
      Tim Chen said:
      
      : The problem can also be reproduced by running simple multi-threaded
      : pmbench benchmark with a fast Optane SSD swap (see profile below).
      :
      :
      : 6.15%     3.08%  pmbench          [kernel.vmlinux]            [k] lruvec_lru_size
      :             |
      :             |--3.07%--lruvec_lru_size
      :             |          |
      :             |          |--2.11%--cpumask_next
      :             |          |          |
      :             |          |           --1.66%--find_next_bit
      :             |          |
      :             |           --0.57%--call_function_interrupt
      :             |                     |
      :             |                      --0.55%--smp_call_function_interrupt
      :             |
      :             |--1.59%--0x441f0fc3d009
      :             |          _ops_rdtsc_init_base_freq
      :             |          access_histogram
      :             |          page_fault
      :             |          __do_page_fault
      :             |          handle_mm_fault
      :             |          __handle_mm_fault
      :             |          |
      :             |           --1.54%--do_swap_page
      :             |                     swapin_readahead
      :             |                     swap_cluster_readahead
      :             |                     |
      :             |                      --1.53%--read_swap_cache_async
      :             |                                __read_swap_cache_async
      :             |                                alloc_pages_vma
      :             |                                __alloc_pages_nodemask
      :             |                                __alloc_pages_slowpath
      :             |                                try_to_free_pages
      :             |                                do_try_to_free_pages
      :             |                                shrink_node
      :             |                                shrink_node_memcg
      :             |                                |
      :             |                                |--0.77%--lruvec_lru_size
      :             |                                |
      :             |                                 --0.76%--inactive_list_is_low
      :             |                                           |
      :             |                                            --0.76%--lruvec_lru_size
      :             |
      :              --1.50%--measure_read
      :                        page_fault
      :                        __do_page_fault
      :                        handle_mm_fault
      :                        __handle_mm_fault
      :                        do_swap_page
      :                        swapin_readahead
      :                        swap_cluster_readahead
      :                        |
      :                         --1.48%--read_swap_cache_async
      :                                   __read_swap_cache_async
      :                                   alloc_pages_vma
      :                                   __alloc_pages_nodemask
      :                                   __alloc_pages_slowpath
      :                                   try_to_free_pages
      :                                   do_try_to_free_pages
      :                                   shrink_node
      :                                   shrink_node_memcg
      :                                   |
      :                                   |--0.75%--inactive_list_is_low
      :                                   |          |
      :                                   |           --0.75%--lruvec_lru_size
      :                                   |
      :                                    --0.73%--lruvec_lru_size
      
      The likely culprit is the cache traffic the lruvec_page_state_local
      generates.  Dave Hansen says:
      
      : I was thinking purely of the cache footprint.  If it's reading
      : pn->lruvec_stat_local->count[idx] is three separate cachelines, so 192
      : bytes of cache *96 CPUs = 18k of data, mostly read-only.  1 cgroup would
      : be 18k of data for the whole system and the caching would be pretty
      : efficient and all 18k would probably survive a tight page fault loop in
      : the L1.  500 cgroups would be ~90k of data per CPU thread which doesn't
      : fit in the L1 and probably wouldn't survive a tight page fault loop if
      : both logical threads were banging on different cgroups.
      :
      : It's just a theory, but it's why I noted the number of cgroups when I
      : initially saw this show up in profiles
      
      Fix the regression by partially reverting the said commit and calculate
      the lru size explicitly.
      
      Link: http://lkml.kernel.org/r/20190905071034.16822-1-honglei.wang@oracle.com
      Fixes: 1a61ab80 ("mm: memcontrol: replace zone summing with lruvec_page_state()")
      Signed-off-by: NHonglei Wang <honglei.wang@oracle.com>
      Reported-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NTim Chen <tim.c.chen@linux.intel.com>
      Tested-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: <stable@vger.kernel.org>	[5.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b11edebb
    • J
      mm/gup: fix a misnamed "write" argument, and a related bug · 0cd22afd
      John Hubbard 提交于
      In several routines, the "flags" argument is incorrectly named "write".
      Change it to "flags".
      
      Also, in one place, the misnaming led to an actual bug:
      "flags & FOLL_WRITE" is required, rather than just "flags".
      (That problem was flagged by krobot, in v1 of this patch.)
      
      Also, change the flags argument from int, to unsigned int.
      
      You can see that this was a simple oversight, because the
      calling code passes "flags" to the fifth argument:
      
      gup_pgd_range():
          ...
          if (!gup_huge_pd(__hugepd(pgd_val(pgd)), addr,
      		    PGDIR_SHIFT, next, flags, pages, nr))
      
      ...which, until this patch, the callees referred to as "write".
      
      Also, change two lines to avoid checkpatch line length
      complaints, and another line to fix another oversight
      that checkpatch called out: missing "int" on pdshift.
      
      Link: http://lkml.kernel.org/r/20191014184639.1512873-3-jhubbard@nvidia.com
      Fixes: b798bec4 ("mm/gup: change write parameter to flags in fast walk")
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Suggested-by: NKirill A. Shutemov <kirill@shutemov.name>
      Suggested-by: NIra Weiny <ira.weiny@intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cd22afd
    • R
      mm: memcg/slab: fix panic in __free_slab() caused by premature memcg pointer release · b749ecfa
      Roman Gushchin 提交于
      Karsten reported the following panic in __free_slab() happening on a s390x
      machine:
      
        Unable to handle kernel pointer dereference in virtual kernel address space
        Failing address: 0000000000000000 TEID: 0000000000000483
        Fault in home space mode while using kernel ASCE.
        AS:00000000017d4007 R3:000000007fbd0007 S:000000007fbff000 P:000000000000003d
        Oops: 0004 ilc:3 Ý#1¨ PREEMPT SMP
        Modules linked in: tcp_diag inet_diag xt_tcpudp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_at nf_nat
        CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-05872-g6133e3e4bada-dirty #14
        Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
        Krnl PSW : 0704d00180000000 00000000003cadb6 (__free_slab+0x686/0x6b0)
                   R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:1 PM:0 RI:0 EA:3
        Krnl GPRS: 00000000f3a32928 0000000000000000 000000007fbf5d00 000000000117c4b8
                   0000000000000000 000000009e3291c1 0000000000000000 0000000000000000
                   0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
                   0000000000000003 0000000000000008 000000002b478b00 000003d080a97600
                   000000000117ba00 000003e000057db0 00000000003cabcc 000003e000057c78
        Krnl Code: 00000000003cada6: e310a1400004        lg      %r1,320(%r10)
                   00000000003cadac: c0e50046c286        brasl   %r14,ca32b8
                  #00000000003cadb2: a7f4fe36            brc     15,3caa1e
                  >00000000003cadb6: e32060800024        stg     %r2,128(%r6)
                   00000000003cadbc: a7f4fd9e            brc     15,3ca8f8
                   00000000003cadc0: c0e50046790c        brasl   %r14,c99fd8
                   00000000003cadc6: a7f4fe2c            brc     15,3caa
                   00000000003cadc6: a7f4fe2c            brc     15,3caa1e
                   00000000003cadca: ecb1ffff00d9        aghik   %r11,%r1,-1
        Call Trace:
        (<00000000003cabcc> __free_slab+0x49c/0x6b0)
         <00000000001f5886> rcu_core+0x5a6/0x7e0
         <0000000000ca2dea> __do_softirq+0xf2/0x5c0
         <0000000000152644> irq_exit+0x104/0x130
         <000000000010d222> do_IRQ+0x9a/0xf0
         <0000000000ca2344> ext_int_handler+0x130/0x134
         <0000000000103648> enabled_wait+0x58/0x128
        (<0000000000103634> enabled_wait+0x44/0x128)
         <0000000000103b00> arch_cpu_idle+0x40/0x58
         <0000000000ca0544> default_idle_call+0x3c/0x68
         <000000000018eaa4> do_idle+0xec/0x1c0
         <000000000018ee0e> cpu_startup_entry+0x36/0x40
         <000000000122df34> arch_call_rest_init+0x5c/0x88
         <0000000000000000> 0x0
        INFO: lockdep is turned off.
        Last Breaking-Event-Address:
         <00000000003ca8f4> __free_slab+0x1c4/0x6b0
        Kernel panic - not syncing: Fatal exception in interrupt
      
      The kernel panics on an attempt to dereference the NULL memcg pointer.
      When shutdown_cache() is called from the kmem_cache_destroy() context, a
      memcg kmem_cache might have empty slab pages in a partial list, which are
      still charged to the memory cgroup.
      
      These pages are released by free_partial() at the beginning of
      shutdown_cache(): either directly or by scheduling a RCU-delayed work
      (if the kmem_cache has the SLAB_TYPESAFE_BY_RCU flag).  The latter case
      is when the reported panic can happen: memcg_unlink_cache() is called
      immediately after shrinking partial lists, without waiting for scheduled
      RCU works.  It sets the kmem_cache->memcg_params.memcg pointer to NULL,
      and the following attempt to dereference it by __free_slab() from the
      RCU work context causes the panic.
      
      To fix the issue, let's postpone the release of the memcg pointer to
      destroy_memcg_params().  It's called from a separate work context by
      slab_caches_to_rcu_destroy_workfn(), which contains a full RCU barrier.
      This guarantees that all scheduled page release RCU works will complete
      before the memcg pointer will be zeroed.
      
      Big thanks for Karsten for the perfect report containing all necessary
      information, his help with the analysis of the problem and testing of the
      fix.
      
      Link: http://lkml.kernel.org/r/20191010160549.1584316-1-guro@fb.com
      Fixes: fb2f2b0a ("mm: memcg/slab: reparent memcg kmem_caches on cgroup removal")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NKarsten Graul <kgraul@linux.ibm.com>
      Tested-by: NKarsten Graul <kgraul@linux.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Karsten Graul <kgraul@linux.ibm.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b749ecfa
    • A
      mm/memunmap: don't access uninitialized memmap in memunmap_pages() · 77e080e7
      Aneesh Kumar K.V 提交于
      Patch series "mm/memory_hotplug: Shrink zones before removing memory",
      v6.
      
      This series fixes the access of uninitialized memmaps when shrinking
      zones/nodes and when removing memory.  Also, it contains all fixes for
      crashes that can be triggered when removing certain namespace using
      memunmap_pages() - ZONE_DEVICE, reported by Aneesh.
      
      We stop trying to shrink ZONE_DEVICE, as it's buggy, fixing it would be
      more involved (we don't have SECTION_IS_ONLINE as an indicator), and
      shrinking is only of limited use (set_zone_contiguous() cannot detect
      the ZONE_DEVICE as contiguous).
      
      We continue shrinking !ZONE_DEVICE zones, however, I reduced the amount
      of code to a minimum.  Shrinking is especially necessary to keep
      zone->contiguous set where possible, especially, on memory unplug of
      DIMMs at zone boundaries.
      
      --------------------------------------------------------------------------
      
      Zones are now properly shrunk when offlining memory blocks or when
      onlining failed.  This allows to properly shrink zones on memory unplug
      even if the separate memory blocks of a DIMM were onlined to different
      zones or re-onlined to a different zone after offlining.
      
      Example:
      
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  0
                present  0
                managed  0
        :/# echo "online_movable" > /sys/devices/system/memory/memory41/state
        :/# echo "online_movable" > /sys/devices/system/memory/memory43/state
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  98304
                present  65536
                managed  65536
        :/# echo 0 > /sys/devices/system/memory/memory43/online
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  32768
                present  32768
                managed  32768
        :/# echo 0 > /sys/devices/system/memory/memory41/online
        :/# cat /proc/zoneinfo
        Node 1, zone  Movable
                spanned  0
                present  0
                managed  0
      
      This patch (of 10):
      
      With an altmap, the memmap falling into the reserved altmap space are not
      initialized and, therefore, contain a garbage NID and a garbage zone.
      Make sure to read the NID/zone from a memmap that was initialized.
      
      This fixes a kernel crash that is observed when destroying a namespace:
      
        kernel BUG at include/linux/mm.h:1107!
        cpu 0x1: Vector: 700 (Program Check) at [c000000274087890]
            pc: c0000000004b9728: memunmap_pages+0x238/0x340
            lr: c0000000004b9724: memunmap_pages+0x234/0x340
        ...
            pid   = 3669, comm = ndctl
        kernel BUG at include/linux/mm.h:1107!
          devm_action_release+0x30/0x50
          release_nodes+0x268/0x2d0
          device_release_driver_internal+0x174/0x240
          unbind_store+0x13c/0x190
          drv_attr_store+0x44/0x60
          sysfs_kf_write+0x70/0xa0
          kernfs_fop_write+0x1ac/0x290
          __vfs_write+0x3c/0x70
          vfs_write+0xe4/0x200
          ksys_write+0x7c/0x140
          system_call+0x5c/0x68
      
      The "page_zone(pfn_to_page(pfn)" was introduced by 69324b8f ("mm,
      devm_memremap_pages: add MEMORY_DEVICE_PRIVATE support"), however, I
      think we will never have driver reserved memory with
      MEMORY_DEVICE_PRIVATE (no altmap AFAIKS).
      
      [david@redhat.com: minimze code changes, rephrase description]
      Link: http://lkml.kernel.org/r/20191006085646.5768-2-david@redhat.com
      Fixes: 2c2a5af6 ("mm, memory_hotplug: add nid parameter to arch_remove_memory")
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Damian Tometzki <damian.tometzki@gmail.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      77e080e7
    • D
      mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span() · 00d6c019
      David Hildenbrand 提交于
      We might use the nid of memmaps that were never initialized.  For
      example, if the memmap was poisoned, we will crash the kernel in
      pfn_to_nid() right now.  Let's use the calculated boundaries of the
      separate zones instead.  This now also avoids having to iterate over a
      whole bunch of subsections again, after shrinking one zone.
      
      Before commit d0dc12e8 ("mm/memory_hotplug: optimize memory
      hotplug"), the memmap was initialized to 0 and the node was set to the
      right value.  After that commit, the node might be garbage.
      
      We'll have to fix shrink_zone_span() next.
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-4-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Damian Tometzki <damian.tometzki@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00d6c019
    • Q
      mm/page_owner: don't access uninitialized memmaps when reading /proc/pagetypeinfo · a26ee565
      Qian Cai 提交于
      Uninitialized memmaps contain garbage and in the worst case trigger
      kernel BUGs, especially with CONFIG_PAGE_POISONING.  They should not get
      touched.
      
      For example, when not onlining a memory block that is spanned by a zone
      and reading /proc/pagetypeinfo with CONFIG_DEBUG_VM_PGFLAGS and
      CONFIG_PAGE_POISONING, we can trigger a kernel BUG:
      
        :/# echo 1 > /sys/devices/system/memory/memory40/online
        :/# echo 1 > /sys/devices/system/memory/memory42/online
        :/# cat /proc/pagetypeinfo > test.file
         page:fffff2c585200000 is uninitialized and poisoned
         raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
         raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
         page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
         There is not page extension available.
         ------------[ cut here ]------------
         kernel BUG at include/linux/mm.h:1107!
         invalid opcode: 0000 [#1] SMP NOPTI
      
      Please note that this change does not affect ZONE_DEVICE, because
      pagetypeinfo_showmixedcount_print() is called from
      mm/vmstat.c:pagetypeinfo_showmixedcount() only for populated zones, and
      ZONE_DEVICE is never populated (zone->present_pages always 0).
      
      [david@redhat.com: move check to outer loop, add comment, rephrase description]
      Link: http://lkml.kernel.org/r/20191011140638.8160-1-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e8Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a26ee565
    • D
      mm/memory-failure.c: don't access uninitialized memmaps in memory_failure() · 96c804a6
      David Hildenbrand 提交于
      We should check for pfn_to_online_page() to not access uninitialized
      memmaps.  Reshuffle the code so we don't have to duplicate the error
      message.
      
      Link: http://lkml.kernel.org/r/20191009142435.3975-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96c804a6
  5. 18 10月, 2019 1 次提交
    • J
      mm: fix double page fault on arm64 if PTE_AF is cleared · 83d116c5
      Jia He 提交于
      When we tested pmdk unit test [1] vmmalloc_fork TEST3 on arm64 guest, there
      will be a double page fault in __copy_from_user_inatomic of cow_user_page.
      
      To reproduce the bug, the cmd is as follows after you deployed everything:
      make -C src/test/vmmalloc_fork/ TEST_TIME=60m check
      
      Below call trace is from arm64 do_page_fault for debugging purpose:
      [  110.016195] Call trace:
      [  110.016826]  do_page_fault+0x5a4/0x690
      [  110.017812]  do_mem_abort+0x50/0xb0
      [  110.018726]  el1_da+0x20/0xc4
      [  110.019492]  __arch_copy_from_user+0x180/0x280
      [  110.020646]  do_wp_page+0xb0/0x860
      [  110.021517]  __handle_mm_fault+0x994/0x1338
      [  110.022606]  handle_mm_fault+0xe8/0x180
      [  110.023584]  do_page_fault+0x240/0x690
      [  110.024535]  do_mem_abort+0x50/0xb0
      [  110.025423]  el0_da+0x20/0x24
      
      The pte info before __copy_from_user_inatomic is (PTE_AF is cleared):
      [ffff9b007000] pgd=000000023d4f8003, pud=000000023da9b003,
                     pmd=000000023d4b3003, pte=360000298607bd3
      
      As told by Catalin: "On arm64 without hardware Access Flag, copying from
      user will fail because the pte is old and cannot be marked young. So we
      always end up with zeroed page after fork() + CoW for pfn mappings. we
      don't always have a hardware-managed access flag on arm64."
      
      This patch fixes it by calling pte_mkyoung. Also, the parameter is
      changed because vmf should be passed to cow_user_page()
      
      Add a WARN_ON_ONCE when __copy_from_user_inatomic() returns error
      in case there can be some obscure use-case (by Kirill).
      
      [1] https://github.com/pmem/pmdk/tree/master/src/test/vmmalloc_forkSigned-off-by: NJia He <justin.he@arm.com>
      Reported-by: NYibo Cai <Yibo.Cai@arm.com>
      Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      83d116c5
  6. 15 10月, 2019 1 次提交
    • J
      mm/memory-failure: poison read receives SIGKILL instead of SIGBUS if mmaped more than once · 3d7fed4a
      Jane Chu 提交于
      Mmap /dev/dax more than once, then read the poison location using
      address from one of the mappings.  The other mappings due to not having
      the page mapped in will cause SIGKILLs delivered to the process.
      SIGKILL succeeds over SIGBUS, so user process loses the opportunity to
      handle the UE.
      
      Although one may add MAP_POPULATE to mmap(2) to work around the issue,
      MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
      isn't always an option.
      
      Details -
      
        ndctl inject-error --block=10 --count=1 namespace6.0
      
        ./read_poison -x dax6.0 -o 5120 -m 2
        mmaped address 0x7f5bb6600000
        mmaped address 0x7f3cf3600000
        doing local read at address 0x7f3cf3601400
        Killed
      
      Console messages in instrumented kernel -
      
        mce: Uncorrected hardware memory error in user-access at edbe201400
        Memory failure: tk->addr = 7f5bb6601000
        Memory failure: address edbe201: call dev_pagemap_mapping_shift
        dev_pagemap_mapping_shift: page edbe201: no PUD
        Memory failure: tk->size_shift == 0
        Memory failure: Unable to find user space address edbe201 in read_poison
        Memory failure: tk->addr = 7f3cf3601000
        Memory failure: address edbe201: call dev_pagemap_mapping_shift
        Memory failure: tk->size_shift = 21
        Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of failure to unmap corrupted page
          => to deliver SIGKILL
        Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory corruption
          => to deliver SIGBUS
      
      Link: http://lkml.kernel.org/r/1565112345-28754-3-git-send-email-jane.chu@oracle.comSigned-off-by: NJane Chu <jane.chu@oracle.com>
      Suggested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d7fed4a