1. 01 12月, 2019 1 次提交
  2. 26 11月, 2019 1 次提交
  3. 24 11月, 2019 5 次提交
  4. 23 11月, 2019 2 次提交
    • A
      mm/ksm.c: don't WARN if page is still mapped in remove_stable_node() · 9a63236f
      Andrey Ryabinin 提交于
      It's possible to hit the WARN_ON_ONCE(page_mapped(page)) in
      remove_stable_node() when it races with __mmput() and squeezes in
      between ksm_exit() and exit_mmap().
      
        WARNING: CPU: 0 PID: 3295 at mm/ksm.c:888 remove_stable_node+0x10c/0x150
      
        Call Trace:
         remove_all_stable_nodes+0x12b/0x330
         run_store+0x4ef/0x7b0
         kernfs_fop_write+0x200/0x420
         vfs_write+0x154/0x450
         ksys_write+0xf9/0x1d0
         do_syscall_64+0x99/0x510
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Remove the warning as there is nothing scary going on.
      
      Link: http://lkml.kernel.org/r/20191119131850.5675-1-aryabinin@virtuozzo.com
      Fixes: cbf86cfe ("ksm: remove old stable nodes more thoroughly")
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9a63236f
    • D
      mm/memory_hotplug: don't access uninitialized memmaps in shrink_zone_span() · 7ce700bf
      David Hildenbrand 提交于
      Let's limit shrinking to !ZONE_DEVICE so we can fix the current code.
      We should never try to touch the memmap of offline sections where we
      could have uninitialized memmaps and could trigger BUGs when calling
      page_to_nid() on poisoned pages.
      
      There is no reliable way to distinguish an uninitialized memmap from an
      initialized memmap that belongs to ZONE_DEVICE, as we don't have
      anything like SECTION_IS_ONLINE we can use similar to
      pfn_to_online_section() for !ZONE_DEVICE memory.
      
      E.g., set_zone_contiguous() similarly relies on pfn_to_online_section()
      and will therefore never set a ZONE_DEVICE zone consecutive.  Stopping
      to shrink the ZONE_DEVICE therefore results in no observable changes,
      besides /proc/zoneinfo indicating different boundaries - something we
      can totally live with.
      
      Before commit d0dc12e8 ("mm/memory_hotplug: optimize memory
      hotplug"), the memmap was initialized with 0 and the node with the right
      value.  So the zone might be wrong but not garbage.  After that commit,
      both the zone and the node will be garbage when touching uninitialized
      memmaps.
      
      Toshiki reported a BUG (race between delayed initialization of
      ZONE_DEVICE memmaps without holding the memory hotplug lock and
      concurrent zone shrinking).
      
        https://lkml.org/lkml/2019/11/14/1040
      
      "Iteration of create and destroy namespace causes the panic as below:
      
            kernel BUG at mm/page_alloc.c:535!
            CPU: 7 PID: 2766 Comm: ndctl Not tainted 5.4.0-rc4 #6
            Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
            RIP: 0010:set_pfnblock_flags_mask+0x95/0xf0
            Call Trace:
             memmap_init_zone_device+0x165/0x17c
             memremap_pages+0x4c1/0x540
             devm_memremap_pages+0x1d/0x60
             pmem_attach_disk+0x16b/0x600 [nd_pmem]
             nvdimm_bus_probe+0x69/0x1c0
             really_probe+0x1c2/0x3e0
             driver_probe_device+0xb4/0x100
             device_driver_attach+0x4f/0x60
             bind_store+0xc9/0x110
             kernfs_fop_write+0x116/0x190
             vfs_write+0xa5/0x1a0
             ksys_write+0x59/0xd0
             do_syscall_64+0x5b/0x180
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        While creating a namespace and initializing memmap, if you destroy the
        namespace and shrink the zone, it will initialize the memmap outside
        the zone and trigger VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page),
        pfn), page) in set_pfnblock_flags_mask()."
      
      This BUG is also mitigated by this commit, where we for now stop to
      shrink the ZONE_DEVICE zone until we can do it in a safe and clean way.
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-5-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reported-by: NToshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Damian Tometzki <damian.tometzki@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ce700bf
  5. 18 11月, 2019 1 次提交
    • A
      bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY · fc970227
      Andrii Nakryiko 提交于
      Add ability to memory-map contents of BPF array map. This is extremely useful
      for working with BPF global data from userspace programs. It allows to avoid
      typical bpf_map_{lookup,update}_elem operations, improving both performance
      and usability.
      
      There had to be special considerations for map freezing, to avoid having
      writable memory view into a frozen map. To solve this issue, map freezing and
      mmap-ing is happening under mutex now:
        - if map is already frozen, no writable mapping is allowed;
        - if map has writable memory mappings active (accounted in map->writecnt),
          map freezing will keep failing with -EBUSY;
        - once number of writable memory mappings drops to zero, map freezing can be
          performed again.
      
      Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
      can't be memory mapped either.
      
      For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
      to be mmap()'able. We also need to make sure that array data memory is
      page-sized and page-aligned, so we over-allocate memory in such a way that
      struct bpf_array is at the end of a single page of memory with array->value
      being aligned with the start of the second page. On deallocation we need to
      accomodate this memory arrangement to free vmalloc()'ed memory correctly.
      
      One important consideration regarding how memory-mapping subsystem functions.
      Memory-mapping subsystem provides few optional callbacks, among them open()
      and close().  close() is called for each memory region that is unmapped, so
      that users can decrease their reference counters and free up resources, if
      necessary. open() is *almost* symmetrical: it's called for each memory region
      that is being mapped, **except** the very first one. So bpf_map_mmap does
      initial refcnt bump, while open() will do any extra ones after that. Thus
      number of close() calls is equal to number of open() calls plus one more.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-4-andriin@fb.com
      fc970227
  6. 16 11月, 2019 10 次提交
    • R
      mm/debug.c: PageAnon() is true for PageKsm() pages · 6855ac4a
      Ralph Campbell 提交于
      PageAnon() and PageKsm() use the low two bits of the page->mapping
      pointer to indicate the page type.  PageAnon() only checks the LSB while
      PageKsm() checks the least significant 2 bits are equal to 3.
      
      Therefore, PageAnon() is true for KSM pages.  __dump_page() incorrectly
      will never print "ksm" because it checks PageAnon() first.  Fix this by
      checking PageKsm() first.
      
      Link: http://lkml.kernel.org/r/20191113000651.20677-1-rcampbell@nvidia.com
      Fixes: 1c6fb1d8 ("mm: print more information about mapping in __dump_page")
      Signed-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6855ac4a
    • R
      mm/debug.c: __dump_page() prints an extra line · 76a1850e
      Ralph Campbell 提交于
      When dumping struct page information, __dump_page() prints the page type
      with a trailing blank followed by the page flags on a separate line:
      
        anon
        flags: 0x100000000090034(uptodate|lru|active|head|swapbacked)
      
      It looks like the intent was to use pr_cont() for printing "flags:" but
      pr_cont() usage is discouraged so fix this by extending the format to
      include the flags into a single line:
      
        anon flags: 0x100000000090034(uptodate|lru|active|head|swapbacked)
      
      If the page is file backed, the name might be long so use two lines:
      
        shmem_aops name:"dev/zero"
        flags: 0x10000000008000c(uptodate|dirty|swapbacked)
      
      Eliminate pr_conf() usage as well for appending compound_mapcount.
      
      Link: http://lkml.kernel.org/r/20191112012608.16926-1-rcampbell@nvidia.comSigned-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76a1850e
    • V
      mm/page_io.c: do not free shared swap slots · 5df373e9
      Vinayak Menon 提交于
      The following race is observed due to which a processes faulting on a
      swap entry, finds the page neither in swapcache nor swap.  This causes
      zram to give a zero filled page that gets mapped to the process,
      resulting in a user space crash later.
      
      Consider parent and child processes Pa and Pb sharing the same swap slot
      with swap_count 2.  Swap is on zram with SWP_SYNCHRONOUS_IO set.
      Virtual address 'VA' of Pa and Pb points to the shared swap entry.
      
      Pa                                       Pb
      
      fault on VA                              fault on VA
      do_swap_page                             do_swap_page
      lookup_swap_cache fails                  lookup_swap_cache fails
                                               Pb scheduled out
      swapin_readahead (deletes zram entry)
      swap_free (makes swap_count 1)
                                               Pb scheduled in
                                               swap_readpage (swap_count == 1)
                                               Takes SWP_SYNCHRONOUS_IO path
                                               zram enrty absent
                                               zram gives a zero filled page
      
      Fix this by making sure that swap slot is freed only when swap count
      drops down to one.
      
      Link: http://lkml.kernel.org/r/1571743294-14285-1-git-send-email-vinmenon@codeaurora.org
      Fixes: aa8d22a1 ("mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no other reference")
      Signed-off-by: NVinayak Menon <vinmenon@codeaurora.org>
      Suggested-by: NMinchan Kim <minchan@google.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5df373e9
    • D
      mm/memory_hotplug: fix try_offline_node() · 2c91f8fc
      David Hildenbrand 提交于
      try_offline_node() is pretty much broken right now:
      
       - The node span is updated when onlining memory, not when adding it. We
         ignore memory that was mever onlined. Bad.
      
       - We touch possible garbage memmaps. The pfn_to_nid(pfn) can easily
         trigger a kernel panic. Bad for memory that is offline but also bad
         for subsection hotadd with ZONE_DEVICE, whereby the memmap of the
         first PFN of a section might contain garbage.
      
       - Sections belonging to mixed nodes are not properly considered.
      
      As memory blocks might belong to multiple nodes, we would have to walk
      all pageblocks (or at least subsections) within present sections.
      However, we don't have a way to identify whether a memmap that is not
      online was initialized (relevant for ZONE_DEVICE).  This makes things
      more complicated.
      
      Luckily, we can piggy pack on the node span and the nid stored in memory
      blocks.  Currently, the node span is grown when calling
      move_pfn_range_to_zone() - e.g., when onlining memory, and shrunk when
      removing memory, before calling try_offline_node().  Sysfs links are
      created via link_mem_sections(), e.g., during boot or when adding
      memory.
      
      If the node still spans memory or if any memory block belongs to the
      nid, we don't set the node offline.  As memory blocks that span multiple
      nodes cannot get offlined, the nid stored in memory blocks is reliable
      enough (for such online memory blocks, the node still spans the memory).
      
      Introduce for_each_memory_block() to efficiently walk all memory blocks.
      
      Note: We will soon stop shrinking the ZONE_DEVICE zone and the node span
      when removing ZONE_DEVICE memory to fix similar issues (access of
      garbage memmaps) - until we have a reliable way to identify whether
      these memmaps were properly initialized.  This implies later, that once
      a node had ZONE_DEVICE memory, we won't be able to set a node offline -
      which should be acceptable.
      
      Since commit f1dd2cd1 ("mm, memory_hotplug: do not associate
      hotadded memory to zones until online") memory that is added is not
      assoziated with a zone/node (memmap not initialized).  The introducing
      commit 60a5a19e ("memory-hotplug: remove sysfs file of node")
      already missed that we could have multiple nodes for a section and that
      the zone/node span is updated when onlining pages, not when adding them.
      
      I tested this by hotplugging two DIMMs to a memory-less and cpu-less
      NUMA node.  The node is properly onlined when adding the DIMMs.  When
      removing the DIMMs, the node is properly offlined.
      
      Masayoshi Mizuma reported:
      
      : Without this patch, memory hotplug fails as panic:
      :
      :  BUG: kernel NULL pointer dereference, address: 0000000000000000
      :  ...
      :  Call Trace:
      :   remove_memory_block_devices+0x81/0xc0
      :   try_remove_memory+0xb4/0x130
      :   __remove_memory+0xa/0x20
      :   acpi_memory_device_remove+0x84/0x100
      :   acpi_bus_trim+0x57/0x90
      :   acpi_bus_trim+0x2e/0x90
      :   acpi_device_hotplug+0x2b2/0x4d0
      :   acpi_hotplug_work_fn+0x1a/0x30
      :   process_one_work+0x171/0x380
      :   worker_thread+0x49/0x3f0
      :   kthread+0xf8/0x130
      :   ret_from_fork+0x35/0x40
      
      [david@redhat.com: v3]
        Link: http://lkml.kernel.org/r/20191102120221.7553-1-david@redhat.com
      Link: http://lkml.kernel.org/r/20191028105458.28320-1-david@redhat.com
      Fixes: 60a5a19e ("memory-hotplug: remove sysfs file of node")
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visiable after d0dc12e8Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Tested-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Cc: Tang Chen <tangchen@cn.fujitsu.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J. Wysocki" <rafael@kernel.org>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Jiri Olsa <jolsa@kernel.org>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Jani Nikula <jani.nikula@intel.com>
      Cc: Nayna Jain <nayna@linux.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c91f8fc
    • S
      mm,thp: recheck each page before collapsing file THP · 4655e5e5
      Song Liu 提交于
      In collapse_file(), for !is_shmem case, current check cannot guarantee
      the locked page is up-to-date.  Specifically, xas_unlock_irq() should
      not be called before lock_page() and get_page(); and it is necessary to
      recheck PageUptodate() after locking the page.
      
      With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text
      may contain corrupted data.  This is because khugepaged mistakenly
      collapses some not up-to-date sub pages into a huge page, and assumes
      the huge page is up-to-date.  This will NOT corrupt data in the disk,
      because the page is read-only and never written back.  Fix this by
      properly checking PageUptodate() after locking the page.  This check
      replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);".
      
      Also, move PageDirty() check after locking the page.  Current khugepaged
      should not try to collapse dirty file THP, because it is limited to
      read-only .text.  The only case we hit a dirty page here is when the
      page hasn't been written since write.  Bail out and retry when this
      happens.
      
      syzbot reported bug on previous version of this patch.
      
      Link: http://lkml.kernel.org/r/20191106060930.2571389-2-songliubraving@fb.com
      Fixes: 99cb0dbd ("mm,thp: add read-only THP support for (non-shmem) FS")
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Reported-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4655e5e5
    • L
      mm: slub: really fix slab walking for init_on_free · aea4df4c
      Laura Abbott 提交于
      Commit 1b7e816f ("mm: slub: Fix slab walking for init_on_free")
      fixed one problem with the slab walking but missed a key detail: When
      walking the list, the head and tail pointers need to be updated since we
      end up reversing the list as a result.  Without doing this, bulk free is
      broken.
      
      One way this is exposed is a NULL pointer with slub_debug=F:
      
        =============================================================================
        BUG skbuff_head_cache (Tainted: G                T): Object already free
        -----------------------------------------------------------------------------
      
        INFO: Slab 0x000000000d2d2f8f objects=16 used=3 fp=0x0000000064309071 flags=0x3fff00000000201
        BUG: kernel NULL pointer dereference, address: 0000000000000000
        Oops: 0000 [#1] PREEMPT SMP PTI
        RIP: 0010:print_trailer+0x70/0x1d5
        Call Trace:
         <IRQ>
         free_debug_processing.cold.37+0xc9/0x149
         __slab_free+0x22a/0x3d0
         kmem_cache_free_bulk+0x415/0x420
         __kfree_skb_flush+0x30/0x40
         net_rx_action+0x2dd/0x480
         __do_softirq+0xf0/0x246
         irq_exit+0x93/0xb0
         do_IRQ+0xa0/0x110
         common_interrupt+0xf/0xf
         </IRQ>
      
      Given we're now almost identical to the existing debugging code which
      correctly walks the list, combine with that.
      
      Link: https://lkml.kernel.org/r/20191104170303.GA50361@gandi.net
      Link: http://lkml.kernel.org/r/20191106222208.26815-1-labbott@redhat.com
      Fixes: 1b7e816f ("mm: slub: Fix slab walking for init_on_free")
      Signed-off-by: NLaura Abbott <labbott@redhat.com>
      Reported-by: NThibaut Sautereau <thibaut.sautereau@clip-os.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Tested-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NAlexander Potapenko <glider@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <clipos@ssi.gouv.fr>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aea4df4c
    • R
      mm: hugetlb: switch to css_tryget() in hugetlb_cgroup_charge_cgroup() · 0362f326
      Roman Gushchin 提交于
      An exiting task might belong to an offline cgroup.  In this case an
      attempt to grab a cgroup reference from the task can end up with an
      infinite loop in hugetlb_cgroup_charge_cgroup(), because neither the
      cgroup will become online, neither the task will be migrated to a live
      cgroup.
      
      Fix this by switching over to css_tryget().  As css_tryget_online()
      can't guarantee that the cgroup won't go offline, in most cases the
      check doesn't make sense.  In this particular case users of
      hugetlb_cgroup_charge_cgroup() are not affected by this change.
      
      A similar problem is described by commit 18fa84a2 ("cgroup: Use
      css_tryget() instead of css_tryget_online() in task_get_css()").
      
      Link: http://lkml.kernel.org/r/20191106225131.3543616-2-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0362f326
    • R
      mm: memcg: switch to css_tryget() in get_mem_cgroup_from_mm() · 00d484f3
      Roman Gushchin 提交于
      We've encountered a rcu stall in get_mem_cgroup_from_mm():
      
        rcu: INFO: rcu_sched self-detected stall on CPU
        rcu: 33-....: (21000 ticks this GP) idle=6c6/1/0x4000000000000002 softirq=35441/35441 fqs=5017
        (t=21031 jiffies g=324821 q=95837) NMI backtrace for cpu 33
        <...>
        RIP: 0010:get_mem_cgroup_from_mm+0x2f/0x90
        <...>
         __memcg_kmem_charge+0x55/0x140
         __alloc_pages_nodemask+0x267/0x320
         pipe_write+0x1ad/0x400
         new_sync_write+0x127/0x1c0
         __kernel_write+0x4f/0xf0
         dump_emit+0x91/0xc0
         writenote+0xa0/0xc0
         elf_core_dump+0x11af/0x1430
         do_coredump+0xc65/0xee0
         get_signal+0x132/0x7c0
         do_signal+0x36/0x640
         exit_to_usermode_loop+0x61/0xd0
         do_syscall_64+0xd4/0x100
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The problem is caused by an exiting task which is associated with an
      offline memcg.  We're iterating over and over in the do {} while
      (!css_tryget_online()) loop, but obviously the memcg won't become online
      and the exiting task won't be migrated to a live memcg.
      
      Let's fix it by switching from css_tryget_online() to css_tryget().
      
      As css_tryget_online() cannot guarantee that the memcg won't go offline,
      the check is usually useless, except some rare cases when for example it
      determines if something should be presented to a user.
      
      A similar problem is described by commit 18fa84a2 ("cgroup: Use
      css_tryget() instead of css_tryget_online() in task_get_css()").
      
      Johannes:
      
      : The bug aside, it doesn't matter whether the cgroup is online for the
      : callers.  It used to matter when offlining needed to evacuate all charges
      : from the memcg, and so needed to prevent new ones from showing up, but we
      : don't care now.
      
      Link: http://lkml.kernel.org/r/20191106225131.3543616-1-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NShakeel Butt <shakeeb@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Michal Koutn <mkoutny@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      00d484f3
    • Z
      mm: fix trying to reclaim unevictable lru page when calling madvise_pageout · 82072962
      zhong jiang 提交于
      Recently, I hit the following issue when running upstream.
      
        kernel BUG at mm/vmscan.c:1521!
        invalid opcode: 0000 [#1] SMP KASAN PTI
        CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
        RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
        Call Trace:
         reclaim_pages+0x499/0x800 mm/vmscan.c:2188
         madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
         walk_pmd_range mm/pagewalk.c:53 [inline]
         walk_pud_range mm/pagewalk.c:112 [inline]
         walk_p4d_range mm/pagewalk.c:139 [inline]
         walk_pgd_range mm/pagewalk.c:166 [inline]
         __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
         walk_page_range+0x179/0x310 mm/pagewalk.c:349
         madvise_pageout_page_range mm/madvise.c:506 [inline]
         madvise_pageout+0x1f0/0x330 mm/madvise.c:542
         madvise_vma mm/madvise.c:931 [inline]
         __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
         do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      madvise_pageout() accesses the specified range of the vma and isolates
      them, then runs shrink_page_list() to reclaim its memory.  But it also
      isolates the unevictable pages to reclaim.  Hence, we can catch the
      cases in shrink_page_list().
      
      The root cause is that we scan the page tables instead of specific LRU
      list.  and so we need to filter out the unevictable lru pages from our
      end.
      
      Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com
      Fixes: 1a4e58cc ("mm: introduce MADV_PAGEOUT")
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82072962
    • Y
      mm: mempolicy: fix the wrong return value and potential pages leak of mbind · a85dfc30
      Yang Shi 提交于
      Commit d8835445 ("mm: mempolicy: make the behavior consistent when
      MPOL_MF_MOVE* and MPOL_MF_STRICT were specified") fixed the return value
      of mbind() for a couple of corner cases.  But, it altered the errno for
      some other cases, for example, mbind() should return -EFAULT when part
      or all of the memory range specified by nodemask and maxnode points
      outside your accessible address space, or there was an unmapped hole in
      the specified memory range specified by addr and len.
      
      Fix this by preserving the errno returned by queue_pages_range().  And,
      the pagelist may be not empty even though queue_pages_range() returns
      error, put the pages back to LRU since mbind_range() is not called to
      really apply the policy so those pages should not be migrated, this is
      also the old behavior before the problematic commit.
      
      Link: http://lkml.kernel.org/r/1572454731-3925-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: d8835445 ("mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified")
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reported-by: NLi Xinhai <lixinhai.lxh@gmail.com>
      Reviewed-by: NLi Xinhai <lixinhai.lxh@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>	[4.19 and 5.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a85dfc30
  7. 13 11月, 2019 1 次提交
  8. 07 11月, 2019 10 次提交
    • J
      mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges · 869712fd
      Johannes Weiner 提交于
      While upgrading from 4.16 to 5.2, we noticed these allocation errors in
      the log of the new kernel:
      
        SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
          cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0
          node 0: slabs: 5, objs: 170, free: 0
      
              slab_out_of_memory+1
              ___slab_alloc+969
              __slab_alloc+14
              kmem_cache_alloc+346
              inet_twsk_alloc+60
              tcp_time_wait+46
              tcp_fin+206
              tcp_data_queue+2034
              tcp_rcv_state_process+784
              tcp_v6_do_rcv+405
              __release_sock+118
              tcp_close+385
              inet_release+46
              __sock_release+55
              sock_close+17
              __fput+170
              task_work_run+127
              exit_to_usermode_loop+191
              do_syscall_64+212
              entry_SYSCALL_64_after_hwframe+68
      
      accompanied by an increase in machines going completely radio silent
      under memory pressure.
      
      One thing that changed since 4.16 is e699e2c6 ("net, mm: account
      sock objects to kmemcg"), which made these slab caches subject to cgroup
      memory accounting and control.
      
      The problem with that is that cgroups, unlike the page allocator, do not
      maintain dedicated atomic reserves.  As a cgroup's usage hovers at its
      limit, atomic allocations - such as done during network rx - can fail
      consistently for extended periods of time.  The kernel is not able to
      operate under these conditions.
      
      We don't want to revert the culprit patch, because it indeed tracks a
      potentially substantial amount of memory used by a cgroup.
      
      We also don't want to implement dedicated atomic reserves for cgroups.
      There is no point in keeping a fixed margin of unused bytes in the
      cgroup's memory budget to accomodate a consumer that is impossible to
      predict - we'd be wasting memory and get into configuration headaches,
      not unlike what we have going with min_free_kbytes.  We do this for
      physical mem because we have to, but cgroups are an accounting game.
      
      Instead, account these privileged allocations to the cgroup, but let
      them bypass the configured limit if they have to.  This way, we get the
      benefits of accounting the consumed memory and have it exert pressure on
      the rest of the cgroup, but like with the page allocator, we shift the
      burden of reclaimining on behalf of atomic allocations onto the regular
      allocations that can block.
      
      Link: http://lkml.kernel.org/r/20191022233708.365764-1-hannes@cmpxchg.org
      Fixes: e699e2c6 ("net, mm: account sock objects to kmemcg")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.18+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      869712fd
    • D
      mm/memory_hotplug: fix updating the node span · 656d5711
      David Hildenbrand 提交于
      We recently started updating the node span based on the zone span to
      avoid touching uninitialized memmaps.
      
      Currently, we will always detect the node span to start at 0, meaning a
      node can easily span too many pages.  pgdat_is_empty() will still work
      correctly if all zones span no pages.  We should skip over all zones
      without spanned pages and properly handle the first detected zone that
      spans pages.
      
      Unfortunately, in contrast to the zone span (/proc/zoneinfo), the node
      span cannot easily be inspected and tested.  The node span gives no real
      guarantees when an architecture supports memory hotplug, meaning it can
      easily contain holes or span pages of different nodes.
      
      The node span is not really used after init on architectures that
      support memory hotplug.
      
      E.g., we use it in mm/memory_hotplug.c:try_offline_node() and in
      mm/kmemleak.c:kmemleak_scan().  These users seem to be fine.
      
      Link: http://lkml.kernel.org/r/20191027222714.5313-1-david@redhat.com
      Fixes: 00d6c019 ("mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span()")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      656d5711
    • R
      mm: slab: make page_cgroup_ino() to recognize non-compound slab pages properly · 221ec5c0
      Roman Gushchin 提交于
      page_cgroup_ino() doesn't return a valid memcg pointer for non-compound
      slab pages, because it depends on PgHead AND PgSlab flags to be set to
      determine the memory cgroup from the kmem_cache.  It's correct for
      compound pages, but not for generic small pages.  Those don't have PgHead
      set, so it ends up returning zero.
      
      Fix this by replacing the condition to PageSlab() && !PageTail().
      
      Before this patch:
        [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
        0x0000000000000080	        38        0  _______S___________________________________	slab
      
      After this patch:
        [root@localhost ~]# ./page-types -c /sys/fs/cgroup/user.slice/user-0.slice/user@0.service/ | grep slab
        0x0000000000000080	       147        0  _______S___________________________________	slab
      
      Also, hwpoison_filter_task() uses output of page_cgroup_ino() in order
      to filter error injection events based on memcg.  So if
      page_cgroup_ino() fails to return memcg pointer, we just fail to inject
      memory error.  Considering that hwpoison filter is for testing, affected
      users are limited and the impact should be marginal.
      
      [n-horiguchi@ah.jp.nec.com: changelog additions]
      Link: http://lkml.kernel.org/r/20191031012151.2722280-1-guro@fb.com
      Fixes: 4d96ba35 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      221ec5c0
    • J
      mm/page_alloc.c: ratelimit allocation failure warnings more aggressively · 1be334e5
      Johannes Weiner 提交于
      While investigating a bug related to higher atomic allocation failures,
      we noticed the failure warnings positively drowning the console, and in
      our case trigger lockup warnings because of a serial console too slow to
      handle all that output.
      
      But even if we had a faster console, it's unclear what additional
      information the current level of repetition provides.
      
      Allocation failures happen for three reasons: The machine is OOM, the VM
      is failing to handle reasonable requests, or somebody is making
      unreasonable requests (and didn't acknowledge their opportunism with
      __GFP_NOWARN).  Having the memory dump, a callstack, and the ratelimit
      stats on skipped failure warnings should provide enough information to
      let users/admins/developers know whether something is wrong and point
      them in the right direction for debugging, bpftracing etc.
      
      Limit allocation failure warnings to one spew every ten seconds.
      
      Link: http://lkml.kernel.org/r/20191028194906.26899-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1be334e5
    • V
      mm/khugepaged: fix might_sleep() warn with CONFIG_HIGHPTE=y · ec649c9d
      Ville Syrjälä 提交于
      I got some khugepaged spew on a 32bit x86:
      
        BUG: sleeping function called from invalid context at include/linux/mmu_notifier.h:346
        in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 25, name: khugepaged
        INFO: lockdep is turned off.
        CPU: 1 PID: 25 Comm: khugepaged Not tainted 5.4.0-rc5-elk+ #206
        Hardware name: System manufacturer P5Q-EM/P5Q-EM, BIOS 2203    07/08/2009
        Call Trace:
         dump_stack+0x66/0x8e
         ___might_sleep.cold.96+0x95/0xa6
         __might_sleep+0x2e/0x80
         collapse_huge_page.isra.51+0x5ac/0x1360
         khugepaged+0x9a9/0x20f0
         kthread+0xf5/0x110
         ret_from_fork+0x2e/0x38
      
      Looks like it's due to CONFIG_HIGHPTE=y pte_offset_map()->kmap_atomic()
      vs.  mmu_notifier_invalidate_range_start().  Let's do the naive approach
      and just reorder the two operations.
      
      Link: http://lkml.kernel.org/r/20191029201513.GG1208@intel.com
      Fixes: 810e24e0 ("mm/mmu_notifiers: annotate with might_sleep()")
      Signed-off-by: NVille Syrjl <ville.syrjala@linux.intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec649c9d
    • M
      mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo · 93b3a674
      Michal Hocko 提交于
      pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
      This is not really nice because it blocks both any interrupts on that
      cpu and the page allocator.  On large machines this might even trigger
      the hard lockup detector.
      
      Considering the pagetypeinfo is a debugging tool we do not really need
      exact numbers here.  The primary reason to look at the outuput is to see
      how pageblocks are spread among different migratetypes and low number of
      pages is much more interesting therefore putting a bound on the number
      of pages on the free_list sounds like a reasonable tradeoff.
      
      The new output will simply tell
        [...]
        Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648
      
      instead of
        Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648
      
      The limit has been chosen arbitrary and it is a subject of a future
      change should there be a need for that.
      
      While we are at it, also drop the zone lock after each free_list
      iteration which will help with the IRQ and page allocator responsiveness
      even further as the IRQ lock held time is always bound to those 100k
      pages.
      
      [akpm@linux-foundation.org: tweak comment text, per David Hildenbrand]
      Link: http://lkml.kernel.org/r/20191025072610.18526-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NWaiman Long <longman@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Jann Horn <jannh@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Song Liu <songliubraving@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93b3a674
    • M
      mm, vmstat: hide /proc/pagetypeinfo from normal users · abaed011
      Michal Hocko 提交于
      /proc/pagetypeinfo is a debugging tool to examine internal page
      allocator state wrt to fragmentation.  It is not very useful for any
      other use so normal users really do not need to read this file.
      
      Waiman Long has noticed that reading this file can have negative side
      effects because zone->lock is necessary for gathering data and that a)
      interferes with the page allocator and its users and b) can lead to hard
      lockups on large machines which have very long free_list.
      
      Reduce both issues by simply not exporting the file to regular users.
      
      Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org
      Fixes: 467c996c ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NWaiman Long <longman@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NWaiman Long <longman@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Jann Horn <jannh@google.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      abaed011
    • J
      mm/mmu_notifiers: use the right return code for WARN_ON · df2ec764
      Jason Gunthorpe 提交于
      The return code from the op callback is actually in _ret, while the
      WARN_ON was checking ret which causes it to misfire.
      
      Link: http://lkml.kernel.org/r/20191025175502.GA31127@ziepe.ca
      Fixes: 8402ce61 ("mm/mmu_notifiers: check if mmu notifier callbacks are allowed to fail")
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df2ec764
    • M
      mm, meminit: recalculate pcpu batch and high limits after init completes · 3e8fc007
      Mel Gorman 提交于
      Deferred memory initialisation updates zone->managed_pages during the
      initialisation phase but before that finishes, the per-cpu page
      allocator (pcpu) calculates the number of pages allocated/freed in
      batches as well as the maximum number of pages allowed on a per-cpu
      list.  As zone->managed_pages is not up to date yet, the pcpu
      initialisation calculates inappropriately low batch and high values.
      
      This increases zone lock contention quite severely in some cases with
      the degree of severity depending on how many CPUs share a local zone and
      the size of the zone.  A private report indicated that kernel build
      times were excessive with extremely high system CPU usage.  A perf
      profile indicated that a large chunk of time was lost on zone->lock
      contention.
      
      This patch recalculates the pcpu batch and high values after deferred
      initialisation completes for every populated zone in the system.  It was
      tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
      workload -- allmodconfig and all available CPUs.
      
      mmtests configuration: config-workload-kernbench-max Configuration was
      modified to build on a fresh XFS partition.
      
      kernbench
                                      5.4.0-rc3              5.4.0-rc3
                                        vanilla           resetpcpu-v2
      Amean     user-256    13249.50 (   0.00%)    16401.31 * -23.79%*
      Amean     syst-256    14760.30 (   0.00%)     4448.39 *  69.86%*
      Amean     elsp-256      162.42 (   0.00%)      119.13 *  26.65%*
      Stddev    user-256       42.97 (   0.00%)       19.15 (  55.43%)
      Stddev    syst-256      336.87 (   0.00%)        6.71 (  98.01%)
      Stddev    elsp-256        2.46 (   0.00%)        0.39 (  84.03%)
      
                         5.4.0-rc3    5.4.0-rc3
                           vanilla resetpcpu-v2
      Duration User       39766.24     49221.79
      Duration System     44298.10     13361.67
      Duration Elapsed      519.11       388.87
      
      The patch reduces system CPU usage by 69.86% and total build time by
      26.65%.  The variance of system CPU usage is also much reduced.
      
      Before, this was the breakdown of batch and high values over all zones
      was:
      
          256               batch: 1
          256               batch: 63
          512               batch: 7
          256               high:  0
          256               high:  378
          512               high:  42
      
      512 pcpu pagesets had a batch limit of 7 and a high limit of 42.  After
      the patch:
      
          256               batch: 1
          768               batch: 63
          256               high:  0
          768               high:  378
      
      [mgorman@techsingularity.net: fix merge/linkage snafu]
        Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: <stable@vger.kernel.org>	[4.1+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e8fc007
    • S
      mm: memcontrol: fix NULL-ptr deref in percpu stats flush · 7961eee3
      Shakeel Butt 提交于
      __mem_cgroup_free() can be called on the failure path in
      mem_cgroup_alloc().  However memcg_flush_percpu_vmstats() and
      memcg_flush_percpu_vmevents() which are called from __mem_cgroup_free()
      access the fields of memcg which can potentially be null if called from
      failure path from mem_cgroup_alloc().  Indeed syzbot has reported the
      following crash:
      
      	kasan: CONFIG_KASAN_INLINE enabled
      	kasan: GPF could be caused by NULL-ptr deref or user memory access
      	general protection fault: 0000 [#1] PREEMPT SMP KASAN
      	CPU: 0 PID: 30393 Comm: syz-executor.1 Not tainted 5.4.0-rc2+ #0
      	Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      	RIP: 0010:memcg_flush_percpu_vmstats+0x4ae/0x930 mm/memcontrol.c:3436
      	Code: 05 41 89 c0 41 0f b6 04 24 41 38 c7 7c 08 84 c0 0f 85 5d 03 00 00 44 3b 05 33 d5 12 08 0f 83 e2 00 00 00 4c 89 f0 48 c1 e8 03 <42> 80 3c 28 00 0f 85 91 03 00 00 48 8b 85 10 fe ff ff 48 8b b0 90
      	RSP: 0018:ffff888095c27980 EFLAGS: 00010206
      	RAX: 0000000000000012 RBX: ffff888095c27b28 RCX: ffffc90008192000
      	RDX: 0000000000040000 RSI: ffffffff8340fae7 RDI: 0000000000000007
      	RBP: ffff888095c27be0 R08: 0000000000000000 R09: ffffed1013f0da33
      	R10: ffffed1013f0da32 R11: ffff88809f86d197 R12: fffffbfff138b760
      	R13: dffffc0000000000 R14: 0000000000000090 R15: 0000000000000007
      	FS:  00007f5027170700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      	CR2: 0000000000710158 CR3: 00000000a7b18000 CR4: 00000000001406f0
      	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      	Call Trace:
      	__mem_cgroup_free+0x1a/0x190 mm/memcontrol.c:5021
      	mem_cgroup_free mm/memcontrol.c:5033 [inline]
      	mem_cgroup_css_alloc+0x3a1/0x1ae0 mm/memcontrol.c:5160
      	css_create kernel/cgroup/cgroup.c:5156 [inline]
      	cgroup_apply_control_enable+0x44d/0xc40 kernel/cgroup/cgroup.c:3119
      	cgroup_mkdir+0x899/0x11b0 kernel/cgroup/cgroup.c:5401
      	kernfs_iop_mkdir+0x14d/0x1d0 fs/kernfs/dir.c:1124
      	vfs_mkdir+0x42e/0x670 fs/namei.c:3807
      	do_mkdirat+0x234/0x2a0 fs/namei.c:3830
      	__do_sys_mkdir fs/namei.c:3846 [inline]
      	__se_sys_mkdir fs/namei.c:3844 [inline]
      	__x64_sys_mkdir+0x5c/0x80 fs/namei.c:3844
      	do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
      	entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Fixing this by moving the flush to mem_cgroup_free as there is no need
      to flush anything if we see failure in mem_cgroup_alloc().
      
      Link: http://lkml.kernel.org/r/20191018165231.249872-1-shakeelb@google.com
      Fixes: bb65f89b ("mm: memcontrol: flush percpu vmevents before releasing memcg")
      Fixes: c350a99e ("mm: memcontrol: flush percpu vmstats before releasing memcg")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: syzbot+515d5bcfe179cdf049b2@syzkaller.appspotmail.com
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7961eee3
  9. 06 11月, 2019 3 次提交
    • T
      mm: Add write-protect and clean utilities for address space ranges · c5acad84
      Thomas Hellstrom 提交于
      Add two utilities to 1) write-protect and 2) clean all ptes pointing into
      a range of an address space.
      The utilities are intended to aid in tracking dirty pages (either
      driver-allocated system memory or pci device memory).
      The write-protect utility should be used in conjunction with
      page_mkwrite() and pfn_mkwrite() to trigger write page-faults on page
      accesses. Typically one would want to use this on sparse accesses into
      large memory regions. The clean utility should be used to utilize
      hardware dirtying functionality and avoid the overhead of page-faults,
      typically on large accesses into small memory regions.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NThomas Hellstrom <thellstrom@vmware.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      c5acad84
    • T
      mm: Add a walk_page_mapping() function to the pagewalk code · ecaad8ac
      Thomas Hellstrom 提交于
      For users that want to travers all page table entries pointing into a
      region of a struct address_space mapping, introduce a walk_page_mapping()
      function.
      
      The walk_page_mapping() function will be initially be used for dirty-
      tracking in virtual graphics drivers.
      
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NThomas Hellstrom <thellstrom@vmware.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      ecaad8ac
    • T
      mm: pagewalk: Take the pagetable lock in walk_pte_range() · ace88f10
      Thomas Hellstrom 提交于
      Without the lock, anybody modifying a pte from within this function might
      have it concurrently modified by someone else.
      
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill@shutemov.name>
      Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NThomas Hellstrom <thellstrom@vmware.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      ace88f10
  10. 03 11月, 2019 2 次提交
  11. 30 10月, 2019 1 次提交
  12. 19 10月, 2019 3 次提交