1. 27 12月, 2019 1 次提交
  2. 18 12月, 2019 2 次提交
    • M
      mm, thp, proc: report THP eligibility for each vma · c76adee3
      Michal Hocko 提交于
      [ Upstream commit 7635d9cbe8327e131a1d3d8517dc186c2796ce2e ]
      
      Userspace falls short when trying to find out whether a specific memory
      range is eligible for THP.  There are usecases that would like to know
      that
      http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
      : This is used to identify heap mappings that should be able to fault thp
      : but do not, and they normally point to a low-on-memory or fragmentation
      : issue.
      
      The only way to deduce this now is to query for hg resp.  nh flags and
      confronting the state with the global setting.  Except that there is also
      PR_SET_THP_DISABLE that might change the picture.  So the final logic is
      not trivial.  Moreover the eligibility of the vma depends on the type of
      VMA as well.  In the past we have supported only anononymous memory VMAs
      but things have changed and shmem based vmas are supported as well these
      days and the query logic gets even more complicated because the
      eligibility depends on the mount option and another global configuration
      knob.
      
      Simplify the current state and report the THP eligibility in
      /proc/<pid>/smaps for each existing vma.  Reuse
      transparent_hugepage_enabled for this purpose.  The original
      implementation of this function assumes that the caller knows that the vma
      itself is supported for THP so make the core checks into
      __transparent_hugepage_enabled and use it for existing callers.
      __show_smap just use the new transparent_hugepage_enabled which also
      checks the vma support status (please note that this one has to be out of
      line due to include dependency issues).
      
      [mhocko@kernel.org: fix oops with NULL ->f_mapping]
        Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Paul Oppenheimer <bepvte@gmail.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c76adee3
    • C
      mm/shmem.c: cast the type of unmap_start to u64 · 32b02bfd
      Chen Jun 提交于
      commit aa71ecd8d86500da6081a72da6b0b524007e0627 upstream.
      
      In 64bit system. sb->s_maxbytes of shmem filesystem is MAX_LFS_FILESIZE,
      which equal LLONG_MAX.
      
      If offset > LLONG_MAX - PAGE_SIZE, offset + len < LLONG_MAX in
      shmem_fallocate, which will pass the checking in vfs_fallocate.
      
      	/* Check for wrap through zero too */
      	if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0))
      		return -EFBIG;
      
      loff_t unmap_start = round_up(offset, PAGE_SIZE) in shmem_fallocate
      causes a overflow.
      
      Syzkaller reports a overflow problem in mm/shmem:
      
        UBSAN: Undefined behaviour in mm/shmem.c:2014:10
        signed integer overflow: '9223372036854775807 + 1' cannot be represented in type 'long long int'
        CPU: 0 PID:17076 Comm: syz-executor0 Not tainted 4.1.46+ #1
        Hardware name: linux, dummy-virt (DT)
        Call trace:
           dump_backtrace+0x0/0x2c8 arch/arm64/kernel/traps.c:100
           show_stack+0x20/0x30 arch/arm64/kernel/traps.c:238
           __dump_stack lib/dump_stack.c:15 [inline]
           ubsan_epilogue+0x18/0x70 lib/ubsan.c:164
           handle_overflow+0x158/0x1b0 lib/ubsan.c:195
           shmem_fallocate+0x6d0/0x820 mm/shmem.c:2104
           vfs_fallocate+0x238/0x428 fs/open.c:312
           SYSC_fallocate fs/open.c:335 [inline]
           SyS_fallocate+0x54/0xc8 fs/open.c:239
      
      The highest bit of unmap_start will be appended with sign bit 1
      (overflow) when calculate shmem_falloc.start:
      
          shmem_falloc.start = unmap_start >> PAGE_SHIFT.
      
      Fix it by casting the type of unmap_start to u64, when right shifted.
      
      This bug is found in LTS Linux 4.1.  It also seems to exist in mainline.
      
      Link: http://lkml.kernel.org/r/1573867464-5107-1-git-send-email-chenjun102@huawei.comSigned-off-by: NChen Jun <chenjun102@huawei.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      32b02bfd
  3. 13 12月, 2019 1 次提交
  4. 05 12月, 2019 4 次提交
  5. 01 12月, 2019 9 次提交
    • D
      mm/memory_hotplug: don't access uninitialized memmaps in shrink_zone_span() · 5779cbc9
      David Hildenbrand 提交于
      commit 7ce700bf11b5e2cb84e4352bbdf2123a7a239c84 upstream.
      
      Let's limit shrinking to !ZONE_DEVICE so we can fix the current code.
      We should never try to touch the memmap of offline sections where we
      could have uninitialized memmaps and could trigger BUGs when calling
      page_to_nid() on poisoned pages.
      
      There is no reliable way to distinguish an uninitialized memmap from an
      initialized memmap that belongs to ZONE_DEVICE, as we don't have
      anything like SECTION_IS_ONLINE we can use similar to
      pfn_to_online_section() for !ZONE_DEVICE memory.
      
      E.g., set_zone_contiguous() similarly relies on pfn_to_online_section()
      and will therefore never set a ZONE_DEVICE zone consecutive.  Stopping
      to shrink the ZONE_DEVICE therefore results in no observable changes,
      besides /proc/zoneinfo indicating different boundaries - something we
      can totally live with.
      
      Before commit d0dc12e8 ("mm/memory_hotplug: optimize memory
      hotplug"), the memmap was initialized with 0 and the node with the right
      value.  So the zone might be wrong but not garbage.  After that commit,
      both the zone and the node will be garbage when touching uninitialized
      memmaps.
      
      Toshiki reported a BUG (race between delayed initialization of
      ZONE_DEVICE memmaps without holding the memory hotplug lock and
      concurrent zone shrinking).
      
        https://lkml.org/lkml/2019/11/14/1040
      
      "Iteration of create and destroy namespace causes the panic as below:
      
            kernel BUG at mm/page_alloc.c:535!
            CPU: 7 PID: 2766 Comm: ndctl Not tainted 5.4.0-rc4 #6
            Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
            RIP: 0010:set_pfnblock_flags_mask+0x95/0xf0
            Call Trace:
             memmap_init_zone_device+0x165/0x17c
             memremap_pages+0x4c1/0x540
             devm_memremap_pages+0x1d/0x60
             pmem_attach_disk+0x16b/0x600 [nd_pmem]
             nvdimm_bus_probe+0x69/0x1c0
             really_probe+0x1c2/0x3e0
             driver_probe_device+0xb4/0x100
             device_driver_attach+0x4f/0x60
             bind_store+0xc9/0x110
             kernfs_fop_write+0x116/0x190
             vfs_write+0xa5/0x1a0
             ksys_write+0x59/0xd0
             do_syscall_64+0x5b/0x180
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        While creating a namespace and initializing memmap, if you destroy the
        namespace and shrink the zone, it will initialize the memmap outside
        the zone and trigger VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page),
        pfn), page) in set_pfnblock_flags_mask()."
      
      This BUG is also mitigated by this commit, where we for now stop to
      shrink the ZONE_DEVICE zone until we can do it in a safe and clean way.
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-5-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reported-by: NToshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Damian Tometzki <damian.tometzki@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5779cbc9
    • V
      mm/page_io.c: do not free shared swap slots · 006360ec
      Vinayak Menon 提交于
      [ Upstream commit 5df373e95689b9519b8557da7c5bd0db0856d776 ]
      
      The following race is observed due to which a processes faulting on a
      swap entry, finds the page neither in swapcache nor swap.  This causes
      zram to give a zero filled page that gets mapped to the process,
      resulting in a user space crash later.
      
      Consider parent and child processes Pa and Pb sharing the same swap slot
      with swap_count 2.  Swap is on zram with SWP_SYNCHRONOUS_IO set.
      Virtual address 'VA' of Pa and Pb points to the shared swap entry.
      
      Pa                                       Pb
      
      fault on VA                              fault on VA
      do_swap_page                             do_swap_page
      lookup_swap_cache fails                  lookup_swap_cache fails
                                               Pb scheduled out
      swapin_readahead (deletes zram entry)
      swap_free (makes swap_count 1)
                                               Pb scheduled in
                                               swap_readpage (swap_count == 1)
                                               Takes SWP_SYNCHRONOUS_IO path
                                               zram enrty absent
                                               zram gives a zero filled page
      
      Fix this by making sure that swap slot is freed only when swap count
      drops down to one.
      
      Link: http://lkml.kernel.org/r/1571743294-14285-1-git-send-email-vinmenon@codeaurora.org
      Fixes: aa8d22a1 ("mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no other reference")
      Signed-off-by: NVinayak Menon <vinmenon@codeaurora.org>
      Suggested-by: NMinchan Kim <minchan@google.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      006360ec
    • R
      mm: handle no memcg case in memcg_kmem_charge() properly · 6a2245d8
      Roman Gushchin 提交于
      [ Upstream commit e68599a3c3ad0f3171a7cb4e48aa6f9a69381902 ]
      
      Mike Galbraith reported a regression caused by the commit 9b6f7e163cd0
      ("mm: rework memcg kernel stack accounting") on a system with
      "cgroup_disable=memory" boot option: the system panics with the following
      stack trace:
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8
        PGD 0 P4D 0
        Oops: 0002 [#1] PREEMPT SMP PTI
        CPU: 0 PID: 1 Comm: systemd Not tainted 4.19.0-preempt+ #410
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fed4
        RIP: 0010:page_counter_try_charge+0x22/0xc0
        Code: 41 5d c3 c3 0f 1f 40 00 0f 1f 44 00 00 48 85 ff 0f 84 a7 00 00 00 41 56 48 89 f8 49 89 fe 49
        Call Trace:
         try_charge+0xcb/0x780
         memcg_kmem_charge_memcg+0x28/0x80
         memcg_kmem_charge+0x8b/0x1d0
         copy_process.part.41+0x1ca/0x2070
         _do_fork+0xd7/0x3d0
         do_syscall_64+0x5a/0x180
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The problem occurs because get_mem_cgroup_from_current() returns the NULL
      pointer if memory controller is disabled.  Let's check if this is a case
      at the beginning of memcg_kmem_charge() and just return 0 if
      mem_cgroup_disabled() returns true.  This is how we handle this case in
      many other places in the memory controller code.
      
      Link: http://lkml.kernel.org/r/20181029215123.17830-1-guro@fb.com
      Fixes: 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6a2245d8
    • D
      mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock · 17523d7a
      David Hildenbrand 提交于
      [ Upstream commit 381eab4a6ee81266f8dddc62e57376c7e584e5b8 ]
      
      There seem to be some problems as result of 30467e0b ("mm, hotplug:
      fix concurrent memory hot-add deadlock"), which tried to fix a possible
      lock inversion reported and discussed in [1] due to the two locks
      	a) device_lock()
      	b) mem_hotplug_lock
      
      While add_memory() first takes b), followed by a) during
      bus_probe_device(), onlining of memory from user space first took a),
      followed by b), exposing a possible deadlock.
      
      In [1], and it was decided to not make use of device_hotplug_lock, but
      rather to enforce a locking order.
      
      The problems I spotted related to this:
      
      1. Memory block device attributes: While .state first calls
         mem_hotplug_begin() and the calls device_online() - which takes
         device_lock() - .online does no longer call mem_hotplug_begin(), so
         effectively calls online_pages() without mem_hotplug_lock.
      
      2. device_online() should be called under device_hotplug_lock, however
         onlining memory during add_memory() does not take care of that.
      
      In addition, I think there is also something wrong about the locking in
      
      3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
         without locks. This was introduced after 30467e0b. And skimming over
         the code, I assume it could need some more care in regards to locking
         (e.g. device_online() called without device_hotplug_lock. This will
         be addressed in the following patches.
      
      Now that we hold the device_hotplug_lock when
      - adding memory (e.g. via add_memory()/add_memory_resource())
      - removing memory (e.g. via remove_memory())
      - device_online()/device_offline()
      
      We can move mem_hotplug_lock usage back into
      online_pages()/offline_pages().
      
      Why is mem_hotplug_lock still needed? Essentially to make
      get_online_mems()/put_online_mems() be very fast (relying on
      device_hotplug_lock would be very slow), and to serialize against
      addition of memory that does not create memory block devices (hmm).
      
      [1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
          2015-February/065324.html
      
      This patch is partly based on a patch by Vitaly Kuznetsov.
      
      Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Rashmica Gupta <rashmica.g@gmail.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: John Allen <jallen@linux.vnet.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      17523d7a
    • D
      mm/memory_hotplug: make add_memory() take the device_hotplug_lock · 02735d59
      David Hildenbrand 提交于
      [ Upstream commit 8df1d0e4a265f25dc1e7e7624ccdbcb4a6630c89 ]
      
      add_memory() currently does not take the device_hotplug_lock, however
      is aleady called under the lock from
      	arch/powerpc/platforms/pseries/hotplug-memory.c
      	drivers/acpi/acpi_memhotplug.c
      to synchronize against CPU hot-remove and similar.
      
      In general, we should hold the device_hotplug_lock when adding memory to
      synchronize against online/offline request (e.g.  from user space) - which
      already resulted in lock inversions due to device_lock() and
      mem_hotplug_lock - see 30467e0b ("mm, hotplug: fix concurrent memory
      hot-add deadlock").  add_memory()/add_memory_resource() will create memory
      block devices, so this really feels like the right thing to do.
      
      Holding the device_hotplug_lock makes sure that a memory block device
      can really only be accessed (e.g. via .online/.state) from user space,
      once the memory has been fully added to the system.
      
      The lock is not held yet in
      	drivers/xen/balloon.c
      	arch/powerpc/platforms/powernv/memtrace.c
      	drivers/s390/char/sclp_cmd.c
      	drivers/hv/hv_balloon.c
      So, let's either use the locked variants or take the lock.
      
      Don't export add_memory_resource(), as it once was exported to be used by
      XEN, which is never built as a module.  If somebody requires it, we also
      have to export a locked variant (as device_hotplug_lock is never
      exported).
      
      Link: http://lkml.kernel.org/r/20180925091457.28651-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: John Allen <jallen@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      02735d59
    • D
      mm/gup_benchmark.c: prevent integer overflow in ioctl · 30598425
      Dan Carpenter 提交于
      [ Upstream commit 4b408c74ee5a0b74fc9265c2fe39b0e7dec7c056 ]
      
      The concern here is that "gup->size" is a u64 and "nr_pages" is unsigned
      long.  On 32 bit systems we could trick the kernel into allocating fewer
      pages than expected.
      
      Link: http://lkml.kernel.org/r/20181025061546.hnhkv33diogf2uis@kili.mountain
      Fixes: 64c349f4 ("mm: add infrastructure for get_user_pages_fast() benchmarking")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      30598425
    • A
      mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition · 4291e97c
      Andrea Arcangeli 提交于
      [ Upstream commit d7c3393413fe7e7dc54498ea200ea94742d61e18 ]
      
      Patch series "migrate_misplaced_transhuge_page race conditions".
      
      Aaron found a new instance of the THP MADV_DONTNEED race against
      pmdp_clear_flush* variants, that was apparently left unfixed.
      
      While looking into the race found by Aaron, I may have found two more
      issues in migrate_misplaced_transhuge_page.
      
      These race conditions would not cause kernel instability, but they'd
      corrupt userland data or leave data non zero after MADV_DONTNEED.
      
      I did only minor testing, and I don't expect to be able to reproduce this
      (especially the lack of ->invalidate_range before migrate_page_copy,
      requires the latest iommu hardware or infiniband to reproduce).  The last
      patch is noop for x86 and it needs further review from maintainers of
      archs that implement flush_cache_range() (not in CC yet).
      
      To avoid confusion, it's not the first patch that introduces the bug fixed
      in the second patch, even before removing the
      pmdp_huge_clear_flush_notify, that _notify suffix was called after
      migrate_page_copy already run.
      
      This patch (of 3):
      
      This is a corollary of ced10803 ("thp: fix MADV_DONTNEED vs.  numa
      balancing race"), 58ceeb6b ("thp: fix MADV_DONTNEED vs.  MADV_FREE
      race") and 5b7abeae ("thp: fix MADV_DONTNEED vs clear soft dirty
      race).
      
      When the above three fixes where posted Dave asked
      https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com
      but apparently this was missed.
      
      The pmdp_clear_flush* in migrate_misplaced_transhuge_page() was introduced
      in a54a407f ("mm: Close races between THP migration and PMD numa
      clearing").
      
      The important part of such commit is only the part where the page lock is
      not released until the first do_huge_pmd_numa_page() finished disarming
      the pagenuma/protnone.
      
      The addition of pmdp_clear_flush() wasn't beneficial to such commit and
      there's no commentary about such an addition either.
      
      I guess the pmdp_clear_flush() in such commit was added just in case for
      safety, but it ended up introducing the MADV_DONTNEED race condition found
      by Aaron.
      
      At that point in time nobody thought of such kind of MADV_DONTNEED race
      conditions yet (they were fixed later) so the code may have looked more
      robust by adding the pmdp_clear_flush().
      
      This specific race condition won't destabilize the kernel, but it can
      confuse userland because after MADV_DONTNEED the memory won't be zeroed
      out.
      
      This also optimizes the code and removes a superfluous TLB flush.
      
      [akpm@linux-foundation.org: reflow comment to 80 cols, fix grammar and typo (beacuse)]
      Link: http://lkml.kernel.org/r/20181013002430.698-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NAaron Tomlin <atomlin@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4291e97c
    • D
      mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock · 2d9d6c09
      Dave Chinner 提交于
      [ Upstream commit 64081362e8ff4587b4554087f3cfc73d3e0a4cd7 ]
      
      We've recently seen a workload on XFS filesystems with a repeatable
      deadlock between background writeback and a multi-process application
      doing concurrent writes and fsyncs to a small range of a file.
      
      range_cyclic
      writeback		Process 1		Process 2
      
      xfs_vm_writepages
        write_cache_pages
          writeback_index = 2
          cycled = 0
          ....
          find page 2 dirty
          lock Page 2
          ->writepage
            page 2 writeback
            page 2 clean
            page 2 added to bio
          no more pages
      			write()
      			locks page 1
      			dirties page 1
      			locks page 2
      			dirties page 1
      			fsync()
      			....
      			xfs_vm_writepages
      			write_cache_pages
      			  start index 0
      			  find page 1 towrite
      			  lock Page 1
      			  ->writepage
      			    page 1 writeback
      			    page 1 clean
      			    page 1 added to bio
      			  find page 2 towrite
      			  lock Page 2
      			  page 2 is writeback
      			  <blocks>
      						write()
      						locks page 1
      						dirties page 1
      						fsync()
      						....
      						xfs_vm_writepages
      						write_cache_pages
      						  start index 0
      
          !done && !cycled
            sets index to 0, restarts lookup
          find page 1 dirty
      						  find page 1 towrite
      						  lock Page 1
      						  page 1 is writeback
      						  <blocks>
      
          lock Page 1
          <blocks>
      
      DEADLOCK because:
      
      	- process 1 needs page 2 writeback to complete to make
      	  enough progress to issue IO pending for page 1
      	- writeback needs page 1 writeback to complete so process 2
      	  can progress and unlock the page it is blocked on, then it
      	  can issue the IO pending for page 2
      	- process 2 can't make progress until process 1 issues IO
      	  for page 1
      
      The underlying cause of the problem here is that range_cyclic writeback is
      processing pages in descending index order as we hold higher index pages
      in a structure controlled from above write_cache_pages().  The
      write_cache_pages() caller needs to be able to submit these pages for IO
      before write_cache_pages restarts writeback at mapping index 0 to avoid
      wcp inverting the page lock/writeback wait order.
      
      generic_writepages() is not susceptible to this bug as it has no private
      context held across write_cache_pages() - filesystems using this
      infrastructure always submit pages in ->writepage immediately and so there
      is no problem with range_cyclic going back to mapping index 0.
      
      However:
      	mpage_writepages() has a private bio context,
      	exofs_writepages() has page_collect
      	fuse_writepages() has fuse_fill_wb_data
      	nfs_writepages() has nfs_pageio_descriptor
      	xfs_vm_writepages() has xfs_writepage_ctx
      
      All of these ->writepages implementations can hold pages under writeback
      in their private structures until write_cache_pages() returns, and hence
      they are all susceptible to this deadlock.
      
      Also worth noting is that ext4 has it's own bastardised version of
      write_cache_pages() and so it /may/ have an equivalent deadlock.  I looked
      at the code long enough to understand that it has a similar retry loop for
      range_cyclic writeback reaching the end of the file and then promptly ran
      away before my eyes bled too much.  I'll leave it for the ext4 developers
      to determine if their code is actually has this deadlock and how to fix it
      if it has.
      
      There's a few ways I can see avoid this deadlock.  There's probably more,
      but these are the first I've though of:
      
      1. get rid of range_cyclic altogether
      
      2. range_cyclic always stops at EOF, and we start again from
      writeback index 0 on the next call into write_cache_pages()
      
      2a. wcp also returns EAGAIN to ->writepages implementations to
      indicate range cyclic has hit EOF. writepages implementations can
      then flush the current context and call wpc again to continue. i.e.
      lift the retry into the ->writepages implementation
      
      3. range_cyclic uses trylock_page() rather than lock_page(), and it
      skips pages it can't lock without blocking. It will already do this
      for pages under writeback, so this seems like a no-brainer
      
      3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
      blocking as per pages under writeback.
      
      I don't think #1 is an option - range_cyclic prevents frequently
      dirtied lower file offset from starving background writeback of
      rarely touched higher file offsets.
      
      #2 is simple, and I don't think it will have any impact on
      performance as going back to the start of the file implies an
      immediate seek. We'll have exactly the same number of seeks if we
      switch writeback to another inode, and then come back to this one
      later and restart from index 0.
      
      #2a is pretty much "status quo without the deadlock". Moving the
      retry loop up into the wcp caller means we can issue IO on the
      pending pages before calling wcp again, and so avoid locking or
      waiting on pages in the wrong order. I'm not convinced we need to do
      this given that we get the same thing from #2 on the next writeback
      call from the writeback infrastructure.
      
      #3 is really just a band-aid - it doesn't fix the access/wait
      inversion problem, just prevents it from becoming a deadlock
      situation. I'd prefer we fix the inversion, not sweep it under the
      carpet like this.
      
      #3a is really an optimisation that just so happens to include the
      band-aid fix of #3.
      
      So it seems that the simplest way to fix this issue is to implement
      solution #2
      
      Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.comSigned-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.de>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      2d9d6c09
    • A
      mm/ksm.c: don't WARN if page is still mapped in remove_stable_node() · e8d355be
      Andrey Ryabinin 提交于
      commit 9a63236f1ad82d71a98aa80320b6cb618fb32f44 upstream.
      
      It's possible to hit the WARN_ON_ONCE(page_mapped(page)) in
      remove_stable_node() when it races with __mmput() and squeezes in
      between ksm_exit() and exit_mmap().
      
        WARNING: CPU: 0 PID: 3295 at mm/ksm.c:888 remove_stable_node+0x10c/0x150
      
        Call Trace:
         remove_all_stable_nodes+0x12b/0x330
         run_store+0x4ef/0x7b0
         kernfs_fop_write+0x200/0x420
         vfs_write+0x154/0x450
         ksys_write+0xf9/0x1d0
         do_syscall_64+0x99/0x510
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      Remove the warning as there is nothing scary going on.
      
      Link: http://lkml.kernel.org/r/20191119131850.5675-1-aryabinin@virtuozzo.com
      Fixes: cbf86cfe ("ksm: remove old stable nodes more thoroughly")
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e8d355be
  6. 24 11月, 2019 2 次提交
    • D
      mm/memory_hotplug: fix updating the node span · f8b09a04
      David Hildenbrand 提交于
      commit 656d571193262a11c2daa4012e53e4d645bbce56 upstream.
      
      We recently started updating the node span based on the zone span to
      avoid touching uninitialized memmaps.
      
      Currently, we will always detect the node span to start at 0, meaning a
      node can easily span too many pages.  pgdat_is_empty() will still work
      correctly if all zones span no pages.  We should skip over all zones
      without spanned pages and properly handle the first detected zone that
      spans pages.
      
      Unfortunately, in contrast to the zone span (/proc/zoneinfo), the node
      span cannot easily be inspected and tested.  The node span gives no real
      guarantees when an architecture supports memory hotplug, meaning it can
      easily contain holes or span pages of different nodes.
      
      The node span is not really used after init on architectures that
      support memory hotplug.
      
      E.g., we use it in mm/memory_hotplug.c:try_offline_node() and in
      mm/kmemleak.c:kmemleak_scan().  These users seem to be fine.
      
      Link: http://lkml.kernel.org/r/20191027222714.5313-1-david@redhat.com
      Fixes: 00d6c019b5bc ("mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span()")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f8b09a04
    • D
      mm/memory_hotplug: don't access uninitialized memmaps in shrink_pgdat_span() · 6631def3
      David Hildenbrand 提交于
      commit 00d6c019b5bc175cee3770e0e659f2b5f4804ea5 upstream.
      
      We might use the nid of memmaps that were never initialized.  For
      example, if the memmap was poisoned, we will crash the kernel in
      pfn_to_nid() right now.  Let's use the calculated boundaries of the
      separate zones instead.  This now also avoids having to iterate over a
      whole bunch of subsections again, after shrinking one zone.
      
      Before commit d0dc12e8 ("mm/memory_hotplug: optimize memory
      hotplug"), the memmap was initialized to 0 and the node was set to the
      right value.  After that commit, the node might be garbage.
      
      We'll have to fix shrink_zone_span() next.
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-4-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Damian Tometzki <damian.tometzki@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
      6631def3
  7. 21 11月, 2019 4 次提交
  8. 13 11月, 2019 4 次提交
    • K
      mm/filemap.c: don't initiate writeback if mapping has no dirty pages · d3b3c0a1
      Konstantin Khlebnikov 提交于
      commit c3aab9a0bd91b696a852169479b7db1ece6cbf8c upstream.
      
      Functions like filemap_write_and_wait_range() should do nothing if inode
      has no dirty pages or pages currently under writeback.  But they anyway
      construct struct writeback_control and this does some atomic operations if
      CONFIG_CGROUP_WRITEBACK=y - on fast path it locks inode->i_lock and
      updates state of writeback ownership, on slow path might be more work.
      Current this path is safely avoided only when inode mapping has no pages.
      
      For example generic_file_read_iter() calls filemap_write_and_wait_range()
      at each O_DIRECT read - pretty hot path.
      
      This patch skips starting new writeback if mapping has no dirty tags set.
      If writeback is already in progress filemap_write_and_wait_range() will
      wait for it.
      
      Link: http://lkml.kernel.org/r/156378816804.1087.8607636317907921438.stgit@buzzSigned-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d3b3c0a1
    • M
      mm, vmstat: hide /proc/pagetypeinfo from normal users · 6c944fc5
      Michal Hocko 提交于
      commit abaed0112c1db08be15a784a2c5c8a8b3063cdd3 upstream.
      
      /proc/pagetypeinfo is a debugging tool to examine internal page
      allocator state wrt to fragmentation.  It is not very useful for any
      other use so normal users really do not need to read this file.
      
      Waiman Long has noticed that reading this file can have negative side
      effects because zone->lock is necessary for gathering data and that a)
      interferes with the page allocator and its users and b) can lead to hard
      lockups on large machines which have very long free_list.
      
      Reduce both issues by simply not exporting the file to regular users.
      
      Link: http://lkml.kernel.org/r/20191025072610.18526-2-mhocko@kernel.org
      Fixes: 467c996c ("Print out statistics in relation to fragmentation avoidance to /proc/pagetypeinfo")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NWaiman Long <longman@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NWaiman Long <longman@redhat.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Jann Horn <jannh@google.com>
      Cc: Song Liu <songliubraving@fb.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      6c944fc5
    • M
      mm, meminit: recalculate pcpu batch and high limits after init completes · 7dfa51be
      Mel Gorman 提交于
      commit 3e8fc0075e24338b1117cdff6a79477427b8dbed upstream.
      
      Deferred memory initialisation updates zone->managed_pages during the
      initialisation phase but before that finishes, the per-cpu page
      allocator (pcpu) calculates the number of pages allocated/freed in
      batches as well as the maximum number of pages allowed on a per-cpu
      list.  As zone->managed_pages is not up to date yet, the pcpu
      initialisation calculates inappropriately low batch and high values.
      
      This increases zone lock contention quite severely in some cases with
      the degree of severity depending on how many CPUs share a local zone and
      the size of the zone.  A private report indicated that kernel build
      times were excessive with extremely high system CPU usage.  A perf
      profile indicated that a large chunk of time was lost on zone->lock
      contention.
      
      This patch recalculates the pcpu batch and high values after deferred
      initialisation completes for every populated zone in the system.  It was
      tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
      workload -- allmodconfig and all available CPUs.
      
      mmtests configuration: config-workload-kernbench-max Configuration was
      modified to build on a fresh XFS partition.
      
      kernbench
                                      5.4.0-rc3              5.4.0-rc3
                                        vanilla           resetpcpu-v2
      Amean     user-256    13249.50 (   0.00%)    16401.31 * -23.79%*
      Amean     syst-256    14760.30 (   0.00%)     4448.39 *  69.86%*
      Amean     elsp-256      162.42 (   0.00%)      119.13 *  26.65%*
      Stddev    user-256       42.97 (   0.00%)       19.15 (  55.43%)
      Stddev    syst-256      336.87 (   0.00%)        6.71 (  98.01%)
      Stddev    elsp-256        2.46 (   0.00%)        0.39 (  84.03%)
      
                         5.4.0-rc3    5.4.0-rc3
                           vanilla resetpcpu-v2
      Duration User       39766.24     49221.79
      Duration System     44298.10     13361.67
      Duration Elapsed      519.11       388.87
      
      The patch reduces system CPU usage by 69.86% and total build time by
      26.65%.  The variance of system CPU usage is also much reduced.
      
      Before, this was the breakdown of batch and high values over all zones
      was:
      
          256               batch: 1
          256               batch: 63
          512               batch: 7
          256               high:  0
          256               high:  378
          512               high:  42
      
      512 pcpu pagesets had a batch limit of 7 and a high limit of 42.  After
      the patch:
      
          256               batch: 1
          768               batch: 63
          256               high:  0
          768               high:  378
      
      [mgorman@techsingularity.net: fix merge/linkage snafu]
        Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: <stable@vger.kernel.org>	[4.1+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7dfa51be
    • J
      mm: memcontrol: fix network errors from failing __GFP_ATOMIC charges · 8e6bf4bc
      Johannes Weiner 提交于
      commit 869712fd3de5a90b7ba23ae1272278cddc66b37b upstream.
      
      While upgrading from 4.16 to 5.2, we noticed these allocation errors in
      the log of the new kernel:
      
        SLUB: Unable to allocate memory on node -1, gfp=0xa20(GFP_ATOMIC)
          cache: tw_sock_TCPv6(960:helper-logs), object size: 232, buffer size: 240, default order: 1, min order: 0
          node 0: slabs: 5, objs: 170, free: 0
      
              slab_out_of_memory+1
              ___slab_alloc+969
              __slab_alloc+14
              kmem_cache_alloc+346
              inet_twsk_alloc+60
              tcp_time_wait+46
              tcp_fin+206
              tcp_data_queue+2034
              tcp_rcv_state_process+784
              tcp_v6_do_rcv+405
              __release_sock+118
              tcp_close+385
              inet_release+46
              __sock_release+55
              sock_close+17
              __fput+170
              task_work_run+127
              exit_to_usermode_loop+191
              do_syscall_64+212
              entry_SYSCALL_64_after_hwframe+68
      
      accompanied by an increase in machines going completely radio silent
      under memory pressure.
      
      One thing that changed since 4.16 is e699e2c6 ("net, mm: account
      sock objects to kmemcg"), which made these slab caches subject to cgroup
      memory accounting and control.
      
      The problem with that is that cgroups, unlike the page allocator, do not
      maintain dedicated atomic reserves.  As a cgroup's usage hovers at its
      limit, atomic allocations - such as done during network rx - can fail
      consistently for extended periods of time.  The kernel is not able to
      operate under these conditions.
      
      We don't want to revert the culprit patch, because it indeed tracks a
      potentially substantial amount of memory used by a cgroup.
      
      We also don't want to implement dedicated atomic reserves for cgroups.
      There is no point in keeping a fixed margin of unused bytes in the
      cgroup's memory budget to accomodate a consumer that is impossible to
      predict - we'd be wasting memory and get into configuration headaches,
      not unlike what we have going with min_free_kbytes.  We do this for
      physical mem because we have to, but cgroups are an accounting game.
      
      Instead, account these privileged allocations to the cgroup, but let
      them bypass the configured limit if they have to.  This way, we get the
      benefits of accounting the consumed memory and have it exert pressure on
      the rest of the cgroup, but like with the page allocator, we shift the
      burden of reclaimining on behalf of atomic allocations onto the regular
      allocations that can block.
      
      Link: http://lkml.kernel.org/r/20191022233708.365764-1-hannes@cmpxchg.org
      Fixes: e699e2c6 ("net, mm: account sock objects to kmemcg")
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Suleiman Souhlal <suleiman@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.18+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8e6bf4bc
  9. 29 10月, 2019 6 次提交
    • J
      mm/memory-failure: poison read receives SIGKILL instead of SIGBUS if mmaped more than once · 30cff8ab
      Jane Chu 提交于
      commit 3d7fed4ad8ccb691d217efbb0f934e6a4df5ef91 upstream.
      
      Mmap /dev/dax more than once, then read the poison location using
      address from one of the mappings.  The other mappings due to not having
      the page mapped in will cause SIGKILLs delivered to the process.
      SIGKILL succeeds over SIGBUS, so user process loses the opportunity to
      handle the UE.
      
      Although one may add MAP_POPULATE to mmap(2) to work around the issue,
      MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
      isn't always an option.
      
      Details -
      
        ndctl inject-error --block=10 --count=1 namespace6.0
      
        ./read_poison -x dax6.0 -o 5120 -m 2
        mmaped address 0x7f5bb6600000
        mmaped address 0x7f3cf3600000
        doing local read at address 0x7f3cf3601400
        Killed
      
      Console messages in instrumented kernel -
      
        mce: Uncorrected hardware memory error in user-access at edbe201400
        Memory failure: tk->addr = 7f5bb6601000
        Memory failure: address edbe201: call dev_pagemap_mapping_shift
        dev_pagemap_mapping_shift: page edbe201: no PUD
        Memory failure: tk->size_shift == 0
        Memory failure: Unable to find user space address edbe201 in read_poison
        Memory failure: tk->addr = 7f3cf3601000
        Memory failure: address edbe201: call dev_pagemap_mapping_shift
        Memory failure: tk->size_shift = 21
        Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of failure to unmap corrupted page
          => to deliver SIGKILL
        Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory corruption
          => to deliver SIGBUS
      
      Link: http://lkml.kernel.org/r/1565112345-28754-3-git-send-email-jane.chu@oracle.comSigned-off-by: NJane Chu <jane.chu@oracle.com>
      Suggested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      30cff8ab
    • D
      hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic() · 91eec769
      David Hildenbrand 提交于
      commit f231fe4235e22e18d847e05cbe705deaca56580a upstream.
      
      Uninitialized memmaps contain garbage and in the worst case trigger
      kernel BUGs, especially with CONFIG_PAGE_POISONING.  They should not get
      touched.
      
      Let's make sure that we only consider online memory (managed by the
      buddy) that has initialized memmaps.  ZONE_DEVICE is not applicable.
      
      page_zone() will call page_to_nid(), which will trigger
      VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING
      and CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps.  This
      can be the case when an offline memory block (e.g., never onlined) is
      spanned by a zone.
      
      Note: As explained by Michal in [1], alloc_contig_range() will verify
      the range.  So it boils down to the wrong access in this function.
      
      [1] http://lkml.kernel.org/r/20180423000943.GO17484@dhcp22.suse.cz
      
      Link: http://lkml.kernel.org/r/20191015120717.4858-1-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      91eec769
    • Q
      mm/page_owner: don't access uninitialized memmaps when reading /proc/pagetypeinfo · f712e306
      Qian Cai 提交于
      commit a26ee565b6cd8dc2bf15ff6aa70bbb28f928b773 upstream.
      
      Uninitialized memmaps contain garbage and in the worst case trigger
      kernel BUGs, especially with CONFIG_PAGE_POISONING.  They should not get
      touched.
      
      For example, when not onlining a memory block that is spanned by a zone
      and reading /proc/pagetypeinfo with CONFIG_DEBUG_VM_PGFLAGS and
      CONFIG_PAGE_POISONING, we can trigger a kernel BUG:
      
        :/# echo 1 > /sys/devices/system/memory/memory40/online
        :/# echo 1 > /sys/devices/system/memory/memory42/online
        :/# cat /proc/pagetypeinfo > test.file
         page:fffff2c585200000 is uninitialized and poisoned
         raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
         raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
         page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
         There is not page extension available.
         ------------[ cut here ]------------
         kernel BUG at include/linux/mm.h:1107!
         invalid opcode: 0000 [#1] SMP NOPTI
      
      Please note that this change does not affect ZONE_DEVICE, because
      pagetypeinfo_showmixedcount_print() is called from
      mm/vmstat.c:pagetypeinfo_showmixedcount() only for populated zones, and
      ZONE_DEVICE is never populated (zone->present_pages always 0).
      
      [david@redhat.com: move check to outer loop, add comment, rephrase description]
      Link: http://lkml.kernel.org/r/20191011140638.8160-1-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e8Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f712e306
    • Q
      mm/slub: fix a deadlock in show_slab_objects() · bb6932c5
      Qian Cai 提交于
      commit e4f8e513c3d353c134ad4eef9fd0bba12406c7c8 upstream.
      
      A long time ago we fixed a similar deadlock in show_slab_objects() [1].
      However, it is apparently due to the commits like 01fb58bc ("slab:
      remove synchronous synchronize_sched() from memcg cache deactivation
      path") and 03afc0e2 ("slab: get_online_mems for
      kmem_cache_{create,destroy,shrink}"), this kind of deadlock is back by
      just reading files in /sys/kernel/slab which will generate a lockdep
      splat below.
      
      Since the "mem_hotplug_lock" here is only to obtain a stable online node
      mask while racing with NUMA node hotplug, in the worst case, the results
      may me miscalculated while doing NUMA node hotplug, but they shall be
      corrected by later reads of the same files.
      
        WARNING: possible circular locking dependency detected
        ------------------------------------------------------
        cat/5224 is trying to acquire lock:
        ffff900012ac3120 (mem_hotplug_lock.rw_sem){++++}, at:
        show_slab_objects+0x94/0x3a8
      
        but task is already holding lock:
        b8ff009693eee398 (kn->count#45){++++}, at: kernfs_seq_start+0x44/0xf0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (kn->count#45){++++}:
               lock_acquire+0x31c/0x360
               __kernfs_remove+0x290/0x490
               kernfs_remove+0x30/0x44
               sysfs_remove_dir+0x70/0x88
               kobject_del+0x50/0xb0
               sysfs_slab_unlink+0x2c/0x38
               shutdown_cache+0xa0/0xf0
               kmemcg_cache_shutdown_fn+0x1c/0x34
               kmemcg_workfn+0x44/0x64
               process_one_work+0x4f4/0x950
               worker_thread+0x390/0x4bc
               kthread+0x1cc/0x1e8
               ret_from_fork+0x10/0x18
      
        -> #1 (slab_mutex){+.+.}:
               lock_acquire+0x31c/0x360
               __mutex_lock_common+0x16c/0xf78
               mutex_lock_nested+0x40/0x50
               memcg_create_kmem_cache+0x38/0x16c
               memcg_kmem_cache_create_func+0x3c/0x70
               process_one_work+0x4f4/0x950
               worker_thread+0x390/0x4bc
               kthread+0x1cc/0x1e8
               ret_from_fork+0x10/0x18
      
        -> #0 (mem_hotplug_lock.rw_sem){++++}:
               validate_chain+0xd10/0x2bcc
               __lock_acquire+0x7f4/0xb8c
               lock_acquire+0x31c/0x360
               get_online_mems+0x54/0x150
               show_slab_objects+0x94/0x3a8
               total_objects_show+0x28/0x34
               slab_attr_show+0x38/0x54
               sysfs_kf_seq_show+0x198/0x2d4
               kernfs_seq_show+0xa4/0xcc
               seq_read+0x30c/0x8a8
               kernfs_fop_read+0xa8/0x314
               __vfs_read+0x88/0x20c
               vfs_read+0xd8/0x10c
               ksys_read+0xb0/0x120
               __arm64_sys_read+0x54/0x88
               el0_svc_handler+0x170/0x240
               el0_svc+0x8/0xc
      
        other info that might help us debug this:
      
        Chain exists of:
          mem_hotplug_lock.rw_sem --> slab_mutex --> kn->count#45
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(kn->count#45);
                                       lock(slab_mutex);
                                       lock(kn->count#45);
          lock(mem_hotplug_lock.rw_sem);
      
         *** DEADLOCK ***
      
        3 locks held by cat/5224:
         #0: 9eff00095b14b2a0 (&p->lock){+.+.}, at: seq_read+0x4c/0x8a8
         #1: 0eff008997041480 (&of->mutex){+.+.}, at: kernfs_seq_start+0x34/0xf0
         #2: b8ff009693eee398 (kn->count#45){++++}, at:
        kernfs_seq_start+0x44/0xf0
      
        stack backtrace:
        Call trace:
         dump_backtrace+0x0/0x248
         show_stack+0x20/0x2c
         dump_stack+0xd0/0x140
         print_circular_bug+0x368/0x380
         check_noncircular+0x248/0x250
         validate_chain+0xd10/0x2bcc
         __lock_acquire+0x7f4/0xb8c
         lock_acquire+0x31c/0x360
         get_online_mems+0x54/0x150
         show_slab_objects+0x94/0x3a8
         total_objects_show+0x28/0x34
         slab_attr_show+0x38/0x54
         sysfs_kf_seq_show+0x198/0x2d4
         kernfs_seq_show+0xa4/0xcc
         seq_read+0x30c/0x8a8
         kernfs_fop_read+0xa8/0x314
         __vfs_read+0x88/0x20c
         vfs_read+0xd8/0x10c
         ksys_read+0xb0/0x120
         __arm64_sys_read+0x54/0x88
         el0_svc_handler+0x170/0x240
         el0_svc+0x8/0xc
      
      I think it is important to mention that this doesn't expose the
      show_slab_objects to use-after-free.  There is only a single path that
      might really race here and that is the slab hotplug notifier callback
      __kmem_cache_shrink (via slab_mem_going_offline_callback) but that path
      doesn't really destroy kmem_cache_node data structures.
      
      [1] http://lkml.iu.edu/hypermail/linux/kernel/1101.0/02850.html
      
      [akpm@linux-foundation.org: add comment explaining why we don't need mem_hotplug_lock]
      Link: http://lkml.kernel.org/r/1570192309-10132-1-git-send-email-cai@lca.pw
      Fixes: 01fb58bc ("slab: remove synchronous synchronize_sched() from memcg cache deactivation path")
      Fixes: 03afc0e2 ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bb6932c5
    • D
      mm/memory-failure.c: don't access uninitialized memmaps in memory_failure() · 9792afbd
      David Hildenbrand 提交于
      commit 96c804a6ae8c59a9092b3d5dd581198472063184 upstream.
      
      We should check for pfn_to_online_page() to not access uninitialized
      memmaps.  Reshuffle the code so we don't have to duplicate the error
      message.
      
      Link: http://lkml.kernel.org/r/20191009142435.3975-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9792afbd
    • M
      memfd: Fix locking when tagging pins · 99b45e7a
      Matthew Wilcox (Oracle) 提交于
      The RCU lock is insufficient to protect the radix tree iteration as
      a deletion from the tree can occur before we take the spinlock to
      tag the entry.  In 4.19, this has manifested as a bug with the following
      trace:
      
      kernel BUG at lib/radix-tree.c:1429!
      invalid opcode: 0000 [#1] SMP KASAN PTI
      CPU: 7 PID: 6935 Comm: syz-executor.2 Not tainted 4.19.36 #25
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      RIP: 0010:radix_tree_tag_set+0x200/0x2f0 lib/radix-tree.c:1429
      Code: 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 44 24 10 e8 a3 29 7e fe 48 8b 44 24 10 48 0f ab 03 e9 d2 fe ff ff e8 90 29 7e fe <0f> 0b 48 c7 c7 e0 5a 87 84 e8 f0 e7 08 ff 4c 89 ef e8 4a ff ac fe
      RSP: 0018:ffff88837b13fb60 EFLAGS: 00010016
      RAX: 0000000000040000 RBX: ffff8883c5515d58 RCX: ffffffff82cb2ef0
      RDX: 0000000000000b72 RSI: ffffc90004cf2000 RDI: ffff8883c5515d98
      RBP: ffff88837b13fb98 R08: ffffed106f627f7e R09: ffffed106f627f7e
      R10: 0000000000000001 R11: ffffed106f627f7d R12: 0000000000000004
      R13: ffffea000d7fea80 R14: 1ffff1106f627f6f R15: 0000000000000002
      FS:  00007fa1b8df2700(0000) GS:ffff8883e2fc0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fa1b8df1db8 CR3: 000000037d4d2001 CR4: 0000000000160ee0
      Call Trace:
       memfd_tag_pins mm/memfd.c:51 [inline]
       memfd_wait_for_pins+0x2c5/0x12d0 mm/memfd.c:81
       memfd_add_seals mm/memfd.c:215 [inline]
       memfd_fcntl+0x33d/0x4a0 mm/memfd.c:247
       do_fcntl+0x589/0xeb0 fs/fcntl.c:421
       __do_sys_fcntl fs/fcntl.c:463 [inline]
       __se_sys_fcntl fs/fcntl.c:448 [inline]
       __x64_sys_fcntl+0x12d/0x180 fs/fcntl.c:448
       do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:293
      
      The problem does not occur in mainline due to the XArray rewrite which
      changed the locking to exclude modification of the tree during iteration.
      At the time, nobody realised this was a bugfix.  Backport the locking
      changes to stable.
      
      Cc: stable@vger.kernel.org
      Reported-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      99b45e7a
  10. 18 10月, 2019 1 次提交
  11. 12 10月, 2019 1 次提交
    • K
      usercopy: Avoid HIGHMEM pfn warning · 12c6c4a5
      Kees Cook 提交于
      commit 314eed30ede02fa925990f535652254b5bad6b65 upstream.
      
      When running on a system with >512MB RAM with a 32-bit kernel built with:
      
      	CONFIG_DEBUG_VIRTUAL=y
      	CONFIG_HIGHMEM=y
      	CONFIG_HARDENED_USERCOPY=y
      
      all execve()s will fail due to argv copying into kmap()ed pages, and on
      usercopy checking the calls ultimately of virt_to_page() will be looking
      for "bad" kmap (highmem) pointers due to CONFIG_DEBUG_VIRTUAL=y:
      
       ------------[ cut here ]------------
       kernel BUG at ../arch/x86/mm/physaddr.c:83!
       invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
       CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.3.0-rc8 #6
       Hardware name: Dell Inc. Inspiron 1318/0C236D, BIOS A04 01/15/2009
       EIP: __phys_addr+0xaf/0x100
       ...
       Call Trace:
        __check_object_size+0xaf/0x3c0
        ? __might_sleep+0x80/0xa0
        copy_strings+0x1c2/0x370
        copy_strings_kernel+0x2b/0x40
        __do_execve_file+0x4ca/0x810
        ? kmem_cache_alloc+0x1c7/0x370
        do_execve+0x1b/0x20
        ...
      
      The check is from arch/x86/mm/physaddr.c:
      
      	VIRTUAL_BUG_ON((phys_addr >> PAGE_SHIFT) > max_low_pfn);
      
      Due to the kmap() in fs/exec.c:
      
      		kaddr = kmap(kmapped_page);
      	...
      	if (copy_from_user(kaddr+offset, str, bytes_to_copy)) ...
      
      Now we can fetch the correct page to avoid the pfn check. In both cases,
      hardened usercopy will need to walk the page-span checker (if enabled)
      to do sanity checking.
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Tested-by: NRandy Dunlap <rdunlap@infradead.org>
      Fixes: f5509cc1 ("mm: Hardened usercopy")
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Link: https://lore.kernel.org/r/201909171056.7F2FFD17@keescookSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      12c6c4a5
  12. 05 10月, 2019 3 次提交
    • Y
      mm/compaction.c: clear total_{migrate,free}_scanned before scanning a new zone · 4d8bdf7f
      Yafang Shao 提交于
      [ Upstream commit a94b525241c0fff3598809131d7cfcfe1d572d8c ]
      
      total_{migrate,free}_scanned will be added to COMPACTMIGRATE_SCANNED and
      COMPACTFREE_SCANNED in compact_zone().  We should clear them before
      scanning a new zone.  In the proc triggered compaction, we forgot clearing
      them.
      
      [laoar.shao@gmail.com: introduce a helper compact_zone_counters_init()]
        Link: http://lkml.kernel.org/r/1563869295-25748-1-git-send-email-laoar.shao@gmail.com
      [akpm@linux-foundation.org: expand compact_zone_counters_init() into its single callsite, per mhocko]
      [vbabka@suse.cz: squash compact_zone() list_head init as well]
        Link: http://lkml.kernel.org/r/1fb6f7da-f776-9e42-22f8-bbb79b030b98@suse.cz
      [akpm@linux-foundation.org: kcompactd_do_work(): avoid unnecessary initialization of cc.zone]
      Link: http://lkml.kernel.org/r/1563789275-9639-1-git-send-email-laoar.shao@gmail.com
      Fixes: 7f354a54 ("mm, compaction: add vmstats for kcompactd work")
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yafang Shao <shaoyafang@didiglobal.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4d8bdf7f
    • M
      memcg, kmem: do not fail __GFP_NOFAIL charges · b4a734a5
      Michal Hocko 提交于
      commit e55d9d9bfb69405bd7615c0f8d229d8fafb3e9b8 upstream.
      
      Thomas has noticed the following NULL ptr dereference when using cgroup
      v1 kmem limit:
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      PGD 0
      P4D 0
      Oops: 0000 [#1] PREEMPT SMP PTI
      CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
      Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
      RIP: 0010:create_empty_buffers+0x24/0x100
      Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
      RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
      RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
      RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
      R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
      R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
      FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
      Call Trace:
       create_page_buffers+0x4d/0x60
       __block_write_begin_int+0x8e/0x5a0
       ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
       ? jbd2__journal_start+0xd7/0x1f0
       ext4_da_write_begin+0x112/0x3d0
       generic_perform_write+0xf1/0x1b0
       ? file_update_time+0x70/0x140
       __generic_file_write_iter+0x141/0x1a0
       ext4_file_write_iter+0xef/0x3b0
       __vfs_write+0x17e/0x1e0
       vfs_write+0xa5/0x1a0
       ksys_write+0x57/0xd0
       do_syscall_64+0x55/0x160
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
      fails __GFP_NOFAIL charge when the kmem limit is reached.  This is a wrong
      behavior because nofail allocations are not allowed to fail.  Normal
      charge path simply forces the charge even if that means to cross the
      limit.  Kmem accounting should be doing the same.
      
      Link: http://lkml.kernel.org/r/20190906125608.32129-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NThomas Lindroth <thomas.lindroth@gmail.com>
      Debugged-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Thomas Lindroth <thomas.lindroth@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b4a734a5
    • T
      memcg, oom: don't require __GFP_FS when invoking memcg OOM killer · d40b3eaf
      Tetsuo Handa 提交于
      commit f9c645621a28e37813a1de96d9cbd89cde94a1e4 upstream.
      
      Masoud Sharbiani noticed that commit 29ef680a ("memcg, oom: move
      out_of_memory back to the charge path") broke memcg OOM called from
      __xfs_filemap_fault() path.  It turned out that try_charge() is retrying
      forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
      cannot invoke the OOM killer due to commit 3da88fb3 ("mm, oom:
      move GFP_NOFS check to out_of_memory").
      
      Allowing forced charge due to being unable to invoke memcg OOM killer will
      lead to global OOM situation.  Also, just returning -ENOMEM will be risky
      because OOM path is lost and some paths (e.g.  get_user_pages()) will leak
      -ENOMEM.  Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be
      the only choice we can choose for now.
      
      Until 29ef680a, we were able to invoke memcg OOM killer when
      GFP_KERNEL reclaim failed [1].  But since 29ef680a, we need to
      invoke memcg OOM killer when GFP_NOFS reclaim failed [2].  Although in the
      past we did invoke memcg OOM killer for GFP_NOFS [3], we might get
      pre-mature memcg OOM reports due to this patch.
      
      [1]
      
       leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
       CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19
       Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
       Call Trace:
        dump_stack+0x63/0x88
        dump_header+0x67/0x27a
        ? mem_cgroup_scan_tasks+0x91/0xf0
        oom_kill_process+0x210/0x410
        out_of_memory+0x10a/0x2c0
        mem_cgroup_out_of_memory+0x46/0x80
        mem_cgroup_oom_synchronize+0x2e4/0x310
        ? high_work_func+0x20/0x20
        pagefault_out_of_memory+0x31/0x76
        mm_fault_error+0x55/0x115
        ? handle_mm_fault+0xfd/0x220
        __do_page_fault+0x433/0x4e0
        do_page_fault+0x22/0x30
        ? page_fault+0x8/0x30
        page_fault+0x1e/0x30
       RIP: 0033:0x4009f0
       Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
       RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206
       RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000
       RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
       RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d
       R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
       R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000
       Task in /leaker killed as a result of limit of /leaker
       memory: usage 524288kB, limit 524288kB, failcnt 158965
       memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
       kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0
       Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB
       Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child
       Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB
       oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      [2]
      
       leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0
       CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20
       Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
       Call Trace:
        dump_stack+0x63/0x88
        dump_header+0x67/0x27a
        ? mem_cgroup_scan_tasks+0x91/0xf0
        oom_kill_process+0x210/0x410
        out_of_memory+0x109/0x2d0
        mem_cgroup_out_of_memory+0x46/0x80
        try_charge+0x58d/0x650
        ? __radix_tree_replace+0x81/0x100
        mem_cgroup_try_charge+0x7a/0x100
        __add_to_page_cache_locked+0x92/0x180
        add_to_page_cache_lru+0x4d/0xf0
        iomap_readpages_actor+0xde/0x1b0
        ? iomap_zero_range_actor+0x1d0/0x1d0
        iomap_apply+0xaf/0x130
        iomap_readpages+0x9f/0x150
        ? iomap_zero_range_actor+0x1d0/0x1d0
        xfs_vm_readpages+0x18/0x20 [xfs]
        read_pages+0x60/0x140
        __do_page_cache_readahead+0x193/0x1b0
        ondemand_readahead+0x16d/0x2c0
        page_cache_async_readahead+0x9a/0xd0
        filemap_fault+0x403/0x620
        ? alloc_set_pte+0x12c/0x540
        ? _cond_resched+0x14/0x30
        __xfs_filemap_fault+0x66/0x180 [xfs]
        xfs_filemap_fault+0x27/0x30 [xfs]
        __do_fault+0x19/0x40
        __handle_mm_fault+0x8e8/0xb60
        handle_mm_fault+0xfd/0x220
        __do_page_fault+0x238/0x4e0
        do_page_fault+0x22/0x30
        ? page_fault+0x8/0x30
        page_fault+0x1e/0x30
       RIP: 0033:0x4009f0
       Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
       RSP: 002b:00007ffda45c9290 EFLAGS: 00010206
       RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000
       RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
       RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d
       R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
       R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000
       Task in /leaker killed as a result of limit of /leaker
       memory: usage 524288kB, limit 524288kB, failcnt 7221
       memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
       kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0
       Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB
       Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child
       Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB
       oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      [3]
      
       leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0
       leaker cpuset=/ mems_allowed=0
       CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1
       Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
       Call Trace:
        [<ffffffffaf364147>] dump_stack+0x19/0x1b
        [<ffffffffaf35eb6a>] dump_header+0x90/0x229
        [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0
        [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60
        [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0
        [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570
        [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0
        [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90
        [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157
        [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0
        [<ffffffffaf371925>] do_page_fault+0x35/0x90
        [<ffffffffaf36d768>] page_fault+0x28/0x30
       Task in /leaker killed as a result of limit of /leaker
       memory: usage 524288kB, limit 524288kB, failcnt 20628
       memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0
       kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
       Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB
       Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child
       Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB
      
      Bisected by Masoud Sharbiani.
      
      Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp
      Fixes: 3da88fb3 ("mm, oom: move GFP_NOFS check to out_of_memory") [necessary after 29ef680a]
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NMasoud Sharbiani <msharbiani@apple.com>
      Tested-by: NMasoud Sharbiani <msharbiani@apple.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[4.19+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d40b3eaf
  13. 16 9月, 2019 1 次提交
  14. 06 9月, 2019 1 次提交