1. 03 4月, 2020 7 次提交
  2. 30 3月, 2020 4 次提交
    • A
      mm/sparse: fix kernel crash with pfn_section_valid check · b943f045
      Aneesh Kumar K.V 提交于
      Fix the crash like this:
      
          BUG: Kernel NULL pointer dereference on read at 0x00000000
          Faulting instruction address: 0xc000000000c3447c
          Oops: Kernel access of bad area, sig: 11 [#1]
          LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
          CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
          ...
          NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
          LR [c000000000088354] vmemmap_free+0x144/0x320
          Call Trace:
             section_deactivate+0x220/0x240
             __remove_pages+0x118/0x170
             arch_remove_memory+0x3c/0x150
             memunmap_pages+0x1cc/0x2f0
             devm_action_release+0x30/0x50
             release_nodes+0x2f8/0x3e0
             device_release_driver_internal+0x168/0x270
             unbind_store+0x130/0x170
             drv_attr_store+0x44/0x60
             sysfs_kf_write+0x68/0x80
             kernfs_fop_write+0x100/0x290
             __vfs_write+0x3c/0x70
             vfs_write+0xcc/0x240
             ksys_write+0x7c/0x140
             system_call+0x5c/0x68
      
      The crash is due to NULL dereference at
      
      	test_bit(idx, ms->usage->subsection_map);
      
      due to ms->usage = NULL in pfn_section_valid()
      
      With commit d41e2f3b ("mm/hotplug: fix hot remove failure in
      SPARSEMEM|!VMEMMAP case") section_mem_map is set to NULL after
      depopulate_section_mem().  This was done so that pfn_page() can work
      correctly with kernel config that disables SPARSEMEM_VMEMMAP.  With that
      config pfn_to_page does
      
      	__section_mem_map_addr(__sec) + __pfn;
      
      where
      
        static inline struct page *__section_mem_map_addr(struct mem_section *section)
        {
      	unsigned long map = section->section_mem_map;
      	map &= SECTION_MAP_MASK;
      	return (struct page *)map;
        }
      
      Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is
      used to check the pfn validity (pfn_valid()).  Since section_deactivate
      release mem_section->usage if a section is fully deactivated,
      pfn_valid() check after a subsection_deactivate cause a kernel crash.
      
        static inline int pfn_valid(unsigned long pfn)
        {
        ...
      	return early_section(ms) || pfn_section_valid(ms, pfn);
        }
      
      where
      
        static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
        {
      	int idx = subsection_map_index(pfn);
      
      	return test_bit(idx, ms->usage->subsection_map);
        }
      
      Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is
      freed.  For architectures like ppc64 where large pages are used for
      vmmemap mapping (16MB), a specific vmemmap mapping can cover multiple
      sections.  Hence before a vmemmap mapping page can be freed, the kernel
      needs to make sure there are no valid sections within that mapping.
      Clearing the section valid bit before depopulate_section_memap enables
      this.
      
      [aneesh.kumar@linux.ibm.com: add comment]
        Link: http://lkml.kernel.org/r/20200326133235.343616-1-aneesh.kumar@linux.ibm.comLink: http://lkml.kernel.org/r/20200325031914.107660-1-aneesh.kumar@linux.ibm.com
      Fixes: d41e2f3b ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
      Reported-by: NSachin Sant <sachinp@linux.vnet.ibm.com>
      Signed-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NSachin Sant <sachinp@linux.vnet.ibm.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b943f045
    • R
      mm: fork: fix kernel_stack memcg stats for various stack implementations · 8380ce47
      Roman Gushchin 提交于
      Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the
      space for task stacks can be allocated using __vmalloc_node_range(),
      alloc_pages_node() and kmem_cache_alloc_node().
      
      In the first and the second cases page->mem_cgroup pointer is set, but
      in the third it's not: memcg membership of a slab page should be
      determined using the memcg_from_slab_page() function, which looks at
      page->slab_cache->memcg_params.memcg .  In this case, using
      mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
      page->mem_cgroup pointer is NULL even for pages charged to a non-root
      memory cgroup.
      
      It can lead to kernel_stack per-memcg counters permanently showing 0 on
      some architectures (depending on the configuration).
      
      In order to fix it, let's introduce a mod_memcg_obj_state() helper,
      which takes a pointer to a kernel object as a first argument, uses
      mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls
      mod_memcg_state().  It allows to handle all possible configurations
      (CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without
      spilling any memcg/kmem specifics into fork.c .
      
      Note: This is a special version of the patch created for stable
      backports.  It contains code from the following two patches:
        - mm: memcg/slab: introduce mem_cgroup_from_obj()
        - mm: fork: fix kernel_stack memcg stats for various stack implementations
      
      [guro@fb.com: introduce mem_cgroup_from_obj()]
        Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com
      Fixes: 4d96ba35 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Bharata B Rao <bharata@linux.ibm.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8380ce47
    • M
      hugetlb_cgroup: fix illegal access to memory · 726b7bbe
      Mina Almasry 提交于
      This appears to be a mistake in commit faced7e0 ("mm: hugetlb
      controller for cgroups v2").
      
      Essentially that commit does a hugetlb_cgroup_from_counter assuming that
      page_counter_try_charge has initialized counter.
      
      But if that has failed then it seems will not initialize counter, so
      hugetlb_cgroup_from_counter(counter) ends up pointing to random memory,
      causing kasan to complain.
      
      The solution is to simply use 'h_cg', instead of
      hugetlb_cgroup_from_counter(counter), since that is a reference to the
      hugetlb_cgroup anyway.  After this change kasan ceases to complain.
      
      Fixes: faced7e0 ("mm: hugetlb controller for cgroups v2")
      Reported-by: syzbot+cac0c4e204952cf449b1@syzkaller.appspotmail.com
      Signed-off-by: NMina Almasry <almasrymina@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NGiuseppe Scrivano <gscrivan@redhat.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: David Rientjes <rientjes@google.com>
      Link: http://lkml.kernel.org/r/20200313223920.124230-1-almasrymina@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      726b7bbe
    • N
      mm/swapfile.c: move inode_lock out of claim_swapfile · d795a90e
      Naohiro Aota 提交于
      claim_swapfile() currently keeps the inode locked when it is successful,
      or the file is already swapfile (with -EBUSY).  And, on the other error
      cases, it does not lock the inode.
      
      This inconsistency of the lock state and return value is quite confusing
      and actually causing a bad unlock balance as below in the "bad_swap"
      section of __do_sys_swapon().
      
      This commit fixes this issue by moving the inode_lock() and IS_SWAPFILE
      check out of claim_swapfile().  The inode is unlocked in
      "bad_swap_unlock_inode" section, so that the inode is ensured to be
      unlocked at "bad_swap".  Thus, error handling codes after the locking now
      jumps to "bad_swap_unlock_inode" instead of "bad_swap".
      
          =====================================
          WARNING: bad unlock balance detected!
          5.5.0-rc7+ #176 Not tainted
          -------------------------------------
          swapon/4294 is trying to release lock (&sb->s_type->i_mutex_key) at: __do_sys_swapon+0x94b/0x3550
          but there are no more locks to release!
      
          other info that might help us debug this:
          no locks held by swapon/4294.
      
          stack backtrace:
          CPU: 5 PID: 4294 Comm: swapon Not tainted 5.5.0-rc7-BTRFS-ZNS+ #176
          Hardware name: ASUS All Series/H87-PRO, BIOS 2102 07/29/2014
          Call Trace:
           dump_stack+0xa1/0xea
           print_unlock_imbalance_bug.cold+0x114/0x123
           lock_release+0x562/0xed0
           up_write+0x2d/0x490
           __do_sys_swapon+0x94b/0x3550
           __x64_sys_swapon+0x54/0x80
           do_syscall_64+0xa4/0x4b0
           entry_SYSCALL_64_after_hwframe+0x49/0xbe
          RIP: 0033:0x7f15da0a0dc7
      
      Fixes: 1638045c ("mm: set S_SWAPFILE on blockdev swap devices")
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NQais Youef <qais.yousef@arm.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200206090132.154869-1-naohiro.aota@wdc.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d795a90e
  3. 26 3月, 2020 1 次提交
  4. 22 3月, 2020 8 次提交
    • J
      x86/mm: split vmalloc_sync_all() · 763802b5
      Joerg Roedel 提交于
      Commit 3f8fd02b ("mm/vmalloc: Sync unmappings in
      __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
      the vunmap() code-path.  While this change was necessary to maintain
      correctness on x86-32-pae kernels, it also adds additional cycles for
      architectures that don't need it.
      
      Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
      severe performance regressions in micro-benchmarks because it now also
      calls the x86-64 implementation of vmalloc_sync_all() on vunmap().  But
      the vmalloc_sync_all() implementation on x86-64 is only needed for newly
      created mappings.
      
      To avoid the unnecessary work on x86-64 and to gain the performance
      back, split up vmalloc_sync_all() into two functions:
      
      	* vmalloc_sync_mappings(), and
      	* vmalloc_sync_unmappings()
      
      Most call-sites to vmalloc_sync_all() only care about new mappings being
      synchronized.  The only exception is the new call-site added in the
      above mentioned commit.
      
      Shile Zhang directed us to a report of an 80% regression in reaim
      throughput.
      
      Fixes: 3f8fd02b ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Reported-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBorislav Petkov <bp@suse.de>
      Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	[GHES]
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
      Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
      Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      763802b5
    • V
      mm, slub: prevent kmalloc_node crashes and memory leaks · 0715e6c5
      Vlastimil Babka 提交于
      Sachin reports [1] a crash in SLUB __slab_alloc():
      
        BUG: Kernel NULL pointer dereference on read at 0x000073b0
        Faulting instruction address: 0xc0000000003d55f4
        Oops: Kernel access of bad area, sig: 11 [#1]
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in:
        CPU: 19 PID: 1 Comm: systemd Not tainted 5.6.0-rc2-next-20200218-autotest #1
        NIP:  c0000000003d55f4 LR: c0000000003d5b94 CTR: 0000000000000000
        REGS: c0000008b37836d0 TRAP: 0300   Not tainted  (5.6.0-rc2-next-20200218-autotest)
        MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24004844  XER: 00000000
        CFAR: c00000000000dec4 DAR: 00000000000073b0 DSISR: 40000000 IRQMASK: 1
        GPR00: c0000000003d5b94 c0000008b3783960 c00000000155d400 c0000008b301f500
        GPR04: 0000000000000dc0 0000000000000002 c0000000003443d8 c0000008bb398620
        GPR08: 00000008ba2f0000 0000000000000001 0000000000000000 0000000000000000
        GPR12: 0000000024004844 c00000001ec52a00 0000000000000000 0000000000000000
        GPR16: c0000008a1b20048 c000000001595898 c000000001750c18 0000000000000002
        GPR20: c000000001750c28 c000000001624470 0000000fffffffe0 5deadbeef0000122
        GPR24: 0000000000000001 0000000000000dc0 0000000000000002 c0000000003443d8
        GPR28: c0000008b301f500 c0000008bb398620 0000000000000000 c00c000002287180
        NIP ___slab_alloc+0x1f4/0x760
        LR __slab_alloc+0x34/0x60
        Call Trace:
          ___slab_alloc+0x334/0x760 (unreliable)
          __slab_alloc+0x34/0x60
          __kmalloc_node+0x110/0x490
          kvmalloc_node+0x58/0x110
          mem_cgroup_css_online+0x108/0x270
          online_css+0x48/0xd0
          cgroup_apply_control_enable+0x2ec/0x4d0
          cgroup_mkdir+0x228/0x5f0
          kernfs_iop_mkdir+0x90/0xf0
          vfs_mkdir+0x110/0x230
          do_mkdirat+0xb0/0x1a0
          system_call+0x5c/0x68
      
      This is a PowerPC platform with following NUMA topology:
      
        available: 2 nodes (0-1)
        node 0 cpus:
        node 0 size: 0 MB
        node 0 free: 0 MB
        node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
        node 1 size: 35247 MB
        node 1 free: 30907 MB
        node distances:
        node   0   1
          0:  10  40
          1:  40  10
      
        possible numa nodes: 0-31
      
      This only happens with a mmotm patch "mm/memcontrol.c: allocate
      shrinker_map on appropriate NUMA node" [2] which effectively calls
      kmalloc_node for each possible node.  SLUB however only allocates
      kmem_cache_node on online N_NORMAL_MEMORY nodes, and relies on
      node_to_mem_node to return such valid node for other nodes since commit
      a561ce00 ("slub: fall back to node_to_mem_node() node if allocating
      on memoryless node").  This is however not true in this configuration
      where the _node_numa_mem_ array is not initialized for nodes 0 and 2-31,
      thus it contains zeroes and get_partial() ends up accessing
      non-allocated kmem_cache_node.
      
      A related issue was reported by Bharata (originally by Ramachandran) [3]
      where a similar PowerPC configuration, but with mainline kernel without
      patch [2] ends up allocating large amounts of pages by kmalloc-1k
      kmalloc-512.  This seems to have the same underlying issue with
      node_to_mem_node() not behaving as expected, and might probably also
      lead to an infinite loop with CONFIG_SLUB_CPU_PARTIAL [4].
      
      This patch should fix both issues by not relying on node_to_mem_node()
      anymore and instead simply falling back to NUMA_NO_NODE, when
      kmalloc_node(node) is attempted for a node that's not online, or has no
      usable memory.  The "usable memory" condition is also changed from
      node_present_pages() to N_NORMAL_MEMORY node state, as that is exactly
      the condition that SLUB uses to allocate kmem_cache_node structures.
      The check in get_partial() is removed completely, as the checks in
      ___slab_alloc() are now sufficient to prevent get_partial() being
      reached with an invalid node.
      
      [1] https://lore.kernel.org/linux-next/3381CD91-AB3D-4773-BA04-E7A072A63968@linux.vnet.ibm.com/
      [2] https://lore.kernel.org/linux-mm/fff0e636-4c36-ed10-281c-8cdb0687c839@virtuozzo.com/
      [3] https://lore.kernel.org/linux-mm/20200317092624.GB22538@in.ibm.com/
      [4] https://lore.kernel.org/linux-mm/088b5996-faae-8a56-ef9c-5b567125ae54@suse.cz/
      
      Fixes: a561ce00 ("slub: fall back to node_to_mem_node() node if allocating on memoryless node")
      Reported-by: NSachin Sant <sachinp@linux.vnet.ibm.com>
      Reported-by: NPUVICHAKRAVARTHY RAMACHANDRAN <puvichakravarthy@in.ibm.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NSachin Sant <sachinp@linux.vnet.ibm.com>
      Tested-by: NBharata B Rao <bharata@linux.ibm.com>
      Reviewed-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: linuxppc-dev@lists.ozlabs.org
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200320115533.9604-1-vbabka@suse.czDebugged-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0715e6c5
    • Q
      mm/mmu_notifier: silence PROVE_RCU_LIST warnings · 63886bad
      Qian Cai 提交于
      It is safe to traverse mm->notifier_subscriptions->list either under
      SRCU read lock or mm->notifier_subscriptions->lock using
      hlist_for_each_entry_rcu().  Silence the PROVE_RCU_LIST false positives,
      for example,
      
        WARNING: suspicious RCU usage
        -----------------------------
        mm/mmu_notifier.c:484 RCU-list traversed in non-reader section!!
      
        other info that might help us debug this:
      
        rcu_scheduler_active = 2, debug_locks = 1
        3 locks held by libvirtd/802:
         #0: ffff9321e3f58148 (&mm->mmap_sem#2){++++}, at: do_mprotect_pkey+0xe1/0x3e0
         #1: ffffffff91ae6160 (mmu_notifier_invalidate_range_start){+.+.}, at: change_p4d_range+0x5fa/0x800
         #2: ffffffff91ae6e08 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x178/0x460
      
        stack backtrace:
        CPU: 7 PID: 802 Comm: libvirtd Tainted: G          I       5.6.0-rc6-next-20200317+ #2
        Hardware name: HP ProLiant BL460c Gen8, BIOS I31 11/02/2014
        Call Trace:
          dump_stack+0xa4/0xfe
          lockdep_rcu_suspicious+0xeb/0xf5
          __mmu_notifier_invalidate_range_start+0x3ff/0x460
          change_p4d_range+0x746/0x800
          change_protection+0x1df/0x300
          mprotect_fixup+0x245/0x3e0
          do_mprotect_pkey+0x23b/0x3e0
          __x64_sys_mprotect+0x51/0x70
          do_syscall_64+0x91/0xae8
          entry_SYSCALL_64_after_hwframe+0x49/0xb3
      Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPaul E. McKenney <paulmck@kernel.org>
      Reviewed-by: NJason Gunthorpe <jgg@mellanox.com>
      Link: http://lkml.kernel.org/r/20200317175640.2047-1-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63886bad
    • M
      mm: do not allow MADV_PAGEOUT for CoW pages · 12e967fd
      Michal Hocko 提交于
      Jann has brought up a very interesting point [1].  While shared pages
      are excluded from MADV_PAGEOUT normally, CoW pages can be easily
      reclaimed that way.  This can lead to all sorts of hard to debug
      problems.  E.g.  performance problems outlined by Daniel [2].
      
      There are runtime environments where there is a substantial memory
      shared among security domains via CoW memory and a easy to reclaim way
      of that memory, which MADV_{COLD,PAGEOUT} offers, can lead to either
      performance degradation in for the parent process which might be more
      privileged or even open side channel attacks.
      
      The feasibility of the latter is not really clear to me TBH but there is
      no real reason for exposure at this stage.  It seems there is no real
      use case to depend on reclaiming CoW memory via madvise at this stage so
      it is much easier to simply disallow it and this is what this patch
      does.  Put it simply MADV_{PAGEOUT,COLD} can operate only on the
      exclusively owned memory which is a straightforward semantic.
      
      [1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com
      [2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com
      
      Fixes: 9c276cc6 ("mm: introduce MADV_COLD")
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200312082248.GS23944@dhcp22.suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12e967fd
    • C
      mm, memcg: throttle allocators based on ancestral memory.high · e26733e0
      Chris Down 提交于
      Prior to this commit, we only directly check the affected cgroup's
      memory.high against its usage.  However, it's possible that we are being
      reclaimed as a result of hitting an ancestor memory.high and should be
      penalised based on that, instead.
      
      This patch changes memory.high overage throttling to use the largest
      overage in its ancestors when considering how many penalty jiffies to
      charge.  This makes sure that we penalise poorly behaving cgroups in the
      same way regardless of at what level of the hierarchy memory.high was
      breached.
      
      Fixes: 0e4b01df ("mm, memcg: throttle allocators when failing reclaim over memory.high")
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NChris Down <chris@chrisdown.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>	[5.4.x+]
      Link: http://lkml.kernel.org/r/8cd132f84bd7e16cdb8fde3378cdbf05ba00d387.1584036142.git.chris@chrisdown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e26733e0
    • C
      mm, memcg: fix corruption on 64-bit divisor in memory.high throttling · d397a45f
      Chris Down 提交于
      Commit 0e4b01df had a bunch of fixups to use the right division
      method.  However, it seems that after all that it still wasn't right --
      div_u64 takes a 32-bit divisor.
      
      The headroom is still large (2^32 pages), so on mundane systems you
      won't hit this, but this should definitely be fixed.
      
      Fixes: 0e4b01df ("mm, memcg: throttle allocators when failing reclaim over memory.high")
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NChris Down <chris@chrisdown.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Cc: <stable@vger.kernel.org>	[5.4.x+]
      Link: http://lkml.kernel.org/r/80780887060514967d414b3cd91f9a316a16ab98.1584036142.git.chris@chrisdown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d397a45f
    • B
      mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case · d41e2f3b
      Baoquan He 提交于
      In section_deactivate(), pfn_to_page() doesn't work any more after
      ms->section_mem_map is resetting to NULL in SPARSEMEM|!VMEMMAP case.  It
      causes a hot remove failure:
      
        kernel BUG at mm/page_alloc.c:4806!
        invalid opcode: 0000 [#1] SMP PTI
        CPU: 3 PID: 8 Comm: kworker/u16:0 Tainted: G        W         5.5.0-next-20200205+ #340
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
        Workqueue: kacpi_hotplug acpi_hotplug_work_fn
        RIP: 0010:free_pages+0x85/0xa0
        Call Trace:
         __remove_pages+0x99/0xc0
         arch_remove_memory+0x23/0x4d
         try_remove_memory+0xc8/0x130
         __remove_memory+0xa/0x11
         acpi_memory_device_remove+0x72/0x100
         acpi_bus_trim+0x55/0x90
         acpi_device_hotplug+0x2eb/0x3d0
         acpi_hotplug_work_fn+0x1a/0x30
         process_one_work+0x1a7/0x370
         worker_thread+0x30/0x380
         kthread+0x112/0x130
         ret_from_fork+0x35/0x40
      
      Let's move the ->section_mem_map resetting after
      depopulate_section_memmap() to fix it.
      
      [akpm@linux-foundation.org: remove unneeded initialization, per David]
      Fixes: ba72b4c8 ("mm/sparsemem: support sub-section hotplug")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200307084229.28251-2-bhe@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d41e2f3b
    • C
      memcg: fix NULL pointer dereference in __mem_cgroup_usage_unregister_event · 7d36665a
      Chunguang Xu 提交于
      An eventfd monitors multiple memory thresholds of the cgroup, closes them,
      the kernel deletes all events related to this eventfd.  Before all events
      are deleted, another eventfd monitors the memory threshold of this cgroup,
      leading to a crash:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000004
        #PF: supervisor write access in kernel mode
        #PF: error_code(0x0002) - not-present page
        PGD 800000033058e067 P4D 800000033058e067 PUD 3355ce067 PMD 0
        Oops: 0002 [#1] SMP PTI
        CPU: 2 PID: 14012 Comm: kworker/2:6 Kdump: loaded Not tainted 5.6.0-rc4 #3
        Hardware name: LENOVO 20AWS01K00/20AWS01K00, BIOS GLET70WW (2.24 ) 05/21/2014
        Workqueue: events memcg_event_remove
        RIP: 0010:__mem_cgroup_usage_unregister_event+0xb3/0x190
        RSP: 0018:ffffb47e01c4fe18 EFLAGS: 00010202
        RAX: 0000000000000001 RBX: ffff8bb223a8a000 RCX: 0000000000000001
        RDX: 0000000000000001 RSI: ffff8bb22fb83540 RDI: 0000000000000001
        RBP: ffffb47e01c4fe48 R08: 0000000000000000 R09: 0000000000000010
        R10: 000000000000000c R11: 071c71c71c71c71c R12: ffff8bb226aba880
        R13: ffff8bb223a8a480 R14: 0000000000000000 R15: 0000000000000000
        FS:  0000000000000000(0000) GS:ffff8bb242680000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000004 CR3: 000000032c29c003 CR4: 00000000001606e0
        Call Trace:
          memcg_event_remove+0x32/0x90
          process_one_work+0x172/0x380
          worker_thread+0x49/0x3f0
          kthread+0xf8/0x130
          ret_from_fork+0x35/0x40
        CR2: 0000000000000004
      
      We can reproduce this problem in the following ways:
      
      1. We create a new cgroup subdirectory and a new eventfd, and then we
         monitor multiple memory thresholds of the cgroup through this eventfd.
      
      2.  closing this eventfd, and __mem_cgroup_usage_unregister_event ()
         will be called multiple times to delete all events related to this
         eventfd.
      
      The first time __mem_cgroup_usage_unregister_event() is called, the
      kernel will clear all items related to this eventfd in thresholds->
      primary.
      
      Since there is currently only one eventfd, thresholds-> primary becomes
      empty, so the kernel will set thresholds-> primary and hresholds-> spare
      to NULL.  If at this time, the user creates a new eventfd and monitor
      the memory threshold of this cgroup, kernel will re-initialize
      thresholds-> primary.
      
      Then when __mem_cgroup_usage_unregister_event () is called for the
      second time, because thresholds-> primary is not empty, the system will
      access thresholds-> spare, but thresholds-> spare is NULL, which will
      trigger a crash.
      
      In general, the longer it takes to delete all events related to this
      eventfd, the easier it is to trigger this problem.
      
      The solution is to check whether the thresholds associated with the
      eventfd has been cleared when deleting the event.  If so, we do nothing.
      
      [akpm@linux-foundation.org: fix comment, per Kirill]
      Fixes: 907860ed ("cgroups: make cftype.unregister_event() void-returning")
      Signed-off-by: NChunguang Xu <brookxu@tencent.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/077a6f67-aefa-4591-efec-f2f3af2b0b02@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d36665a
  5. 19 3月, 2020 2 次提交
    • L
      mm: slub: be more careful about the double cmpxchg of freelist · 5076190d
      Linus Torvalds 提交于
      This is just a cleanup addition to Jann's fix to properly update the
      transaction ID for the slub slowpath in commit fd4d9c7d ("mm: slub:
      add missing TID bump..").
      
      The transaction ID is what protects us against any concurrent accesses,
      but we should really also make sure to make the 'freelist' comparison
      itself always use the same freelist value that we then used as the new
      next free pointer.
      
      Jann points out that if we do all of this carefully, we could skip the
      transaction ID update for all the paths that only remove entries from
      the lists, and only update the TID when adding entries (to avoid the ABA
      issue with cmpxchg and list handling re-adding a previously seen value).
      
      But this patch just does the "make sure to cmpxchg the same value we
      used" rather than then try to be clever.
      Acked-by: NJann Horn <jannh@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5076190d
    • J
      mm: slub: add missing TID bump in kmem_cache_alloc_bulk() · fd4d9c7d
      Jann Horn 提交于
      When kmem_cache_alloc_bulk() attempts to allocate N objects from a percpu
      freelist of length M, and N > M > 0, it will first remove the M elements
      from the percpu freelist, then call ___slab_alloc() to allocate the next
      element and repopulate the percpu freelist. ___slab_alloc() can re-enable
      IRQs via allocate_slab(), so the TID must be bumped before ___slab_alloc()
      to properly commit the freelist head change.
      
      Fix it by unconditionally bumping c->tid when entering the slowpath.
      
      Cc: stable@vger.kernel.org
      Fixes: ebe909e0 ("slub: improve bulk alloc strategy")
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd4d9c7d
  6. 11 3月, 2020 2 次提交
    • S
      net: memcg: late association of sock to memcg · d752a498
      Shakeel Butt 提交于
      If a TCP socket is allocated in IRQ context or cloned from unassociated
      (i.e. not associated to a memcg) in IRQ context then it will remain
      unassociated for its whole life. Almost half of the TCPs created on the
      system are created in IRQ context, so, memory used by such sockets will
      not be accounted by the memcg.
      
      This issue is more widespread in cgroup v1 where network memory
      accounting is opt-in but it can happen in cgroup v2 if the source socket
      for the cloning was created in root memcg.
      
      To fix the issue, just do the association of the sockets at the accept()
      time in the process context and then force charge the memory buffer
      already used and reserved by the socket.
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d752a498
    • S
      cgroup: memcg: net: do not associate sock with unrelated cgroup · e876ecc6
      Shakeel Butt 提交于
      We are testing network memory accounting in our setup and noticed
      inconsistent network memory usage and often unrelated cgroups network
      usage correlates with testing workload. On further inspection, it
      seems like mem_cgroup_sk_alloc() and cgroup_sk_alloc() are broken in
      irq context specially for cgroup v1.
      
      mem_cgroup_sk_alloc() and cgroup_sk_alloc() can be called in irq context
      and kind of assumes that this can only happen from sk_clone_lock()
      and the source sock object has already associated cgroup. However in
      cgroup v1, where network memory accounting is opt-in, the source sock
      can be unassociated with any cgroup and the new cloned sock can get
      associated with unrelated interrupted cgroup.
      
      Cgroup v2 can also suffer if the source sock object was created by
      process in the root cgroup or if sk_alloc() is called in irq context.
      The fix is to just do nothing in interrupt.
      
      WARNING: Please note that about half of the TCP sockets are allocated
      from the IRQ context, so, memory used by such sockets will not be
      accouted by the memcg.
      
      The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
      
      CPU: 70 PID: 12720 Comm: ssh Tainted:  5.6.0-smp-DEV #1
      Hardware name: ...
      Call Trace:
       <IRQ>
       dump_stack+0x57/0x75
       mem_cgroup_sk_alloc+0xe9/0xf0
       sk_clone_lock+0x2a7/0x420
       inet_csk_clone_lock+0x1b/0x110
       tcp_create_openreq_child+0x23/0x3b0
       tcp_v6_syn_recv_sock+0x88/0x730
       tcp_check_req+0x429/0x560
       tcp_v6_rcv+0x72d/0xa40
       ip6_protocol_deliver_rcu+0xc9/0x400
       ip6_input+0x44/0xd0
       ? ip6_protocol_deliver_rcu+0x400/0x400
       ip6_rcv_finish+0x71/0x80
       ipv6_rcv+0x5b/0xe0
       ? ip6_sublist_rcv+0x2e0/0x2e0
       process_backlog+0x108/0x1e0
       net_rx_action+0x26b/0x460
       __do_softirq+0x104/0x2a6
       do_softirq_own_stack+0x2a/0x40
       </IRQ>
       do_softirq.part.19+0x40/0x50
       __local_bh_enable_ip+0x51/0x60
       ip6_finish_output2+0x23d/0x520
       ? ip6table_mangle_hook+0x55/0x160
       __ip6_finish_output+0xa1/0x100
       ip6_finish_output+0x30/0xd0
       ip6_output+0x73/0x120
       ? __ip6_finish_output+0x100/0x100
       ip6_xmit+0x2e3/0x600
       ? ipv6_anycast_cleanup+0x50/0x50
       ? inet6_csk_route_socket+0x136/0x1e0
       ? skb_free_head+0x1e/0x30
       inet6_csk_xmit+0x95/0xf0
       __tcp_transmit_skb+0x5b4/0xb20
       __tcp_send_ack.part.60+0xa3/0x110
       tcp_send_ack+0x1d/0x20
       tcp_rcv_state_process+0xe64/0xe80
       ? tcp_v6_connect+0x5d1/0x5f0
       tcp_v6_do_rcv+0x1b1/0x3f0
       ? tcp_v6_do_rcv+0x1b1/0x3f0
       __release_sock+0x7f/0xd0
       release_sock+0x30/0xa0
       __inet_stream_connect+0x1c3/0x3b0
       ? prepare_to_wait+0xb0/0xb0
       inet_stream_connect+0x3b/0x60
       __sys_connect+0x101/0x120
       ? __sys_getsockopt+0x11b/0x140
       __x64_sys_connect+0x1a/0x20
       do_syscall_64+0x51/0x200
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The stack trace of mem_cgroup_sk_alloc() from IRQ-context:
      Fixes: 2d758073 ("mm: memcontrol: consolidate cgroup socket tracking")
      Fixes: d979a39d ("cgroup: duplicate cgroup reference when cloning sockets")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e876ecc6
  7. 06 3月, 2020 5 次提交
    • V
      mm, hotplug: fix page online with DEBUG_PAGEALLOC compiled but not enabled · c87cbc1f
      Vlastimil Babka 提交于
      Commit cd02cf1a ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
      fixed memory hotplug with debug_pagealloc enabled, where onlining a page
      goes through page freeing, which removes the direct mapping.  Some arches
      don't like when the page is not mapped in the first place, so
      generic_online_page() maps it first.  This is somewhat wasteful, but
      better than special casing page freeing fast paths.
      
      The commit however missed that DEBUG_PAGEALLOC configured doesn't mean
      it's actually enabled.  One has to test debug_pagealloc_enabled() since
      031bc574 ("mm/debug-pagealloc: make debug-pagealloc boottime
      configurable"), or alternatively debug_pagealloc_enabled_static() since
      8e57f8ac ("mm, debug_pagealloc: don't rely on static keys too early"),
      but this is not done.
      
      As a result, a s390 kernel with DEBUG_PAGEALLOC configured but not enabled
      will crash:
      
      Unable to handle kernel pointer dereference in virtual kernel address space
      Failing address: 0000000000000000 TEID: 0000000000000483
      Fault in home space mode while using kernel ASCE.
      AS:0000001ece13400b R2:000003fff7fd000b R3:000003fff7fcc007 S:000003fff7fd7000 P:000000000000013d
      Oops: 0004 ilc:2 [#1] SMP
      CPU: 1 PID: 26015 Comm: chmem Kdump: loaded Tainted: GX 5.3.18-5-default #1 SLE15-SP2 (unreleased)
      Krnl PSW : 0704e00180000000 0000001ecd281b9e (__kernel_map_pages+0x166/0x188)
      R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
      Krnl GPRS: 0000000000000000 0000000000000800 0000400b00000000 0000000000000100
      0000000000000001 0000000000000000 0000000000000002 0000000000000100
      0000001ece139230 0000001ecdd98d40 0000400b00000100 0000000000000000
      000003ffa17e4000 001fffe0114f7d08 0000001ecd4d93ea 001fffe0114f7b20
      Krnl Code: 0000001ecd281b8e: ec17ffff00d8 ahik %r1,%r7,-1
      0000001ecd281b94: ec111dbc0355 risbg %r1,%r1,29,188,3
      >0000001ecd281b9e: 94fb5006 ni 6(%r5),251
      0000001ecd281ba2: 41505008 la %r5,8(%r5)
      0000001ecd281ba6: ec51fffc6064 cgrj %r5,%r1,6,1ecd281b9e
      0000001ecd281bac: 1a07 ar %r0,%r7
      0000001ecd281bae: ec03ff584076 crj %r0,%r3,4,1ecd281a5e
      Call Trace:
      [<0000001ecd281b9e>] __kernel_map_pages+0x166/0x188
      [<0000001ecd4d9516>] online_pages_range+0xf6/0x128
      [<0000001ecd2a8186>] walk_system_ram_range+0x7e/0xd8
      [<0000001ecda28aae>] online_pages+0x2fe/0x3f0
      [<0000001ecd7d02a6>] memory_subsys_online+0x8e/0xc0
      [<0000001ecd7add42>] device_online+0x5a/0xc8
      [<0000001ecd7d0430>] state_store+0x88/0x118
      [<0000001ecd5b9f62>] kernfs_fop_write+0xc2/0x200
      [<0000001ecd5064b6>] vfs_write+0x176/0x1e0
      [<0000001ecd50676a>] ksys_write+0xa2/0x100
      [<0000001ecda315d4>] system_call+0xd8/0x2c8
      
      Fix this by checking debug_pagealloc_enabled_static() before calling
      kernel_map_pages(). Backports for kernel before 5.5 should use
      debug_pagealloc_enabled() instead. Also add comments.
      
      Fixes: cd02cf1a ("mm/hotplug: fix an imbalance with DEBUG_PAGEALLOC")
      Reported-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/20200224094651.18257-1-vbabka@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c87cbc1f
    • S
      mm/z3fold.c: do not include rwlock.h directly · a8198fed
      Sebastian Andrzej Siewior 提交于
      rwlock.h should not be included directly. Instead linux/splinlock.h
      should be included. One thing it does is to break the RT build.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Vitaly Wool <vitaly.wool@konsulko.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: http://lkml.kernel.org/r/20200224133631.1510569-1-bigeasy@linutronix.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8198fed
    • K
      mm: avoid data corruption on CoW fault into PFN-mapped VMA · c3e5ea6e
      Kirill A. Shutemov 提交于
      Jeff Moyer has reported that one of xfstests triggers a warning when run
      on DAX-enabled filesystem:
      
      	WARNING: CPU: 76 PID: 51024 at mm/memory.c:2317 wp_page_copy+0xc40/0xd50
      	...
      	wp_page_copy+0x98c/0xd50 (unreliable)
      	do_wp_page+0xd8/0xad0
      	__handle_mm_fault+0x748/0x1b90
      	handle_mm_fault+0x120/0x1f0
      	__do_page_fault+0x240/0xd70
      	do_page_fault+0x38/0xd0
      	handle_page_fault+0x10/0x30
      
      The warning happens on failed __copy_from_user_inatomic() which tries to
      copy data into a CoW page.
      
      This happens because of race between MADV_DONTNEED and CoW page fault:
      
      	CPU0					CPU1
       handle_mm_fault()
         do_wp_page()
           wp_page_copy()
             do_wp_page()
      					madvise(MADV_DONTNEED)
      					  zap_page_range()
      					    zap_pte_range()
      					      ptep_get_and_clear_full()
      					      <TLB flush>
      	 __copy_from_user_inatomic()
      	 sees empty PTE and fails
      	 WARN_ON_ONCE(1)
      	 clear_page()
      
      The solution is to re-try __copy_from_user_inatomic() under PTL after
      checking that PTE is matches the orig_pte.
      
      The second copy attempt can still fail, like due to non-readable PTE, but
      there's nothing reasonable we can do about, except clearing the CoW page.
      Reported-by: NJeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NJeff Moyer <jmoyer@redhat.com>
      Cc: <stable@vger.kernel.org>
      Cc: Justin He <Justin.He@arm.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Link: http://lkml.kernel.org/r/20200218154151.13349-1-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3e5ea6e
    • H
      mm: fix possible PMD dirty bit lost in set_pmd_migration_entry() · 8a8683ad
      Huang Ying 提交于
      In set_pmd_migration_entry(), pmdp_invalidate() is used to change PMD
      atomically.  But the PMD is read before that with an ordinary memory
      reading.  If the THP (transparent huge page) is written between the PMD
      reading and pmdp_invalidate(), the PMD dirty bit may be lost, and cause
      data corruption.  The race window is quite small, but still possible in
      theory, so need to be fixed.
      
      The race is fixed via using the return value of pmdp_invalidate() to get
      the original content of PMD, which is a read/modify/write atomic
      operation.  So no THP writing can occur in between.
      
      The race has been introduced when the THP migration support is added in
      the commit 616b8371 ("mm: thp: enable thp migration in generic path").
      But this fix depends on the commit d52605d7 ("mm: do not lose dirty
      and accessed bits in pmdp_invalidate()").  So it's easy to be backported
      after v4.16.  But the race window is really small, so it may be fine not
      to backport the fix at all.
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NZi Yan <ziy@nvidia.com>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Link: http://lkml.kernel.org/r/20200220075220.2327056-1-ying.huang@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a8683ad
    • M
      mm, numa: fix bad pmd by atomically check for pmd_trans_huge when marking page tables prot_numa · 8b272b3c
      Mel Gorman 提交于
      : A user reported a bug against a distribution kernel while running a
      : proprietary workload described as "memory intensive that is not swapping"
      : that is expected to apply to mainline kernels.  The workload is
      : read/write/modifying ranges of memory and checking the contents.  They
      : reported that within a few hours that a bad PMD would be reported followed
      : by a memory corruption where expected data was all zeros.  A partial
      : report of the bad PMD looked like
      :
      :   [ 5195.338482] ../mm/pgtable-generic.c:33: bad pmd ffff8888157ba008(000002e0396009e2)
      :   [ 5195.341184] ------------[ cut here ]------------
      :   [ 5195.356880] kernel BUG at ../mm/pgtable-generic.c:35!
      :   ....
      :   [ 5195.410033] Call Trace:
      :   [ 5195.410471]  [<ffffffff811bc75d>] change_protection_range+0x7dd/0x930
      :   [ 5195.410716]  [<ffffffff811d4be8>] change_prot_numa+0x18/0x30
      :   [ 5195.410918]  [<ffffffff810adefe>] task_numa_work+0x1fe/0x310
      :   [ 5195.411200]  [<ffffffff81098322>] task_work_run+0x72/0x90
      :   [ 5195.411246]  [<ffffffff81077139>] exit_to_usermode_loop+0x91/0xc2
      :   [ 5195.411494]  [<ffffffff81003a51>] prepare_exit_to_usermode+0x31/0x40
      :   [ 5195.411739]  [<ffffffff815e56af>] retint_user+0x8/0x10
      :
      : Decoding revealed that the PMD was a valid prot_numa PMD and the bad PMD
      : was a false detection.  The bug does not trigger if automatic NUMA
      : balancing or transparent huge pages is disabled.
      :
      : The bug is due a race in change_pmd_range between a pmd_trans_huge and
      : pmd_nond_or_clear_bad check without any locks held.  During the
      : pmd_trans_huge check, a parallel protection update under lock can have
      : cleared the PMD and filled it with a prot_numa entry between the transhuge
      : check and the pmd_none_or_clear_bad check.
      :
      : While this could be fixed with heavy locking, it's only necessary to make
      : a copy of the PMD on the stack during change_pmd_range and avoid races.  A
      : new helper is created for this as the check if quite subtle and the
      : existing similar helpful is not suitable.  This passed 154 hours of
      : testing (usually triggers between 20 minutes and 24 hours) without
      : detecting bad PMDs or corruption.  A basic test of an autonuma-intensive
      : workload showed no significant change in behaviour.
      
      Although Mel withdrew the patch on the face of LKML comment
      https://lkml.org/lkml/2017/4/10/922 the race window aforementioned is
      still open, and we have reports of Linpack test reporting bad residuals
      after the bad PMD warning is observed.  In addition to that, bad
      rss-counter and non-zero pgtables assertions are triggered on mm teardown
      for the task hitting the bad PMD.
      
       host kernel: mm/pgtable-generic.c:40: bad pmd 00000000b3152f68(8000000d2d2008e7)
       ....
       host kernel: BUG: Bad rss-counter state mm:00000000b583043d idx:1 val:512
       host kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096
      
      The issue is observed on a v4.18-based distribution kernel, but the race
      window is expected to be applicable to mainline kernels, as well.
      
      [akpm@linux-foundation.org: fix comment typo, per Rafael]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Link: http://lkml.kernel.org/r/20200216191800.22423-1-aquini@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b272b3c
  8. 22 2月, 2020 4 次提交
    • W
      mm/sparsemem: pfn_to_page is not valid yet on SPARSEMEM · 18e19f19
      Wei Yang 提交于
      When we use SPARSEMEM instead of SPARSEMEM_VMEMMAP, pfn_to_page()
      doesn't work before sparse_init_one_section() is called.
      
      This leads to a crash when hotplug memory:
      
          BUG: unable to handle page fault for address: 0000000006400000
          #PF: supervisor write access in kernel mode
          #PF: error_code(0x0002) - not-present page
          PGD 0 P4D 0
          Oops: 0002 [#1] SMP PTI
          CPU: 3 PID: 221 Comm: kworker/u16:1 Tainted: G        W         5.5.0-next-20200205+ #343
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
          Workqueue: kacpi_hotplug acpi_hotplug_work_fn
          RIP: 0010:__memset+0x24/0x30
          Code: cc cc cc cc cc cc 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 03 40 0f b6 f6 48 b8 01 01 01 01 01 01 01 01 48 0f af c6 <f3> 48 ab 89 d1 f3 aa 4c 89 c8 c3 90 49 89 f9 40 88 f0 48 89 d1 f3
          RSP: 0018:ffffb43ac0373c80 EFLAGS: 00010a87
          RAX: ffffffffffffffff RBX: ffff8a1518800000 RCX: 0000000000050000
          RDX: 0000000000000000 RSI: 00000000000000ff RDI: 0000000006400000
          RBP: 0000000000140000 R08: 0000000000100000 R09: 0000000006400000
          R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
          R13: 0000000000000028 R14: 0000000000000000 R15: ffff8a153ffd9280
          FS:  0000000000000000(0000) GS:ffff8a153ab00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 0000000006400000 CR3: 0000000136fca000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           sparse_add_section+0x1c9/0x26a
           __add_pages+0xbf/0x150
           add_pages+0x12/0x60
           add_memory_resource+0xc8/0x210
           __add_memory+0x62/0xb0
           acpi_memory_device_add+0x13f/0x300
           acpi_bus_attach+0xf6/0x200
           acpi_bus_scan+0x43/0x90
           acpi_device_hotplug+0x275/0x3d0
           acpi_hotplug_work_fn+0x1a/0x30
           process_one_work+0x1a7/0x370
           worker_thread+0x30/0x380
           kthread+0x112/0x130
           ret_from_fork+0x35/0x40
      
      We should use memmap as it did.
      
      On x86 the impact is limited to x86_32 builds, or x86_64 configurations
      that override the default setting for SPARSEMEM_VMEMMAP.
      
      Other memory hotplug archs (arm64, ia64, and ppc) also default to
      SPARSEMEM_VMEMMAP=y.
      
      [dan.j.williams@intel.com: changelog update]
      {rppt@linux.ibm.com: changelog update]
      Link: http://lkml.kernel.org/r/20200219030454.4844-1-bhe@redhat.com
      Fixes: ba72b4c8 ("mm/sparsemem: support sub-section hotplug")
      Signed-off-by: NWei Yang <richardw.yang@linux.intel.com>
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18e19f19
    • G
      mm/vmscan.c: don't round up scan size for online memory cgroup · 76073c64
      Gavin Shan 提交于
      Commit 68600f62 ("mm: don't miss the last page because of round-off
      error") makes the scan size round up to @denominator regardless of the
      memory cgroup's state, online or offline.  This affects the overall
      reclaiming behavior: the corresponding LRU list is eligible for
      reclaiming only when its size logically right shifted by @sc->priority
      is bigger than zero in the former formula.
      
      For example, the inactive anonymous LRU list should have at least 0x4000
      pages to be eligible for reclaiming when we have 60/12 for
      swappiness/priority and without taking scan/rotation ratio into account.
      
      After the roundup is applied, the inactive anonymous LRU list becomes
      eligible for reclaiming when its size is bigger than or equal to 0x1000
      in the same condition.
      
          (0x4000 >> 12) * 60 / (60 + 140 + 1) = 1
          ((0x1000 >> 12) * 60) + 200) / (60 + 140 + 1) = 1
      
      aarch64 has 512MB huge page size when the base page size is 64KB.  The
      memory cgroup that has a huge page is always eligible for reclaiming in
      that case.
      
      The reclaiming is likely to stop after the huge page is reclaimed,
      meaing the further iteration on @sc->priority and the silbing and child
      memory cgroups will be skipped.  The overall behaviour has been changed.
      This fixes the issue by applying the roundup to offlined memory cgroups
      only, to give more preference to reclaim memory from offlined memory
      cgroup.  It sounds reasonable as those memory is unlikedly to be used by
      anyone.
      
      The issue was found by starting up 8 VMs on a Ampere Mustang machine,
      which has 8 CPUs and 16 GB memory.  Each VM is given with 2 vCPUs and
      2GB memory.  It took 264 seconds for all VMs to be completely up and
      784MB swap is consumed after that.  With this patch applied, it took 236
      seconds and 60MB swap to do same thing.  So there is 10% performance
      improvement for my case.  Note that KSM is disable while THP is enabled
      in the testing.
      
               total     used    free   shared  buff/cache   available
         Mem:  16196    10065    2049       16        4081        3749
         Swap:  8175      784    7391
               total     used    free   shared  buff/cache   available
         Mem:  16196    11324    3656       24        1215        2936
         Swap:  8175       60    8115
      
      Link: http://lkml.kernel.org/r/20200211024514.8730-1-gshan@redhat.com
      Fixes: 68600f62 ("mm: don't miss the last page because of round-off error")
      Signed-off-by: NGavin Shan <gshan@redhat.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>	[4.20+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76073c64
    • V
      mm/memcontrol.c: lost css_put in memcg_expand_shrinker_maps() · 75866af6
      Vasily Averin 提交于
      for_each_mem_cgroup() increases css reference counter for memory cgroup
      and requires to use mem_cgroup_iter_break() if the walk is cancelled.
      
      Link: http://lkml.kernel.org/r/c98414fb-7e1f-da0f-867a-9340ec4bd30b@virtuozzo.com
      Fixes: 0a4465d3 ("mm, memcg: assign memcg-aware shrinkers bitmap to memcg")
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75866af6
    • C
  9. 20 2月, 2020 1 次提交
  10. 19 2月, 2020 1 次提交
  11. 08 2月, 2020 3 次提交
  12. 07 2月, 2020 2 次提交