1. 29 10月, 2019 2 次提交
    • D
      mm/memory-failure.c: don't access uninitialized memmaps in memory_failure() · 9792afbd
      David Hildenbrand 提交于
      commit 96c804a6ae8c59a9092b3d5dd581198472063184 upstream.
      
      We should check for pfn_to_online_page() to not access uninitialized
      memmaps.  Reshuffle the code so we don't have to duplicate the error
      message.
      
      Link: http://lkml.kernel.org/r/20191009142435.3975-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9792afbd
    • M
      memfd: Fix locking when tagging pins · 99b45e7a
      Matthew Wilcox (Oracle) 提交于
      The RCU lock is insufficient to protect the radix tree iteration as
      a deletion from the tree can occur before we take the spinlock to
      tag the entry.  In 4.19, this has manifested as a bug with the following
      trace:
      
      kernel BUG at lib/radix-tree.c:1429!
      invalid opcode: 0000 [#1] SMP KASAN PTI
      CPU: 7 PID: 6935 Comm: syz-executor.2 Not tainted 4.19.36 #25
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      RIP: 0010:radix_tree_tag_set+0x200/0x2f0 lib/radix-tree.c:1429
      Code: 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 44 24 10 e8 a3 29 7e fe 48 8b 44 24 10 48 0f ab 03 e9 d2 fe ff ff e8 90 29 7e fe <0f> 0b 48 c7 c7 e0 5a 87 84 e8 f0 e7 08 ff 4c 89 ef e8 4a ff ac fe
      RSP: 0018:ffff88837b13fb60 EFLAGS: 00010016
      RAX: 0000000000040000 RBX: ffff8883c5515d58 RCX: ffffffff82cb2ef0
      RDX: 0000000000000b72 RSI: ffffc90004cf2000 RDI: ffff8883c5515d98
      RBP: ffff88837b13fb98 R08: ffffed106f627f7e R09: ffffed106f627f7e
      R10: 0000000000000001 R11: ffffed106f627f7d R12: 0000000000000004
      R13: ffffea000d7fea80 R14: 1ffff1106f627f6f R15: 0000000000000002
      FS:  00007fa1b8df2700(0000) GS:ffff8883e2fc0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fa1b8df1db8 CR3: 000000037d4d2001 CR4: 0000000000160ee0
      Call Trace:
       memfd_tag_pins mm/memfd.c:51 [inline]
       memfd_wait_for_pins+0x2c5/0x12d0 mm/memfd.c:81
       memfd_add_seals mm/memfd.c:215 [inline]
       memfd_fcntl+0x33d/0x4a0 mm/memfd.c:247
       do_fcntl+0x589/0xeb0 fs/fcntl.c:421
       __do_sys_fcntl fs/fcntl.c:463 [inline]
       __se_sys_fcntl fs/fcntl.c:448 [inline]
       __x64_sys_fcntl+0x12d/0x180 fs/fcntl.c:448
       do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:293
      
      The problem does not occur in mainline due to the XArray rewrite which
      changed the locking to exclude modification of the tree during iteration.
      At the time, nobody realised this was a bugfix.  Backport the locking
      changes to stable.
      
      Cc: stable@vger.kernel.org
      Reported-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      99b45e7a
  2. 18 10月, 2019 1 次提交
  3. 12 10月, 2019 1 次提交
    • K
      usercopy: Avoid HIGHMEM pfn warning · 12c6c4a5
      Kees Cook 提交于
      commit 314eed30ede02fa925990f535652254b5bad6b65 upstream.
      
      When running on a system with >512MB RAM with a 32-bit kernel built with:
      
      	CONFIG_DEBUG_VIRTUAL=y
      	CONFIG_HIGHMEM=y
      	CONFIG_HARDENED_USERCOPY=y
      
      all execve()s will fail due to argv copying into kmap()ed pages, and on
      usercopy checking the calls ultimately of virt_to_page() will be looking
      for "bad" kmap (highmem) pointers due to CONFIG_DEBUG_VIRTUAL=y:
      
       ------------[ cut here ]------------
       kernel BUG at ../arch/x86/mm/physaddr.c:83!
       invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
       CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.3.0-rc8 #6
       Hardware name: Dell Inc. Inspiron 1318/0C236D, BIOS A04 01/15/2009
       EIP: __phys_addr+0xaf/0x100
       ...
       Call Trace:
        __check_object_size+0xaf/0x3c0
        ? __might_sleep+0x80/0xa0
        copy_strings+0x1c2/0x370
        copy_strings_kernel+0x2b/0x40
        __do_execve_file+0x4ca/0x810
        ? kmem_cache_alloc+0x1c7/0x370
        do_execve+0x1b/0x20
        ...
      
      The check is from arch/x86/mm/physaddr.c:
      
      	VIRTUAL_BUG_ON((phys_addr >> PAGE_SHIFT) > max_low_pfn);
      
      Due to the kmap() in fs/exec.c:
      
      		kaddr = kmap(kmapped_page);
      	...
      	if (copy_from_user(kaddr+offset, str, bytes_to_copy)) ...
      
      Now we can fetch the correct page to avoid the pfn check. In both cases,
      hardened usercopy will need to walk the page-span checker (if enabled)
      to do sanity checking.
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Tested-by: NRandy Dunlap <rdunlap@infradead.org>
      Fixes: f5509cc1 ("mm: Hardened usercopy")
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Link: https://lore.kernel.org/r/201909171056.7F2FFD17@keescookSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      12c6c4a5
  4. 05 10月, 2019 3 次提交
    • Y
      mm/compaction.c: clear total_{migrate,free}_scanned before scanning a new zone · 4d8bdf7f
      Yafang Shao 提交于
      [ Upstream commit a94b525241c0fff3598809131d7cfcfe1d572d8c ]
      
      total_{migrate,free}_scanned will be added to COMPACTMIGRATE_SCANNED and
      COMPACTFREE_SCANNED in compact_zone().  We should clear them before
      scanning a new zone.  In the proc triggered compaction, we forgot clearing
      them.
      
      [laoar.shao@gmail.com: introduce a helper compact_zone_counters_init()]
        Link: http://lkml.kernel.org/r/1563869295-25748-1-git-send-email-laoar.shao@gmail.com
      [akpm@linux-foundation.org: expand compact_zone_counters_init() into its single callsite, per mhocko]
      [vbabka@suse.cz: squash compact_zone() list_head init as well]
        Link: http://lkml.kernel.org/r/1fb6f7da-f776-9e42-22f8-bbb79b030b98@suse.cz
      [akpm@linux-foundation.org: kcompactd_do_work(): avoid unnecessary initialization of cc.zone]
      Link: http://lkml.kernel.org/r/1563789275-9639-1-git-send-email-laoar.shao@gmail.com
      Fixes: 7f354a54 ("mm, compaction: add vmstats for kcompactd work")
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yafang Shao <shaoyafang@didiglobal.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4d8bdf7f
    • M
      memcg, kmem: do not fail __GFP_NOFAIL charges · b4a734a5
      Michal Hocko 提交于
      commit e55d9d9bfb69405bd7615c0f8d229d8fafb3e9b8 upstream.
      
      Thomas has noticed the following NULL ptr dereference when using cgroup
      v1 kmem limit:
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      PGD 0
      P4D 0
      Oops: 0000 [#1] PREEMPT SMP PTI
      CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
      Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
      RIP: 0010:create_empty_buffers+0x24/0x100
      Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
      RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
      RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
      RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
      R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
      R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
      FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
      Call Trace:
       create_page_buffers+0x4d/0x60
       __block_write_begin_int+0x8e/0x5a0
       ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
       ? jbd2__journal_start+0xd7/0x1f0
       ext4_da_write_begin+0x112/0x3d0
       generic_perform_write+0xf1/0x1b0
       ? file_update_time+0x70/0x140
       __generic_file_write_iter+0x141/0x1a0
       ext4_file_write_iter+0xef/0x3b0
       __vfs_write+0x17e/0x1e0
       vfs_write+0xa5/0x1a0
       ksys_write+0x57/0xd0
       do_syscall_64+0x55/0x160
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
      fails __GFP_NOFAIL charge when the kmem limit is reached.  This is a wrong
      behavior because nofail allocations are not allowed to fail.  Normal
      charge path simply forces the charge even if that means to cross the
      limit.  Kmem accounting should be doing the same.
      
      Link: http://lkml.kernel.org/r/20190906125608.32129-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NThomas Lindroth <thomas.lindroth@gmail.com>
      Debugged-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Thomas Lindroth <thomas.lindroth@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b4a734a5
    • T
      memcg, oom: don't require __GFP_FS when invoking memcg OOM killer · d40b3eaf
      Tetsuo Handa 提交于
      commit f9c645621a28e37813a1de96d9cbd89cde94a1e4 upstream.
      
      Masoud Sharbiani noticed that commit 29ef680a ("memcg, oom: move
      out_of_memory back to the charge path") broke memcg OOM called from
      __xfs_filemap_fault() path.  It turned out that try_charge() is retrying
      forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
      cannot invoke the OOM killer due to commit 3da88fb3 ("mm, oom:
      move GFP_NOFS check to out_of_memory").
      
      Allowing forced charge due to being unable to invoke memcg OOM killer will
      lead to global OOM situation.  Also, just returning -ENOMEM will be risky
      because OOM path is lost and some paths (e.g.  get_user_pages()) will leak
      -ENOMEM.  Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be
      the only choice we can choose for now.
      
      Until 29ef680a, we were able to invoke memcg OOM killer when
      GFP_KERNEL reclaim failed [1].  But since 29ef680a, we need to
      invoke memcg OOM killer when GFP_NOFS reclaim failed [2].  Although in the
      past we did invoke memcg OOM killer for GFP_NOFS [3], we might get
      pre-mature memcg OOM reports due to this patch.
      
      [1]
      
       leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
       CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19
       Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
       Call Trace:
        dump_stack+0x63/0x88
        dump_header+0x67/0x27a
        ? mem_cgroup_scan_tasks+0x91/0xf0
        oom_kill_process+0x210/0x410
        out_of_memory+0x10a/0x2c0
        mem_cgroup_out_of_memory+0x46/0x80
        mem_cgroup_oom_synchronize+0x2e4/0x310
        ? high_work_func+0x20/0x20
        pagefault_out_of_memory+0x31/0x76
        mm_fault_error+0x55/0x115
        ? handle_mm_fault+0xfd/0x220
        __do_page_fault+0x433/0x4e0
        do_page_fault+0x22/0x30
        ? page_fault+0x8/0x30
        page_fault+0x1e/0x30
       RIP: 0033:0x4009f0
       Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
       RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206
       RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000
       RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
       RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d
       R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
       R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000
       Task in /leaker killed as a result of limit of /leaker
       memory: usage 524288kB, limit 524288kB, failcnt 158965
       memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
       kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0
       Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB
       Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child
       Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB
       oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      [2]
      
       leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0
       CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20
       Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
       Call Trace:
        dump_stack+0x63/0x88
        dump_header+0x67/0x27a
        ? mem_cgroup_scan_tasks+0x91/0xf0
        oom_kill_process+0x210/0x410
        out_of_memory+0x109/0x2d0
        mem_cgroup_out_of_memory+0x46/0x80
        try_charge+0x58d/0x650
        ? __radix_tree_replace+0x81/0x100
        mem_cgroup_try_charge+0x7a/0x100
        __add_to_page_cache_locked+0x92/0x180
        add_to_page_cache_lru+0x4d/0xf0
        iomap_readpages_actor+0xde/0x1b0
        ? iomap_zero_range_actor+0x1d0/0x1d0
        iomap_apply+0xaf/0x130
        iomap_readpages+0x9f/0x150
        ? iomap_zero_range_actor+0x1d0/0x1d0
        xfs_vm_readpages+0x18/0x20 [xfs]
        read_pages+0x60/0x140
        __do_page_cache_readahead+0x193/0x1b0
        ondemand_readahead+0x16d/0x2c0
        page_cache_async_readahead+0x9a/0xd0
        filemap_fault+0x403/0x620
        ? alloc_set_pte+0x12c/0x540
        ? _cond_resched+0x14/0x30
        __xfs_filemap_fault+0x66/0x180 [xfs]
        xfs_filemap_fault+0x27/0x30 [xfs]
        __do_fault+0x19/0x40
        __handle_mm_fault+0x8e8/0xb60
        handle_mm_fault+0xfd/0x220
        __do_page_fault+0x238/0x4e0
        do_page_fault+0x22/0x30
        ? page_fault+0x8/0x30
        page_fault+0x1e/0x30
       RIP: 0033:0x4009f0
       Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
       RSP: 002b:00007ffda45c9290 EFLAGS: 00010206
       RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000
       RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
       RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d
       R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
       R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000
       Task in /leaker killed as a result of limit of /leaker
       memory: usage 524288kB, limit 524288kB, failcnt 7221
       memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
       kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0
       Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB
       Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child
       Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB
       oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      [3]
      
       leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0
       leaker cpuset=/ mems_allowed=0
       CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1
       Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
       Call Trace:
        [<ffffffffaf364147>] dump_stack+0x19/0x1b
        [<ffffffffaf35eb6a>] dump_header+0x90/0x229
        [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0
        [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60
        [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0
        [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570
        [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0
        [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90
        [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157
        [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0
        [<ffffffffaf371925>] do_page_fault+0x35/0x90
        [<ffffffffaf36d768>] page_fault+0x28/0x30
       Task in /leaker killed as a result of limit of /leaker
       memory: usage 524288kB, limit 524288kB, failcnt 20628
       memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0
       kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
       Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB
       Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child
       Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB
      
      Bisected by Masoud Sharbiani.
      
      Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp
      Fixes: 3da88fb3 ("mm, oom: move GFP_NOFS check to out_of_memory") [necessary after 29ef680a]
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NMasoud Sharbiani <msharbiani@apple.com>
      Tested-by: NMasoud Sharbiani <msharbiani@apple.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[4.19+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d40b3eaf
  5. 16 9月, 2019 1 次提交
  6. 06 9月, 2019 1 次提交
  7. 29 8月, 2019 3 次提交
  8. 25 8月, 2019 6 次提交
    • Y
      Revert "kmemleak: allow to coexist with fault injection" · 01d8d08f
      Yang Shi 提交于
      [ Upstream commit df9576def004d2cd5beedc00cb6e8901427634b9 ]
      
      When running ltp's oom test with kmemleak enabled, the below warning was
      triggerred since kernel detects __GFP_NOFAIL & ~__GFP_DIRECT_RECLAIM is
      passed in:
      
        WARNING: CPU: 105 PID: 2138 at mm/page_alloc.c:4608 __alloc_pages_nodemask+0x1c31/0x1d50
        Modules linked in: loop dax_pmem dax_pmem_core ip_tables x_tables xfs virtio_net net_failover virtio_blk failover ata_generic virtio_pci virtio_ring virtio libata
        CPU: 105 PID: 2138 Comm: oom01 Not tainted 5.2.0-next-20190710+ #7
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
        RIP: 0010:__alloc_pages_nodemask+0x1c31/0x1d50
        ...
         kmemleak_alloc+0x4e/0xb0
         kmem_cache_alloc+0x2a7/0x3e0
         mempool_alloc_slab+0x2d/0x40
         mempool_alloc+0x118/0x2b0
         bio_alloc_bioset+0x19d/0x350
         get_swap_bio+0x80/0x230
         __swap_writepage+0x5ff/0xb20
      
      The mempool_alloc_slab() clears __GFP_DIRECT_RECLAIM, however kmemleak
      has __GFP_NOFAIL set all the time due to d9570ee3 ("kmemleak:
      allow to coexist with fault injection").  But, it doesn't make any sense
      to have __GFP_NOFAIL and ~__GFP_DIRECT_RECLAIM specified at the same
      time.
      
      According to the discussion on the mailing list, the commit should be
      reverted for short term solution.  Catalin Marinas would follow up with
      a better solution for longer term.
      
      The failure rate of kmemleak metadata allocation may increase in some
      circumstances, but this should be expected side effect.
      
      Link: http://lkml.kernel.org/r/1563299431-111710-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: d9570ee3 ("kmemleak: allow to coexist with fault injection")
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Suggested-by: NCatalin Marinas <catalin.marinas@arm.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      01d8d08f
    • I
      mm/usercopy: use memory range to be accessed for wraparound check · 056368fc
      Isaac J. Manjarres 提交于
      commit 951531691c4bcaa59f56a316e018bc2ff1ddf855 upstream.
      
      Currently, when checking to see if accessing n bytes starting at address
      "ptr" will cause a wraparound in the memory addresses, the check in
      check_bogus_address() adds an extra byte, which is incorrect, as the
      range of addresses that will be accessed is [ptr, ptr + (n - 1)].
      
      This can lead to incorrectly detecting a wraparound in the memory
      address, when trying to read 4 KB from memory that is mapped to the the
      last possible page in the virtual address space, when in fact, accessing
      that range of memory would not cause a wraparound to occur.
      
      Use the memory range that will actually be accessed when considering if
      accessing a certain amount of bytes will cause the memory address to
      wrap around.
      
      Link: http://lkml.kernel.org/r/1564509253-23287-1-git-send-email-isaacm@codeaurora.org
      Fixes: f5509cc1 ("mm: Hardened usercopy")
      Signed-off-by: NPrasad Sodagudi <psodagud@codeaurora.org>
      Signed-off-by: NIsaac J. Manjarres <isaacm@codeaurora.org>
      Co-developed-by: NPrasad Sodagudi <psodagud@codeaurora.org>
      Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
      Acked-by: NKees Cook <keescook@chromium.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Trilok Soni <tsoni@codeaurora.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      056368fc
    • M
      mm/memcontrol.c: fix use after free in mem_cgroup_iter() · c8282f1b
      Miles Chen 提交于
      commit 54a83d6bcbf8f4700013766b974bf9190d40b689 upstream.
      
      This patch is sent to report an use after free in mem_cgroup_iter()
      after merging commit be2657752e9e ("mm: memcg: fix use after free in
      mem_cgroup_iter()").
      
      I work with android kernel tree (4.9 & 4.14), and commit be2657752e9e
      ("mm: memcg: fix use after free in mem_cgroup_iter()") has been merged
      to the trees.  However, I can still observe use after free issues
      addressed in the commit be2657752e9e.  (on low-end devices, a few times
      this month)
      
      backtrace:
              css_tryget <- crash here
              mem_cgroup_iter
              shrink_node
              shrink_zones
              do_try_to_free_pages
              try_to_free_pages
              __perform_reclaim
              __alloc_pages_direct_reclaim
              __alloc_pages_slowpath
              __alloc_pages_nodemask
      
      To debug, I poisoned mem_cgroup before freeing it:
      
        static void __mem_cgroup_free(struct mem_cgroup *memcg)
              for_each_node(node)
              free_mem_cgroup_per_node_info(memcg, node);
              free_percpu(memcg->stat);
        +     /* poison memcg before freeing it */
        +     memset(memcg, 0x78, sizeof(struct mem_cgroup));
              kfree(memcg);
        }
      
      The coredump shows the position=0xdbbc2a00 is freed.
      
        (gdb) p/x ((struct mem_cgroup_per_node *)0xe5009e00)->iter[8]
        $13 = {position = 0xdbbc2a00, generation = 0x2efd}
      
        0xdbbc2a00:     0xdbbc2e00      0x00000000      0xdbbc2800      0x00000100
        0xdbbc2a10:     0x00000200      0x78787878      0x00026218      0x00000000
        0xdbbc2a20:     0xdcad6000      0x00000001      0x78787800      0x00000000
        0xdbbc2a30:     0x78780000      0x00000000      0x0068fb84      0x78787878
        0xdbbc2a40:     0x78787878      0x78787878      0x78787878      0xe3fa5cc0
        0xdbbc2a50:     0x78787878      0x78787878      0x00000000      0x00000000
        0xdbbc2a60:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a70:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a80:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2a90:     0x00000001      0x00000000      0x00000000      0x00100000
        0xdbbc2aa0:     0x00000001      0xdbbc2ac8      0x00000000      0x00000000
        0xdbbc2ab0:     0x00000000      0x00000000      0x00000000      0x00000000
        0xdbbc2ac0:     0x00000000      0x00000000      0xe5b02618      0x00001000
        0xdbbc2ad0:     0x00000000      0x78787878      0x78787878      0x78787878
        0xdbbc2ae0:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2af0:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b00:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b10:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b20:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b30:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b40:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b50:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b60:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b70:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2b80:     0x78787878      0x78787878      0x00000000      0x78787878
        0xdbbc2b90:     0x78787878      0x78787878      0x78787878      0x78787878
        0xdbbc2ba0:     0x78787878      0x78787878      0x78787878      0x78787878
      
      In the reclaim path, try_to_free_pages() does not setup
      sc.target_mem_cgroup and sc is passed to do_try_to_free_pages(), ...,
      shrink_node().
      
      In mem_cgroup_iter(), root is set to root_mem_cgroup because
      sc->target_mem_cgroup is NULL.  It is possible to assign a memcg to
      root_mem_cgroup.nodeinfo.iter in mem_cgroup_iter().
      
              try_to_free_pages
              	struct scan_control sc = {...}, target_mem_cgroup is 0x0;
              do_try_to_free_pages
              shrink_zones
              shrink_node
              	 mem_cgroup *root = sc->target_mem_cgroup;
              	 memcg = mem_cgroup_iter(root, NULL, &reclaim);
              mem_cgroup_iter()
              	if (!root)
              		root = root_mem_cgroup;
              	...
      
              	css = css_next_descendant_pre(css, &root->css);
              	memcg = mem_cgroup_from_css(css);
              	cmpxchg(&iter->position, pos, memcg);
      
      My device uses memcg non-hierarchical mode.  When we release a memcg:
      invalidate_reclaim_iterators() reaches only dead_memcg and its parents.
      If non-hierarchical mode is used, invalidate_reclaim_iterators() never
      reaches root_mem_cgroup.
      
        static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
        {
              struct mem_cgroup *memcg = dead_memcg;
      
              for (; memcg; memcg = parent_mem_cgroup(memcg)
              ...
        }
      
      So the use after free scenario looks like:
      
        CPU1						CPU2
      
        try_to_free_pages
        do_try_to_free_pages
        shrink_zones
        shrink_node
        mem_cgroup_iter()
            if (!root)
            	root = root_mem_cgroup;
            ...
            css = css_next_descendant_pre(css, &root->css);
            memcg = mem_cgroup_from_css(css);
            cmpxchg(&iter->position, pos, memcg);
      
              				invalidate_reclaim_iterators(memcg);
              				...
              				__mem_cgroup_free()
              					kfree(memcg);
      
        try_to_free_pages
        do_try_to_free_pages
        shrink_zones
        shrink_node
        mem_cgroup_iter()
            if (!root)
            	root = root_mem_cgroup;
            ...
            mz = mem_cgroup_nodeinfo(root, reclaim->pgdat->node_id);
            iter = &mz->iter[reclaim->priority];
            pos = READ_ONCE(iter->position);
            css_tryget(&pos->css) <- use after free
      
      To avoid this, we should also invalidate root_mem_cgroup.nodeinfo.iter
      in invalidate_reclaim_iterators().
      
      [cai@lca.pw: fix -Wparentheses compilation warning]
        Link: http://lkml.kernel.org/r/1564580753-17531-1-git-send-email-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190730015729.4406-1-miles.chen@mediatek.com
      Fixes: 5ac8fb31 ("mm: memcontrol: convert reclaim iterator to simple css refcounting")
      Signed-off-by: NMiles Chen <miles.chen@mediatek.com>
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c8282f1b
    • Y
      mm: mempolicy: handle vma with unmovable pages mapped correctly in mbind · 3c0cb90e
      Yang Shi 提交于
      commit a53190a4aaa36494f4d7209fd1fcc6f2ee08e0e0 upstream.
      
      When running syzkaller internally, we ran into the below bug on 4.9.x
      kernel:
      
        kernel BUG at mm/huge_memory.c:2124!
        invalid opcode: 0000 [#1] SMP KASAN
        CPU: 0 PID: 1518 Comm: syz-executor107 Not tainted 4.9.168+ #2
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.5.1 01/01/2011
        task: ffff880067b34900 task.stack: ffff880068998000
        RIP: split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
        Call Trace:
          split_huge_page include/linux/huge_mm.h:100 [inline]
          queue_pages_pte_range+0x7e1/0x1480 mm/mempolicy.c:538
          walk_pmd_range mm/pagewalk.c:50 [inline]
          walk_pud_range mm/pagewalk.c:90 [inline]
          walk_pgd_range mm/pagewalk.c:116 [inline]
          __walk_page_range+0x44a/0xdb0 mm/pagewalk.c:208
          walk_page_range+0x154/0x370 mm/pagewalk.c:285
          queue_pages_range+0x115/0x150 mm/mempolicy.c:694
          do_mbind mm/mempolicy.c:1241 [inline]
          SYSC_mbind+0x3c3/0x1030 mm/mempolicy.c:1370
          SyS_mbind+0x46/0x60 mm/mempolicy.c:1352
          do_syscall_64+0x1d2/0x600 arch/x86/entry/common.c:282
          entry_SYSCALL_64_after_swapgs+0x5d/0xdb
        Code: c7 80 1c 02 00 e8 26 0a 76 01 <0f> 0b 48 c7 c7 40 46 45 84 e8 4c
        RIP  [<ffffffff81895d6b>] split_huge_page_to_list+0x8fb/0x1030 mm/huge_memory.c:2124
         RSP <ffff88006899f980>
      
      with the below test:
      
        uint64_t r[1] = {0xffffffffffffffff};
      
        int main(void)
        {
              syscall(__NR_mmap, 0x20000000, 0x1000000, 3, 0x32, -1, 0);
                                      intptr_t res = 0;
              res = syscall(__NR_socket, 0x11, 3, 0x300);
              if (res != -1)
                      r[0] = res;
              *(uint32_t*)0x20000040 = 0x10000;
              *(uint32_t*)0x20000044 = 1;
              *(uint32_t*)0x20000048 = 0xc520;
              *(uint32_t*)0x2000004c = 1;
              syscall(__NR_setsockopt, r[0], 0x107, 0xd, 0x20000040, 0x10);
              syscall(__NR_mmap, 0x20fed000, 0x10000, 0, 0x8811, r[0], 0);
              *(uint64_t*)0x20000340 = 2;
              syscall(__NR_mbind, 0x20ff9000, 0x4000, 0x4002, 0x20000340, 0x45d4, 3);
              return 0;
        }
      
      Actually the test does:
      
        mmap(0x20000000, 16777216, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x20000000
        socket(AF_PACKET, SOCK_RAW, 768)        = 3
        setsockopt(3, SOL_PACKET, PACKET_TX_RING, {block_size=65536, block_nr=1, frame_size=50464, frame_nr=1}, 16) = 0
        mmap(0x20fed000, 65536, PROT_NONE, MAP_SHARED|MAP_FIXED|MAP_POPULATE|MAP_DENYWRITE, 3, 0) = 0x20fed000
        mbind(..., MPOL_MF_STRICT|MPOL_MF_MOVE) = 0
      
      The setsockopt() would allocate compound pages (16 pages in this test)
      for packet tx ring, then the mmap() would call packet_mmap() to map the
      pages into the user address space specified by the mmap() call.
      
      When calling mbind(), it would scan the vma to queue the pages for
      migration to the new node.  It would split any huge page since 4.9
      doesn't support THP migration, however, the packet tx ring compound
      pages are not THP and even not movable.  So, the above bug is triggered.
      
      However, the later kernel is not hit by this issue due to commit
      d44d363f ("mm: don't assume anonymous pages have SwapBacked flag"),
      which just removes the PageSwapBacked check for a different reason.
      
      But, there is a deeper issue.  According to the semantic of mbind(), it
      should return -EIO if MPOL_MF_MOVE or MPOL_MF_MOVE_ALL was specified and
      MPOL_MF_STRICT was also specified, but the kernel was unable to move all
      existing pages in the range.  The tx ring of the packet socket is
      definitely not movable, however, mbind() returns success for this case.
      
      Although the most socket file associates with non-movable pages, but XDP
      may have movable pages from gup.  So, it sounds not fine to just check
      the underlying file type of vma in vma_migratable().
      
      Change migrate_page_add() to check if the page is movable or not, if it
      is unmovable, just return -EIO.  But do not abort pte walk immediately,
      since there may be pages off LRU temporarily.  We should migrate other
      pages if MPOL_MF_MOVE* is specified.  Set has_unmovable flag if some
      paged could not be not moved, then return -EIO for mbind() eventually.
      
      With this change the above test would return -EIO as expected.
      
      [yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
        Link: http://lkml.kernel.org/r/1563556862-54056-3-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1561162809-59140-3-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      3c0cb90e
    • Y
      mm: mempolicy: make the behavior consistent when MPOL_MF_MOVE* and MPOL_MF_STRICT were specified · cd825d87
      Yang Shi 提交于
      commit d883544515aae54842c21730b880172e7894fde9 upstream.
      
      When both MPOL_MF_MOVE* and MPOL_MF_STRICT was specified, mbind() should
      try best to migrate misplaced pages, if some of the pages could not be
      migrated, then return -EIO.
      
      There are three different sub-cases:
       1. vma is not migratable
       2. vma is migratable, but there are unmovable pages
       3. vma is migratable, pages are movable, but migrate_pages() fails
      
      If #1 happens, kernel would just abort immediately, then return -EIO,
      after a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when
      MPOL_MF_STRICT is specified").
      
      If #3 happens, kernel would set policy and migrate pages with
      best-effort, but won't rollback the migrated pages and reset the policy
      back.
      
      Before that commit, they behaves in the same way.  It'd better to keep
      their behavior consistent.  But, rolling back the migrated pages and
      resetting the policy back sounds not feasible, so just make #1 behave as
      same as #3.
      
      Userspace will know that not everything was successfully migrated (via
      -EIO), and can take whatever steps it deems necessary - attempt
      rollback, determine which exact page(s) are violating the policy, etc.
      
      Make queue_pages_range() return 1 to indicate there are unmovable pages
      or vma is not migratable.
      
      The #2 is not handled correctly in the current kernel, the following
      patch will fix it.
      
      [yang.shi@linux.alibaba.com: fix review comments from Vlastimil]
        Link: http://lkml.kernel.org/r/1563556862-54056-2-git-send-email-yang.shi@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1561162809-59140-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      cd825d87
    • R
      mm/hmm: fix bad subpage pointer in try_to_unmap_one · f0fed828
      Ralph Campbell 提交于
      commit 1de13ee59225dfc98d483f8cce7d83f97c0b31de upstream.
      
      When migrating an anonymous private page to a ZONE_DEVICE private page,
      the source page->mapping and page->index fields are copied to the
      destination ZONE_DEVICE struct page and the page_mapcount() is
      increased.  This is so rmap_walk() can be used to unmap and migrate the
      page back to system memory.
      
      However, try_to_unmap_one() computes the subpage pointer from a swap pte
      which computes an invalid page pointer and a kernel panic results such
      as:
      
        BUG: unable to handle page fault for address: ffffea1fffffffc8
      
      Currently, only single pages can be migrated to device private memory so
      no subpage computation is needed and it can be set to "page".
      
      [rcampbell@nvidia.com: add comment]
        Link: http://lkml.kernel.org/r/20190724232700.23327-4-rcampbell@nvidia.com
      Link: http://lkml.kernel.org/r/20190719192955.30462-4-rcampbell@nvidia.com
      Fixes: a5430dda ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
      Signed-off-by: NRalph Campbell <rcampbell@nvidia.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Lai Jiangshan <jiangshanlai@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f0fed828
  9. 16 8月, 2019 1 次提交
  10. 07 8月, 2019 2 次提交
  11. 31 7月, 2019 6 次提交
    • K
      mm: use down_read_killable for locking mmap_sem in access_remote_vm · b0768724
      Konstantin Khlebnikov 提交于
      [ Upstream commit 1e426fe28261b03f297992e89da3320b42816f4e ]
      
      This function is used by ptrace and proc files like /proc/pid/cmdline and
      /proc/pid/environ.
      
      Access_remote_vm never returns error codes, all errors are ignored and
      only size of successfully read data is returned.  So, if current task was
      killed we'll simply return 0 (bytes read).
      
      Mmap_sem could be locked for a long time or forever if something goes
      wrong.  Using a killable lock permits cleanup of stuck tasks and
      simplifies investigation.
      
      Link: http://lkml.kernel.org/r/156007494202.3335.16782303099589302087.stgit@buzzSigned-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Reviewed-by: NMichal Koutný <mkoutny@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      b0768724
    • J
      mm/mmu_notifier: use hlist_add_head_rcu() · a8c568fc
      Jean-Philippe Brucker 提交于
      [ Upstream commit 543bdb2d825fe2400d6e951f1786d92139a16931 ]
      
      Make mmu_notifier_register() safer by issuing a memory barrier before
      registering a new notifier.  This fixes a theoretical bug on weakly
      ordered CPUs.  For example, take this simplified use of notifiers by a
      driver:
      
      	my_struct->mn.ops = &my_ops; /* (1) */
      	mmu_notifier_register(&my_struct->mn, mm)
      		...
      		hlist_add_head(&mn->hlist, &mm->mmu_notifiers); /* (2) */
      		...
      
      Once mmu_notifier_register() releases the mm locks, another thread can
      invalidate a range:
      
      	mmu_notifier_invalidate_range()
      		...
      		hlist_for_each_entry_rcu(mn, &mm->mmu_notifiers, hlist) {
      			if (mn->ops->invalidate_range)
      
      The read side relies on the data dependency between mn and ops to ensure
      that the pointer is properly initialized.  But the write side doesn't have
      any dependency between (1) and (2), so they could be reordered and the
      readers could dereference an invalid mn->ops.  mmu_notifier_register()
      does take all the mm locks before adding to the hlist, but those have
      acquire semantics which isn't sufficient.
      
      By calling hlist_add_head_rcu() instead of hlist_add_head() we update the
      hlist using a store-release, ensuring that readers see prior
      initialization of my_struct.  This situation is better illustated by
      litmus test MP+onceassign+derefonce.
      
      Link: http://lkml.kernel.org/r/20190502133532.24981-1-jean-philippe.brucker@arm.com
      Fixes: cddb8a5c ("mmu-notifiers: core")
      Signed-off-by: NJean-Philippe Brucker <jean-philippe.brucker@arm.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      a8c568fc
    • A
      mm/gup.c: remove some BUG_ONs from get_gate_page() · 041b127d
      Andy Lutomirski 提交于
      [ Upstream commit b5d1c39f34d1c9bca0c4b9ae2e339fbbe264a9c7 ]
      
      If we end up without a PGD or PUD entry backing the gate area, don't BUG
      -- just fail gracefully.
      
      It's not entirely implausible that this could happen some day on x86.  It
      doesn't right now even with an execute-only emulated vsyscall page because
      the fixmap shares the PUD, but the core mm code shouldn't rely on that
      particular detail to avoid OOPSing.
      
      Link: http://lkml.kernel.org/r/a1d9f4efb75b9d464e59fd6af00104b21c58f6f7.1561610798.git.luto@kernel.orgSigned-off-by: NAndy Lutomirski <luto@kernel.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      041b127d
    • G
      mm/gup.c: mark undo_dev_pagemap as __maybe_unused · fa099d6d
      Guenter Roeck 提交于
      [ Upstream commit 790c73690c2bbecb3f6f8becbdb11ddc9bcff8cc ]
      
      Several mips builds generate the following build warning.
      
        mm/gup.c:1788:13: warning: 'undo_dev_pagemap' defined but not used
      
      The function is declared unconditionally but only called from behind
      various ifdefs. Mark it __maybe_unused.
      
      Link: http://lkml.kernel.org/r/1562072523-22311-1-git-send-email-linux@roeck-us.netSigned-off-by: NGuenter Roeck <linux@roeck-us.net>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      fa099d6d
    • D
      mm/kmemleak.c: fix check for softirq context · 071f2135
      Dmitry Vyukov 提交于
      [ Upstream commit 6ef9056952532c3b746de46aa10d45b4d7797bd8 ]
      
      in_softirq() is a wrong predicate to check if we are in a softirq
      context.  It also returns true if we have BH disabled, so objects are
      falsely stamped with "softirq" comm.  The correct predicate is
      in_serving_softirq().
      
      If user does cat from /sys/kernel/debug/kmemleak previously they would
      see this, which is clearly wrong, this is system call context (see the
      comm):
      
      unreferenced object 0xffff88805bd661c0 (size 64):
        comm "softirq", pid 0, jiffies 4294942959 (age 12.400s)
        hex dump (first 32 bytes):
          00 00 00 00 00 00 00 00 ff ff ff ff 00 00 00 00  ................
          00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
        backtrace:
          [<0000000007dcb30c>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
          [<0000000007dcb30c>] slab_post_alloc_hook mm/slab.h:439 [inline]
          [<0000000007dcb30c>] slab_alloc mm/slab.c:3326 [inline]
          [<0000000007dcb30c>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
          [<00000000969722b7>] kmalloc include/linux/slab.h:547 [inline]
          [<00000000969722b7>] kzalloc include/linux/slab.h:742 [inline]
          [<00000000969722b7>] ip_mc_add1_src net/ipv4/igmp.c:1961 [inline]
          [<00000000969722b7>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2085
          [<00000000a4134b5f>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2475
          [<00000000d20248ad>] do_ip_setsockopt.isra.0+0x19fe/0x1c00 net/ipv4/ip_sockglue.c:957
          [<000000003d367be7>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1246
          [<000000003c7c76af>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616
          [<000000000c1aeb23>] sock_common_setsockopt+0x3e/0x50 net/core/sock.c:3130
          [<000000000157b92b>] __sys_setsockopt+0x9e/0x120 net/socket.c:2078
          [<00000000a9f3d058>] __do_sys_setsockopt net/socket.c:2089 [inline]
          [<00000000a9f3d058>] __se_sys_setsockopt net/socket.c:2086 [inline]
          [<00000000a9f3d058>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
          [<000000001b8da885>] do_syscall_64+0x7c/0x1a0 arch/x86/entry/common.c:301
          [<00000000ba770c62>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      now they will see this:
      
      unreferenced object 0xffff88805413c800 (size 64):
        comm "syz-executor.4", pid 8960, jiffies 4294994003 (age 14.350s)
        hex dump (first 32 bytes):
          00 7a 8a 57 80 88 ff ff e0 00 00 01 00 00 00 00  .z.W............
          00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
        backtrace:
          [<00000000c5d3be64>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
          [<00000000c5d3be64>] slab_post_alloc_hook mm/slab.h:439 [inline]
          [<00000000c5d3be64>] slab_alloc mm/slab.c:3326 [inline]
          [<00000000c5d3be64>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
          [<0000000023865be2>] kmalloc include/linux/slab.h:547 [inline]
          [<0000000023865be2>] kzalloc include/linux/slab.h:742 [inline]
          [<0000000023865be2>] ip_mc_add1_src net/ipv4/igmp.c:1961 [inline]
          [<0000000023865be2>] ip_mc_add_src+0x36b/0x400 net/ipv4/igmp.c:2085
          [<000000003029a9d4>] ip_mc_msfilter+0x22d/0x310 net/ipv4/igmp.c:2475
          [<00000000ccd0a87c>] do_ip_setsockopt.isra.0+0x19fe/0x1c00 net/ipv4/ip_sockglue.c:957
          [<00000000a85a3785>] ip_setsockopt+0x3b/0xb0 net/ipv4/ip_sockglue.c:1246
          [<00000000ec13c18d>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616
          [<0000000052d748e3>] sock_common_setsockopt+0x3e/0x50 net/core/sock.c:3130
          [<00000000512f1014>] __sys_setsockopt+0x9e/0x120 net/socket.c:2078
          [<00000000181758bc>] __do_sys_setsockopt net/socket.c:2089 [inline]
          [<00000000181758bc>] __se_sys_setsockopt net/socket.c:2086 [inline]
          [<00000000181758bc>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
          [<00000000d4b73623>] do_syscall_64+0x7c/0x1a0 arch/x86/entry/common.c:301
          [<00000000c1098bec>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Link: http://lkml.kernel.org/r/20190517171507.96046-1-dvyukov@gmail.comSigned-off-by: NDmitry Vyukov <dvyukov@google.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      071f2135
    • I
      mm/swap: fix release_pages() when releasing devmap pages · 30edc7c1
      Ira Weiny 提交于
      [ Upstream commit c5d6c45e90c49150670346967971e14576afd7f1 ]
      
      release_pages() is an optimized version of a loop around put_page().
      Unfortunately for devmap pages the logic is not entirely correct in
      release_pages().  This is because device pages can be more than type
      MEMORY_DEVICE_PUBLIC.  There are in fact 4 types, private, public, FS DAX,
      and PCI P2PDMA.  Some of these have specific needs to "put" the page while
      others do not.
      
      This logic to handle any special needs is contained in
      put_devmap_managed_page().  Therefore all devmap pages should be processed
      by this function where we can contain the correct logic for a page put.
      
      Handle all device type pages within release_pages() by calling
      put_devmap_managed_page() on all devmap pages.  If
      put_devmap_managed_page() returns true the page has been put and we
      continue with the next page.  A false return of put_devmap_managed_page()
      means the page did not require special processing and should fall to
      "normal" processing.
      
      This was found via code inspection while determining if release_pages()
      and the new put_user_pages() could be interchangeable.[1]
      
      [1] https://lkml.kernel.org/r/20190523172852.GA27175@iweiny-DESK2.sc.intel.com
      
      Link: https://lkml.kernel.org/r/20190605214922.17684-1-ira.weiny@intel.com
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NIra Weiny <ira.weiny@intel.com>
      Signed-off-by: NJason Gunthorpe <jgg@mellanox.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      30edc7c1
  12. 28 7月, 2019 2 次提交
    • K
      mm: vmscan: scan anonymous pages on file refaults · c1d98b76
      Kuo-Hsin Yang 提交于
      commit 2c012a4ad1a2cd3fb5a0f9307b9d219f84eda1fa upstream.
      
      When file refaults are detected and there are many inactive file pages,
      the system never reclaim anonymous pages, the file pages are dropped
      aggressively when there are still a lot of cold anonymous pages and
      system thrashes.  This issue impacts the performance of applications
      with large executable, e.g.  chrome.
      
      With this patch, when file refault is detected, inactive_list_is_low()
      always returns true for file pages in get_scan_count() to enable
      scanning anonymous pages.
      
      The problem can be reproduced by the following test program.
      
      ---8<---
      void fallocate_file(const char *filename, off_t size)
      {
      	struct stat st;
      	int fd;
      
      	if (!stat(filename, &st) && st.st_size >= size)
      		return;
      
      	fd = open(filename, O_WRONLY | O_CREAT, 0600);
      	if (fd < 0) {
      		perror("create file");
      		exit(1);
      	}
      	if (posix_fallocate(fd, 0, size)) {
      		perror("fallocate");
      		exit(1);
      	}
      	close(fd);
      }
      
      long *alloc_anon(long size)
      {
      	long *start = malloc(size);
      	memset(start, 1, size);
      	return start;
      }
      
      long access_file(const char *filename, long size, long rounds)
      {
      	int fd, i;
      	volatile char *start1, *end1, *start2;
      	const int page_size = getpagesize();
      	long sum = 0;
      
      	fd = open(filename, O_RDONLY);
      	if (fd == -1) {
      		perror("open");
      		exit(1);
      	}
      
      	/*
      	 * Some applications, e.g. chrome, use a lot of executable file
      	 * pages, map some of the pages with PROT_EXEC flag to simulate
      	 * the behavior.
      	 */
      	start1 = mmap(NULL, size / 2, PROT_READ | PROT_EXEC, MAP_SHARED,
      		      fd, 0);
      	if (start1 == MAP_FAILED) {
      		perror("mmap");
      		exit(1);
      	}
      	end1 = start1 + size / 2;
      
      	start2 = mmap(NULL, size / 2, PROT_READ, MAP_SHARED, fd, size / 2);
      	if (start2 == MAP_FAILED) {
      		perror("mmap");
      		exit(1);
      	}
      
      	for (i = 0; i < rounds; ++i) {
      		struct timeval before, after;
      		volatile char *ptr1 = start1, *ptr2 = start2;
      		gettimeofday(&before, NULL);
      		for (; ptr1 < end1; ptr1 += page_size, ptr2 += page_size)
      			sum += *ptr1 + *ptr2;
      		gettimeofday(&after, NULL);
      		printf("File access time, round %d: %f (sec)
      ", i,
      		       (after.tv_sec - before.tv_sec) +
      		       (after.tv_usec - before.tv_usec) / 1000000.0);
      	}
      	return sum;
      }
      
      int main(int argc, char *argv[])
      {
      	const long MB = 1024 * 1024;
      	long anon_mb, file_mb, file_rounds;
      	const char filename[] = "large";
      	long *ret1;
      	long ret2;
      
      	if (argc != 4) {
      		printf("usage: thrash ANON_MB FILE_MB FILE_ROUNDS
      ");
      		exit(0);
      	}
      	anon_mb = atoi(argv[1]);
      	file_mb = atoi(argv[2]);
      	file_rounds = atoi(argv[3]);
      
      	fallocate_file(filename, file_mb * MB);
      	printf("Allocate %ld MB anonymous pages
      ", anon_mb);
      	ret1 = alloc_anon(anon_mb * MB);
      	printf("Access %ld MB file pages
      ", file_mb);
      	ret2 = access_file(filename, file_mb * MB, file_rounds);
      	printf("Print result to prevent optimization: %ld
      ",
      	       *ret1 + ret2);
      	return 0;
      }
      ---8<---
      
      Running the test program on 2GB RAM VM with kernel 5.2.0-rc5, the program
      fills ram with 2048 MB memory, access a 200 MB file for 10 times.  Without
      this patch, the file cache is dropped aggresively and every access to the
      file is from disk.
      
        $ ./thrash 2048 200 10
        Allocate 2048 MB anonymous pages
        Access 200 MB file pages
        File access time, round 0: 2.489316 (sec)
        File access time, round 1: 2.581277 (sec)
        File access time, round 2: 2.487624 (sec)
        File access time, round 3: 2.449100 (sec)
        File access time, round 4: 2.420423 (sec)
        File access time, round 5: 2.343411 (sec)
        File access time, round 6: 2.454833 (sec)
        File access time, round 7: 2.483398 (sec)
        File access time, round 8: 2.572701 (sec)
        File access time, round 9: 2.493014 (sec)
      
      With this patch, these file pages can be cached.
      
        $ ./thrash 2048 200 10
        Allocate 2048 MB anonymous pages
        Access 200 MB file pages
        File access time, round 0: 2.475189 (sec)
        File access time, round 1: 2.440777 (sec)
        File access time, round 2: 2.411671 (sec)
        File access time, round 3: 1.955267 (sec)
        File access time, round 4: 0.029924 (sec)
        File access time, round 5: 0.000808 (sec)
        File access time, round 6: 0.000771 (sec)
        File access time, round 7: 0.000746 (sec)
        File access time, round 8: 0.000738 (sec)
        File access time, round 9: 0.000747 (sec)
      
      Checked the swap out stats during the test [1], 19006 pages swapped out
      with this patch, 3418 pages swapped out without this patch. There are
      more swap out, but I think it's within reasonable range when file backed
      data set doesn't fit into the memory.
      
      $ ./thrash 2000 100 2100 5 1 # ANON_MB FILE_EXEC FILE_NOEXEC ROUNDS
      PROCESSES Allocate 2000 MB anonymous pages active_anon: 1613644,
      inactive_anon: 348656, active_file: 892, inactive_file: 1384 (kB)
      pswpout: 7972443, pgpgin: 478615246 Access 100 MB executable file pages
      Access 2100 MB regular file pages File access time, round 0: 12.165,
      (sec) active_anon: 1433788, inactive_anon: 478116, active_file: 17896,
      inactive_file: 24328 (kB) File access time, round 1: 11.493, (sec)
      active_anon: 1430576, inactive_anon: 477144, active_file: 25440,
      inactive_file: 26172 (kB) File access time, round 2: 11.455, (sec)
      active_anon: 1427436, inactive_anon: 476060, active_file: 21112,
      inactive_file: 28808 (kB) File access time, round 3: 11.454, (sec)
      active_anon: 1420444, inactive_anon: 473632, active_file: 23216,
      inactive_file: 35036 (kB) File access time, round 4: 11.479, (sec)
      active_anon: 1413964, inactive_anon: 471460, active_file: 31728,
      inactive_file: 32224 (kB) pswpout: 7991449 (+ 19006), pgpgin: 489924366
      (+ 11309120)
      
      With 4 processes accessing non-overlapping parts of a large file, 30316
      pages swapped out with this patch, 5152 pages swapped out without this
      patch.  The swapout number is small comparing to pgpgin.
      
      [1]: https://github.com/vovo/testing/blob/master/mem_thrash.c
      
      Link: http://lkml.kernel.org/r/20190701081038.GA83398@google.com
      Fixes: e9868505 ("mm,vmscan: only evict file pages when we have plenty")
      Fixes: 7c5bd705 ("mm: memcg: only evict file pages when we have plenty")
      Signed-off-by: NKuo-Hsin Yang <vovoy@chromium.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Sonny Rao <sonnyrao@chromium.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [backported to 4.14.y, 4.19.y, 5.1.y: adjust context]
      Signed-off-by: NKuo-Hsin Yang <vovoy@chromium.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      c1d98b76
    • R
      mm: add filemap_fdatawait_range_keep_errors() · 4becd6c1
      Ross Zwisler 提交于
      commit aa0bfcd939c30617385ffa28682c062d78050eba upstream.
      
      In the spirit of filemap_fdatawait_range() and
      filemap_fdatawait_keep_errors(), introduce
      filemap_fdatawait_range_keep_errors() which both takes a range upon
      which to wait and does not clear errors from the address space.
      Signed-off-by: NRoss Zwisler <zwisler@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      4becd6c1
  13. 10 7月, 2019 2 次提交
  14. 03 7月, 2019 4 次提交
  15. 22 6月, 2019 1 次提交
    • A
      coredump: fix race condition between collapse_huge_page() and core dumping · 465ce9a5
      Andrea Arcangeli 提交于
      commit 59ea6d06cfa9247b586a695c21f94afa7183af74 upstream.
      
      When fixing the race conditions between the coredump and the mmap_sem
      holders outside the context of the process, we focused on
      mmget_not_zero()/get_task_mm() callers in 04f5866e41fb70 ("coredump: fix
      race condition between mmget_not_zero()/get_task_mm() and core
      dumping"), but those aren't the only cases where the mmap_sem can be
      taken outside of the context of the process as Michal Hocko noticed
      while backporting that commit to older -stable kernels.
      
      If mmgrab() is called in the context of the process, but then the
      mm_count reference is transferred outside the context of the process,
      that can also be a problem if the mmap_sem has to be taken for writing
      through that mm_count reference.
      
      khugepaged registration calls mmgrab() in the context of the process,
      but the mmap_sem for writing is taken later in the context of the
      khugepaged kernel thread.
      
      collapse_huge_page() after taking the mmap_sem for writing doesn't
      modify any vma, so it's not obvious that it could cause a problem to the
      coredump, but it happens to modify the pmd in a way that breaks an
      invariant that pmd_trans_huge_lock() relies upon.  collapse_huge_page()
      needs the mmap_sem for writing just to block concurrent page faults that
      call pmd_trans_huge_lock().
      
      Specifically the invariant that "!pmd_trans_huge()" cannot become a
      "pmd_trans_huge()" doesn't hold while collapse_huge_page() runs.
      
      The coredump will call __get_user_pages() without mmap_sem for reading,
      which eventually can invoke a lockless page fault which will need a
      functional pmd_trans_huge_lock().
      
      So collapse_huge_page() needs to use mmget_still_valid() to check it's
      not running concurrently with the coredump...  as long as the coredump
      can invoke page faults without holding the mmap_sem for reading.
      
      This has "Fixes: khugepaged" to facilitate backporting, but in my view
      it's more a bug in the coredump code that will eventually have to be
      rewritten to stop invoking page faults without the mmap_sem for reading.
      So the long term plan is still to drop all mmget_still_valid().
      
      Link: http://lkml.kernel.org/r/20190607161558.32104-1-aarcange@redhat.com
      Fixes: ba76149f ("thp: khugepaged")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      465ce9a5
  16. 19 6月, 2019 2 次提交
    • M
      mm/vmscan.c: fix trying to reclaim unevictable LRU page · 54a20289
      Minchan Kim 提交于
      commit a58f2cef26e1ca44182c8b22f4f4395e702a5795 upstream.
      
      There was the below bug report from Wu Fangsuo.
      
      On the CMA allocation path, isolate_migratepages_range() could isolate
      unevictable LRU pages and reclaim_clean_page_from_list() can try to
      reclaim them if they are clean file-backed pages.
      
        page:ffffffbf02f33b40 count:86 mapcount:84 mapping:ffffffc08fa7a810 index:0x24
        flags: 0x19040c(referenced|uptodate|arch_1|mappedtodisk|unevictable|mlocked)
        raw: 000000000019040c ffffffc08fa7a810 0000000000000024 0000005600000053
        raw: ffffffc009b05b20 ffffffc009b05b20 0000000000000000 ffffffc09bf3ee80
        page dumped because: VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page))
        page->mem_cgroup:ffffffc09bf3ee80
        ------------[ cut here ]------------
        kernel BUG at /home/build/farmland/adroid9.0/kernel/linux/mm/vmscan.c:1350!
        Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
        Modules linked in:
        CPU: 0 PID: 7125 Comm: syz-executor Tainted: G S              4.14.81 #3
        Hardware name: ASR AQUILAC EVB (DT)
        task: ffffffc00a54cd00 task.stack: ffffffc009b00000
        PC is at shrink_page_list+0x1998/0x3240
        LR is at shrink_page_list+0x1998/0x3240
        pc : [<ffffff90083a2158>] lr : [<ffffff90083a2158>] pstate: 60400045
        sp : ffffffc009b05940
        ..
           shrink_page_list+0x1998/0x3240
           reclaim_clean_pages_from_list+0x3c0/0x4f0
           alloc_contig_range+0x3bc/0x650
           cma_alloc+0x214/0x668
           ion_cma_allocate+0x98/0x1d8
           ion_alloc+0x200/0x7e0
           ion_ioctl+0x18c/0x378
           do_vfs_ioctl+0x17c/0x1780
           SyS_ioctl+0xac/0xc0
      
      Wu found it's due to commit ad6b6704 ("mm: remove SWAP_MLOCK in
      ttu").  Before that, unevictable pages go to cull_mlocked so that we
      can't reach the VM_BUG_ON_PAGE line.
      
      To fix the issue, this patch filters out unevictable LRU pages from the
      reclaim_clean_pages_from_list in CMA.
      
      Link: http://lkml.kernel.org/r/20190524071114.74202-1-minchan@kernel.org
      Fixes: ad6b6704 ("mm: remove SWAP_MLOCK in ttu")
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NWu Fangsuo <fangsuowu@asrmicro.com>
      Debugged-by: NWu Fangsuo <fangsuowu@asrmicro.com>
      Tested-by: NWu Fangsuo <fangsuowu@asrmicro.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Pankaj Suryawanshi <pankaj.suryawanshi@einfochips.com>
      Cc: <stable@vger.kernel.org>	[4.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      54a20289
    • S
      mm/list_lru.c: fix memory leak in __memcg_init_list_lru_node · 553a1f0d
      Shakeel Butt 提交于
      commit 3510955b327176fd4cbab5baa75b449f077722a2 upstream.
      
      Syzbot reported following memory leak:
      
      ffffffffda RBX: 0000000000000003 RCX: 0000000000441f79
      BUG: memory leak
      unreferenced object 0xffff888114f26040 (size 32):
        comm "syz-executor626", pid 7056, jiffies 4294948701 (age 39.410s)
        hex dump (first 32 bytes):
          40 60 f2 14 81 88 ff ff 40 60 f2 14 81 88 ff ff  @`......@`......
          00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
        backtrace:
           slab_post_alloc_hook mm/slab.h:439 [inline]
           slab_alloc mm/slab.c:3326 [inline]
           kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
           kmalloc include/linux/slab.h:547 [inline]
           __memcg_init_list_lru_node+0x58/0xf0 mm/list_lru.c:352
           memcg_init_list_lru_node mm/list_lru.c:375 [inline]
           memcg_init_list_lru mm/list_lru.c:459 [inline]
           __list_lru_init+0x193/0x2a0 mm/list_lru.c:626
           alloc_super+0x2e0/0x310 fs/super.c:269
           sget_userns+0x94/0x2a0 fs/super.c:609
           sget+0x8d/0xb0 fs/super.c:660
           mount_nodev+0x31/0xb0 fs/super.c:1387
           fuse_mount+0x2d/0x40 fs/fuse/inode.c:1236
           legacy_get_tree+0x27/0x80 fs/fs_context.c:661
           vfs_get_tree+0x2e/0x120 fs/super.c:1476
           do_new_mount fs/namespace.c:2790 [inline]
           do_mount+0x932/0xc50 fs/namespace.c:3110
           ksys_mount+0xab/0x120 fs/namespace.c:3319
           __do_sys_mount fs/namespace.c:3333 [inline]
           __se_sys_mount fs/namespace.c:3330 [inline]
           __x64_sys_mount+0x26/0x30 fs/namespace.c:3330
           do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      This is a simple off by one bug on the error path.
      
      Link: http://lkml.kernel.org/r/20190528043202.99980-1-shakeelb@google.com
      Fixes: 60d3fd32 ("list_lru: introduce per-memcg lists")
      Reported-by: syzbot+f90a420dfe2b1b03cb2c@syzkaller.appspotmail.com
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: <stable@vger.kernel.org>	[4.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      553a1f0d
  17. 15 6月, 2019 2 次提交
    • D
      percpu: do not search past bitmap when allocating an area · 526972e9
      Dennis Zhou 提交于
      [ Upstream commit 8c43004af01635cc9fbb11031d070e5e0d327ef2 ]
      
      pcpu_find_block_fit() guarantees that a fit is found within
      PCPU_BITMAP_BLOCK_BITS. Iteration is used to determine the first fit as
      it compares against the block's contig_hint. This can lead to
      incorrectly scanning past the end of the bitmap. The behavior was okay
      given the check after for bit_off >= end and the correctness of the
      hints from pcpu_find_block_fit().
      
      This patch fixes this by bounding the end offset by the number of bits
      in a chunk.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Reviewed-by: NPeng Fan <peng.fan@nxp.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      526972e9
    • J
      percpu: remove spurious lock dependency between percpu and sched · 5329dcaf
      John Sperbeck 提交于
      [ Upstream commit 198790d9a3aeaef5792d33a560020861126edc22 ]
      
      In free_percpu() we sometimes call pcpu_schedule_balance_work() to
      queue a work item (which does a wakeup) while holding pcpu_lock.
      This creates an unnecessary lock dependency between pcpu_lock and
      the scheduler's pi_lock.  There are other places where we call
      pcpu_schedule_balance_work() without hold pcpu_lock, and this case
      doesn't need to be different.
      
      Moving the call outside the lock prevents the following lockdep splat
      when running tools/testing/selftests/bpf/{test_maps,test_progs} in
      sequence with lockdep enabled:
      
      ======================================================
      WARNING: possible circular locking dependency detected
      5.1.0-dbg-DEV #1 Not tainted
      ------------------------------------------------------
      kworker/23:255/18872 is trying to acquire lock:
      000000000bc79290 (&(&pool->lock)->rlock){-.-.}, at: __queue_work+0xb2/0x520
      
      but task is already holding lock:
      00000000e3e7a6aa (pcpu_lock){..-.}, at: free_percpu+0x36/0x260
      
      which lock already depends on the new lock.
      
      the existing dependency chain (in reverse order) is:
      
      -> #4 (pcpu_lock){..-.}:
             lock_acquire+0x9e/0x180
             _raw_spin_lock_irqsave+0x3a/0x50
             pcpu_alloc+0xfa/0x780
             __alloc_percpu_gfp+0x12/0x20
             alloc_htab_elem+0x184/0x2b0
             __htab_percpu_map_update_elem+0x252/0x290
             bpf_percpu_hash_update+0x7c/0x130
             __do_sys_bpf+0x1912/0x1be0
             __x64_sys_bpf+0x1a/0x20
             do_syscall_64+0x59/0x400
             entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      -> #3 (&htab->buckets[i].lock){....}:
             lock_acquire+0x9e/0x180
             _raw_spin_lock_irqsave+0x3a/0x50
             htab_map_update_elem+0x1af/0x3a0
      
      -> #2 (&rq->lock){-.-.}:
             lock_acquire+0x9e/0x180
             _raw_spin_lock+0x2f/0x40
             task_fork_fair+0x37/0x160
             sched_fork+0x211/0x310
             copy_process.part.43+0x7b1/0x2160
             _do_fork+0xda/0x6b0
             kernel_thread+0x29/0x30
             rest_init+0x22/0x260
             arch_call_rest_init+0xe/0x10
             start_kernel+0x4fd/0x520
             x86_64_start_reservations+0x24/0x26
             x86_64_start_kernel+0x6f/0x72
             secondary_startup_64+0xa4/0xb0
      
      -> #1 (&p->pi_lock){-.-.}:
             lock_acquire+0x9e/0x180
             _raw_spin_lock_irqsave+0x3a/0x50
             try_to_wake_up+0x41/0x600
             wake_up_process+0x15/0x20
             create_worker+0x16b/0x1e0
             workqueue_init+0x279/0x2ee
             kernel_init_freeable+0xf7/0x288
             kernel_init+0xf/0x180
             ret_from_fork+0x24/0x30
      
      -> #0 (&(&pool->lock)->rlock){-.-.}:
             __lock_acquire+0x101f/0x12a0
             lock_acquire+0x9e/0x180
             _raw_spin_lock+0x2f/0x40
             __queue_work+0xb2/0x520
             queue_work_on+0x38/0x80
             free_percpu+0x221/0x260
             pcpu_freelist_destroy+0x11/0x20
             stack_map_free+0x2a/0x40
             bpf_map_free_deferred+0x3c/0x50
             process_one_work+0x1f7/0x580
             worker_thread+0x54/0x410
             kthread+0x10f/0x150
             ret_from_fork+0x24/0x30
      
      other info that might help us debug this:
      
      Chain exists of:
        &(&pool->lock)->rlock --> &htab->buckets[i].lock --> pcpu_lock
      
       Possible unsafe locking scenario:
      
             CPU0                    CPU1
             ----                    ----
        lock(pcpu_lock);
                                     lock(&htab->buckets[i].lock);
                                     lock(pcpu_lock);
        lock(&(&pool->lock)->rlock);
      
       *** DEADLOCK ***
      
      3 locks held by kworker/23:255/18872:
       #0: 00000000b36a6e16 ((wq_completion)events){+.+.},
           at: process_one_work+0x17a/0x580
       #1: 00000000dfd966f0 ((work_completion)(&map->work)){+.+.},
           at: process_one_work+0x17a/0x580
       #2: 00000000e3e7a6aa (pcpu_lock){..-.},
           at: free_percpu+0x36/0x260
      
      stack backtrace:
      CPU: 23 PID: 18872 Comm: kworker/23:255 Not tainted 5.1.0-dbg-DEV #1
      Hardware name: ...
      Workqueue: events bpf_map_free_deferred
      Call Trace:
       dump_stack+0x67/0x95
       print_circular_bug.isra.38+0x1c6/0x220
       check_prev_add.constprop.50+0x9f6/0xd20
       __lock_acquire+0x101f/0x12a0
       lock_acquire+0x9e/0x180
       _raw_spin_lock+0x2f/0x40
       __queue_work+0xb2/0x520
       queue_work_on+0x38/0x80
       free_percpu+0x221/0x260
       pcpu_freelist_destroy+0x11/0x20
       stack_map_free+0x2a/0x40
       bpf_map_free_deferred+0x3c/0x50
       process_one_work+0x1f7/0x580
       worker_thread+0x54/0x410
       kthread+0x10f/0x150
       ret_from_fork+0x24/0x30
      Signed-off-by: NJohn Sperbeck <jsperbeck@google.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      5329dcaf