1. 30 6月, 2021 1 次提交
    • Y
      mm, oom: reorganize the oom report in dump_header · 0201217c
      yuzhoujian 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit ef8444ea
      category: bugfix
      bugzilla: 51815, https://gitee.com/openeuler/kernel/issues/I3IJ9I
      CVE: NA
      
      -------------------------------------------------
      OOM report contains several sections.  The first one is the allocation
      context that has triggered the OOM.  Then we have cpuset context followed
      by the stack trace of the OOM path.  The tird one is the OOM memory
      information.  Followed by the current memory state of all system tasks.
      At last, we will show oom eligible tasks and the information about the
      chosen oom victim.
      
      One thing that makes parsing more awkward than necessary is that we do not
      have a single and easily parsable line about the oom context.  This patch
      is reorganizing the oom report to
      
      1) who invoked oom and what was the allocation request
      
      [  515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
      
      2) OOM stack trace
      
      [  515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3
      [  515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016
      [  515.906821] Call Trace:
      [  515.908062]  dump_stack+0x5a/0x73
      [  515.909311]  dump_header+0x55/0x28c
      [  515.914260]  oom_kill_process+0x2d8/0x300
      [  515.916708]  out_of_memory+0x145/0x4a0
      [  515.917932]  __alloc_pages_slowpath+0x7d2/0xa16
      [  515.919157]  __alloc_pages_nodemask+0x277/0x290
      [  515.920367]  filemap_fault+0x3d0/0x6c0
      [  515.921529]  ? filemap_map_pages+0x2b8/0x420
      [  515.922709]  ext4_filemap_fault+0x2c/0x40 [ext4]
      [  515.923884]  __do_fault+0x20/0x80
      [  515.925032]  __handle_mm_fault+0xbc0/0xe80
      [  515.926195]  handle_mm_fault+0xfa/0x210
      [  515.927357]  __do_page_fault+0x233/0x4c0
      [  515.928506]  do_page_fault+0x32/0x140
      [  515.929646]  ? page_fault+0x8/0x30
      [  515.930770]  page_fault+0x1e/0x30
      
      3) OOM memory information
      
      [  515.958093] Mem-Info:
      [  515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0
       active_file:4402672 inactive_file:483963 isolated_file:1344
       unevictable:0 dirty:4886753 writeback:0 unstable:0
       slab_reclaimable:148442 slab_unreclaimable:18741
       mapped:1347 shmem:1347 pagetables:58669 bounce:0
       free:88663 free_pcp:0 free_cma:0
      ...
      
      4) current memory state of all system tasks
      
      [  516.079544] [    744]     0   744     9211     1345   114688       82             0 systemd-journal
      [  516.082034] [    787]     0   787    31764        0   143360       92             0 lvmetad
      [  516.084465] [    792]     0   792    10930        1   110592      208         -1000 systemd-udevd
      [  516.086865] [   1199]     0  1199    13866        0   131072      112         -1000 auditd
      [  516.089190] [   1222]     0  1222    31990        1   110592      157             0 smartd
      [  516.091477] [   1225]     0  1225     4864       85    81920       43             0 irqbalance
      [  516.093712] [   1226]     0  1226    52612        0   258048      426             0 abrtd
      [  516.112128] [   1280]     0  1280   109774       55   299008      400             0 NetworkManager
      [  516.113998] [   1295]     0  1295    28817       37    69632       24             0 ksmtuned
      [  516.144596] [  10718]     0 10718  2622484  1721372 15998976   267219             0 panic
      [  516.145792] [  10719]     0 10719  2622484  1164767  9818112    53576             0 panic
      [  516.146977] [  10720]     0 10720  2622484  1174361  9904128    53709             0 panic
      [  516.148163] [  10721]     0 10721  2622484  1209070 10194944    54824             0 panic
      [  516.149329] [  10722]     0 10722  2622484  1745799 14774272    91138             0 panic
      
      5) oom context (contrains and the chosen victim).
      
      oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0
      
      An admin can easily get the full oom context at a single line which
      makes parsing much easier.
      
      Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.comSigned-off-by: Nyuzhoujian <yuzhoujian@didichuxing.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <yang.s@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from commit ef8444ea)
      Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      (cherry picked from commit 985eab72d54b5ac73189d609486526b5e30125ac)
      Signed-off-by: NLu Jialin <lujialin4@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0201217c
  2. 17 5月, 2021 1 次提交
  3. 22 2月, 2021 5 次提交
  4. 04 11月, 2020 2 次提交
  5. 15 10月, 2020 2 次提交
    • C
      mm, page_alloc: fix core hung in free_pcppages_bulk() · bd946649
      Charan Teja Reddy 提交于
      stable inclusion
      from linux-4.19.142
      commit c666936d8d8b0ace4f3260d71a4eedefd53011d9
      
      --------------------------------
      
      commit 88e8ac11 upstream.
      
      The following race is observed with the repeated online, offline and a
      delay between two successive online of memory blocks of movable zone.
      
      P1						P2
      
      Online the first memory block in
      the movable zone. The pcp struct
      values are initialized to default
      values,i.e., pcp->high = 0 &
      pcp->batch = 1.
      
      					Allocate the pages from the
      					movable zone.
      
      Try to Online the second memory
      block in the movable zone thus it
      entered the online_pages() but yet
      to call zone_pcp_update().
      					This process is entered into
      					the exit path thus it tries
      					to release the order-0 pages
      					to pcp lists through
      					free_unref_page_commit().
      					As pcp->high = 0, pcp->count = 1
      					proceed to call the function
      					free_pcppages_bulk().
      Update the pcp values thus the
      new pcp values are like, say,
      pcp->high = 378, pcp->batch = 63.
      					Read the pcp's batch value using
      					READ_ONCE() and pass the same to
      					free_pcppages_bulk(), pcp values
      					passed here are, batch = 63,
      					count = 1.
      
      					Since num of pages in the pcp
      					lists are less than ->batch,
      					then it will stuck in
      					while(list_empty(list)) loop
      					with interrupts disabled thus
      					a core hung.
      
      Avoid this by ensuring free_pcppages_bulk() is called with proper count of
      pcp list pages.
      
      The mentioned race is some what easily reproducible without [1] because
      pcp's are not updated for the first memory block online and thus there is
      a enough race window for P2 between alloc+free and pcp struct values
      update through onlining of second memory block.
      
      With [1], the race still exists but it is very narrow as we update the pcp
      struct values for the first memory block online itself.
      
      This is not limited to the movable zone, it could also happen in cases
      with the normal zone (e.g., hotplug to a node that only has DMA memory, or
      no other memory yet).
      
      [1]: https://patchwork.kernel.org/patch/11696389/
      
      Fixes: 5f8dcc21 ("page-allocator: split per-cpu list into one-list-per-migrate-type")
      Signed-off-by: NCharan Teja Reddy <charante@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: <stable@vger.kernel.org> [2.6+]
      Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      bd946649
    • D
      mm: include CMA pages in lowmem_reserve at boot · c2a701de
      Doug Berger 提交于
      stable inclusion
      from linux-4.19.142
      commit 84b8dc232afadf3aab425a104def45a1e7346a58
      
      --------------------------------
      
      commit e08d3fdf upstream.
      
      The lowmem_reserve arrays provide a means of applying pressure against
      allocations from lower zones that were targeted at higher zones.  Its
      values are a function of the number of pages managed by higher zones and
      are assigned by a call to the setup_per_zone_lowmem_reserve() function.
      
      The function is initially called at boot time by the function
      init_per_zone_wmark_min() and may be called later by accesses of the
      /proc/sys/vm/lowmem_reserve_ratio sysctl file.
      
      The function init_per_zone_wmark_min() was moved up from a module_init to
      a core_initcall to resolve a sequencing issue with khugepaged.
      Unfortunately this created a sequencing issue with CMA page accounting.
      
      The CMA pages are added to the managed page count of a zone when
      cma_init_reserved_areas() is called at boot also as a core_initcall.  This
      makes it uncertain whether the CMA pages will be added to the managed page
      counts of their zones before or after the call to
      init_per_zone_wmark_min() as it becomes dependent on link order.  With the
      current link order the pages are added to the managed count after the
      lowmem_reserve arrays are initialized at boot.
      
      This means the lowmem_reserve values at boot may be lower than the values
      used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the
      ratio values are unchanged.
      
      In many cases the difference is not significant, but for example
      an ARM platform with 1GB of memory and the following memory layout
      
        cma: Reserved 256 MiB at 0x0000000030000000
        Zone ranges:
          DMA      [mem 0x0000000000000000-0x000000002fffffff]
          Normal   empty
          HighMem  [mem 0x0000000030000000-0x000000003fffffff]
      
      would result in 0 lowmem_reserve for the DMA zone.  This would allow
      userspace to deplete the DMA zone easily.
      
      Funnily enough
      
        $ cat /proc/sys/vm/lowmem_reserve_ratio
      
      would fix up the situation because as a side effect it forces
      setup_per_zone_lowmem_reserve.
      
      This commit breaks the link order dependency by invoking
      init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages
      have the chance to be properly accounted in their zone(s) and allowing
      the lowmem_reserve arrays to receive consistent values.
      
      Fixes: bc22af74 ("mm: update min_free_kbytes from khugepaged after core initialization")
      Signed-off-by: NDoug Berger <opendmb@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c2a701de
  6. 22 9月, 2020 6 次提交
    • P
      mm: initialize deferred pages with interrupts enabled · 46909518
      Pavel Tatashin 提交于
      stable inclusion
      from linux-4.19.129
      commit 88afa532c14135528b905015f1d9a5e740a95136
      
      --------------------------------
      
      commit 3d060856 upstream.
      
      Initializing struct pages is a long task and keeping interrupts disabled
      for the duration of this operation introduces a number of problems.
      
      1. jiffies are not updated for long period of time, and thus incorrect time
         is reported. See proposed solution and discussion here:
         lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
      2. It prevents farther improving deferred page initialization by allowing
         intra-node multi-threading.
      
      We are keeping interrupts disabled to solve a rather theoretical problem
      that was never observed in real world (See 3a2d7fa8).
      
      Let's keep interrupts enabled. In case we ever encounter a scenario where
      an interrupt thread wants to allocate large amount of memory this early in
      boot we can deal with that by growing zone (see deferred_grow_zone()) by
      the needed amount before starting deferred_init_memmap() threads.
      
      Before:
      [    1.232459] node 0 initialised, 12058412 pages in 1ms
      
      After:
      [    1.632580] node 0 initialised, 12051227 pages in 436ms
      
      Fixes: 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages")
      Reported-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Yiqian Wei <yiwei@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.17+]
      Link: http://lkml.kernel.org/r/20200403140952.17177-3-pasha.tatashin@soleen.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      46909518
    • D
      mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous() · 5cb82e40
      David Hildenbrand 提交于
      stable inclusion
      from linux-4.19.123
      commit dfe810bd92be7a50f491abd381f5a742d9844675
      
      --------------------------------
      
      commit e84fe99b upstream.
      
      Without CONFIG_PREEMPT, it can happen that we get soft lockups detected,
      e.g., while booting up.
      
        watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4
        Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
        RIP: __pageblock_pfn_to_page+0x134/0x1c0
        Call Trace:
         set_zone_contiguous+0x56/0x70
         page_alloc_init_late+0x166/0x176
         kernel_init_freeable+0xfa/0x255
         kernel_init+0xa/0x106
         ret_from_fork+0x35/0x40
      
      The issue becomes visible when having a lot of memory (e.g., 4TB)
      assigned to a single NUMA node - a system that can easily be created
      using QEMU.  Inside VMs on a hypervisor with quite some memory
      overcommit, this is fairly easy to trigger.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5cb82e40
    • A
      mm: Use fixed constant in page_frag_alloc instead of size + 1 · 54d789ed
      Alexander Duyck 提交于
      stable inclusion
      from linux-4.19.116
      commit 695986163d66a9f55daf13aba5976d4b03a23cc9
      
      --------------------------------
      
      commit 86447726 upstream.
      
      This patch replaces the size + 1 value introduced with the recent fix for 1
      byte allocs with a constant value.
      
      The idea here is to reduce code overhead as the previous logic would have
      to read size into a register, then increment it, and write it back to
      whatever field was being used. By using a constant we can avoid those
      memory reads and arithmetic operations in favor of just encoding the
      maximum value into the operation itself.
      
      Fixes: 2c2ade81 ("mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs")
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      54d789ed
    • D
      mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section · d0a3efd5
      David Hildenbrand 提交于
      stable inclusion
      from linux-4.19.103
      commit 0a69047d8235c60d88c6ca488d8dccc7c60d4d3c
      
      --------------------------------
      
      [ Upstream commit e822969c ]
      
      Patch series "mm: fix max_pfn not falling on section boundary", v2.
      
      Playing with different memory sizes for a x86-64 guest, I discovered that
      some memmaps (highest section if max_mem does not fall on the section
      boundary) are marked as being valid and online, but contain garbage.  We
      have to properly initialize these memmaps.
      
      Looking at /proc/kpageflags and friends, I found some more issues,
      partially related to this.
      
      This patch (of 3):
      
      If max_pfn is not aligned to a section boundary, we can easily run into
      BUGs.  This can e.g., be triggered on x86-64 under QEMU by specifying a
      memory size that is not a multiple of 128MB (e.g., 4097MB, but also
      4160MB).  I was told that on real HW, we can easily have this scenario
      (esp., one of the main reasons sub-section hotadd of devmem was added).
      
      The issue is, that we have a valid memmap (pfn_valid()) for the whole
      section, and the whole section will be marked "online".
      pfn_to_online_page() will succeed, but the memmap contains garbage.
      
      E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
      4160M" - (see tools/vm/page-types.c):
      
      [  200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
      [  200.477500] #PF: supervisor read access in kernel mode
      [  200.478334] #PF: error_code(0x0000) - not-present page
      [  200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
      [  200.479557] Oops: 0000 [#4] SMP NOPTI
      [  200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G      D W         5.5.0-rc1-next-20191209 #93
      [  200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
      [  200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
      [  200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
      [  200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
      [  200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
      [  200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
      [  200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
      [  200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
      [  200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
      [  200.487130] FS:  00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
      [  200.487804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
      [  200.488897] Call Trace:
      [  200.489115]  kpageflags_read+0xe9/0x140
      [  200.489447]  proc_reg_read+0x3c/0x60
      [  200.489755]  vfs_read+0xc2/0x170
      [  200.490037]  ksys_pread64+0x65/0xa0
      [  200.490352]  do_syscall_64+0x5c/0xa0
      [  200.490665]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
      after cold/hot plugging a DIMM to such a system:
      
      [root@localhost ~]# cat /proc/kpageflags > /dev/null
      [  111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
      [  111.517907] #PF: supervisor read access in kernel mode
      [  111.518333] #PF: error_code(0x0000) - not-present page
      [  111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0
      
      This patch fixes that by at least zero-ing out that memmap (so e.g.,
      page_to_pfn() will not crash).  Commit 907ec5fc ("mm: zero remaining
      unavailable struct pages") tried to fix a similar issue, but forgot to
      consider this special case.
      
      After this patch, there are still problems to solve.  E.g., not all of
      these pages falling into a memory hole will actually get initialized later
      and set PageReserved - they are only zeroed out - but at least the
      immediate crashes are gone.  A follow-up patch will take care of this.
      
      Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Tested-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: <stable@vger.kernel.org>	[4.15+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d0a3efd5
    • P
      mm: return zero_resv_unavail optimization · 809dc9f5
      Pavel Tatashin 提交于
      stable inclusion
      from linux-4.19.103
      commit f19a50c1e3ba9f58ca5a591a82ac4852da8bc4ee
      
      --------------------------------
      
      [ Upstream commit ec393a0f ]
      
      When checking for valid pfns in zero_resv_unavail(), it is not necessary
      to verify that pfns within pageblock_nr_pages ranges are valid, only the
      first one needs to be checked.  This is because memory for pages are
      allocated in contiguous chunks that contain pageblock_nr_pages struct
      pages.
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.comSigned-off-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Signed-off-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      809dc9f5
    • N
      mm: zero remaining unavailable struct pages · f8d9d8ce
      Naoya Horiguchi 提交于
      stable inclusion
      from linux-4.19.103
      commit 9ac5917a1d28220981512c4f4c391c90a997e0c6
      
      --------------------------------
      
      [ Upstream commit 907ec5fc ]
      
      Patch series "mm: Fix for movable_node boot option", v3.
      
      This patch series contains a fix for the movable_node boot option issue
      which was introduced by commit 124049de ("x86/e820: put !E820_TYPE_RAM
      regions into memblock.reserved").
      
      The commit breaks the option because it changed the memory gap range to
      reserved memblock.  So, the node is marked as Normal zone even if the SRAT
      has Hot pluggable affinity.
      
      First and second patch fix the original issue which the commit tried to
      fix, then revert the commit.
      
      This patch (of 3):
      
      There is a kernel panic that is triggered when reading /proc/kpageflags on
      the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
      
        BUG: unable to handle kernel paging request at fffffffffffffffe
        PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
        Oops: 0000 [#1] SMP PTI
        CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
        RIP: 0010:stable_page_flags+0x27/0x3c0
        Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
        RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
        RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
        RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
        R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
        R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
        FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
        Call Trace:
         kpageflags_read+0xc7/0x120
         proc_reg_read+0x3c/0x60
         __vfs_read+0x36/0x170
         vfs_read+0x89/0x130
         ksys_pread64+0x71/0x90
         do_syscall_64+0x5b/0x160
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7efc42e75e23
        Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
      
      According to kernel bisection, this problem became visible due to commit
      f7f99100 which changes how struct pages are initialized.
      
      Memblock layout affects the pfn ranges covered by node/zone.  Consider
      that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the
      default (no memmap= given) memblock layout is like below:
      
        MEMBLOCK configuration:
         memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x4
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
         memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      If you give memmap=1G!4G (so it just covers memory[0x2]),
      the range [0x100000000-0x13fffffff] is gone:
      
        MEMBLOCK configuration:
         memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x3
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      This causes shrinking node 0's pfn range because it is calculated by the
      address range of memblock.memory.  So some of struct pages in the gap
      range are left uninitialized.
      
      We have a function zero_resv_unavail() which does zeroing the struct pages
      outside memblock.memory, but currently it covers only the reserved
      unavailable range (i.e.  memblock.memory && !memblock.reserved).  This
      patch extends it to cover all unavailable range, which fixes the reported
      issue.
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Tested-by: NOscar Salvador <osalvador@suse.de>
      Tested-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f8d9d8ce
  7. 06 8月, 2020 1 次提交
  8. 27 12月, 2019 22 次提交
    • A
      mm/page_alloc.c: use a single function to free page · 19e1c828
      Aaron Lu 提交于
      [ Upstream commit 742aa7fb ]
      
      There are multiple places of freeing a page, they all do the same things
      so a common function can be used to reduce code duplicate.
      
      It also avoids bug fixed in one function but left in another.
      
      Link: http://lkml.kernel.org/r/20181119134834.17765-3-aaron.lu@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Pankaj gupta <pagupta@redhat.com>
      Cc: Pawel Staszewski <pstaszewski@itcare.pl>
      Cc: Tariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      19e1c828
    • A
      mm/page_alloc.c: free order-0 pages through PCP in page_frag_free() · 3a96a55b
      Aaron Lu 提交于
      [ Upstream commit 65895b67 ]
      
      page_frag_free() calls __free_pages_ok() to free the page back to Buddy.
      This is OK for high order page, but for order-0 pages, it misses the
      optimization opportunity of using Per-Cpu-Pages and can cause zone lock
      contention when called frequently.
      
      Pawel Staszewski recently shared his result of 'how Linux kernel handles
      normal traffic'[1] and from perf data, Jesper Dangaard Brouer found the
      lock contention comes from page allocator:
      
        mlx5e_poll_tx_cq
        |
         --16.34%--napi_consume_skb
                   |
                   |--12.65%--__free_pages_ok
                   |          |
                   |           --11.86%--free_one_page
                   |                     |
                   |                     |--10.10%--queued_spin_lock_slowpath
                   |                     |
                   |                      --0.65%--_raw_spin_lock
                   |
                   |--1.55%--page_frag_free
                   |
                    --1.44%--skb_release_data
      
      Jesper explained how it happened: mlx5 driver RX-page recycle mechanism is
      not effective in this workload and pages have to go through the page
      allocator.  The lock contention happens during mlx5 DMA TX completion
      cycle.  And the page allocator cannot keep up at these speeds.[2]
      
      I thought that __free_pages_ok() are mostly freeing high order pages and
      thought this is an lock contention for high order pages but Jesper
      explained in detail that __free_pages_ok() here are actually freeing
      order-0 pages because mlx5 is using order-0 pages to satisfy its page pool
      allocation request.[3]
      
      The free path as pointed out by Jesper is:
      skb_free_head()
        -> skb_free_frag()
          -> page_frag_free()
      And the pages being freed on this path are order-0 pages.
      
      Fix this by doing similar things as in __page_frag_cache_drain() - send
      the being freed page to PCP if it's an order-0 page, or directly to Buddy
      if it is a high order page.
      
      With this change, Paweł hasn't noticed lock contention yet in his
      workload and Jesper has noticed a 7% performance improvement using a micro
      benchmark and lock contention is gone.  Ilias' test on a 'low' speed 1Gbit
      interface on an cortex-a53 shows ~11% performance boost testing with
      64byte packets and __free_pages_ok() disappeared from perf top.
      
      [1]: https://www.spinics.net/lists/netdev/msg531362.html
      [2]: https://www.spinics.net/lists/netdev/msg531421.html
      [3]: https://www.spinics.net/lists/netdev/msg531556.html
      
      [akpm@linux-foundation.org: add comment]
      Link: http://lkml.kernel.org/r/20181120014544.GB10657@intel.comSigned-off-by: NAaron Lu <aaron.lu@intel.com>
      Reported-by: NPawel Staszewski <pstaszewski@itcare.pl>
      Analysed-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Acked-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Tested-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
      Acked-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Acked-by: NTariq Toukan <tariqt@mellanox.com>
      Acked-by: NPankaj gupta <pagupta@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3a96a55b
    • M
      mm, meminit: recalculate pcpu batch and high limits after init completes · 07d3be02
      Mel Gorman 提交于
      commit 3e8fc0075e24338b1117cdff6a79477427b8dbed upstream.
      
      Deferred memory initialisation updates zone->managed_pages during the
      initialisation phase but before that finishes, the per-cpu page
      allocator (pcpu) calculates the number of pages allocated/freed in
      batches as well as the maximum number of pages allowed on a per-cpu
      list.  As zone->managed_pages is not up to date yet, the pcpu
      initialisation calculates inappropriately low batch and high values.
      
      This increases zone lock contention quite severely in some cases with
      the degree of severity depending on how many CPUs share a local zone and
      the size of the zone.  A private report indicated that kernel build
      times were excessive with extremely high system CPU usage.  A perf
      profile indicated that a large chunk of time was lost on zone->lock
      contention.
      
      This patch recalculates the pcpu batch and high values after deferred
      initialisation completes for every populated zone in the system.  It was
      tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
      workload -- allmodconfig and all available CPUs.
      
      mmtests configuration: config-workload-kernbench-max Configuration was
      modified to build on a fresh XFS partition.
      
      kernbench
                                      5.4.0-rc3              5.4.0-rc3
                                        vanilla           resetpcpu-v2
      Amean     user-256    13249.50 (   0.00%)    16401.31 * -23.79%*
      Amean     syst-256    14760.30 (   0.00%)     4448.39 *  69.86%*
      Amean     elsp-256      162.42 (   0.00%)      119.13 *  26.65%*
      Stddev    user-256       42.97 (   0.00%)       19.15 (  55.43%)
      Stddev    syst-256      336.87 (   0.00%)        6.71 (  98.01%)
      Stddev    elsp-256        2.46 (   0.00%)        0.39 (  84.03%)
      
                         5.4.0-rc3    5.4.0-rc3
                           vanilla resetpcpu-v2
      Duration User       39766.24     49221.79
      Duration System     44298.10     13361.67
      Duration Elapsed      519.11       388.87
      
      The patch reduces system CPU usage by 69.86% and total build time by
      26.65%.  The variance of system CPU usage is also much reduced.
      
      Before, this was the breakdown of batch and high values over all zones
      was:
      
          256               batch: 1
          256               batch: 63
          512               batch: 7
          256               high:  0
          256               high:  378
          512               high:  42
      
      512 pcpu pagesets had a batch limit of 7 and a high limit of 42.  After
      the patch:
      
          256               batch: 1
          768               batch: 63
          256               high:  0
          768               high:  378
      
      [mgorman@techsingularity.net: fix merge/linkage snafu]
        Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Qian Cai <cai@lca.pw>
      Cc: <stable@vger.kernel.org>	[4.1+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      07d3be02
    • V
      mm, compaction: raise compaction priority after it withdrawns · 1ac57dca
      Vlastimil Babka 提交于
      mainline inclusion
      from mainline-5.4-rc1
      commit 494330855641269c8a49f1580f0d4e2ead693245
      category: bugfix
      bugzilla: 23291
      CVE: NA
      
      -------------------------------------------------
      
      Mike Kravetz reports that "hugetlb allocations could stall for minutes or
      hours when should_compact_retry() would return true more often then it
      should.  Specifically, this was in the case where compact_result was
      COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED and no progress was being
      made."
      
      The problem is that the compaction_withdrawn() test in
      should_compact_retry() includes compaction outcomes that are only possible
      on low compaction priority, and results in a retry without increasing the
      priority.  This may result in furter reclaim, and more incomplete
      compaction attempts.
      
      With this patch, compaction priority is raised when possible, or
      should_compact_retry() returns false.
      
      The COMPACT_SKIPPED result doesn't really fit together with the other
      outcomes in compaction_withdrawn(), as that's a result caused by
      insufficient order-0 pages, not due to low compaction priority.  With this
      patch, it is moved to a new compaction_needs_reclaim() function, and for
      that outcome we keep the current logic of retrying if it looks like
      reclaim will be able to help.
      
      Link: http://lkml.kernel.org/r/20190806014744.15446-4-mike.kravetz@oracle.comReported-by: NMike Kravetz <mike.kravetz@oracle.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Tested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NChenwandun <chenwandun@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1ac57dca
    • S
      memcg: localize memcg_kmem_enabled() check · 0f878ed1
      Shakeel Butt 提交于
      mainline inclusion
      from mainline-5.1-rc1
      commit 60cd4bcd62384cfa1e5890cebacccf08b3161156
      category: bugfix
      bugzilla: 21077
      CVE: NA
      
      ------------------------------------------------
      
      Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge
      functions, so, the users don't have to explicitly check that condition.
      
      This is purely code cleanup patch without any functional change.  Only
      the order of checks in memcg_charge_slab() can potentially be changed
      but the functionally it will be same.  This should not matter as
      memcg_charge_slab() is not in the hot path.
      
      Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0f878ed1
    • D
      mm: parallelize deferred struct page initialization within each node · eb761d65
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      Deferred struct page initialization currently runs one thread per node,
      but this is a bottleneck during boot on big machines, so use ktask
      within each pgdatinit thread to parallelize the struct page
      initialization, allowing the system to take better advantage of its
      memory bandwidth.
      
      Because the system is not fully up yet and most CPUs are idle, use more
      than the default maximum number of ktask threads.  The kernel doesn't
      know the memory bandwidth of a given system to get the most efficient
      number of threads, so there's some guesswork involved.  In testing, a
      reasonable value turned out to be about a quarter of the CPUs on the
      node.
      
      __free_pages_core used to increase the zone's managed page count by the
      number of pages being freed.  To accommodate multiple threads, however,
      account the number of freed pages with an atomic shared across the ktask
      threads and bump the managed page count with it after ktask is finished.
      
      Test:    Boot the machine with deferred struct page init three times
      
      Machine: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 88 CPUs, 503G memory,
               2 sockets
      
      kernel                   speedup   max time per   stdev
                                         node (ms)
      
      baseline (4.15-rc2)                        5860     8.6
      ktask                      9.56x            613    12.4
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      eb761d65
    • L
      mem-hotplug: fix node spanned pages when we have a node with only ZONE_MOVABLE · 461d03d3
      Linxu Fang 提交于
      [ Upstream commit 299c83dc ]
      
      342332e6 ("mm/page_alloc.c: introduce kernelcore=mirror option") and
      later patches rewrote the calculation of node spanned pages.
      
      e506b996 ("mem-hotplug: fix node spanned pages when we have a movable
      node"), but the current code still has problems,
      
      When we have a node with only zone_movable and the node id is not zero,
      the size of node spanned pages is double added.
      
      That's because we have an empty normal zone, and zone_start_pfn or
      zone_end_pfn is not between arch_zone_lowest_possible_pfn and
      arch_zone_highest_possible_pfn, so we need to use clamp to constrain the
      range just like the commit <96e907d1> (bootmem: Reimplement
      __absent_pages_in_range() using for_each_mem_pfn_range()).
      
      e.g.
      Zone ranges:
        DMA      [mem 0x0000000000001000-0x0000000000ffffff]
        DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
        Normal   [mem 0x0000000100000000-0x000000023fffffff]
      Movable zone start for each node
        Node 0: 0x0000000100000000
        Node 1: 0x0000000140000000
      Early memory node ranges
        node   0: [mem 0x0000000000001000-0x000000000009efff]
        node   0: [mem 0x0000000000100000-0x00000000bffdffff]
        node   0: [mem 0x0000000100000000-0x000000013fffffff]
        node   1: [mem 0x0000000140000000-0x000000023fffffff]
      
      node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
      node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
      node 0 Normal	spanned:0	present:0	absent:0
      node 0 Movable	spanned:0x40000 present:0x40000 absent:0
      On node 0 totalpages(node_present_pages): 1048446
      node_spanned_pages:1310719
      node 1 DMA	spanned:0	    present:0		absent:0
      node 1 DMA32	spanned:0	    present:0		absent:0
      node 1 Normal	spanned:0x100000    present:0x100000	absent:0
      node 1 Movable	spanned:0x100000    present:0x100000	absent:0
      On node 1 totalpages(node_present_pages): 2097152
      node_spanned_pages:2097152
      Memory: 6967796K/12582392K available (16388K kernel code, 3686K rwdata,
      4468K rodata, 2160K init, 10444K bss, 5614596K reserved, 0K
      cma-reserved)
      
      It shows that the current memory of node 1 is double added.
      After this patch, the problem is fixed.
      
      node 0 DMA	spanned:0xfff   present:0xf9e   absent:0x61
      node 0 DMA32	spanned:0xff000 present:0xbefe0	absent:0x40020
      node 0 Normal	spanned:0	present:0	absent:0
      node 0 Movable	spanned:0x40000 present:0x40000 absent:0
      On node 0 totalpages(node_present_pages): 1048446
      node_spanned_pages:1310719
      node 1 DMA	spanned:0	    present:0		absent:0
      node 1 DMA32	spanned:0	    present:0		absent:0
      node 1 Normal	spanned:0	    present:0		absent:0
      node 1 Movable	spanned:0x100000    present:0x100000	absent:0
      On node 1 totalpages(node_present_pages): 1048576
      node_spanned_pages:1048576
      memory: 6967796K/8388088K available (16388K kernel code, 3686K rwdata,
      4468K rodata, 2160K init, 10444K bss, 1420292K reserved, 0K
      cma-reserved)
      
      Link: http://lkml.kernel.org/r/1554178276-10372-1-git-send-email-fanglinxu@huawei.comSigned-off-by: NLinxu Fang <fanglinxu@huawei.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      461d03d3
    • J
      mm: page_alloc: remain memblock_next_valid_pfn() on arm/arm64 · 4f256a79
      Jia He 提交于
      hulk inclusion
      category: performance
      bugzilla: 11028
      CVE: NA
      
      -------------------------------------------------
      
      Commit b92df1de ("mm: page_alloc: skip over regions of invalid pfns
      where possible") optimized the loop in memmap_init_zone(). But it causes
      possible panic bug. So Daniel Vacek reverted it later.
      
      But as suggested by Daniel Vacek, it is fine to using memblock to skip
      gaps and finding next valid frame with CONFIG_HAVE_ARCH_PFN_VALID.
      Daniel said:
      "On arm and arm64, memblock is used by default. But generic version of
      pfn_valid() is based on mem sections and memblock_next_valid_pfn() does
      not always return the next valid one but skips more resulting in some
      valid frames to be skipped (as if they were invalid). And that's why
      kernel was eventually crashing on some !arm machines."
      
      About the performance consideration:
      As said by James in b92df1de,
      "I have tested this patch on a virtual model of a Samurai CPU
      with a sparse memory map.  The kernel boot time drops from 109 to
      62 seconds."
      
      Thus it would be better if we remain memblock_next_valid_pfn on arm/arm64.
      Suggested-by: NDaniel Vacek <neelx@redhat.com>
      Signed-off-by: NJia He <jia.he@hxt-semitech.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4f256a79
    • V
      mm, page_alloc: disallow __GFP_COMP in alloc_pages_exact() · 97e9bc78
      Vlastimil Babka 提交于
      mainline inclusion
      from mainline-v5.2-rc1
      commit 63931eb9
      category: bugfix
      bugzilla: 16020
      CVE: NA
      
      -------------------------------------------------
      
      alloc_pages_exact*() allocates a page of sufficient order and then splits
      it to return only the number of pages requested.  That makes it
      incompatible with __GFP_COMP, because compound pages cannot be split.
      
      As shown by [1] things may silently work until the requested size
      (possibly depending on user) stops being power of two.  Then for
      CONFIG_DEBUG_VM, BUG_ON() triggers in split_page().  Without
      CONFIG_DEBUG_VM, consequences are unclear.
      
      There are several options here, none of them great:
      
      1) Don't do the splitting when __GFP_COMP is passed, and return the
         whole compound page.  However if caller then returns it via
         free_pages_exact(), that will be unexpected and the freeing actions
         there will be wrong.
      
      2) Warn and remove __GFP_COMP from the flags.  But the caller may have
         really wanted it, so things may break later somewhere.
      
      3) Warn and return NULL.  However NULL may be unexpected, especially
         for small sizes.
      
      This patch picks option 2, because as Michal Hocko put it: "callers wanted
      it" is much less probable than "caller is simply confused and more gfp
      flags is surely better than fewer".
      
      [1] https://lore.kernel.org/lkml/20181126002805.GI18977@shao2-debian/T/#u
      
      Link: http://lkml.kernel.org/r/0c6393eb-b28d-4607-c386-862a71f09de6@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      97e9bc78
    • L
      mm: Be allowed to alloc CDM node memory for MPOL_BIND · 1f3b5458
      Lijun Fang 提交于
      euler inclusion
      category: feature
      bugzilla: 11082
      CVE: NA
      -----------------
      
      CDM nodes should not be part of mems_allowed, However,
      It must be allowed to alloc from CDM node, when mpol->mode was MPOL_BIND.
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1f3b5458
    • A
      mm: Ignore cpuset enforcement when allocation flag has __GFP_THISNODE · 148b50b7
      Anshuman Khandual 提交于
      euler inclusion
      category: feature
      bugzilla: 11082
      CVE: NA
      -------------------
      
      __GFP_THISNODE specifically asks the memory to be allocated from the given
      node. Not all the requests that end up in __alloc_pages_nodemask() are
      originated from the process context where cpuset makes more sense. The
      current condition enforces cpuset limitation on every allocation whether
      originated from process context or not which prevents __GFP_THISNODE
      mandated allocations to come from the specified node. In context of the
      coherent device memory node which is isolated from all cpuset nodemask
      in the system, it prevents the only way of allocation into it which has
      been changed with this patch.
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      148b50b7
    • A
      mm: Enable Buddy allocation isolation for CDM nodes · 8877e9e4
      Anshuman Khandual 提交于
      euler inclusion
      category: feature
      bugzilla: 11082
      CVE: NA
      -------------------
      
      This implements allocation isolation for CDM nodes in buddy allocator by
      discarding CDM memory zones all the time except in the cases where the gfp
      flag has got __GFP_THISNODE or the nodemask contains CDM nodes in cases
      where it is non NULL (explicit allocation request in the kernel or user
      process MPOL_BIND policy based requests).
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8877e9e4
    • A
      mm: Change generic FALLBACK zonelist creation process · 023d1127
      Anshuman Khandual 提交于
      euler inclusion
      category: feature
      bugzilla: 11082
      CVE: NA
      -------------------
      
      Kernel allocation to CDM node has already been prevented by putting it's
      entire memory in ZONE_MOVABLE. But the CDM nodes must also be isolated
      from implicit allocations happening on the system.
      
      Any isolation seeking CDM node requires isolation from implicit memory
      allocations from user space but at the same time there should also have
      an explicit way to do the memory allocation.
      
      Platform node's both zonelists are fundamental to where the memory comes
      from when there is an allocation request. In order to achieve these two
      objectives as stated above, zonelists building process has to change as
      both zonelists (i.e FALLBACK and NOFALLBACK) gives access to the node's
      memory zones during any kind of memory allocation. The following changes
      are implemented in this regard.
      
      * CDM node's zones are not part of any other node's FALLBACK zonelist
      * CDM node's FALLBACK list contains it's own memory zones followed by
        all system RAM zones in regular order as before
      * CDM node's zones are part of it's own NOFALLBACK zonelist
      
      These above changes ensure the following which in turn isolates the CDM
      nodes as desired.
      
      * There wont be any implicit memory allocation ending up in the CDM node
      * Only __GFP_THISNODE marked allocations will come from the CDM node
      * CDM node memory can be allocated through mbind(MPOL_BIND) interface
      * System RAM memory will be used as fallback option in regular order in
        case the CDM memory is insufficient during targted allocation request
      
      Sample zonelist configuration:
      
      [NODE (0)]						RAM
              ZONELIST_FALLBACK (0xc00000000140da00)
                      (0) (node 0) (DMA     0xc00000000140c000)
                      (1) (node 1) (DMA     0xc000000100000000)
              ZONELIST_NOFALLBACK (0xc000000001411a10)
                      (0) (node 0) (DMA     0xc00000000140c000)
      [NODE (1)]						RAM
              ZONELIST_FALLBACK (0xc000000100001a00)
                      (0) (node 1) (DMA     0xc000000100000000)
                      (1) (node 0) (DMA     0xc00000000140c000)
              ZONELIST_NOFALLBACK (0xc000000100005a10)
                      (0) (node 1) (DMA     0xc000000100000000)
      [NODE (2)]						CDM
              ZONELIST_FALLBACK (0xc000000001427700)
                      (0) (node 2) (Movable 0xc000000001427080)
                      (1) (node 0) (DMA     0xc00000000140c000)
                      (2) (node 1) (DMA     0xc000000100000000)
              ZONELIST_NOFALLBACK (0xc00000000142b710)
                      (0) (node 2) (Movable 0xc000000001427080)
      [NODE (3)]						CDM
              ZONELIST_FALLBACK (0xc000000001431400)
                      (0) (node 3) (Movable 0xc000000001430d80)
                      (1) (node 0) (DMA     0xc00000000140c000)
                      (2) (node 1) (DMA     0xc000000100000000)
              ZONELIST_NOFALLBACK (0xc000000001435410)
                      (0) (node 3) (Movable 0xc000000001430d80)
      [NODE (4)]						CDM
              ZONELIST_FALLBACK (0xc00000000143b100)
                      (0) (node 4) (Movable 0xc00000000143aa80)
                      (1) (node 0) (DMA     0xc00000000140c000)
                      (2) (node 1) (DMA     0xc000000100000000)
              ZONELIST_NOFALLBACK (0xc00000000143f110)
                      (0) (node 4) (Movable 0xc00000000143aa80)
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      023d1127
    • A
      mm: Define coherent device memory (CDM) node · 4886e905
      Anshuman Khandual 提交于
      euler inclusion
      category: feature
      bugzilla: 11082
      CVE: NA
      -------------------
      
      There are certain devices like specialized accelerator, GPU cards, network
      cards, FPGA cards etc which might contain onboard memory which is coherent
      along with the existing system RAM while being accessed either from the CPU
      or from the device. They share some similar properties with that of normal
      system RAM but at the same time can also be different with respect to
      system RAM.
      
      User applications might be interested in using this kind of coherent device
      memory explicitly or implicitly along side the system RAM utilizing all
      possible core memory functions like anon mapping (LRU), file mapping (LRU),
      page cache (LRU), driver managed (non LRU), HW poisoning, NUMA migrations
      etc. To achieve this kind of tight integration with core memory subsystem,
      the device onboard coherent memory must be represented as a memory only
      NUMA node. At the same time arch must export some kind of a function to
      identify of this node as a coherent device memory not any other regular
      cpu less memory only NUMA node.
      
      After achieving the integration with core memory subsystem coherent device
      memory might still need some special consideration inside the kernel. There
      can be a variety of coherent memory nodes with different expectations from
      the core kernel memory. But right now only one kind of special treatment is
      considered which requires certain isolation.
      
      Now consider the case of a coherent device memory node type which requires
      isolation. This kind of coherent memory is onboard an external device
      attached to the system through a link where there is always a chance of a
      link failure taking down the entire memory node with it. More over the
      memory might also have higher chance of ECC failure as compared to the
      system RAM. Hence allocation into this kind of coherent memory node should
      be regulated. Kernel allocations must not come here. Normal user space
      allocations too should not come here implicitly (without user application
      knowing about it). This summarizes isolation requirement of certain kind of
      coherent device memory node as an example. There can be different kinds of
      isolation requirement also.
      
      Some coherent memory devices might not require isolation altogether after
      all. Then there might be other coherent memory devices which might require
      some other special treatment after being part of core memory representation
      . For now, will look into isolation seeking coherent device memory node not
      the other ones.
      
      To implement the integration as well as isolation, the coherent memory node
      must be present in N_MEMORY and a new N_COHERENT_DEVICE node mask inside
      the node_states[] array. During memory hotplug operations, the new nodemask
      N_COHERENT_DEVICE is updated along with N_MEMORY for these coherent device
      memory nodes. This also creates the following new sysfs based interface to
      list down all the coherent memory nodes of the system.
      
      	/sys/devices/system/node/is_cdm_node
      
      Architectures must export function arch_check_node_cdm() which identifies
      any coherent device memory node in case they enable CONFIG_COHERENT_DEVICE.
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      [Backported to 4.19
      -remove set or clear node state for memory_hotplug
      -separate CONFIG_COHERENT and CPUSET]
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4886e905
    • Q
      mm/hotplug: treat CMA pages as unmovable · 4700cf13
      Qian Cai 提交于
      mainline inclusion
      from mainline-5.1-rc6
      commit 1a9f2191
      category: bugfix
      bugzilla: 14055
      CVE: NA
      
      -------------------------------------------------
      
      has_unmovable_pages() is used by allocating CMA and gigantic pages as
      well as the memory hotplug.  The later doesn't know how to offline CMA
      pool properly now, but if an unused (free) CMA page is encountered, then
      has_unmovable_pages() happily considers it as a free memory and
      propagates this up the call chain.  Memory offlining code then frees the
      page without a proper CMA tear down which leads to an accounting issues.
      Moreover if the same memory range is onlined again then the memory never
      gets back to the CMA pool.
      
      State after memory offline:
      
       # grep cma /proc/vmstat
       nr_free_cma 205824
      
       # cat /sys/kernel/debug/cma/cma-kvm_cma/count
       209920
      
      Also, kmemleak still think those memory address are reserved below but
      have already been used by the buddy allocator after onlining.  This
      patch fixes the situation by treating CMA pageblocks as unmovable except
      when has_unmovable_pages() is called as part of CMA allocation.
      
        Offlined Pages 4096
        kmemleak: Cannot insert 0xc000201f7d040008 into the object search tree (overlaps existing)
        Call Trace:
          dump_stack+0xb0/0xf4 (unreliable)
          create_object+0x344/0x380
          __kmalloc_node+0x3ec/0x860
          kvmalloc_node+0x58/0x110
          seq_read+0x41c/0x620
          __vfs_read+0x3c/0x70
          vfs_read+0xbc/0x1a0
          ksys_read+0x7c/0x140
          system_call+0x5c/0x70
        kmemleak: Kernel memory leak detector disabled
        kmemleak: Object 0xc000201cc8000000 (size 13757317120):
        kmemleak:   comm "swapper/0", pid 0, jiffies 4294937297
        kmemleak:   min_count = -1
        kmemleak:   count = 0
        kmemleak:   flags = 0x5
        kmemleak:   checksum = 0
        kmemleak:   backtrace:
             cma_declare_contiguous+0x2a4/0x3b0
             kvm_cma_reserve+0x11c/0x134
             setup_arch+0x300/0x3f8
             start_kernel+0x9c/0x6e8
             start_here_common+0x1c/0x4b0
        kmemleak: Automatic memory scanning thread ended
      
      [cai@lca.pw: use is_migrate_cma_page() and update commit log]
        Link: http://lkml.kernel.org/r/20190416170510.20048-1-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190413002623.8967-1-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4700cf13
    • Q
      mm/hotplug: fix offline undo_isolate_page_range() · 47669159
      Qian Cai 提交于
      mainline inclusion
      from mainline-5.1-rc3
      commit 9b7ea46a82b31c74a37e6ff1c2a1df7d53e392ab
      category: bugfix
      bugzilla: 13472
      CVE: NA
      
      -------------------------------------------------
      
      Commit f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded
      memory to zones until online") introduced move_pfn_range_to_zone() which
      calls memmap_init_zone() during onlining a memory block.
      memmap_init_zone() will reset pagetype flags and makes migrate type to
      be MOVABLE.
      
      However, in __offline_pages(), it also call undo_isolate_page_range()
      after offline_isolated_pages() to do the same thing.  Due to commit
      2ce13640 ("mm: __first_valid_page skip over offline pages") changed
      __first_valid_page() to skip offline pages, undo_isolate_page_range()
      here just waste CPU cycles looping around the offlining PFN range while
      doing nothing, because __first_valid_page() will return NULL as
      offline_isolated_pages() has already marked all memory sections within
      the pfn range as offline via offline_mem_sections().
      
      Also, after calling the "useless" undo_isolate_page_range() here, it
      reaches the point of no returning by notifying MEM_OFFLINE.  Those pages
      will be marked as MIGRATE_MOVABLE again once onlining.  The only thing
      left to do is to decrease the number of isolated pageblocks zone counter
      which would make some paths of the page allocation slower that the above
      commit introduced.
      
      Even if alloc_contig_range() can be used to isolate 16GB-hugetlb pages
      on ppc64, an "int" should still be enough to represent the number of
      pageblocks there.  Fix an incorrect comment along the way.
      
      [cai@lca.pw: v4]
        Link: http://lkml.kernel.org/r/20190314150641.59358-1-cai@lca.pw
      Link: http://lkml.kernel.org/r/20190313143133.46200-1-cai@lca.pw
      Fixes: 2ce13640 ("mm: __first_valid_page skip over offline pages")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      47669159
    • M
      mm: only report isolation failures when offlining memory · 2a5141d5
      Michal Hocko 提交于
      mainline inclusion
      from mainline-v5.0-rc1
      commit d381c54760dcfad23743da40516e7e003d73952a
      category: bugfix
      bugzilla: 13472
      CVE: NA
      
      ------------------------------------------------
      
      Heiko has complained that his log is swamped by warnings from
      has_unmovable_pages
      
      [   20.536664] page dumped because: has_unmovable_pages
      [   20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0
      [   20.536794] flags: 0x3fffe0000010200(slab|head)
      [   20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600
      [   20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000
      [   20.536797] page dumped because: has_unmovable_pages
      [   20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0
      [   20.536815] flags: 0x7fffe0000000000()
      [   20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000
      [   20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000
      
      which are not triggered by the memory hotplug but rather CMA allocator.
      The original idea behind dumping the page state for all call paths was
      that these messages will be helpful debugging failures.  From the above it
      seems that this is not the case for the CMA path because we are lacking
      much more context.  E.g the second reported page might be a CMA allocated
      page.  It is still interesting to see a slab page in the CMA area but it
      is hard to tell whether this is bug from the above output alone.
      
      Address this issue by dumping the page state only on request.  Both
      start_isolate_page_range and has_unmovable_pages already have an argument
      to ignore hwpoison pages so make this argument more generic and turn it
      into flags and allow callers to combine non-default modes into a mask.
      While we are at it, has_unmovable_pages call from
      is_pageblock_removable_nolock (sysfs removable file) is questionable to
      report the failure so drop it from there as well.
      
      Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2a5141d5
    • M
      mm, memory_hotplug: be more verbose for memory offline failures · 4a5f2575
      Michal Hocko 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit 2932c8b0
      category: bugfix
      bugzilla: 13472
      CVE: NA
      
      ------------------------------------------------
      
      There is only very limited information printed when the memory offlining
      fails:
      
      [ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff
      
      This tells us that the failure is triggered by the userspace intervention
      but it doesn't tell us much more about the underlying reason.  It might be
      that the page migration failes repeatedly and the userspace timeout
      expires and send a signal or it might be some of the earlier steps
      (isolation, memory notifier) takes too long.
      
      If the migration failes then it would be really helpful to see which page
      that and its state.  The same applies to the isolation phase.  If we fail
      to isolate a page from the allocator then knowing the state of the page
      would be helpful as well.
      
      Dump the page state that fails to get isolated or migrated.  This will
      tell us more about the failure and what to focus on during debugging.
      
      [akpm@linux-foundation.org: add missing printk arg]
      [mhocko@suse.com: tweak dump_page() `reason' text]
        Link: http://lkml.kernel.org/r/20181116083020.20260-6-mhocko@kernel.org
      Link: http://lkml.kernel.org/r/20181107101830.17405-6-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <OSalvador@suse.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4a5f2575
    • J
      mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs · 92834760
      Jann Horn 提交于
      [ Upstream commit 2c2ade81 ]
      
      The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
      number of references that we might need to create in the fastpath later,
      the bump-allocation fastpath only has to modify the non-atomic bias value
      that tracks the number of extra references we hold instead of the atomic
      refcount. The maximum number of allocations we can serve (under the
      assumption that no allocation is made with size 0) is nc->size, so that's
      the bias used.
      
      However, even when all memory in the allocation has been given away, a
      reference to the page is still held; and in the `offset < 0` slowpath, the
      page may be reused if everyone else has dropped their references.
      This means that the necessary number of references is actually
      `nc->size+1`.
      
      Luckily, from a quick grep, it looks like the only path that can call
      page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
      requires CAP_NET_ADMIN in the init namespace and is only intended to be
      used for kernel testing and fuzzing.
      
      To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
      `offset < 0` path, below the virt_to_page() call, and then repeatedly call
      writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
      with a vector consisting of 15 elements containing 1 byte each.
      Signed-off-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      92834760
    • Q
      page_poison: play nicely with KASAN · 15129a33
      Qian Cai 提交于
      mainline inclusion
      from mainline-5.0
      commit 4117992df66a
      category: bugfix
      bugzilla: 11620
      CVE: NA
      
      ------------------------------------------------
      
      KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING).
      It triggers false positives in the allocation path,
      
      BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
      Read of size 8 at addr ffff88881f800000 by task swapper/0
      CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
      Call Trace:
       dump_stack+0xe0/0x19a
       print_address_description.cold.2+0x9/0x28b
       kasan_report.cold.3+0x7a/0xb5
       __asan_report_load8_noabort+0x19/0x20
       memchr_inv+0x2ea/0x330
       kernel_poison_pages+0x103/0x3d5
       get_page_from_freelist+0x15e7/0x4d90
      
      because KASAN has not yet unpoisoned the shadow page for allocation before
      it checks memchr_inv() but only found a stale poison pattern.
      
      Also, false positives in free path,
      
      BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
      Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1
      CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
      Call Trace:
       dump_stack+0xe0/0x19a
       print_address_description.cold.2+0x9/0x28b
       kasan_report.cold.3+0x7a/0xb5
       check_memory_region+0x22d/0x250
       memset+0x28/0x40
       kernel_poison_pages+0x29e/0x3d5
       __free_pages_ok+0x75f/0x13e0
      
      due to KASAN adds poisoned redzones around slab objects, but the page
      poisoning needs to poison the whole page.
      
      Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      15129a33
    • Z
      pagecache: add Kconfig to enable/disable the feature · 862e2308
      zhongjiang 提交于
      euler inclusion
      category: bugfix
      CVE: NA
      Bugzilla: 9580
      
      ---------------------------
      
      Just add Kconfig to the feature.
      Signed-off-by: Nzhongjiang <zhongjiang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      862e2308
    • Z
      pagecache: add sysctl interface to limit pagecache · 6174ecb5
      zhong jiang 提交于
      euleros inclusion
      category: feature
      feature: pagecache limit
      
      add proc sysctl interface to set pagecache limit for reclaim memory
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Reviewed-by: NJing xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6174ecb5