1. 01 3月, 2022 3 次提交
  2. 09 2月, 2022 8 次提交
    • D
      mm/page_alloc: fix counting of free pages after take off from buddy · 3c002599
      Ding Hui 提交于
      mainline inclusion
      from linux-v5.13-rc5
      commit bac9c6fa
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22
      CVE: NA
      
      --------------------------------
      
      Recently we found that there is a lot MemFree left in /proc/meminfo
      after do a lot of pages soft offline, it's not quite correct.
      
      Before Oscar's rework of soft offline for free pages [1], if we soft
      offline free pages, these pages are left in buddy with HWPoison flag,
      and NR_FREE_PAGES is not updated immediately.  So the difference between
      NR_FREE_PAGES and real number of available free pages is also even big
      at the beginning.
      
      However, with the workload running, when we catch HWPoison page in any
      alloc functions subsequently, we will remove it from buddy, meanwhile
      update the NR_FREE_PAGES and try again, so the NR_FREE_PAGES will get
      more and more closer to the real number of available free pages.
      (regardless of unpoison_memory())
      
      Now, for offline free pages, after a successful call
      take_page_off_buddy(), the page is no longer belong to buddy allocator,
      and will not be used any more, but we missed accounting NR_FREE_PAGES in
      this situation, and there is no chance to be updated later.
      
      Do update in take_page_off_buddy() like rmqueue() does, but avoid double
      counting if some one already set_migratetype_isolate() on the page.
      
      [1]: commit 06be6ff3 ("mm,hwpoison: rework soft offline for free pages")
      
      Link: https://lkml.kernel.org/r/20210526075247.11130-1-dinghui@sangfor.com.cn
      Fixes: 06be6ff3 ("mm,hwpoison: rework soft offline for free pages")
      Signed-off-by: NDing Hui <dinghui@sangfor.com.cn>
      Suggested-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3c002599
    • O
      mm,hwpoison: rework soft offline for in-use pages · 40b69d16
      Oscar Salvador 提交于
      mainline inclusion
      from linux-v5.10-rc1
      commit 79f5f8fa
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22
      CVE: NA
      
      --------------------------------
      
      keep set_hwpoison_free_buddy_page exported to avoid kapi change.
      
      This patch changes the way we set and handle in-use poisoned pages.  Until
      now, poisoned pages were released to the buddy allocator, trusting that
      the checks that take place at allocation time would act as a safe net
      and would skip that page.
      
      This has proved to be wrong, as we got some pfn walkers out there, like
      compaction, that all they care is the page to be in a buddy freelist.
      
      Although this might not be the only user, having poisoned pages in the
      buddy allocator seems a bad idea as we should only have free pages that
      are ready and meant to be used as such.
      
      Before explaining the taken approach, let us break down the kind of pages
      we can soft offline.
      
      - Anonymous THP (after the split, they end up being 4K pages)
      - Hugetlb
      - Order-0 pages (that can be either migrated or invalited)
      
      * Normal pages (order-0 and anon-THP)
      
        - If they are clean and unmapped page cache pages, we invalidate
          then by means of invalidate_inode_page().
        - If they are mapped/dirty, we do the isolate-and-migrate dance.
      
      Either way, do not call put_page directly from those paths.
      Instead, we keep the page and send it to page_handle_poison to perform the
      right handling.
      
      page_handle_poison sets the HWPoison flag and does the last put_page.
      
      Down the chain, we placed a check for HWPoison page in
      free_pages_prepare, that just skips any poisoned page, so those pages
      do not end up in any pcplist/freelist.
      
      After that, we set the refcount on the page to 1 and we increment
      the poisoned pages counter.
      
      If we see that the check in free_pages_prepare creates trouble, we can
      always do what we do for free pages:
      
        - wait until the page hits buddy's freelists
        - take it off, and flag it
      
      The downside of the above approach is that we could race with an
      allocation, so by the time we  want to take the page off the buddy, the
      page has been already allocated so we cannot soft offline it.
      But the user could always retry it.
      
      * Hugetlb pages
      
        - We isolate-and-migrate them
      
      After the migration has been successful, we call dissolve_free_huge_page,
      and we set HWPoison on the page if we succeed.
      Hugetlb has a slightly different handling though.
      
      While for non-hugetlb pages we cared about closing the race with an
      allocation, doing so for hugetlb pages requires quite some additional
      and intrusive code (we would need to hook in free_huge_page and some other
      places).
      So I decided to not make the code overly complicated and just fail
      normally if the page we allocated in the meantime.
      
      We can always build on top of this.
      
      As a bonus, because of the way we handle now in-use pages, we no longer
      need the put-as-isolation-migratetype dance, that was guarding for poisoned
      pages to end up in pcplists.
      Signed-off-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      
      Conflicts:
      	mm/page_alloc.c
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      40b69d16
    • O
      mm,hwpoison: rework soft offline for free pages · 0d0fa7b8
      Oscar Salvador 提交于
      mainline inclusion
      from linux-v5.10-rc1
      commit 06be6ff3
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4LE22
      CVE: NA
      
      --------------------------------
      
      When trying to soft-offline a free page, we need to first take it off the
      buddy allocator.
      Once we know is out of reach, we can safely flag it as poisoned.
      
      take_page_off_buddy will be used to take a page meant to be poisoned off
      the buddy allocator. take_page_off_buddy calls break_down_buddy_pages,
      which splits a higher-order page in case our page belongs to one.
      
      Once the page is under our control, we call page_handle_poison to set it
      as poisoned and grab a refcount on it.
      Signed-off-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Conflicts:
      	mm/page_alloc.c
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0d0fa7b8
    • C
      mm/vmscan: fix unexpected shrinking page cache with vm_cache_reclaim_enable disable · 76569c77
      Chen Wandun 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
      CVE: NA
      
      -------------------------------------------------------
      
      In function cache_limit_ratio_sysctl_handler and
      cache_limit_mbytes_sysctl_handler, it will shrink
      page cache even if vm_cache_reclaim_enable is false,
      it is unexpected.
      Signed-off-by: NChen Wandun <chenwandun@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      76569c77
    • M
      mm: Introduce fallback mechanism for memory reliable · 3023a4b3
      Ma Wupeng 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
      CVE: NA
      
      --------------------------------
      
      Introduce fallback mechanism for memory reliable. The following process
      will fallback to non-mirrored region if their allocation from mirrored
      region failed
      
      - User tasks with reliable flag
      - thp collapse pages
      - init tasks
      - pagecache
      - tmpfs
      
      In order to achieve this goals. Buddy system will fallback to non-mirrored
      in the following situations.
      
      - if __GFP_THISNODE is set in gfp_mask and dest nodes do not have any zones
        available
      
      - high_zoneidx will be set to ZONE_MOVABLE to alloc memory before oom
      
      This mechanism is enabled by defalut and can be disabled by adding
      "reliable_debug=F" to the kernel parameters. This mechanism rely on
      CONFIG_MEMORY_RELIABLE and need "kernelcore=reliable" in the kernel
      parameters.
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3023a4b3
    • P
      mm: Add reliable memory use limit for user tasks · 1845e7ad
      Peng Wu 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
      CVE: NA
      
      ----------------------------------------------
      
      there is a upper limit for special user tasks's memory allocation.
      special user task means user task with reliable flag.
      
      Init tasks will alloc memory from non-mirrored region if their allocation
      trigger limit.
      
      The limit can be set or access via /proc/sys/vm/task_reliable_limit
      
      This limit's default value is ULONG_MAX.
      Signed-off-by: NPeng Wu <wupeng58@huawei.com>
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1845e7ad
    • P
      mm: Introduce reliable flag for user task · c7731567
      Peng Wu 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
      CVE: NA
      
      ------------------------------------------
      
      Adding reliable flag for user task. User task with reliable flag can
      only alloc memory from mirrored region. PF_RELIABLE is added to represent
      the task's reliable flag.
      
      - For init task, which is regarded as as special task which alloc memory
        from mirrored region.
      
      - For normal user tasks, The reliable flag can be set via procfs interface
        shown as below and can be inherited via fork().
      
      User can change a user task's reliable flag by
      
      	$ echo [0/1] > /proc/<pid>/reliable
      
      and check a user task's reliable flag by
      
      	$ cat /proc/<pid>/reliable
      
      Note, global init task's reliable file can not be accessed.
      Signed-off-by: NPeng Wu <wupeng58@huawei.com>
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c7731567
    • M
      mm: Introduce memory reliable · 33d1f46a
      Ma Wupeng 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4SK3S
      CVE: NA
      
      --------------------------------
      
      Introduction
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      
      ============
      
      Memory reliable feature is a memory tiering mechanism. It is based on
      kernel mirror feature, which splits memory into two sperate regions,
      mirrored(reliable) region and non-mirrored (non-reliable) region.
      
      for kernel mirror feature:
      
      - allocate kernel memory from mirrored region by default
      - allocate user memory from non-mirrored region by default
      
      non-mirrored region will be arranged into ZONE_MOVABLE.
      
      for kernel reliable feature, it has additional features below:
      
      - normal user tasks never alloc memory from mirrored region with userspace
        apis(malloc, mmap, etc.)
      - special user tasks will allocate memory from mirrored region by default
      - tmpfs/pagecache allocate memory from mirrored region by default
      - upper limit of mirrored region allcated for user tasks, tmpfs and
        pagecache
      
      Support Reliable fallback mechanism which allows special user tasks, tmpfs
      and pagecache can fallback to alloc non-mirrored region, it's the default
      setting.
      
      In order to fulfil the goal
      
      - ___GFP_RELIABILITY flag added for alloc memory from mirrored region.
      
      - the high_zoneidx for special user tasks/tmpfs/pagecache is set to
        ZONE_NORMAL.
      
      - normal user tasks could only alloc from ZONE_MOVABLE.
      
      This patch is just the main framework, memory reliable support for special
      user tasks, pagecache and tmpfs has own patches.
      
      To enable this function, mirrored(reliable) memory is needed and
      "kernelcore=reliable" should be added to kernel parameters.
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      33d1f46a
  3. 30 12月, 2021 1 次提交
  4. 20 10月, 2021 6 次提交
    • D
      mm/page_alloc: place pages to tail in __free_pages_core() · 01702464
      David Hildenbrand 提交于
      mainline inclusion
      from mainline-v5.10-rc1
      commit 7fef431b
      category: feature
      bugzilla: 182882
      CVE: NA
      
      __free_pages_core() is used when exposing fresh memory to the buddy during
      system boot and when onlining memory in generic_online_page().
      
      generic_online_page() is used in two cases:
      
      1. Direct memory onlining in online_pages().
      2. Deferred memory onlining in memory-ballooning-like mechanisms (HyperV
         balloon and virtio-mem), when parts of a section are kept
         fake-offline to be fake-onlined later on.
      
      In 1, we already place pages to the tail of the freelist.  Pages will be
      freed to MIGRATE_ISOLATE lists first and moved to the tail of the
      freelists via undo_isolate_page_range().
      
      In 2, we currently don't implement a proper rule.  In case of virtio-mem,
      where we currently always online MAX_ORDER - 1 pages, the pages will be
      placed to the HEAD of the freelist - undesireable.  While the hyper-v
      balloon calls generic_online_page() with single pages, usually it will
      call it on successive single pages in a larger block.
      
      The pages are fresh, so place them to the tail of the freelist and avoid
      the PCP.  In __free_pages_core(), remove the now superflouos call to
      set_page_refcounted() and add a comment regarding page initialization and
      the refcount.
      
      Note: In 2.  we currently don't shuffle.  If ever relevant (page shuffling
      is usually of limited use in virtualized environments), we might want to
      shuffle after a sequence of generic_online_page() calls in the relevant
      callers.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Link: https://lkml.kernel.org/r/20201005121534.15649-5-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Conflicts:
      	mm/page_alloc.c
      [Peng Liu: adjust context]
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      01702464
    • D
      mm/page_alloc: move pages to tail in move_to_free_list() · 5be3d693
      David Hildenbrand 提交于
      mainline inclusion
      from mainline-v5.10-rc1
      commit 293ffa5e
      category: feature
      bugzilla: 182882
      CVE: NA
      
      -----------------------------------------------
      
      Whenever we move pages between freelists via move_to_free_list()/
      move_freepages_block(), we don't actually touch the pages:
      1. Page isolation doesn't actually touch the pages, it simply isolates
         pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
         When undoing isolation, we move the pages back to the target list.
      2. Page stealing (steal_suitable_fallback()) moves free pages directly
         between lists without touching them.
      3. reserve_highatomic_pageblock()/unreserve_highatomic_pageblock() moves
         free pages directly between freelists without touching them.
      
      We already place pages to the tail of the freelists when undoing isolation
      via __putback_isolated_page(), let's do it in any case (e.g., if order <=
      pageblock_order) and document the behavior. To simplify, let's move the
      pages to the tail for all move_to_free_list()/move_freepages_block() users.
      
      In 2., the target list is empty, so there should be no change.  In 3., we
      might observe a change, however, highatomic is more concerned about
      allocations succeeding than cache hotness - if we ever realize this change
      degrades a workload, we can special-case this instance and add a proper
      comment.
      
      This change results in all pages getting onlined via online_pages() to be
      placed to the tail of the freelist.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: https://lkml.kernel.org/r/20201005121534.15649-4-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Conflicts:
      	mm/page_alloc.c
      [Peng Liu: cherry-pick from 293ffa5e]
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5be3d693
    • D
      mm/page_alloc: place pages to tail in __putback_isolated_page() · 69e15bb4
      David Hildenbrand 提交于
      mainline inclusion
      from mainline-v5.10-rc1
      commit 47b6a24a
      category: feature
      bugzilla: 182882
      CVE: NA
      
      -----------------------------------------------
      
      __putback_isolated_page() already documents that pages will be placed to
      the tail of the freelist - this is, however, not the case for "order >=
      MAX_ORDER - 2" (see buddy_merge_likely()) - which should be the case for
      all existing users.
      
      This change affects two users:
      - free page reporting
      - page isolation, when undoing the isolation (including memory onlining).
      
      This behavior is desirable for pages that haven't really been touched
      lately, so exactly the two users that don't actually read/write page
      content, but rather move untouched pages.
      
      The new behavior is especially desirable for memory onlining, where we
      allow allocation of newly onlined pages via undo_isolate_page_range() in
      online_pages().  Right now, we always place them to the head of the
      freelist, resulting in undesireable behavior: Assume we add individual
      memory chunks via add_memory() and online them right away to the NORMAL
      zone.  We create a dependency chain of unmovable allocations e.g., via the
      memmap.  The memmap of the next chunk will be placed onto previous chunks
      - if the last block cannot get offlined+removed, all dependent ones cannot
      get offlined+removed.  While this can already be observed with individual
      DIMMs, it's more of an issue for virtio-mem (and I suspect also ppc
      DLPAR).
      
      Document that this should only be used for optimizations, and no code
      should rely on this behavior for correction (if the order of the freelists
      ever changes).
      
      We won't care about page shuffling: memory onlining already properly
      shuffles after onlining.  free page reporting doesn't care about
      physically contiguous ranges, and there are already cases where page
      isolation will simply move (physically close) free pages to (currently)
      the head of the freelists via move_freepages_block() instead of shuffling.
      If this becomes ever relevant, we should shuffle the whole zone when
      undoing isolation of larger ranges, and after free_contig_range().
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: https://lkml.kernel.org/r/20201005121534.15649-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Conflicts:
      	mm/page_alloc.c
      [Peng Liu: cherry-pick from 47b6a24a]
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      69e15bb4
    • D
      mm/page_alloc: convert "report" flag of __free_one_page() to a proper flag · 22d4ccfa
      David Hildenbrand 提交于
      mainline inclusion
      from mainline-v5.10-rc1
      commit f04a5d5d
      category: feature
      bugzilla: 182882
      CVE: NA
      
      -----------------------------------------------
      
      Patch series "mm: place pages to the freelist tail when onlining and undoing isolation", v2.
      
      When adding separate memory blocks via add_memory*() and onlining them
      immediately, the metadata (especially the memmap) of the next block will
      be placed onto one of the just added+onlined block.  This creates a chain
      of unmovable allocations: If the last memory block cannot get
      offlined+removed() so will all dependent ones.  We directly have unmovable
      allocations all over the place.
      
      This can be observed quite easily using virtio-mem, however, it can also
      be observed when using DIMMs.  The freshly onlined pages will usually be
      placed to the head of the freelists, meaning they will be allocated next,
      turning the just-added memory usually immediately un-removable.  The fresh
      pages are cold, prefering to allocate others (that might be hot) also
      feels to be the natural thing to do.
      
      It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
      adding separate, successive memory blocks, each memory block will have
      unmovable allocations on them - for example gigantic pages will fail to
      allocate.
      
      While the ZONE_NORMAL doesn't provide any guarantees that memory can get
      offlined+removed again (any kind of fragmentation with unmovable
      allocations is possible), there are many scenarios (hotplugging a lot of
      memory, running workload, hotunplug some memory/as much as possible) where
      we can offline+remove quite a lot with this patchset.
      
      a) To visualize the problem, a very simple example:
      
      Start a VM with 4GB and 8GB of virtio-mem memory:
      
       [root@localhost ~]# lsmem
       RANGE                                 SIZE  STATE REMOVABLE  BLOCK
       0x0000000000000000-0x00000000bfffffff   3G online       yes   0-23
       0x0000000100000000-0x000000033fffffff   9G online       yes 32-103
      
       Memory block size:       128M
       Total online memory:      12G
       Total offline memory:      0B
      
      Then try to unplug as much as possible using virtio-mem. Observe which
      memory blocks are still around. Without this patch set:
      
       [root@localhost ~]# lsmem
       RANGE                                  SIZE  STATE REMOVABLE   BLOCK
       0x0000000000000000-0x00000000bfffffff    3G online       yes    0-23
       0x0000000100000000-0x000000013fffffff    1G online       yes   32-39
       0x0000000148000000-0x000000014fffffff  128M online       yes      41
       0x0000000158000000-0x000000015fffffff  128M online       yes      43
       0x0000000168000000-0x000000016fffffff  128M online       yes      45
       0x0000000178000000-0x000000017fffffff  128M online       yes      47
       0x0000000188000000-0x0000000197ffffff  256M online       yes   49-50
       0x00000001a0000000-0x00000001a7ffffff  128M online       yes      52
       0x00000001b0000000-0x00000001b7ffffff  128M online       yes      54
       0x00000001c0000000-0x00000001c7ffffff  128M online       yes      56
       0x00000001d0000000-0x00000001d7ffffff  128M online       yes      58
       0x00000001e0000000-0x00000001e7ffffff  128M online       yes      60
       0x00000001f0000000-0x00000001f7ffffff  128M online       yes      62
       0x0000000200000000-0x0000000207ffffff  128M online       yes      64
       0x0000000210000000-0x0000000217ffffff  128M online       yes      66
       0x0000000220000000-0x0000000227ffffff  128M online       yes      68
       0x0000000230000000-0x0000000237ffffff  128M online       yes      70
       0x0000000240000000-0x0000000247ffffff  128M online       yes      72
       0x0000000250000000-0x0000000257ffffff  128M online       yes      74
       0x0000000260000000-0x0000000267ffffff  128M online       yes      76
       0x0000000270000000-0x0000000277ffffff  128M online       yes      78
       0x0000000280000000-0x0000000287ffffff  128M online       yes      80
       0x0000000290000000-0x0000000297ffffff  128M online       yes      82
       0x00000002a0000000-0x00000002a7ffffff  128M online       yes      84
       0x00000002b0000000-0x00000002b7ffffff  128M online       yes      86
       0x00000002c0000000-0x00000002c7ffffff  128M online       yes      88
       0x00000002d0000000-0x00000002d7ffffff  128M online       yes      90
       0x00000002e0000000-0x00000002e7ffffff  128M online       yes      92
       0x00000002f0000000-0x00000002f7ffffff  128M online       yes      94
       0x0000000300000000-0x0000000307ffffff  128M online       yes      96
       0x0000000310000000-0x0000000317ffffff  128M online       yes      98
       0x0000000320000000-0x0000000327ffffff  128M online       yes     100
       0x0000000330000000-0x000000033fffffff  256M online       yes 102-103
      
       Memory block size:       128M
       Total online memory:     8.1G
       Total offline memory:      0B
      
      With this patch set:
      
       [root@localhost ~]# lsmem
       RANGE                                 SIZE  STATE REMOVABLE BLOCK
       0x0000000000000000-0x00000000bfffffff   3G online       yes  0-23
       0x0000000100000000-0x000000013fffffff   1G online       yes 32-39
      
       Memory block size:       128M
       Total online memory:       4G
       Total offline memory:      0B
      
      All memory can get unplugged, all memory block can get removed.  Of
      course, no workload ran and the system was basically idle, but it
      highlights the issue - the fairly deterministic chain of unmovable
      allocations.  When a huge page for the 2MB memmap is needed, a
      just-onlined 4MB page will be split.  The remaining 2MB page will be used
      for the memmap of the next memory block.  So one memory block will hold
      the memmap of the two following memory blocks.  Finally the pages of the
      last-onlined memory block will get used for the next bigger allocations -
      if any allocation is unmovable, all dependent memory blocks cannot get
      unplugged and removed until that allocation is gone.
      
      Note that with bigger memory blocks (e.g., 256MB), *all* memory
      blocks are dependent and none can get unplugged again!
      
      b) Experiment with memory intensive workload
      
      I performed an experiment with an older version of this patch set (before
      we used undo_isolate_page_range() in online_pages(): Hotplug 56GB to a VM
      with an initial 4GB, onlining all memory to ZONE_NORMAL right from the
      kernel when adding it.  I then run various memory intensive workloads that
      consume most system memory for a total of 45 minutes.  Once finished, I
      try to unplug as much memory as possible.
      
      With this change, I am able to remove via virtio-mem (adding individual
      128MB memory blocks) 413 out of 448 added memory blocks.  Via individual
      (256MB) DIMMs 380 out of 448 added memory blocks.  (I don't have any
      numbers without this patchset, but looking at the above example, it's at
      most half of the 448 memory blocks for virtio-mem, and most probably none
      for DIMMs).
      
      Again, there are workloads that might behave very differently due to the
      nature of ZONE_NORMAL.
      
      This change also affects (besides memory onlining):
      - Other users of undo_isolate_page_range(): Pages are always placed to the
        tail.
      -- When memory offlining fails
      -- When memory isolation fails after having isolated some pageblocks
      -- When alloc_contig_range() either succeeds or fails
      - Other users of __putback_isolated_page(): Pages are always placed to the
        tail.
      -- Free page reporting
      - Other users of __free_pages_core()
      -- AFAIKs, any memory that is getting exposed to the buddy during boot.
         IIUC we will now usually allocate memory from lower addresses within
         a zone first (especially during boot).
      - Other users of generic_online_page()
      -- Hyper-V balloon
      
      This patch (of 5):
      
      Let's prepare for additional flags and avoid long parameter lists of
      bools.  Follow-up patches will also make use of the flags in
      __free_pages_ok().
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Link: https://lkml.kernel.org/r/20201005121534.15649-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20201005121534.15649-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Conflicts:
      	mm/page_alloc.c
      [Peng Liu: cherry-pick from f04a5d5d]
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      22d4ccfa
    • A
      mm: add function __putback_isolated_page · 91bac231
      Alexander Duyck 提交于
      mainline inclusion
      from mainline-v5.7-rc1
      commit 624f58d8
      category: feature
      bugzilla: 182882
      CVE: NA
      
      -----------------------------------------------
      
      There are cases where we would benefit from avoiding having to go through
      the allocation and free cycle to return an isolated page.
      
      Examples for this might include page poisoning in which we isolate a page
      and then put it back in the free list without ever having actually
      allocated it.
      
      This will enable us to also avoid notifiers for the future free page
      reporting which will need to avoid retriggering page reporting when
      returning pages that have been reported on.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nitesh Narayan Lal <nitesh@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Wang <wei.w.wang@intel.com>
      Cc: Yang Zhang <yang.zhang.wz@gmail.com>
      Cc: wei qi <weiqi4@huawei.com>
      Link: http://lkml.kernel.org/r/20200211224624.29318.89287.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Conflicts:
              mm/internal.h
      [Peng Liu: cherry-pick from 624f58d8]
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      91bac231
    • A
      mm/page_alloc.c: memory hotplug: free pages as higher order · c479b04b
      Arun KS 提交于
      mainline inclusion
      from mainline-v5.1-rc1
      commit a9cd410a
      category: feature
      bugzilla: 182882
      CVE: NA
      
      -----------------------------------------------
      
      When freeing pages are done with higher order, time spent on coalescing
      pages by buddy allocator can be reduced.  With section size of 256MB,
      hot add latency of a single section shows improvement from 50-60 ms to
      less than 1 ms, hence improving the hot add latency by 60 times.  Modify
      external providers of online callback to align with the change.
      
      [arunks@codeaurora.org: v11]
        Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
      [akpm@linux-foundation.org: remove unused local, per Arun]
      [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
      [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
      [arunks@codeaurora.org: v8]
        Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
      [arunks@codeaurora.org: v9]
        Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
      Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: K. Y. Srinivasan <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Aaron Lu <aaron.lu@intel.com>
      Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Conflicts:
      	mm/page_alloc.c
      	mm/memory_hotplug.c
      [Peng Liu: adjust context]
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c479b04b
  5. 03 9月, 2021 1 次提交
    • J
      page_alloc: consider highatomic reserve in watermark fast · 283a76a5
      Jaewon Kim 提交于
      mainline inclusion
      from mainline-v5.9-rc1
      commit f27ce0e1
      category: bugfix
      bugzilla: 41086
      CVE: NA
      
      -----------------------------------------------
      
      zone_watermark_fast was introduced by commit 48ee5f36 ("mm,
      page_alloc: shortcut watermark checks for order-0 pages").  The commit
      simply checks if free pages is bigger than watermark without additional
      calculation such like reducing watermark.
      
      It considered free cma pages but it did not consider highatomic reserved.
      This may incur exhaustion of free pages except high order atomic free
      pages.
      
      Assume that reserved_highatomic pageblock is bigger than watermark min,
      and there are only few free pages except high order atomic free.  Because
      zone_watermark_fast passes the allocation without considering high order
      atomic free, normal reclaimable allocation like GFP_HIGHUSER will consume
      all the free pages.  Then finally order-0 atomic allocation may fail on
      allocation.
      
      This means watermark min is not protected against non-atomic allocation.
      The order-0 atomic allocation with ALLOC_HARDER unwantedly can be failed.
      Additionally the __GFP_MEMALLOC allocation with ALLOC_NO_WATERMARKS also
      can be failed.
      
      To avoid the problem, zone_watermark_fast should consider highatomic
      reserve.  If the actual size of high atomic free is counted accurately
      like cma free, we may use it.  On this patch just use
      nr_reserved_highatomic.  Additionally introduce
      __zone_watermark_unusable_free to factor out common parts between
      zone_watermark_fast and __zone_watermark_ok.
      
      This is an example of ALLOC_HARDER allocation failure using v4.19 based
      kernel.
      
       Binder:9343_3: page allocation failure: order:0, mode:0x480020(GFP_ATOMIC), nodemask=(null)
       Call trace:
       [<ffffff8008f40f8c>] dump_stack+0xb8/0xf0
       [<ffffff8008223320>] warn_alloc+0xd8/0x12c
       [<ffffff80082245e4>] __alloc_pages_nodemask+0x120c/0x1250
       [<ffffff800827f6e8>] new_slab+0x128/0x604
       [<ffffff800827b0cc>] ___slab_alloc+0x508/0x670
       [<ffffff800827ba00>] __kmalloc+0x2f8/0x310
       [<ffffff80084ac3e0>] context_struct_to_string+0x104/0x1cc
       [<ffffff80084ad8fc>] security_sid_to_context_core+0x74/0x144
       [<ffffff80084ad880>] security_sid_to_context+0x10/0x18
       [<ffffff800849bd80>] selinux_secid_to_secctx+0x20/0x28
       [<ffffff800849109c>] security_secid_to_secctx+0x3c/0x70
       [<ffffff8008bfe118>] binder_transaction+0xe68/0x454c
       Mem-Info:
       active_anon:102061 inactive_anon:81551 isolated_anon:0
        active_file:59102 inactive_file:68924 isolated_file:64
        unevictable:611 dirty:63 writeback:0 unstable:0
        slab_reclaimable:13324 slab_unreclaimable:44354
        mapped:83015 shmem:4858 pagetables:26316 bounce:0
        free:2727 free_pcp:1035 free_cma:178
       Node 0 active_anon:408244kB inactive_anon:326204kB active_file:236408kB inactive_file:275696kB unevictable:2444kB isolated(anon):0kB isolated(file):256kB mapped:332060kB dirty:252kB writeback:0kB shmem:19432kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
       Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB
       lowmem_reserve[]: 0 0
       Normal: 505*4kB (H) 357*8kB (H) 201*16kB (H) 65*32kB (H) 1*64kB (H) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 10236kB
       138826 total pagecache pages
       5460 pages in swap cache
       Swap cache stats: add 8273090, delete 8267506, find 1004381/4060142
      
      This is an example of ALLOC_NO_WATERMARKS allocation failure using v4.14
      based kernel.
      
       kswapd0: page allocation failure: order:0, mode:0x140000a(GFP_NOIO|__GFP_HIGHMEM|__GFP_MOVABLE), nodemask=(null)
       kswapd0 cpuset=/ mems_allowed=0
       CPU: 4 PID: 1221 Comm: kswapd0 Not tainted 4.14.113-18770262-userdebug #1
       Call trace:
       [<0000000000000000>] dump_backtrace+0x0/0x248
       [<0000000000000000>] show_stack+0x18/0x20
       [<0000000000000000>] __dump_stack+0x20/0x28
       [<0000000000000000>] dump_stack+0x68/0x90
       [<0000000000000000>] warn_alloc+0x104/0x198
       [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0
       [<0000000000000000>] zs_malloc+0x148/0x3d0
       [<0000000000000000>] zram_bvec_rw+0x410/0x798
       [<0000000000000000>] zram_rw_page+0x88/0xdc
       [<0000000000000000>] bdev_write_page+0x70/0xbc
       [<0000000000000000>] __swap_writepage+0x58/0x37c
       [<0000000000000000>] swap_writepage+0x40/0x4c
       [<0000000000000000>] shrink_page_list+0xc30/0xf48
       [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c
       [<0000000000000000>] shrink_node_memcg+0x23c/0x618
       [<0000000000000000>] shrink_node+0x1c8/0x304
       [<0000000000000000>] kswapd+0x680/0x7c4
       [<0000000000000000>] kthread+0x110/0x120
       [<0000000000000000>] ret_from_fork+0x10/0x18
       Mem-Info:
       active_anon:111826 inactive_anon:65557 isolated_anon:0\x0a active_file:44260 inactive_file:83422 isolated_file:0\x0a unevictable:4158 dirty:117 writeback:0 unstable:0\x0a            slab_reclaimable:13943 slab_unreclaimable:43315\x0a mapped:102511 shmem:3299 pagetables:19566 bounce:0\x0a free:3510 free_pcp:553 free_cma:0
       Node 0 active_anon:447304kB inactive_anon:262228kB active_file:177040kB inactive_file:333688kB unevictable:16632kB isolated(anon):0kB isolated(file):0kB mapped:410044kB d irty:468kB writeback:0kB shmem:13196kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
       Normal free:14040kB min:7440kB low:94500kB high:98136kB reserved_highatomic:32768KB active_anon:447336kB inactive_anon:261668kB active_file:177572kB inactive_file:333768k           B unevictable:16632kB writepending:480kB present:4081664kB managed:3637088kB mlocked:16632kB kernel_stack:47072kB pagetables:78264kB bounce:0kB free_pcp:2280kB local_pcp:720kB free_cma:0kB        [ 4738.329607] lowmem_reserve[]: 0 0
       Normal: 860*4kB (H) 453*8kB (H) 180*16kB (H) 26*32kB (H) 34*64kB (H) 6*128kB (H) 2*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14232kB
      
      This is trace log which shows GFP_HIGHUSER consumes free pages right
      before ALLOC_NO_WATERMARKS.
      
        <...>-22275 [006] ....   889.213383: mm_page_alloc: page=00000000d2be5665 pfn=970744 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
        <...>-22275 [006] ....   889.213385: mm_page_alloc: page=000000004b2335c2 pfn=970745 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
        <...>-22275 [006] ....   889.213387: mm_page_alloc: page=00000000017272e1 pfn=970278 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
        <...>-22275 [006] ....   889.213389: mm_page_alloc: page=00000000c4be79fb pfn=970279 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
        <...>-22275 [006] ....   889.213391: mm_page_alloc: page=00000000f8a51d4f pfn=970260 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
        <...>-22275 [006] ....   889.213393: mm_page_alloc: page=000000006ba8f5ac pfn=970261 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
        <...>-22275 [006] ....   889.213395: mm_page_alloc: page=00000000819f1cd3 pfn=970196 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
        <...>-22275 [006] ....   889.213396: mm_page_alloc: page=00000000f6b72a64 pfn=970197 order=0 migratetype=0 nr_free=3650 gfp_flags=GFP_HIGHUSER|__GFP_ZERO
      kswapd0-1207  [005] ...1   889.213398: mm_page_alloc: page= (null) pfn=0 order=0 migratetype=1 nr_free=3650 gfp_flags=GFP_NOWAIT|__GFP_HIGHMEM|__GFP_NOWARN|__GFP_MOVABLE
      
      [jaewon31.kim@samsung.com: remove redundant code for high-order]
        Link: http://lkml.kernel.org/r/20200623035242.27232-1-jaewon31.kim@samsung.comReported-by: NYong-Taek Lee <ytk.lee@samsung.com>
      Suggested-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NJaewon Kim <jaewon31.kim@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Yong-Taek Lee <ytk.lee@samsung.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Link: http://lkml.kernel.org/r/20200619235958.11283-1-jaewon31.kim@samsung.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Conflicts:
          mm/page_alloc.c
      [Peng Liu: cherry-pick from f27ce0e1]
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: Ntong tiangen <tongtiangen@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      283a76a5
  6. 29 7月, 2021 1 次提交
  7. 30 6月, 2021 1 次提交
    • Y
      mm, oom: reorganize the oom report in dump_header · 0201217c
      yuzhoujian 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit ef8444ea
      category: bugfix
      bugzilla: 51815, https://gitee.com/openeuler/kernel/issues/I3IJ9I
      CVE: NA
      
      -------------------------------------------------
      OOM report contains several sections.  The first one is the allocation
      context that has triggered the OOM.  Then we have cpuset context followed
      by the stack trace of the OOM path.  The tird one is the OOM memory
      information.  Followed by the current memory state of all system tasks.
      At last, we will show oom eligible tasks and the information about the
      chosen oom victim.
      
      One thing that makes parsing more awkward than necessary is that we do not
      have a single and easily parsable line about the oom context.  This patch
      is reorganizing the oom report to
      
      1) who invoked oom and what was the allocation request
      
      [  515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0
      
      2) OOM stack trace
      
      [  515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3
      [  515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016
      [  515.906821] Call Trace:
      [  515.908062]  dump_stack+0x5a/0x73
      [  515.909311]  dump_header+0x55/0x28c
      [  515.914260]  oom_kill_process+0x2d8/0x300
      [  515.916708]  out_of_memory+0x145/0x4a0
      [  515.917932]  __alloc_pages_slowpath+0x7d2/0xa16
      [  515.919157]  __alloc_pages_nodemask+0x277/0x290
      [  515.920367]  filemap_fault+0x3d0/0x6c0
      [  515.921529]  ? filemap_map_pages+0x2b8/0x420
      [  515.922709]  ext4_filemap_fault+0x2c/0x40 [ext4]
      [  515.923884]  __do_fault+0x20/0x80
      [  515.925032]  __handle_mm_fault+0xbc0/0xe80
      [  515.926195]  handle_mm_fault+0xfa/0x210
      [  515.927357]  __do_page_fault+0x233/0x4c0
      [  515.928506]  do_page_fault+0x32/0x140
      [  515.929646]  ? page_fault+0x8/0x30
      [  515.930770]  page_fault+0x1e/0x30
      
      3) OOM memory information
      
      [  515.958093] Mem-Info:
      [  515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0
       active_file:4402672 inactive_file:483963 isolated_file:1344
       unevictable:0 dirty:4886753 writeback:0 unstable:0
       slab_reclaimable:148442 slab_unreclaimable:18741
       mapped:1347 shmem:1347 pagetables:58669 bounce:0
       free:88663 free_pcp:0 free_cma:0
      ...
      
      4) current memory state of all system tasks
      
      [  516.079544] [    744]     0   744     9211     1345   114688       82             0 systemd-journal
      [  516.082034] [    787]     0   787    31764        0   143360       92             0 lvmetad
      [  516.084465] [    792]     0   792    10930        1   110592      208         -1000 systemd-udevd
      [  516.086865] [   1199]     0  1199    13866        0   131072      112         -1000 auditd
      [  516.089190] [   1222]     0  1222    31990        1   110592      157             0 smartd
      [  516.091477] [   1225]     0  1225     4864       85    81920       43             0 irqbalance
      [  516.093712] [   1226]     0  1226    52612        0   258048      426             0 abrtd
      [  516.112128] [   1280]     0  1280   109774       55   299008      400             0 NetworkManager
      [  516.113998] [   1295]     0  1295    28817       37    69632       24             0 ksmtuned
      [  516.144596] [  10718]     0 10718  2622484  1721372 15998976   267219             0 panic
      [  516.145792] [  10719]     0 10719  2622484  1164767  9818112    53576             0 panic
      [  516.146977] [  10720]     0 10720  2622484  1174361  9904128    53709             0 panic
      [  516.148163] [  10721]     0 10721  2622484  1209070 10194944    54824             0 panic
      [  516.149329] [  10722]     0 10722  2622484  1745799 14774272    91138             0 panic
      
      5) oom context (contrains and the chosen victim).
      
      oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0
      
      An admin can easily get the full oom context at a single line which
      makes parsing much easier.
      
      Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.comSigned-off-by: Nyuzhoujian <yuzhoujian@didichuxing.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Yang Shi <yang.s@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      (cherry picked from commit ef8444ea)
      Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      (cherry picked from commit 985eab72d54b5ac73189d609486526b5e30125ac)
      Signed-off-by: NLu Jialin <lujialin4@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0201217c
  8. 17 5月, 2021 1 次提交
  9. 22 2月, 2021 5 次提交
  10. 04 11月, 2020 2 次提交
  11. 15 10月, 2020 2 次提交
    • C
      mm, page_alloc: fix core hung in free_pcppages_bulk() · bd946649
      Charan Teja Reddy 提交于
      stable inclusion
      from linux-4.19.142
      commit c666936d8d8b0ace4f3260d71a4eedefd53011d9
      
      --------------------------------
      
      commit 88e8ac11 upstream.
      
      The following race is observed with the repeated online, offline and a
      delay between two successive online of memory blocks of movable zone.
      
      P1						P2
      
      Online the first memory block in
      the movable zone. The pcp struct
      values are initialized to default
      values,i.e., pcp->high = 0 &
      pcp->batch = 1.
      
      					Allocate the pages from the
      					movable zone.
      
      Try to Online the second memory
      block in the movable zone thus it
      entered the online_pages() but yet
      to call zone_pcp_update().
      					This process is entered into
      					the exit path thus it tries
      					to release the order-0 pages
      					to pcp lists through
      					free_unref_page_commit().
      					As pcp->high = 0, pcp->count = 1
      					proceed to call the function
      					free_pcppages_bulk().
      Update the pcp values thus the
      new pcp values are like, say,
      pcp->high = 378, pcp->batch = 63.
      					Read the pcp's batch value using
      					READ_ONCE() and pass the same to
      					free_pcppages_bulk(), pcp values
      					passed here are, batch = 63,
      					count = 1.
      
      					Since num of pages in the pcp
      					lists are less than ->batch,
      					then it will stuck in
      					while(list_empty(list)) loop
      					with interrupts disabled thus
      					a core hung.
      
      Avoid this by ensuring free_pcppages_bulk() is called with proper count of
      pcp list pages.
      
      The mentioned race is some what easily reproducible without [1] because
      pcp's are not updated for the first memory block online and thus there is
      a enough race window for P2 between alloc+free and pcp struct values
      update through onlining of second memory block.
      
      With [1], the race still exists but it is very narrow as we update the pcp
      struct values for the first memory block online itself.
      
      This is not limited to the movable zone, it could also happen in cases
      with the normal zone (e.g., hotplug to a node that only has DMA memory, or
      no other memory yet).
      
      [1]: https://patchwork.kernel.org/patch/11696389/
      
      Fixes: 5f8dcc21 ("page-allocator: split per-cpu list into one-list-per-migrate-type")
      Signed-off-by: NCharan Teja Reddy <charante@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: <stable@vger.kernel.org> [2.6+]
      Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      bd946649
    • D
      mm: include CMA pages in lowmem_reserve at boot · c2a701de
      Doug Berger 提交于
      stable inclusion
      from linux-4.19.142
      commit 84b8dc232afadf3aab425a104def45a1e7346a58
      
      --------------------------------
      
      commit e08d3fdf upstream.
      
      The lowmem_reserve arrays provide a means of applying pressure against
      allocations from lower zones that were targeted at higher zones.  Its
      values are a function of the number of pages managed by higher zones and
      are assigned by a call to the setup_per_zone_lowmem_reserve() function.
      
      The function is initially called at boot time by the function
      init_per_zone_wmark_min() and may be called later by accesses of the
      /proc/sys/vm/lowmem_reserve_ratio sysctl file.
      
      The function init_per_zone_wmark_min() was moved up from a module_init to
      a core_initcall to resolve a sequencing issue with khugepaged.
      Unfortunately this created a sequencing issue with CMA page accounting.
      
      The CMA pages are added to the managed page count of a zone when
      cma_init_reserved_areas() is called at boot also as a core_initcall.  This
      makes it uncertain whether the CMA pages will be added to the managed page
      counts of their zones before or after the call to
      init_per_zone_wmark_min() as it becomes dependent on link order.  With the
      current link order the pages are added to the managed count after the
      lowmem_reserve arrays are initialized at boot.
      
      This means the lowmem_reserve values at boot may be lower than the values
      used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the
      ratio values are unchanged.
      
      In many cases the difference is not significant, but for example
      an ARM platform with 1GB of memory and the following memory layout
      
        cma: Reserved 256 MiB at 0x0000000030000000
        Zone ranges:
          DMA      [mem 0x0000000000000000-0x000000002fffffff]
          Normal   empty
          HighMem  [mem 0x0000000030000000-0x000000003fffffff]
      
      would result in 0 lowmem_reserve for the DMA zone.  This would allow
      userspace to deplete the DMA zone easily.
      
      Funnily enough
      
        $ cat /proc/sys/vm/lowmem_reserve_ratio
      
      would fix up the situation because as a side effect it forces
      setup_per_zone_lowmem_reserve.
      
      This commit breaks the link order dependency by invoking
      init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages
      have the chance to be properly accounted in their zone(s) and allowing
      the lowmem_reserve arrays to receive consistent values.
      
      Fixes: bc22af74 ("mm: update min_free_kbytes from khugepaged after core initialization")
      Signed-off-by: NDoug Berger <opendmb@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c2a701de
  12. 22 9月, 2020 6 次提交
    • P
      mm: initialize deferred pages with interrupts enabled · 46909518
      Pavel Tatashin 提交于
      stable inclusion
      from linux-4.19.129
      commit 88afa532c14135528b905015f1d9a5e740a95136
      
      --------------------------------
      
      commit 3d060856 upstream.
      
      Initializing struct pages is a long task and keeping interrupts disabled
      for the duration of this operation introduces a number of problems.
      
      1. jiffies are not updated for long period of time, and thus incorrect time
         is reported. See proposed solution and discussion here:
         lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
      2. It prevents farther improving deferred page initialization by allowing
         intra-node multi-threading.
      
      We are keeping interrupts disabled to solve a rather theoretical problem
      that was never observed in real world (See 3a2d7fa8).
      
      Let's keep interrupts enabled. In case we ever encounter a scenario where
      an interrupt thread wants to allocate large amount of memory this early in
      boot we can deal with that by growing zone (see deferred_grow_zone()) by
      the needed amount before starting deferred_init_memmap() threads.
      
      Before:
      [    1.232459] node 0 initialised, 12058412 pages in 1ms
      
      After:
      [    1.632580] node 0 initialised, 12051227 pages in 436ms
      
      Fixes: 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages")
      Reported-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Signed-off-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Sasha Levin <sashal@kernel.org>
      Cc: Yiqian Wei <yiwei@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.17+]
      Link: http://lkml.kernel.org/r/20200403140952.17177-3-pasha.tatashin@soleen.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      46909518
    • D
      mm/page_alloc: fix watchdog soft lockups during set_zone_contiguous() · 5cb82e40
      David Hildenbrand 提交于
      stable inclusion
      from linux-4.19.123
      commit dfe810bd92be7a50f491abd381f5a742d9844675
      
      --------------------------------
      
      commit e84fe99b upstream.
      
      Without CONFIG_PREEMPT, it can happen that we get soft lockups detected,
      e.g., while booting up.
      
        watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4
        Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
        RIP: __pageblock_pfn_to_page+0x134/0x1c0
        Call Trace:
         set_zone_contiguous+0x56/0x70
         page_alloc_init_late+0x166/0x176
         kernel_init_freeable+0xfa/0x255
         kernel_init+0xa/0x106
         ret_from_fork+0x35/0x40
      
      The issue becomes visible when having a lot of memory (e.g., 4TB)
      assigned to a single NUMA node - a system that can easily be created
      using QEMU.  Inside VMs on a hypervisor with quite some memory
      overcommit, this is fairly easy to trigger.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Shile Zhang <shile.zhang@linux.alibaba.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Alexander Duyck <alexander.duyck@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5cb82e40
    • A
      mm: Use fixed constant in page_frag_alloc instead of size + 1 · 54d789ed
      Alexander Duyck 提交于
      stable inclusion
      from linux-4.19.116
      commit 695986163d66a9f55daf13aba5976d4b03a23cc9
      
      --------------------------------
      
      commit 86447726 upstream.
      
      This patch replaces the size + 1 value introduced with the recent fix for 1
      byte allocs with a constant value.
      
      The idea here is to reduce code overhead as the previous logic would have
      to read size into a register, then increment it, and write it back to
      whatever field was being used. By using a constant we can avoid those
      memory reads and arithmetic operations in favor of just encoding the
      maximum value into the operation itself.
      
      Fixes: 2c2ade81 ("mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs")
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      54d789ed
    • D
      mm/page_alloc.c: fix uninitialized memmaps on a partially populated last section · d0a3efd5
      David Hildenbrand 提交于
      stable inclusion
      from linux-4.19.103
      commit 0a69047d8235c60d88c6ca488d8dccc7c60d4d3c
      
      --------------------------------
      
      [ Upstream commit e822969c ]
      
      Patch series "mm: fix max_pfn not falling on section boundary", v2.
      
      Playing with different memory sizes for a x86-64 guest, I discovered that
      some memmaps (highest section if max_mem does not fall on the section
      boundary) are marked as being valid and online, but contain garbage.  We
      have to properly initialize these memmaps.
      
      Looking at /proc/kpageflags and friends, I found some more issues,
      partially related to this.
      
      This patch (of 3):
      
      If max_pfn is not aligned to a section boundary, we can easily run into
      BUGs.  This can e.g., be triggered on x86-64 under QEMU by specifying a
      memory size that is not a multiple of 128MB (e.g., 4097MB, but also
      4160MB).  I was told that on real HW, we can easily have this scenario
      (esp., one of the main reasons sub-section hotadd of devmem was added).
      
      The issue is, that we have a valid memmap (pfn_valid()) for the whole
      section, and the whole section will be marked "online".
      pfn_to_online_page() will succeed, but the memmap contains garbage.
      
      E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
      4160M" - (see tools/vm/page-types.c):
      
      [  200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
      [  200.477500] #PF: supervisor read access in kernel mode
      [  200.478334] #PF: error_code(0x0000) - not-present page
      [  200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
      [  200.479557] Oops: 0000 [#4] SMP NOPTI
      [  200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G      D W         5.5.0-rc1-next-20191209 #93
      [  200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
      [  200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
      [  200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
      [  200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
      [  200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
      [  200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
      [  200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
      [  200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
      [  200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
      [  200.487130] FS:  00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
      [  200.487804] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
      [  200.488897] Call Trace:
      [  200.489115]  kpageflags_read+0xe9/0x140
      [  200.489447]  proc_reg_read+0x3c/0x60
      [  200.489755]  vfs_read+0xc2/0x170
      [  200.490037]  ksys_pread64+0x65/0xa0
      [  200.490352]  do_syscall_64+0x5c/0xa0
      [  200.490665]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
      after cold/hot plugging a DIMM to such a system:
      
      [root@localhost ~]# cat /proc/kpageflags > /dev/null
      [  111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
      [  111.517907] #PF: supervisor read access in kernel mode
      [  111.518333] #PF: error_code(0x0000) - not-present page
      [  111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0
      
      This patch fixes that by at least zero-ing out that memmap (so e.g.,
      page_to_pfn() will not crash).  Commit 907ec5fc ("mm: zero remaining
      unavailable struct pages") tried to fix a similar issue, but forgot to
      consider this special case.
      
      After this patch, there are still problems to solve.  E.g., not all of
      these pages falling into a memory hole will actually get initialized later
      and set PageReserved - they are only zeroed out - but at least the
      immediate crashes are gone.  A follow-up patch will take care of this.
      
      Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Tested-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Bob Picco <bob.picco@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: <stable@vger.kernel.org>	[4.15+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      d0a3efd5
    • P
      mm: return zero_resv_unavail optimization · 809dc9f5
      Pavel Tatashin 提交于
      stable inclusion
      from linux-4.19.103
      commit f19a50c1e3ba9f58ca5a591a82ac4852da8bc4ee
      
      --------------------------------
      
      [ Upstream commit ec393a0f ]
      
      When checking for valid pfns in zero_resv_unavail(), it is not necessary
      to verify that pfns within pageblock_nr_pages ranges are valid, only the
      first one needs to be checked.  This is because memory for pages are
      allocated in contiguous chunks that contain pageblock_nr_pages struct
      pages.
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.comSigned-off-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Signed-off-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      809dc9f5
    • N
      mm: zero remaining unavailable struct pages · f8d9d8ce
      Naoya Horiguchi 提交于
      stable inclusion
      from linux-4.19.103
      commit 9ac5917a1d28220981512c4f4c391c90a997e0c6
      
      --------------------------------
      
      [ Upstream commit 907ec5fc ]
      
      Patch series "mm: Fix for movable_node boot option", v3.
      
      This patch series contains a fix for the movable_node boot option issue
      which was introduced by commit 124049de ("x86/e820: put !E820_TYPE_RAM
      regions into memblock.reserved").
      
      The commit breaks the option because it changed the memory gap range to
      reserved memblock.  So, the node is marked as Normal zone even if the SRAT
      has Hot pluggable affinity.
      
      First and second patch fix the original issue which the commit tried to
      fix, then revert the commit.
      
      This patch (of 3):
      
      There is a kernel panic that is triggered when reading /proc/kpageflags on
      the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':
      
        BUG: unable to handle kernel paging request at fffffffffffffffe
        PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
        Oops: 0000 [#1] SMP PTI
        CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
        RIP: 0010:stable_page_flags+0x27/0x3c0
        Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
        RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
        RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
        RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
        RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
        R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
        R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
        FS:  00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
        Call Trace:
         kpageflags_read+0xc7/0x120
         proc_reg_read+0x3c/0x60
         __vfs_read+0x36/0x170
         vfs_read+0x89/0x130
         ksys_pread64+0x71/0x90
         do_syscall_64+0x5b/0x160
         entry_SYSCALL_64_after_hwframe+0x44/0xa9
        RIP: 0033:0x7efc42e75e23
        Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24
      
      According to kernel bisection, this problem became visible due to commit
      f7f99100 which changes how struct pages are initialized.
      
      Memblock layout affects the pfn ranges covered by node/zone.  Consider
      that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the
      default (no memmap= given) memblock layout is like below:
      
        MEMBLOCK configuration:
         memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x4
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
         memory[0x3]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      If you give memmap=1G!4G (so it just covers memory[0x2]),
      the range [0x100000000-0x13fffffff] is gone:
      
        MEMBLOCK configuration:
         memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
         memory.cnt  = 0x3
         memory[0x0]     [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
         memory[0x1]     [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
         memory[0x2]     [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
         ...
      
      This causes shrinking node 0's pfn range because it is calculated by the
      address range of memblock.memory.  So some of struct pages in the gap
      range are left uninitialized.
      
      We have a function zero_resv_unavail() which does zeroing the struct pages
      outside memblock.memory, but currently it covers only the reserved
      unavailable range (i.e.  memblock.memory && !memblock.reserved).  This
      patch extends it to cover all unavailable range, which fixes the reported
      issue.
      
      Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com
      Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Tested-by: NOscar Salvador <osalvador@suse.de>
      Tested-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f8d9d8ce
  13. 06 8月, 2020 1 次提交
  14. 27 12月, 2019 2 次提交