1. 16 12月, 2020 7 次提交
  2. 19 11月, 2020 1 次提交
    • D
      page_frag: Recover from memory pressure · d8c19014
      Dongli Zhang 提交于
      The ethernet driver may allocate skb (and skb->data) via napi_alloc_skb().
      This ends up to page_frag_alloc() to allocate skb->data from
      page_frag_cache->va.
      
      During the memory pressure, page_frag_cache->va may be allocated as
      pfmemalloc page. As a result, the skb->pfmemalloc is always true as
      skb->data is from page_frag_cache->va. The skb will be dropped if the
      sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour
      under memory pressure.
      
      However, once kernel is not under memory pressure any longer (suppose large
      amount of memory pages are just reclaimed), the page_frag_alloc() may still
      re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a
      result, the skb->pfmemalloc is always true unless page_frag_cache->va is
      re-allocated, even if the kernel is not under memory pressure any longer.
      
      Here is how kernel runs into issue.
      
      1. The kernel is under memory pressure and allocation of
      PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead,
      the pfmemalloc page is allocated for page_frag_cache->va.
      
      2: All skb->data from page_frag_cache->va (pfmemalloc) will have
      skb->pfmemalloc=true. The skb will always be dropped by sock without
      SOCK_MEMALLOC. This is an expected behaviour.
      
      3. Suppose a large amount of pages are reclaimed and kernel is not under
      memory pressure any longer. We expect skb->pfmemalloc drop will not happen.
      
      4. Unfortunately, page_frag_alloc() does not proactively re-allocate
      page_frag_alloc->va and will always re-use the prior pfmemalloc page. The
      skb->pfmemalloc is always true even kernel is not under memory pressure any
      longer.
      
      Fix this by freeing and re-allocating the page instead of recycling it.
      
      References: https://lore.kernel.org/lkml/20201103193239.1807-1-dongli.zhang@oracle.com/
      References: https://lore.kernel.org/linux-mm/20201105042140.5253-1-willy@infradead.org/Suggested-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Aruna Ramakrishna <aruna.ramakrishna@oracle.com>
      Cc: Bert Barbe <bert.barbe@oracle.com>
      Cc: Rama Nichanamatlu <rama.nichanamatlu@oracle.com>
      Cc: Venkat Venkatsubra <venkat.x.venkatsubra@oracle.com>
      Cc: Manjunath Patil <manjunath.b.patil@oracle.com>
      Cc: Joe Jin <joe.jin@oracle.com>
      Cc: SRINIVAS <srinivas.eeda@oracle.com>
      Fixes: 79930f58 ("net: do not deplete pfmemalloc reserve")
      Signed-off-by: NDongli Zhang <dongli.zhang@oracle.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Link: https://lore.kernel.org/r/20201115201029.11903-1-dongli.zhang@oracle.comSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      d8c19014
  3. 17 10月, 2020 12 次提交
    • M
      mm: rename page_order() to buddy_order() · ab130f91
      Matthew Wilcox (Oracle) 提交于
      The current page_order() can only be called on pages in the buddy
      allocator.  For compound pages, you have to use compound_order().  This is
      confusing and led to a bug, so rename page_order() to buddy_order().
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Link: https://lkml.kernel.org/r/20201001152259.14932-2-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab130f91
    • D
      mm/page_alloc: place pages to tail in __free_pages_core() · 7fef431b
      David Hildenbrand 提交于
      __free_pages_core() is used when exposing fresh memory to the buddy during
      system boot and when onlining memory in generic_online_page().
      
      generic_online_page() is used in two cases:
      
      1. Direct memory onlining in online_pages().
      2. Deferred memory onlining in memory-ballooning-like mechanisms (HyperV
         balloon and virtio-mem), when parts of a section are kept
         fake-offline to be fake-onlined later on.
      
      In 1, we already place pages to the tail of the freelist.  Pages will be
      freed to MIGRATE_ISOLATE lists first and moved to the tail of the
      freelists via undo_isolate_page_range().
      
      In 2, we currently don't implement a proper rule.  In case of virtio-mem,
      where we currently always online MAX_ORDER - 1 pages, the pages will be
      placed to the HEAD of the freelist - undesireable.  While the hyper-v
      balloon calls generic_online_page() with single pages, usually it will
      call it on successive single pages in a larger block.
      
      The pages are fresh, so place them to the tail of the freelist and avoid
      the PCP.  In __free_pages_core(), remove the now superflouos call to
      set_page_refcounted() and add a comment regarding page initialization and
      the refcount.
      
      Note: In 2.  we currently don't shuffle.  If ever relevant (page shuffling
      is usually of limited use in virtualized environments), we might want to
      shuffle after a sequence of generic_online_page() calls in the relevant
      callers.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Link: https://lkml.kernel.org/r/20201005121534.15649-5-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7fef431b
    • D
      mm/page_alloc: move pages to tail in move_to_free_list() · 293ffa5e
      David Hildenbrand 提交于
      Whenever we move pages between freelists via move_to_free_list()/
      move_freepages_block(), we don't actually touch the pages:
      1. Page isolation doesn't actually touch the pages, it simply isolates
         pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
         When undoing isolation, we move the pages back to the target list.
      2. Page stealing (steal_suitable_fallback()) moves free pages directly
         between lists without touching them.
      3. reserve_highatomic_pageblock()/unreserve_highatomic_pageblock() moves
         free pages directly between freelists without touching them.
      
      We already place pages to the tail of the freelists when undoing isolation
      via __putback_isolated_page(), let's do it in any case (e.g., if order <=
      pageblock_order) and document the behavior. To simplify, let's move the
      pages to the tail for all move_to_free_list()/move_freepages_block() users.
      
      In 2., the target list is empty, so there should be no change.  In 3., we
      might observe a change, however, highatomic is more concerned about
      allocations succeeding than cache hotness - if we ever realize this change
      degrades a workload, we can special-case this instance and add a proper
      comment.
      
      This change results in all pages getting onlined via online_pages() to be
      placed to the tail of the freelist.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: https://lkml.kernel.org/r/20201005121534.15649-4-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      293ffa5e
    • D
      mm/page_alloc: place pages to tail in __putback_isolated_page() · 47b6a24a
      David Hildenbrand 提交于
      __putback_isolated_page() already documents that pages will be placed to
      the tail of the freelist - this is, however, not the case for "order >=
      MAX_ORDER - 2" (see buddy_merge_likely()) - which should be the case for
      all existing users.
      
      This change affects two users:
      - free page reporting
      - page isolation, when undoing the isolation (including memory onlining).
      
      This behavior is desirable for pages that haven't really been touched
      lately, so exactly the two users that don't actually read/write page
      content, but rather move untouched pages.
      
      The new behavior is especially desirable for memory onlining, where we
      allow allocation of newly onlined pages via undo_isolate_page_range() in
      online_pages().  Right now, we always place them to the head of the
      freelist, resulting in undesireable behavior: Assume we add individual
      memory chunks via add_memory() and online them right away to the NORMAL
      zone.  We create a dependency chain of unmovable allocations e.g., via the
      memmap.  The memmap of the next chunk will be placed onto previous chunks
      - if the last block cannot get offlined+removed, all dependent ones cannot
      get offlined+removed.  While this can already be observed with individual
      DIMMs, it's more of an issue for virtio-mem (and I suspect also ppc
      DLPAR).
      
      Document that this should only be used for optimizations, and no code
      should rely on this behavior for correction (if the order of the freelists
      ever changes).
      
      We won't care about page shuffling: memory onlining already properly
      shuffles after onlining.  free page reporting doesn't care about
      physically contiguous ranges, and there are already cases where page
      isolation will simply move (physically close) free pages to (currently)
      the head of the freelists via move_freepages_block() instead of shuffling.
      If this becomes ever relevant, we should shuffle the whole zone when
      undoing isolation of larger ranges, and after free_contig_range().
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Link: https://lkml.kernel.org/r/20201005121534.15649-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      47b6a24a
    • D
      mm/page_alloc: convert "report" flag of __free_one_page() to a proper flag · f04a5d5d
      David Hildenbrand 提交于
      Patch series "mm: place pages to the freelist tail when onlining and undoing isolation", v2.
      
      When adding separate memory blocks via add_memory*() and onlining them
      immediately, the metadata (especially the memmap) of the next block will
      be placed onto one of the just added+onlined block.  This creates a chain
      of unmovable allocations: If the last memory block cannot get
      offlined+removed() so will all dependent ones.  We directly have unmovable
      allocations all over the place.
      
      This can be observed quite easily using virtio-mem, however, it can also
      be observed when using DIMMs.  The freshly onlined pages will usually be
      placed to the head of the freelists, meaning they will be allocated next,
      turning the just-added memory usually immediately un-removable.  The fresh
      pages are cold, prefering to allocate others (that might be hot) also
      feels to be the natural thing to do.
      
      It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
      adding separate, successive memory blocks, each memory block will have
      unmovable allocations on them - for example gigantic pages will fail to
      allocate.
      
      While the ZONE_NORMAL doesn't provide any guarantees that memory can get
      offlined+removed again (any kind of fragmentation with unmovable
      allocations is possible), there are many scenarios (hotplugging a lot of
      memory, running workload, hotunplug some memory/as much as possible) where
      we can offline+remove quite a lot with this patchset.
      
      a) To visualize the problem, a very simple example:
      
      Start a VM with 4GB and 8GB of virtio-mem memory:
      
       [root@localhost ~]# lsmem
       RANGE                                 SIZE  STATE REMOVABLE  BLOCK
       0x0000000000000000-0x00000000bfffffff   3G online       yes   0-23
       0x0000000100000000-0x000000033fffffff   9G online       yes 32-103
      
       Memory block size:       128M
       Total online memory:      12G
       Total offline memory:      0B
      
      Then try to unplug as much as possible using virtio-mem. Observe which
      memory blocks are still around. Without this patch set:
      
       [root@localhost ~]# lsmem
       RANGE                                  SIZE  STATE REMOVABLE   BLOCK
       0x0000000000000000-0x00000000bfffffff    3G online       yes    0-23
       0x0000000100000000-0x000000013fffffff    1G online       yes   32-39
       0x0000000148000000-0x000000014fffffff  128M online       yes      41
       0x0000000158000000-0x000000015fffffff  128M online       yes      43
       0x0000000168000000-0x000000016fffffff  128M online       yes      45
       0x0000000178000000-0x000000017fffffff  128M online       yes      47
       0x0000000188000000-0x0000000197ffffff  256M online       yes   49-50
       0x00000001a0000000-0x00000001a7ffffff  128M online       yes      52
       0x00000001b0000000-0x00000001b7ffffff  128M online       yes      54
       0x00000001c0000000-0x00000001c7ffffff  128M online       yes      56
       0x00000001d0000000-0x00000001d7ffffff  128M online       yes      58
       0x00000001e0000000-0x00000001e7ffffff  128M online       yes      60
       0x00000001f0000000-0x00000001f7ffffff  128M online       yes      62
       0x0000000200000000-0x0000000207ffffff  128M online       yes      64
       0x0000000210000000-0x0000000217ffffff  128M online       yes      66
       0x0000000220000000-0x0000000227ffffff  128M online       yes      68
       0x0000000230000000-0x0000000237ffffff  128M online       yes      70
       0x0000000240000000-0x0000000247ffffff  128M online       yes      72
       0x0000000250000000-0x0000000257ffffff  128M online       yes      74
       0x0000000260000000-0x0000000267ffffff  128M online       yes      76
       0x0000000270000000-0x0000000277ffffff  128M online       yes      78
       0x0000000280000000-0x0000000287ffffff  128M online       yes      80
       0x0000000290000000-0x0000000297ffffff  128M online       yes      82
       0x00000002a0000000-0x00000002a7ffffff  128M online       yes      84
       0x00000002b0000000-0x00000002b7ffffff  128M online       yes      86
       0x00000002c0000000-0x00000002c7ffffff  128M online       yes      88
       0x00000002d0000000-0x00000002d7ffffff  128M online       yes      90
       0x00000002e0000000-0x00000002e7ffffff  128M online       yes      92
       0x00000002f0000000-0x00000002f7ffffff  128M online       yes      94
       0x0000000300000000-0x0000000307ffffff  128M online       yes      96
       0x0000000310000000-0x0000000317ffffff  128M online       yes      98
       0x0000000320000000-0x0000000327ffffff  128M online       yes     100
       0x0000000330000000-0x000000033fffffff  256M online       yes 102-103
      
       Memory block size:       128M
       Total online memory:     8.1G
       Total offline memory:      0B
      
      With this patch set:
      
       [root@localhost ~]# lsmem
       RANGE                                 SIZE  STATE REMOVABLE BLOCK
       0x0000000000000000-0x00000000bfffffff   3G online       yes  0-23
       0x0000000100000000-0x000000013fffffff   1G online       yes 32-39
      
       Memory block size:       128M
       Total online memory:       4G
       Total offline memory:      0B
      
      All memory can get unplugged, all memory block can get removed.  Of
      course, no workload ran and the system was basically idle, but it
      highlights the issue - the fairly deterministic chain of unmovable
      allocations.  When a huge page for the 2MB memmap is needed, a
      just-onlined 4MB page will be split.  The remaining 2MB page will be used
      for the memmap of the next memory block.  So one memory block will hold
      the memmap of the two following memory blocks.  Finally the pages of the
      last-onlined memory block will get used for the next bigger allocations -
      if any allocation is unmovable, all dependent memory blocks cannot get
      unplugged and removed until that allocation is gone.
      
      Note that with bigger memory blocks (e.g., 256MB), *all* memory
      blocks are dependent and none can get unplugged again!
      
      b) Experiment with memory intensive workload
      
      I performed an experiment with an older version of this patch set (before
      we used undo_isolate_page_range() in online_pages(): Hotplug 56GB to a VM
      with an initial 4GB, onlining all memory to ZONE_NORMAL right from the
      kernel when adding it.  I then run various memory intensive workloads that
      consume most system memory for a total of 45 minutes.  Once finished, I
      try to unplug as much memory as possible.
      
      With this change, I am able to remove via virtio-mem (adding individual
      128MB memory blocks) 413 out of 448 added memory blocks.  Via individual
      (256MB) DIMMs 380 out of 448 added memory blocks.  (I don't have any
      numbers without this patchset, but looking at the above example, it's at
      most half of the 448 memory blocks for virtio-mem, and most probably none
      for DIMMs).
      
      Again, there are workloads that might behave very differently due to the
      nature of ZONE_NORMAL.
      
      This change also affects (besides memory onlining):
      - Other users of undo_isolate_page_range(): Pages are always placed to the
        tail.
      -- When memory offlining fails
      -- When memory isolation fails after having isolated some pageblocks
      -- When alloc_contig_range() either succeeds or fails
      - Other users of __putback_isolated_page(): Pages are always placed to the
        tail.
      -- Free page reporting
      - Other users of __free_pages_core()
      -- AFAIKs, any memory that is getting exposed to the buddy during boot.
         IIUC we will now usually allocate memory from lower addresses within
         a zone first (especially during boot).
      - Other users of generic_online_page()
      -- Hyper-V balloon
      
      This patch (of 5):
      
      Let's prepare for additional flags and avoid long parameter lists of
      bools.  Follow-up patches will also make use of the flags in
      __free_pages_ok().
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com>
      Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Wei Liu <wei.liu@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Link: https://lkml.kernel.org/r/20201005121534.15649-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20201005121534.15649-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f04a5d5d
    • D
      mm: pass migratetype into memmap_init_zone() and move_pfn_range_to_zone() · d882c006
      David Hildenbrand 提交于
      On the memory onlining path, we want to start with MIGRATE_ISOLATE, to
      un-isolate the pages after memory onlining is complete.  Let's allow
      passing in the migratetype.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Link: https://lkml.kernel.org/r/20200819175957.28465-10-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d882c006
    • D
      mm/page_alloc: drop stale pageblock comment in memmap_init_zone*() · 4eb29bd9
      David Hildenbrand 提交于
      Commit ac5d2539 ("mm: meminit: reduce number of times pageblocks are
      set during struct page init") moved the actual zone range check, leaving
      only the alignment check for pageblocks.
      
      Let's drop the stale comment and make the pageblock check easier to read.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-9-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4eb29bd9
    • D
      mm/page_isolation: simplify return value of start_isolate_page_range() · 3fa0c7c7
      David Hildenbrand 提交于
      Callers no longer need the number of isolated pageblocks.  Let's simplify.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-7-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3fa0c7c7
    • D
      mm/page_alloc: simplify __offline_isolated_pages() · 257bea71
      David Hildenbrand 提交于
      offline_pages() is the only user.  __offline_isolated_pages() never gets
      called with ranges that contain memory holes and we no longer care about
      the return value.  Drop the return value handling and all pfn_valid()
      checks.
      
      Update the documentation.
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
      Cc: Charan Teja Reddy <charante@codeaurora.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Mike Rapoport <rppt@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200819175957.28465-5-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      257bea71
    • O
      mm,hwpoison: rework soft offline for in-use pages · 79f5f8fa
      Oscar Salvador 提交于
      This patch changes the way we set and handle in-use poisoned pages.  Until
      now, poisoned pages were released to the buddy allocator, trusting that
      the checks that take place at allocation time would act as a safe net and
      would skip that page.
      
      This has proved to be wrong, as we got some pfn walkers out there, like
      compaction, that all they care is the page to be in a buddy freelist.
      
      Although this might not be the only user, having poisoned pages in the
      buddy allocator seems a bad idea as we should only have free pages that
      are ready and meant to be used as such.
      
      Before explaining the taken approach, let us break down the kind of pages
      we can soft offline.
      
      - Anonymous THP (after the split, they end up being 4K pages)
      - Hugetlb
      - Order-0 pages (that can be either migrated or invalited)
      
      * Normal pages (order-0 and anon-THP)
      
        - If they are clean and unmapped page cache pages, we invalidate
          then by means of invalidate_inode_page().
        - If they are mapped/dirty, we do the isolate-and-migrate dance.
      
      Either way, do not call put_page directly from those paths.  Instead, we
      keep the page and send it to page_handle_poison to perform the right
      handling.
      
      page_handle_poison sets the HWPoison flag and does the last put_page.
      
      Down the chain, we placed a check for HWPoison page in
      free_pages_prepare, that just skips any poisoned page, so those pages
      do not end up in any pcplist/freelist.
      
      After that, we set the refcount on the page to 1 and we increment
      the poisoned pages counter.
      
      If we see that the check in free_pages_prepare creates trouble, we can
      always do what we do for free pages:
      
        - wait until the page hits buddy's freelists
        - take it off, and flag it
      
      The downside of the above approach is that we could race with an
      allocation, so by the time we  want to take the page off the buddy, the
      page has been already allocated so we cannot soft offline it.
      But the user could always retry it.
      
      * Hugetlb pages
      
        - We isolate-and-migrate them
      
      After the migration has been successful, we call dissolve_free_huge_page,
      and we set HWPoison on the page if we succeed.
      Hugetlb has a slightly different handling though.
      
      While for non-hugetlb pages we cared about closing the race with an
      allocation, doing so for hugetlb pages requires quite some additional
      and intrusive code (we would need to hook in free_huge_page and some other
      places).
      So I decided to not make the code overly complicated and just fail
      normally if the page we allocated in the meantime.
      
      We can always build on top of this.
      
      As a bonus, because of the way we handle now in-use pages, we no longer
      need the put-as-isolation-migratetype dance, that was guarding for poisoned
      pages to end up in pcplists.
      Signed-off-by: NOscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-10-osalvador@suse.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79f5f8fa
    • O
      mm,hwpoison: rework soft offline for free pages · 06be6ff3
      Oscar Salvador 提交于
      When trying to soft-offline a free page, we need to first take it off the
      buddy allocator.  Once we know is out of reach, we can safely flag it as
      poisoned.
      
      take_page_off_buddy will be used to take a page meant to be poisoned off
      the buddy allocator.  take_page_off_buddy calls break_down_buddy_pages,
      which splits a higher-order page in case our page belongs to one.
      
      Once the page is under our control, we call page_handle_poison to set it
      as poisoned and grab a refcount on it.
      Signed-off-by: NOscar Salvador <osalvador@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NNaoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Aristeu Rozanski <aris@ruivo.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Dmitry Yakunin <zeil@yandex-team.ru>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Oscar Salvador <osalvador@suse.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Tony Luck <tony.luck@intel.com>
      Link: https://lkml.kernel.org/r/20200922135650.1634-9-osalvador@suse.deSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06be6ff3
    • M
      mm/page_owner: change split_page_owner to take a count · 8fb156c9
      Matthew Wilcox (Oracle) 提交于
      The implementation of split_page_owner() prefers a count rather than the
      old order of the page.  When we support a variable size THP, we won't
      have the order at this point, but we will have the number of pages.
      So change the interface to what the caller and callee would prefer.
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NSeongJae Park <sjpark@amazon.de>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Link: https://lkml.kernel.org/r/20200908195539.25896-4-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8fb156c9
  4. 14 10月, 2020 11 次提交
  5. 12 10月, 2020 1 次提交
  6. 04 10月, 2020 1 次提交
  7. 27 9月, 2020 1 次提交
    • L
      mm: replace memmap_context by meminit_context · c1d0da83
      Laurent Dufour 提交于
      Patch series "mm: fix memory to node bad links in sysfs", v3.
      
      Sometimes, firmware may expose interleaved memory layout like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      In that case, we can see memory blocks assigned to multiple nodes in
      sysfs:
      
        $ ls -l /sys/devices/system/memory/memory21
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
        drwxr-xr-x 2 root root     0 Aug 24 05:27 power
        -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
        lrwxrwxrwx 1 root root     0 Aug 24 05:25 subsystem -> ../../../../bus/memory
        -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
        -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones
      
      The same applies in the node's directory with a memory21 link in both
      the node1 and node2's directory.
      
      This is wrong but doesn't prevent the system to run.  However when
      later, one of these memory blocks is hot-unplugged and then hot-plugged,
      the system is detecting an inconsistency in the sysfs layout and a
      BUG_ON() is raised:
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This has been seen on PowerPC LPAR.
      
      The root cause of this issue is that when node's memory is registered,
      the range used can overlap another node's range, thus the memory block
      is registered to multiple nodes in sysfs.
      
      There are two issues here:
      
       (a) The sysfs memory and node's layouts are broken due to these
           multiple links
      
       (b) The link errors in link_mem_sections() should not lead to a system
           panic.
      
      To address (a) register_mem_sect_under_node should not rely on the
      system state to detect whether the link operation is triggered by a hot
      plug operation or not.  This is addressed by the patches 1 and 2 of this
      series.
      
      Issue (b) will be addressed separately.
      
      This patch (of 2):
      
      The memmap_context enum is used to detect whether a memory operation is
      due to a hot-add operation or happening at boot time.
      
      Make it general to the hotplug operation and rename it as
      meminit_context.
      
      There is no functional change introduced by this patch
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J . Wysocki" <rafael@kernel.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
      Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c1d0da83
  8. 29 8月, 2020 1 次提交
  9. 22 8月, 2020 2 次提交
    • C
      mm, page_alloc: fix core hung in free_pcppages_bulk() · 88e8ac11
      Charan Teja Reddy 提交于
      The following race is observed with the repeated online, offline and a
      delay between two successive online of memory blocks of movable zone.
      
      P1						P2
      
      Online the first memory block in
      the movable zone. The pcp struct
      values are initialized to default
      values,i.e., pcp->high = 0 &
      pcp->batch = 1.
      
      					Allocate the pages from the
      					movable zone.
      
      Try to Online the second memory
      block in the movable zone thus it
      entered the online_pages() but yet
      to call zone_pcp_update().
      					This process is entered into
      					the exit path thus it tries
      					to release the order-0 pages
      					to pcp lists through
      					free_unref_page_commit().
      					As pcp->high = 0, pcp->count = 1
      					proceed to call the function
      					free_pcppages_bulk().
      Update the pcp values thus the
      new pcp values are like, say,
      pcp->high = 378, pcp->batch = 63.
      					Read the pcp's batch value using
      					READ_ONCE() and pass the same to
      					free_pcppages_bulk(), pcp values
      					passed here are, batch = 63,
      					count = 1.
      
      					Since num of pages in the pcp
      					lists are less than ->batch,
      					then it will stuck in
      					while(list_empty(list)) loop
      					with interrupts disabled thus
      					a core hung.
      
      Avoid this by ensuring free_pcppages_bulk() is called with proper count of
      pcp list pages.
      
      The mentioned race is some what easily reproducible without [1] because
      pcp's are not updated for the first memory block online and thus there is
      a enough race window for P2 between alloc+free and pcp struct values
      update through onlining of second memory block.
      
      With [1], the race still exists but it is very narrow as we update the pcp
      struct values for the first memory block online itself.
      
      This is not limited to the movable zone, it could also happen in cases
      with the normal zone (e.g., hotplug to a node that only has DMA memory, or
      no other memory yet).
      
      [1]: https://patchwork.kernel.org/patch/11696389/
      
      Fixes: 5f8dcc21 ("page-allocator: split per-cpu list into one-list-per-migrate-type")
      Signed-off-by: NCharan Teja Reddy <charante@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: <stable@vger.kernel.org> [2.6+]
      Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88e8ac11
    • D
      mm: include CMA pages in lowmem_reserve at boot · e08d3fdf
      Doug Berger 提交于
      The lowmem_reserve arrays provide a means of applying pressure against
      allocations from lower zones that were targeted at higher zones.  Its
      values are a function of the number of pages managed by higher zones and
      are assigned by a call to the setup_per_zone_lowmem_reserve() function.
      
      The function is initially called at boot time by the function
      init_per_zone_wmark_min() and may be called later by accesses of the
      /proc/sys/vm/lowmem_reserve_ratio sysctl file.
      
      The function init_per_zone_wmark_min() was moved up from a module_init to
      a core_initcall to resolve a sequencing issue with khugepaged.
      Unfortunately this created a sequencing issue with CMA page accounting.
      
      The CMA pages are added to the managed page count of a zone when
      cma_init_reserved_areas() is called at boot also as a core_initcall.  This
      makes it uncertain whether the CMA pages will be added to the managed page
      counts of their zones before or after the call to
      init_per_zone_wmark_min() as it becomes dependent on link order.  With the
      current link order the pages are added to the managed count after the
      lowmem_reserve arrays are initialized at boot.
      
      This means the lowmem_reserve values at boot may be lower than the values
      used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the
      ratio values are unchanged.
      
      In many cases the difference is not significant, but for example
      an ARM platform with 1GB of memory and the following memory layout
      
        cma: Reserved 256 MiB at 0x0000000030000000
        Zone ranges:
          DMA      [mem 0x0000000000000000-0x000000002fffffff]
          Normal   empty
          HighMem  [mem 0x0000000030000000-0x000000003fffffff]
      
      would result in 0 lowmem_reserve for the DMA zone.  This would allow
      userspace to deplete the DMA zone easily.
      
      Funnily enough
      
        $ cat /proc/sys/vm/lowmem_reserve_ratio
      
      would fix up the situation because as a side effect it forces
      setup_per_zone_lowmem_reserve.
      
      This commit breaks the link order dependency by invoking
      init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages
      have the chance to be properly accounted in their zone(s) and allowing
      the lowmem_reserve arrays to receive consistent values.
      
      Fixes: bc22af74 ("mm: update min_free_kbytes from khugepaged after core initialization")
      Signed-off-by: NDoug Berger <opendmb@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Jason Baron <jbaron@akamai.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e08d3fdf
  10. 15 8月, 2020 1 次提交
  11. 13 8月, 2020 2 次提交