1. 16 8月, 2023 1 次提交
  2. 17 2月, 2023 2 次提交
  3. 19 1月, 2023 1 次提交
  4. 22 11月, 2022 1 次提交
  5. 21 11月, 2022 1 次提交
  6. 15 11月, 2022 3 次提交
  7. 11 11月, 2022 5 次提交
  8. 10 11月, 2022 1 次提交
  9. 17 8月, 2022 2 次提交
  10. 26 7月, 2022 1 次提交
  11. 19 7月, 2022 1 次提交
  12. 06 7月, 2022 1 次提交
    • A
      mm/pages_alloc.c: don't create ZONE_MOVABLE beyond the end of a node · ea2a0e2a
      Alistair Popple 提交于
      stable inclusion
      from stable-v5.10.110
      commit 7188e7c96f39ae40b8f8d6a807d3f338fb1927ac
      bugzilla: https://gitee.com/openeuler/kernel/issues/I574AL
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=7188e7c96f39ae40b8f8d6a807d3f338fb1927ac
      
      --------------------------------
      
      commit ddbc84f3 upstream.
      
      ZONE_MOVABLE uses the remaining memory in each node.  Its starting pfn
      is also aligned to MAX_ORDER_NR_PAGES.  It is possible for the remaining
      memory in a node to be less than MAX_ORDER_NR_PAGES, meaning there is
      not enough room for ZONE_MOVABLE on that node.
      
      Unfortunately this condition is not checked for.  This leads to
      zone_movable_pfn[] getting set to a pfn greater than the last pfn in a
      node.
      
      calculate_node_totalpages() then sets zone->present_pages to be greater
      than zone->spanned_pages which is invalid, as spanned_pages represents
      the maximum number of pages in a zone assuming no holes.
      
      Subsequently it is possible free_area_init_core() will observe a zone of
      size zero with present pages.  In this case it will skip setting up the
      zone, including the initialisation of free_lists[].
      
      However populated_zone() checks zone->present_pages to see if a zone has
      memory available.  This is used by iterators such as
      walk_zones_in_node().  pagetypeinfo_showfree() uses this to walk the
      free_list of each zone in each node, which are assumed to be initialised
      due to the zone not being empty.
      
      As free_area_init_core() never initialised the free_lists[] this results
      in the following kernel crash when trying to read /proc/pagetypeinfo:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000000
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
        CPU: 0 PID: 456 Comm: cat Not tainted 5.16.0 #461
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
        RIP: 0010:pagetypeinfo_show+0x163/0x460
        Code: 9e 82 e8 80 57 0e 00 49 8b 06 b9 01 00 00 00 4c 39 f0 75 16 e9 65 02 00 00 48 83 c1 01 48 81 f9 a0 86 01 00 0f 84 48 02 00 00 <48> 8b 00 4c 39 f0 75 e7 48 c7 c2 80 a2 e2 82 48 c7 c6 79 ef e3 82
        RSP: 0018:ffffc90001c4bd10 EFLAGS: 00010003
        RAX: 0000000000000000 RBX: ffff88801105f638 RCX: 0000000000000001
        RDX: 0000000000000001 RSI: 000000000000068b RDI: ffff8880163dc68b
        RBP: ffffc90001c4bd90 R08: 0000000000000001 R09: ffff8880163dc67e
        R10: 656c6261766f6d6e R11: 6c6261766f6d6e55 R12: ffff88807ffb4a00
        R13: ffff88807ffb49f8 R14: ffff88807ffb4580 R15: ffff88807ffb3000
        FS:  00007f9c83eff5c0(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 0000000013c8e000 CR4: 0000000000350ef0
        Call Trace:
         seq_read_iter+0x128/0x460
         proc_reg_read_iter+0x51/0x80
         new_sync_read+0x113/0x1a0
         vfs_read+0x136/0x1d0
         ksys_read+0x70/0xf0
         __x64_sys_read+0x1a/0x20
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fix this by checking that the aligned zone_movable_pfn[] does not exceed
      the end of the node, and if it does skip creating a movable zone on this
      node.
      
      Link: https://lkml.kernel.org/r/20220215025831.2113067-1-apopple@nvidia.com
      Fixes: 2a1e274a ("Create the ZONE_MOVABLE zone")
      Signed-off-by: NAlistair Popple <apopple@nvidia.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYu Liao <liaoyu15@huawei.com>
      Reviewed-by: NWei Li <liwei391@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      ea2a0e2a
  13. 27 4月, 2022 2 次提交
    • B
      mm/page_alloc.c: do not warn allocation failure on zone DMA if no managed pages · 56692753
      Baoquan He 提交于
      stable inclusion
      from stable-v5.10.94
      commit 6c6f86bb618b73007dc2bc8d4b4003f80ba1efeb
      bugzilla: https://gitee.com/openeuler/kernel/issues/I531X9
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=6c6f86bb618b73007dc2bc8d4b4003f80ba1efeb
      
      --------------------------------
      
      commit c4dc63f0 upstream.
      
      In kdump kernel of x86_64, page allocation failure is observed:
      
       kworker/u2:2: page allocation failure: order:0, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
       CPU: 0 PID: 55 Comm: kworker/u2:2 Not tainted 5.16.0-rc4+ #5
       Hardware name: AMD Dinar/Dinar, BIOS RDN1505B 06/05/2013
       Workqueue: events_unbound async_run_entry_fn
       Call Trace:
        <TASK>
        dump_stack_lvl+0x48/0x5e
        warn_alloc.cold+0x72/0xd6
        __alloc_pages_slowpath.constprop.0+0xc69/0xcd0
        __alloc_pages+0x1df/0x210
        new_slab+0x389/0x4d0
        ___slab_alloc+0x58f/0x770
        __slab_alloc.constprop.0+0x4a/0x80
        kmem_cache_alloc_trace+0x24b/0x2c0
        sr_probe+0x1db/0x620
        ......
        device_add+0x405/0x920
        ......
        __scsi_add_device+0xe5/0x100
        ata_scsi_scan_host+0x97/0x1d0
        async_run_entry_fn+0x30/0x130
        process_one_work+0x1e8/0x3c0
        worker_thread+0x50/0x3b0
        ? rescuer_thread+0x350/0x350
        kthread+0x16b/0x190
        ? set_kthread_struct+0x40/0x40
        ret_from_fork+0x22/0x30
        </TASK>
       Mem-Info:
       ......
      
      The above failure happened when calling kmalloc() to allocate buffer with
      GFP_DMA.  It requests to allocate slab page from DMA zone while no managed
      pages at all in there.
      
       sr_probe()
       --> get_capabilities()
           --> buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
      
      Because in the current kernel, dma-kmalloc will be created as long as
      CONFIG_ZONE_DMA is enabled.  However, kdump kernel of x86_64 doesn't have
      managed pages on DMA zone since commit 6f599d84 ("x86/kdump: Always
      reserve the low 1M when the crashkernel option is specified").  The
      failure can be always reproduced.
      
      For now, let's mute the warning of allocation failure if requesting pages
      from DMA zone while no managed pages.
      
      [akpm@linux-foundation.org: fix warning]
      
      Link: https://lkml.kernel.org/r/20211223094435.248523-4-bhe@redhat.com
      Fixes: 6f599d84 ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NJohn Donnelly  <john.p.donnelly@oracle.com>
      Reviewed-by: NHyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      56692753
    • B
      mm_zone: add function to check if managed dma zone exists · d4add921
      Baoquan He 提交于
      stable inclusion
      from stable-v5.10.94
      commit d2e572411738a5aad67901caef8e083fb9df29fd
      bugzilla: https://gitee.com/openeuler/kernel/issues/I531X9
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=d2e572411738a5aad67901caef8e083fb9df29fd
      
      --------------------------------
      
      commit 62b31070 upstream.
      
      Patch series "Handle warning of allocation failure on DMA zone w/o
      managed pages", v4.
      
      **Problem observed:
      On x86_64, when crash is triggered and entering into kdump kernel, page
      allocation failure can always be seen.
      
       ---------------------------------
       DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
       swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
       CPU: 0 PID: 1 Comm: swapper/0
       Call Trace:
        dump_stack+0x7f/0xa1
        warn_alloc.cold+0x72/0xd6
        ......
        __alloc_pages+0x24d/0x2c0
        ......
        dma_atomic_pool_init+0xdb/0x176
        do_one_initcall+0x67/0x320
        ? rcu_read_lock_sched_held+0x3f/0x80
        kernel_init_freeable+0x290/0x2dc
        ? rest_init+0x24f/0x24f
        kernel_init+0xa/0x111
        ret_from_fork+0x22/0x30
       Mem-Info:
       ------------------------------------
      
      ***Root cause:
      In the current kernel, it assumes that DMA zone must have managed pages
      and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not
      always true. E.g in kdump kernel of x86_64, only low 1M is presented and
      locked down at very early stage of boot, so that this low 1M won't be
      added into buddy allocator to become managed pages of DMA zone. This
      exception will always cause page allocation failure if page is requested
      from DMA zone.
      
      ***Investigation:
      This failure happens since below commit merged into linus's tree.
        1a6a9044 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options
        23721c8e x86/crash: Remove crash_reserve_low_1M()
        f1d4d47c x86/setup: Always reserve the first 1M of RAM
        7c321eb2 x86/kdump: Remove the backup region handling
        6f599d84 x86/kdump: Always reserve the low 1M when the crashkernel option is specified
      
      Before them, on x86_64, the low 640K area will be reused by kdump kernel.
      So in kdump kernel, the content of low 640K area is copied into a backup
      region for dumping before jumping into kdump. Then except of those firmware
      reserved region in [0, 640K], the left area will be added into buddy
      allocator to become available managed pages of DMA zone.
      
      However, after above commits applied, in kdump kernel of x86_64, the low
      1M is reserved by memblock, but not released to buddy allocator. So any
      later page allocation requested from DMA zone will fail.
      
      At the beginning, if crashkernel is reserved, the low 1M need be locked
      down because AMD SME encrypts memory making the old backup region
      mechanims impossible when switching into kdump kernel.
      
      Later, it was also observed that there are BIOSes corrupting memory
      under 1M. To solve this, in commit f1d4d47c, the entire region of
      low 1M is always reserved after the real mode trampoline is allocated.
      
      Besides, recently, Intel engineer mentioned their TDX (Trusted domain
      extensions) which is under development in kernel also needs to lock down
      the low 1M. So we can't simply revert above commits to fix the page allocation
      failure from DMA zone as someone suggested.
      
      ***Solution:
      Currently, only DMA atomic pool and dma-kmalloc will initialize and
      request page allocation with GFP_DMA during bootup.
      
      So only initializ DMA atomic pool when DMA zone has available managed
      pages, otherwise just skip the initialization.
      
      For dma-kmalloc(), for the time being, let's mute the warning of
      allocation failure if requesting pages from DMA zone while no manged
      pages.  Meanwhile, change code to use dma_alloc_xx/dma_map_xx API to
      replace kmalloc(GFP_DMA), or do not use GFP_DMA when calling kmalloc() if
      not necessary.  Christoph is posting patches to fix those under
      drivers/scsi/.  Finally, we can remove the need of dma-kmalloc() as people
      suggested.
      
      This patch (of 3):
      
      In some places of the current kernel, it assumes that dma zone must have
      managed pages if CONFIG_ZONE_DMA is enabled.  While this is not always
      true.  E.g in kdump kernel of x86_64, only low 1M is presented and locked
      down at very early stage of boot, so that there's no managed pages at all
      in DMA zone.  This exception will always cause page allocation failure if
      page is requested from DMA zone.
      
      Here add function has_managed_dma() and the relevant helper functions to
      check if there's DMA zone with managed pages.  It will be used in later
      patches.
      
      Link: https://lkml.kernel.org/r/20211223094435.248523-1-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20211223094435.248523-2-bhe@redhat.com
      Fixes: 6f599d84 ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NJohn Donnelly  <john.p.donnelly@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
       Conflicts:
      	mm/page_alloc.c
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      d4add921
  14. 20 3月, 2022 1 次提交
  15. 23 2月, 2022 3 次提交
    • P
      mm: Introduce reliable flag for user task · 8ee6e050
      Peng Wu 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4PM01
      CVE: NA
      
      ------------------------------------------
      
      Adding reliable flag for user task. User task with reliable flag can
      only alloc memory from mirrored region. PF_RELIABLE is added to represent
      the task's reliable flag.
      
      - For init task, which is regarded as special task which alloc memory
        from mirrored region.
      
      - For normal user tasks, The reliable flag can be set via procfs interface
        shown as below and can be inherited via fork().
      
      User can change a user task's reliable flag by
      
      	$ echo [0/1] > /proc/<pid>/reliable
      
      and check a user task's reliable flag by
      
      	$ cat /proc/<pid>/reliable
      
      Note, global init task's reliable file can not be accessed.
      Signed-off-by: NPeng Wu <wupeng58@huawei.com>
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      8ee6e050
    • M
      mm: Introduce memory reliable · 6c59ddf2
      Ma Wupeng 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4PM01
      CVE: NA
      
      --------------------------------
      
      Introduction
      
      ============
      
      Memory reliable feature is a memory tiering mechanism. It is based on
      kernel mirror feature, which splits memory into two sperate regions,
      mirrored(reliable) region and non-mirrored (non-reliable) region.
      
      for kernel mirror feature:
      
      - allocate kernel memory from mirrored region by default
      - allocate user memory from non-mirrored region by default
      
      non-mirrored region will be arranged into ZONE_MOVABLE.
      
      for kernel reliable feature, it has additional features below:
      
      - normal user tasks never alloc memory from mirrored region with userspace
        apis(malloc, mmap, etc.)
      - special user tasks will allocate memory from mirrored region by default
      - tmpfs/pagecache allocate memory from mirrored region by default
      - upper limit of mirrored region allcated for user tasks, tmpfs and
        pagecache
      
      Support Reliable fallback mechanism which allows special user tasks, tmpfs
      and pagecache can fallback to alloc non-mirrored region, it's the default
      setting.
      
      In order to fulfil the goal
      
      - ___GFP_RELIABLE flag added for alloc memory from mirrored region.
      
      - the high_zoneidx for special user tasks/tmpfs/pagecache is set to
        ZONE_NORMAL.
      
      - normal user tasks could only alloc from ZONE_MOVABLE.
      
      This patch is just the main framework, memory reliable support for special
      user tasks, pagecache and tmpfs has own patches.
      
      To enable this function, mirrored(reliable) memory is needed and
      "kernelcore=reliable" should be added to kernel parameters.
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      6c59ddf2
    • M
      efi: Disable mirror feature if kernelcore is not specified · 856090e5
      Ma Wupeng 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4PM01
      CVE: NA
      
      --------------------------------
      
      With this patch, kernel will check mirrored_kernelcore before calling
      efi_find_mirror() which will enable basic mirrored feature.
      
      If system have some mirrored memory and mirrored feature is not specified
      in boot parameter, the basic mirrored feature will be enabled and this will
      lead to the following situations:
      
      - memblock memory allocation perfers mirrored region. This may have some
        unexpected influence on numa affinity.
      
      - contiguous memory will be splited into server parts if parts of them
      is mirrored memroy via memblock_mark_mirror().
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      856090e5
  16. 11 2月, 2022 2 次提交
    • K
      mm/page_alloc: use accumulated load when building node fallback list · f532b284
      Krupa Ramakrishnan 提交于
      mainline inclusion
      from mainline-v5.16-rc1
      commit 54d032ce
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4T0ML
      CVE: NA
      
      -----------------------------------------------
      
      In build_zonelists(), when the fallback list is built for the nodes, the
      node load gets reinitialized during each iteration.  This results in
      nodes with same distances occupying the same slot in different node
      fallback lists rather than appearing in the intended round- robin
      manner.  This results in one node getting picked for allocation more
      compared to other nodes with the same distance.
      
      As an example, consider a 4 node system with the following distance
      matrix.
      
        Node 0  1  2  3
        ----------------
        0    10 12 32 32
        1    12 10 32 32
        2    32 32 10 12
        3    32 32 12 10
      
      For this case, the node fallback list gets built like this:
      
        Node  Fallback list
        ---------------------
        0     0 1 2 3
        1     1 0 3 2
        2     2 3 0 1
        3     3 2 0 1 <-- Unexpected fallback order
      
      In the fallback list for nodes 2 and 3, the nodes 0 and 1 appear in the
      same order which results in more allocations getting satisfied from node
      0 compared to node 1.
      
      The effect of this on remote memory bandwidth as seen by stream
      benchmark is shown below:
      
        Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1
      	(numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>)
        Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3
      	(numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>)
      
        ----------------------------------------
      		BANDWIDTH (MB/s)
            TEST	Case 1		Case 2
        ----------------------------------------
            COPY	57479.6		110791.8
           SCALE	55372.9		105685.9
             ADD	50460.6		96734.2
          TRIADD	50397.6		97119.1
        ----------------------------------------
      
      The bandwidth drop in Case 1 occurs because most of the allocations get
      satisfied by node 0 as it appears first in the fallback order for both
      nodes 2 and 3.
      
      This can be fixed by accumulating the node load in build_zonelists()
      rather than reinitializing it during each iteration.  With this the
      nodes with the same distance rightly get assigned in the round robin
      manner.
      
      In fact this was how it was originally until commit f0c0b2b8
      ("change zonelist order: zonelist order selection logic") dropped the
      load accumulation and resorted to initializing the load during each
      iteration.
      
      While zonelist ordering was removed by commit c9bff3ee ("mm,
      page_alloc: rip out ZONELIST_ORDER_ZONE"), the change to the node load
      accumulation in build_zonelists() remained.  So essentially this patch
      reverts back to the accumulated node load logic.
      
      After this fix, the fallback order gets built like this:
      
        Node Fallback list
        ------------------
        0    0 1 2 3
        1    1 0 3 2
        2    2 3 0 1
        3    3 2 1 0 <-- Note the change here
      
      The bandwidth in Case 1 improves and matches Case 2 as shown below.
      
        ----------------------------------------
      		BANDWIDTH (MB/s)
            TEST	Case 1		Case 2
        ----------------------------------------
            COPY	110438.9	110107.2
           SCALE	105930.5	105817.5
             ADD	97005.1		96159.8
          TRIADD	97441.5		96757.1
        ----------------------------------------
      
      The correctness of the fallback list generation has been verified for
      the above node configuration where the node 3 starts as memory-less node
      and comes up online only during memory hotplug.
      
      [bharata@amd.com: Added changelog, review, test validation]
      
      Link: https://lkml.kernel.org/r/20210830121603.1081-3-bharata@amd.com
      Fixes: f0c0b2b8 ("change zonelist order: zonelist order selection logic")
      Signed-off-by: NKrupa Ramakrishnan <krupa.ramakrishnan@amd.com>
      Co-developed-by: NSadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
      Signed-off-by: NSadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
      Signed-off-by: NBharata B Rao <bharata@amd.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      f532b284
    • B
      mm/page_alloc: print node fallback order · 262a7219
      Bharata B Rao 提交于
      mainline inclusion
      from mainline-v5.16-rc1
      commit 6cf25392
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4T0ML
      CVE: NA
      
      -------------------------------------------------
      
      Patch series "Fix NUMA nodes fallback list ordering".
      
      For a NUMA system that has multiple nodes at same distance from other
      nodes, the fallback list generation prefers same node order for them
      instead of round-robin thereby penalizing one node over others.  This
      series fixes it.
      
      More description of the problem and the fix is present in the patch
      description.
      
      This patch (of 2):
      
      Print information message about the allocation fallback order for each
      NUMA node during boot.
      
      No functional changes here.  This makes it easier to illustrate the
      problem in the node fallback list generation, which the next patch
      fixes.
      
      Link: https://lkml.kernel.org/r/20210830121603.1081-1-bharata@amd.com
      Link: https://lkml.kernel.org/r/20210830121603.1081-2-bharata@amd.comSigned-off-by: NBharata B Rao <bharata@amd.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
      Cc: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      262a7219
  17. 10 2月, 2022 1 次提交
  18. 19 1月, 2022 3 次提交
  19. 14 1月, 2022 1 次提交
    • M
      hugetlb: address ref count racing in prep_compound_gigantic_page · df906dae
      Mike Kravetz 提交于
      mainline inclusion
      from mainline-v5.14-rc1
      commit 7118fc29
      category: bugfix
      bugzilla:171843
      
      -----------------------------------------------
      
      In [1], Jann Horn points out a possible race between
      prep_compound_gigantic_page and __page_cache_add_speculative.  The root
      cause of the possible race is prep_compound_gigantic_page uncondittionally
      setting the ref count of pages to zero.  It does this because
      prep_compound_gigantic_page is handed a 'group' of pages from an allocator
      and needs to convert that group of pages to a compound page.  The ref
      count of each page in this 'group' is one as set by the allocator.
      However, the ref count of compound page tail pages must be zero.
      
      The potential race comes about when ref counted pages are returned from
      the allocator.  When this happens, other mm code could also take a
      reference on the page.  __page_cache_add_speculative is one such example.
      Therefore, prep_compound_gigantic_page can not just set the ref count of
      pages to zero as it does today.  Doing so would lose the reference taken
      by any other code.  This would lead to BUGs in code checking ref counts
      and could possibly even lead to memory corruption.
      
      There are two possible ways to address this issue.
      
      1) Make all allocators of gigantic groups of pages be able to return a
         properly constructed compound page.
      
      2) Make prep_compound_gigantic_page be more careful when constructing a
         compound page.
      
      This patch takes approach 2.
      
      In prep_compound_gigantic_page, use cmpxchg to only set ref count to zero
      if it is one.  If the cmpxchg fails, call synchronize_rcu() in the hope
      that the extra ref count will be driopped during a rcu grace period.  This
      is not a performance critical code path and the wait should be
      accceptable.  If the ref count is still inflated after the grace period,
      then undo any modifications made and return an error.
      
      Currently prep_compound_gigantic_page is type void and does not return
      errors.  Modify the two callers to check for and handle error returns.  On
      error, the caller must free the 'group' of pages as they can not be used
      to form a gigantic page.  After freeing pages, the runtime caller
      (alloc_fresh_huge_page) will retry the allocation once.  Boot time
      allocations can not be retried.
      
      The routine prep_compound_page also unconditionally sets the ref count of
      compound page tail pages to zero.  However, in this case the buddy
      allocator is constructing a compound page from freshly allocated pages.
      The ref count on those freshly allocated pages is already zero, so the
      set_page_count(p, 0) is unnecessary and could lead to confusion.  Just
      remove it.
      
      [1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20210622021423.154662-3-mike.kravetz@oracle.com
      Fixes: 58a84aa9 ("thp: set compound tail page _count to zero")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NJann Horn <jannh@google.com>
      Cc: Youquan Song <youquan.song@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NChen Huang <chenhuang5@huawei.com>
      Reviewed-by: NChen Wandun <chenwandun@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      df906dae
  20. 08 1月, 2022 1 次提交
  21. 29 12月, 2021 6 次提交