1. 24 7月, 2023 2 次提交
    • Z
      mm: fix alloc CDM node memory for MPOL_BIND · 61e43f49
      Zhou Guanghui 提交于
      hulk inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I612UG
      CVE: NA
      
      ------------------------------------------------------
      
      Memory can be allocated from a specified CDM node only when it is
      allowed to apply for memory from the CDM node. Otherwise, memory
      will be allocated from other non-CDM nodes that are not allowed by
      th cpuset.
      Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com>
      61e43f49
    • Z
      mm: fix ignore cpuset enforcement · 00a465fb
      Zhou Guanghui 提交于
      hulk inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I612UG
      CVE: NA
      
      -----------------------------------------------------
      
      Since the current condition ignores the cpuset enforcement by adding
      __GFP_THISNODEi to the gfp_mask, this will result in allocations that
      specify __GFP_THISNODE and non-cdm nodes not subject to cpuset
      restrictions.
      
      For example, procA pid 1000:
      node 0 cpus: 0 1 2 3
      node 0 free: 57199MB
      node 1 cpus: 4 5 6 7
      node 1 free: 55930MB
      
      cpuset/test/cpuset.mems  1
      cpuset/test/tasks        1000
      cpuset/test/cpuset.cpus  0-3
      
      No cdm node exists. When procA malloc 100MB memory, the result is:
      node 0 cpus: 0 1 2 3
      node 0 free: 57099MB
      node 1 cpus: 4 5 6 7
      node 1 free: 55930MB
      This is not what we expected, and in fact 100 MB of memory should be
      allocated from node1. The reason for this problem is that THP specifies
      __GFP_THISNODE to attempt to allocate from the local node.
      
      Therefore, the cpuset enforcement should be ignored only when explicitly
      allocating memory from the cdm node using __GFP_ THISNODE.
      Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com>
      00a465fb
  2. 18 1月, 2023 1 次提交
  3. 07 12月, 2022 1 次提交
  4. 10 11月, 2022 1 次提交
  5. 26 7月, 2022 1 次提交
  6. 19 7月, 2022 1 次提交
  7. 05 7月, 2022 1 次提交
    • A
      mm/pages_alloc.c: don't create ZONE_MOVABLE beyond the end of a node · cc554362
      Alistair Popple 提交于
      stable inclusion
      from stable-v5.10.110
      commit 7188e7c96f39ae40b8f8d6a807d3f338fb1927ac
      bugzilla: https://gitee.com/openeuler/kernel/issues/I574AL
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=7188e7c96f39ae40b8f8d6a807d3f338fb1927ac
      
      --------------------------------
      
      commit ddbc84f3 upstream.
      
      ZONE_MOVABLE uses the remaining memory in each node.  Its starting pfn
      is also aligned to MAX_ORDER_NR_PAGES.  It is possible for the remaining
      memory in a node to be less than MAX_ORDER_NR_PAGES, meaning there is
      not enough room for ZONE_MOVABLE on that node.
      
      Unfortunately this condition is not checked for.  This leads to
      zone_movable_pfn[] getting set to a pfn greater than the last pfn in a
      node.
      
      calculate_node_totalpages() then sets zone->present_pages to be greater
      than zone->spanned_pages which is invalid, as spanned_pages represents
      the maximum number of pages in a zone assuming no holes.
      
      Subsequently it is possible free_area_init_core() will observe a zone of
      size zero with present pages.  In this case it will skip setting up the
      zone, including the initialisation of free_lists[].
      
      However populated_zone() checks zone->present_pages to see if a zone has
      memory available.  This is used by iterators such as
      walk_zones_in_node().  pagetypeinfo_showfree() uses this to walk the
      free_list of each zone in each node, which are assumed to be initialised
      due to the zone not being empty.
      
      As free_area_init_core() never initialised the free_lists[] this results
      in the following kernel crash when trying to read /proc/pagetypeinfo:
      
        BUG: kernel NULL pointer dereference, address: 0000000000000000
        #PF: supervisor read access in kernel mode
        #PF: error_code(0x0000) - not-present page
        PGD 0 P4D 0
        Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
        CPU: 0 PID: 456 Comm: cat Not tainted 5.16.0 #461
        Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
        RIP: 0010:pagetypeinfo_show+0x163/0x460
        Code: 9e 82 e8 80 57 0e 00 49 8b 06 b9 01 00 00 00 4c 39 f0 75 16 e9 65 02 00 00 48 83 c1 01 48 81 f9 a0 86 01 00 0f 84 48 02 00 00 <48> 8b 00 4c 39 f0 75 e7 48 c7 c2 80 a2 e2 82 48 c7 c6 79 ef e3 82
        RSP: 0018:ffffc90001c4bd10 EFLAGS: 00010003
        RAX: 0000000000000000 RBX: ffff88801105f638 RCX: 0000000000000001
        RDX: 0000000000000001 RSI: 000000000000068b RDI: ffff8880163dc68b
        RBP: ffffc90001c4bd90 R08: 0000000000000001 R09: ffff8880163dc67e
        R10: 656c6261766f6d6e R11: 6c6261766f6d6e55 R12: ffff88807ffb4a00
        R13: ffff88807ffb49f8 R14: ffff88807ffb4580 R15: ffff88807ffb3000
        FS:  00007f9c83eff5c0(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000000 CR3: 0000000013c8e000 CR4: 0000000000350ef0
        Call Trace:
         seq_read_iter+0x128/0x460
         proc_reg_read_iter+0x51/0x80
         new_sync_read+0x113/0x1a0
         vfs_read+0x136/0x1d0
         ksys_read+0x70/0xf0
         __x64_sys_read+0x1a/0x20
         do_syscall_64+0x3b/0xc0
         entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Fix this by checking that the aligned zone_movable_pfn[] does not exceed
      the end of the node, and if it does skip creating a movable zone on this
      node.
      
      Link: https://lkml.kernel.org/r/20220215025831.2113067-1-apopple@nvidia.com
      Fixes: 2a1e274a ("Create the ZONE_MOVABLE zone")
      Signed-off-by: NAlistair Popple <apopple@nvidia.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYu Liao <liaoyu15@huawei.com>
      Reviewed-by: NWei Li <liwei391@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      cc554362
  8. 27 4月, 2022 2 次提交
    • B
      mm/page_alloc.c: do not warn allocation failure on zone DMA if no managed pages · 9e757375
      Baoquan He 提交于
      stable inclusion
      from stable-v5.10.94
      commit 6c6f86bb618b73007dc2bc8d4b4003f80ba1efeb
      bugzilla: https://gitee.com/openeuler/kernel/issues/I531X9
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=6c6f86bb618b73007dc2bc8d4b4003f80ba1efeb
      
      --------------------------------
      
      commit c4dc63f0 upstream.
      
      In kdump kernel of x86_64, page allocation failure is observed:
      
       kworker/u2:2: page allocation failure: order:0, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
       CPU: 0 PID: 55 Comm: kworker/u2:2 Not tainted 5.16.0-rc4+ #5
       Hardware name: AMD Dinar/Dinar, BIOS RDN1505B 06/05/2013
       Workqueue: events_unbound async_run_entry_fn
       Call Trace:
        <TASK>
        dump_stack_lvl+0x48/0x5e
        warn_alloc.cold+0x72/0xd6
        __alloc_pages_slowpath.constprop.0+0xc69/0xcd0
        __alloc_pages+0x1df/0x210
        new_slab+0x389/0x4d0
        ___slab_alloc+0x58f/0x770
        __slab_alloc.constprop.0+0x4a/0x80
        kmem_cache_alloc_trace+0x24b/0x2c0
        sr_probe+0x1db/0x620
        ......
        device_add+0x405/0x920
        ......
        __scsi_add_device+0xe5/0x100
        ata_scsi_scan_host+0x97/0x1d0
        async_run_entry_fn+0x30/0x130
        process_one_work+0x1e8/0x3c0
        worker_thread+0x50/0x3b0
        ? rescuer_thread+0x350/0x350
        kthread+0x16b/0x190
        ? set_kthread_struct+0x40/0x40
        ret_from_fork+0x22/0x30
        </TASK>
       Mem-Info:
       ......
      
      The above failure happened when calling kmalloc() to allocate buffer with
      GFP_DMA.  It requests to allocate slab page from DMA zone while no managed
      pages at all in there.
      
       sr_probe()
       --> get_capabilities()
           --> buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);
      
      Because in the current kernel, dma-kmalloc will be created as long as
      CONFIG_ZONE_DMA is enabled.  However, kdump kernel of x86_64 doesn't have
      managed pages on DMA zone since commit 6f599d84 ("x86/kdump: Always
      reserve the low 1M when the crashkernel option is specified").  The
      failure can be always reproduced.
      
      For now, let's mute the warning of allocation failure if requesting pages
      from DMA zone while no managed pages.
      
      [akpm@linux-foundation.org: fix warning]
      
      Link: https://lkml.kernel.org/r/20211223094435.248523-4-bhe@redhat.com
      Fixes: 6f599d84 ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Acked-by: NJohn Donnelly  <john.p.donnelly@oracle.com>
      Reviewed-by: NHyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      9e757375
    • B
      mm_zone: add function to check if managed dma zone exists · f572e27a
      Baoquan He 提交于
      stable inclusion
      from stable-v5.10.94
      commit d2e572411738a5aad67901caef8e083fb9df29fd
      bugzilla: https://gitee.com/openeuler/kernel/issues/I531X9
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=d2e572411738a5aad67901caef8e083fb9df29fd
      
      --------------------------------
      
      commit 62b31070 upstream.
      
      Patch series "Handle warning of allocation failure on DMA zone w/o
      managed pages", v4.
      
      **Problem observed:
      On x86_64, when crash is triggered and entering into kdump kernel, page
      allocation failure can always be seen.
      
       ---------------------------------
       DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
       swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
       CPU: 0 PID: 1 Comm: swapper/0
       Call Trace:
        dump_stack+0x7f/0xa1
        warn_alloc.cold+0x72/0xd6
        ......
        __alloc_pages+0x24d/0x2c0
        ......
        dma_atomic_pool_init+0xdb/0x176
        do_one_initcall+0x67/0x320
        ? rcu_read_lock_sched_held+0x3f/0x80
        kernel_init_freeable+0x290/0x2dc
        ? rest_init+0x24f/0x24f
        kernel_init+0xa/0x111
        ret_from_fork+0x22/0x30
       Mem-Info:
       ------------------------------------
      
      ***Root cause:
      In the current kernel, it assumes that DMA zone must have managed pages
      and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not
      always true. E.g in kdump kernel of x86_64, only low 1M is presented and
      locked down at very early stage of boot, so that this low 1M won't be
      added into buddy allocator to become managed pages of DMA zone. This
      exception will always cause page allocation failure if page is requested
      from DMA zone.
      
      ***Investigation:
      This failure happens since below commit merged into linus's tree.
        1a6a9044 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options
        23721c8e x86/crash: Remove crash_reserve_low_1M()
        f1d4d47c x86/setup: Always reserve the first 1M of RAM
        7c321eb2 x86/kdump: Remove the backup region handling
        6f599d84 x86/kdump: Always reserve the low 1M when the crashkernel option is specified
      
      Before them, on x86_64, the low 640K area will be reused by kdump kernel.
      So in kdump kernel, the content of low 640K area is copied into a backup
      region for dumping before jumping into kdump. Then except of those firmware
      reserved region in [0, 640K], the left area will be added into buddy
      allocator to become available managed pages of DMA zone.
      
      However, after above commits applied, in kdump kernel of x86_64, the low
      1M is reserved by memblock, but not released to buddy allocator. So any
      later page allocation requested from DMA zone will fail.
      
      At the beginning, if crashkernel is reserved, the low 1M need be locked
      down because AMD SME encrypts memory making the old backup region
      mechanims impossible when switching into kdump kernel.
      
      Later, it was also observed that there are BIOSes corrupting memory
      under 1M. To solve this, in commit f1d4d47c, the entire region of
      low 1M is always reserved after the real mode trampoline is allocated.
      
      Besides, recently, Intel engineer mentioned their TDX (Trusted domain
      extensions) which is under development in kernel also needs to lock down
      the low 1M. So we can't simply revert above commits to fix the page allocation
      failure from DMA zone as someone suggested.
      
      ***Solution:
      Currently, only DMA atomic pool and dma-kmalloc will initialize and
      request page allocation with GFP_DMA during bootup.
      
      So only initializ DMA atomic pool when DMA zone has available managed
      pages, otherwise just skip the initialization.
      
      For dma-kmalloc(), for the time being, let's mute the warning of
      allocation failure if requesting pages from DMA zone while no manged
      pages.  Meanwhile, change code to use dma_alloc_xx/dma_map_xx API to
      replace kmalloc(GFP_DMA), or do not use GFP_DMA when calling kmalloc() if
      not necessary.  Christoph is posting patches to fix those under
      drivers/scsi/.  Finally, we can remove the need of dma-kmalloc() as people
      suggested.
      
      This patch (of 3):
      
      In some places of the current kernel, it assumes that dma zone must have
      managed pages if CONFIG_ZONE_DMA is enabled.  While this is not always
      true.  E.g in kdump kernel of x86_64, only low 1M is presented and locked
      down at very early stage of boot, so that there's no managed pages at all
      in DMA zone.  This exception will always cause page allocation failure if
      page is requested from DMA zone.
      
      Here add function has_managed_dma() and the relevant helper functions to
      check if there's DMA zone with managed pages.  It will be used in later
      patches.
      
      Link: https://lkml.kernel.org/r/20211223094435.248523-1-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20211223094435.248523-2-bhe@redhat.com
      Fixes: 6f599d84 ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NJohn Donnelly  <john.p.donnelly@oracle.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: David Laight <David.Laight@ACULAB.COM>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      
       Conflicts:
      	mm/page_alloc.c
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      f572e27a
  9. 20 3月, 2022 1 次提交
  10. 23 2月, 2022 3 次提交
    • P
      mm: Introduce reliable flag for user task · 8ee6e050
      Peng Wu 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4PM01
      CVE: NA
      
      ------------------------------------------
      
      Adding reliable flag for user task. User task with reliable flag can
      only alloc memory from mirrored region. PF_RELIABLE is added to represent
      the task's reliable flag.
      
      - For init task, which is regarded as special task which alloc memory
        from mirrored region.
      
      - For normal user tasks, The reliable flag can be set via procfs interface
        shown as below and can be inherited via fork().
      
      User can change a user task's reliable flag by
      
      	$ echo [0/1] > /proc/<pid>/reliable
      
      and check a user task's reliable flag by
      
      	$ cat /proc/<pid>/reliable
      
      Note, global init task's reliable file can not be accessed.
      Signed-off-by: NPeng Wu <wupeng58@huawei.com>
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      8ee6e050
    • M
      mm: Introduce memory reliable · 6c59ddf2
      Ma Wupeng 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4PM01
      CVE: NA
      
      --------------------------------
      
      Introduction
      
      ============
      
      Memory reliable feature is a memory tiering mechanism. It is based on
      kernel mirror feature, which splits memory into two sperate regions,
      mirrored(reliable) region and non-mirrored (non-reliable) region.
      
      for kernel mirror feature:
      
      - allocate kernel memory from mirrored region by default
      - allocate user memory from non-mirrored region by default
      
      non-mirrored region will be arranged into ZONE_MOVABLE.
      
      for kernel reliable feature, it has additional features below:
      
      - normal user tasks never alloc memory from mirrored region with userspace
        apis(malloc, mmap, etc.)
      - special user tasks will allocate memory from mirrored region by default
      - tmpfs/pagecache allocate memory from mirrored region by default
      - upper limit of mirrored region allcated for user tasks, tmpfs and
        pagecache
      
      Support Reliable fallback mechanism which allows special user tasks, tmpfs
      and pagecache can fallback to alloc non-mirrored region, it's the default
      setting.
      
      In order to fulfil the goal
      
      - ___GFP_RELIABLE flag added for alloc memory from mirrored region.
      
      - the high_zoneidx for special user tasks/tmpfs/pagecache is set to
        ZONE_NORMAL.
      
      - normal user tasks could only alloc from ZONE_MOVABLE.
      
      This patch is just the main framework, memory reliable support for special
      user tasks, pagecache and tmpfs has own patches.
      
      To enable this function, mirrored(reliable) memory is needed and
      "kernelcore=reliable" should be added to kernel parameters.
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      6c59ddf2
    • M
      efi: Disable mirror feature if kernelcore is not specified · 856090e5
      Ma Wupeng 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4PM01
      CVE: NA
      
      --------------------------------
      
      With this patch, kernel will check mirrored_kernelcore before calling
      efi_find_mirror() which will enable basic mirrored feature.
      
      If system have some mirrored memory and mirrored feature is not specified
      in boot parameter, the basic mirrored feature will be enabled and this will
      lead to the following situations:
      
      - memblock memory allocation perfers mirrored region. This may have some
        unexpected influence on numa affinity.
      
      - contiguous memory will be splited into server parts if parts of them
      is mirrored memroy via memblock_mark_mirror().
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      856090e5
  11. 11 2月, 2022 2 次提交
    • K
      mm/page_alloc: use accumulated load when building node fallback list · f532b284
      Krupa Ramakrishnan 提交于
      mainline inclusion
      from mainline-v5.16-rc1
      commit 54d032ce
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4T0ML
      CVE: NA
      
      -----------------------------------------------
      
      In build_zonelists(), when the fallback list is built for the nodes, the
      node load gets reinitialized during each iteration.  This results in
      nodes with same distances occupying the same slot in different node
      fallback lists rather than appearing in the intended round- robin
      manner.  This results in one node getting picked for allocation more
      compared to other nodes with the same distance.
      
      As an example, consider a 4 node system with the following distance
      matrix.
      
        Node 0  1  2  3
        ----------------
        0    10 12 32 32
        1    12 10 32 32
        2    32 32 10 12
        3    32 32 12 10
      
      For this case, the node fallback list gets built like this:
      
        Node  Fallback list
        ---------------------
        0     0 1 2 3
        1     1 0 3 2
        2     2 3 0 1
        3     3 2 0 1 <-- Unexpected fallback order
      
      In the fallback list for nodes 2 and 3, the nodes 0 and 1 appear in the
      same order which results in more allocations getting satisfied from node
      0 compared to node 1.
      
      The effect of this on remote memory bandwidth as seen by stream
      benchmark is shown below:
      
        Case 1: Bandwidth from cores on nodes 2 & 3 to memory on nodes 0 & 1
      	(numactl -m 0,1 ./stream_lowOverhead ... --cores <from 2, 3>)
        Case 2: Bandwidth from cores on nodes 0 & 1 to memory on nodes 2 & 3
      	(numactl -m 2,3 ./stream_lowOverhead ... --cores <from 0, 1>)
      
        ----------------------------------------
      		BANDWIDTH (MB/s)
            TEST	Case 1		Case 2
        ----------------------------------------
            COPY	57479.6		110791.8
           SCALE	55372.9		105685.9
             ADD	50460.6		96734.2
          TRIADD	50397.6		97119.1
        ----------------------------------------
      
      The bandwidth drop in Case 1 occurs because most of the allocations get
      satisfied by node 0 as it appears first in the fallback order for both
      nodes 2 and 3.
      
      This can be fixed by accumulating the node load in build_zonelists()
      rather than reinitializing it during each iteration.  With this the
      nodes with the same distance rightly get assigned in the round robin
      manner.
      
      In fact this was how it was originally until commit f0c0b2b8
      ("change zonelist order: zonelist order selection logic") dropped the
      load accumulation and resorted to initializing the load during each
      iteration.
      
      While zonelist ordering was removed by commit c9bff3ee ("mm,
      page_alloc: rip out ZONELIST_ORDER_ZONE"), the change to the node load
      accumulation in build_zonelists() remained.  So essentially this patch
      reverts back to the accumulated node load logic.
      
      After this fix, the fallback order gets built like this:
      
        Node Fallback list
        ------------------
        0    0 1 2 3
        1    1 0 3 2
        2    2 3 0 1
        3    3 2 1 0 <-- Note the change here
      
      The bandwidth in Case 1 improves and matches Case 2 as shown below.
      
        ----------------------------------------
      		BANDWIDTH (MB/s)
            TEST	Case 1		Case 2
        ----------------------------------------
            COPY	110438.9	110107.2
           SCALE	105930.5	105817.5
             ADD	97005.1		96159.8
          TRIADD	97441.5		96757.1
        ----------------------------------------
      
      The correctness of the fallback list generation has been verified for
      the above node configuration where the node 3 starts as memory-less node
      and comes up online only during memory hotplug.
      
      [bharata@amd.com: Added changelog, review, test validation]
      
      Link: https://lkml.kernel.org/r/20210830121603.1081-3-bharata@amd.com
      Fixes: f0c0b2b8 ("change zonelist order: zonelist order selection logic")
      Signed-off-by: NKrupa Ramakrishnan <krupa.ramakrishnan@amd.com>
      Co-developed-by: NSadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
      Signed-off-by: NSadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
      Signed-off-by: NBharata B Rao <bharata@amd.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      f532b284
    • B
      mm/page_alloc: print node fallback order · 262a7219
      Bharata B Rao 提交于
      mainline inclusion
      from mainline-v5.16-rc1
      commit 6cf25392
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4T0ML
      CVE: NA
      
      -------------------------------------------------
      
      Patch series "Fix NUMA nodes fallback list ordering".
      
      For a NUMA system that has multiple nodes at same distance from other
      nodes, the fallback list generation prefers same node order for them
      instead of round-robin thereby penalizing one node over others.  This
      series fixes it.
      
      More description of the problem and the fix is present in the patch
      description.
      
      This patch (of 2):
      
      Print information message about the allocation fallback order for each
      NUMA node during boot.
      
      No functional changes here.  This makes it easier to illustrate the
      problem in the node fallback list generation, which the next patch
      fixes.
      
      Link: https://lkml.kernel.org/r/20210830121603.1081-1-bharata@amd.com
      Link: https://lkml.kernel.org/r/20210830121603.1081-2-bharata@amd.comSigned-off-by: NBharata B Rao <bharata@amd.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Krupa Ramakrishnan <krupa.ramakrishnan@amd.com>
      Cc: Sadagopan Srinivasan <Sadagopan.Srinivasan@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NPeng Liu <liupeng256@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      262a7219
  12. 10 2月, 2022 1 次提交
  13. 19 1月, 2022 3 次提交
  14. 14 1月, 2022 1 次提交
    • M
      hugetlb: address ref count racing in prep_compound_gigantic_page · df906dae
      Mike Kravetz 提交于
      mainline inclusion
      from mainline-v5.14-rc1
      commit 7118fc29
      category: bugfix
      bugzilla:171843
      
      -----------------------------------------------
      
      In [1], Jann Horn points out a possible race between
      prep_compound_gigantic_page and __page_cache_add_speculative.  The root
      cause of the possible race is prep_compound_gigantic_page uncondittionally
      setting the ref count of pages to zero.  It does this because
      prep_compound_gigantic_page is handed a 'group' of pages from an allocator
      and needs to convert that group of pages to a compound page.  The ref
      count of each page in this 'group' is one as set by the allocator.
      However, the ref count of compound page tail pages must be zero.
      
      The potential race comes about when ref counted pages are returned from
      the allocator.  When this happens, other mm code could also take a
      reference on the page.  __page_cache_add_speculative is one such example.
      Therefore, prep_compound_gigantic_page can not just set the ref count of
      pages to zero as it does today.  Doing so would lose the reference taken
      by any other code.  This would lead to BUGs in code checking ref counts
      and could possibly even lead to memory corruption.
      
      There are two possible ways to address this issue.
      
      1) Make all allocators of gigantic groups of pages be able to return a
         properly constructed compound page.
      
      2) Make prep_compound_gigantic_page be more careful when constructing a
         compound page.
      
      This patch takes approach 2.
      
      In prep_compound_gigantic_page, use cmpxchg to only set ref count to zero
      if it is one.  If the cmpxchg fails, call synchronize_rcu() in the hope
      that the extra ref count will be driopped during a rcu grace period.  This
      is not a performance critical code path and the wait should be
      accceptable.  If the ref count is still inflated after the grace period,
      then undo any modifications made and return an error.
      
      Currently prep_compound_gigantic_page is type void and does not return
      errors.  Modify the two callers to check for and handle error returns.  On
      error, the caller must free the 'group' of pages as they can not be used
      to form a gigantic page.  After freeing pages, the runtime caller
      (alloc_fresh_huge_page) will retry the allocation once.  Boot time
      allocations can not be retried.
      
      The routine prep_compound_page also unconditionally sets the ref count of
      compound page tail pages to zero.  However, in this case the buddy
      allocator is constructing a compound page from freshly allocated pages.
      The ref count on those freshly allocated pages is already zero, so the
      set_page_count(p, 0) is unnecessary and could lead to confusion.  Just
      remove it.
      
      [1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/
      
      Link: https://lkml.kernel.org/r/20210622021423.154662-3-mike.kravetz@oracle.com
      Fixes: 58a84aa9 ("thp: set compound tail page _count to zero")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reported-by: NJann Horn <jannh@google.com>
      Cc: Youquan Song <youquan.song@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Muchun Song <songmuchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NChen Huang <chenhuang5@huawei.com>
      Reviewed-by: NChen Wandun <chenwandun@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      df906dae
  15. 08 1月, 2022 1 次提交
  16. 29 12月, 2021 7 次提交
  17. 29 11月, 2021 5 次提交
    • L
      mm: Be allowed to alloc CDM node memory for MPOL_BIND · c1434668
      Lijun Fang 提交于
      ascend inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4JMLR
      CVE: NA
      -----------------
      
      CDM nodes should not be part of mems_allowed, However,
      It must be allowed to alloc from CDM node, when mpol->mode was MPOL_BIND.
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      c1434668
    • A
      mm: Ignore cpuset enforcement when allocation flag has __GFP_THISNODE · b49907cc
      Anshuman Khandual 提交于
      ascend inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4JMLR
      CVE: NA
      -------------------
      
      __GFP_THISNODE specifically asks the memory to be allocated from the given
      node. Not all the requests that end up in __alloc_pages_nodemask() are
      originated from the process context where cpuset makes more sense. The
      current condition enforces cpuset limitation on every allocation whether
      originated from process context or not which prevents __GFP_THISNODE
      mandated allocations to come from the specified node. In context of the
      coherent device memory node which is isolated from all cpuset nodemask
      in the system, it prevents the only way of allocation into it which has
      been changed with this patch.
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      b49907cc
    • A
      mm: Enable Buddy allocation isolation for CDM nodes · a8cc75dd
      Anshuman Khandual 提交于
      ascend inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4JMLR
      CVE: NA
      -------------------
      
      This implements allocation isolation for CDM nodes in buddy allocator by
      discarding CDM memory zones all the time except in the cases where the gfp
      flag has got __GFP_THISNODE or the nodemask contains CDM nodes in cases
      where it is non NULL (explicit allocation request in the kernel or user
      process MPOL_BIND policy based requests).
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      a8cc75dd
    • A
      mm: Change generic FALLBACK zonelist creation process · 5f387449
      Anshuman Khandual 提交于
      ascend inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4JMLR
      CVE: NA
      -------------------
      
      Kernel allocation to CDM node has already been prevented by putting it's
      entire memory in ZONE_MOVABLE. But the CDM nodes must also be isolated
      from implicit allocations happening on the system.
      
      Any isolation seeking CDM node requires isolation from implicit memory
      allocations from user space but at the same time there should also have
      an explicit way to do the memory allocation.
      
      Platform node's both zonelists are fundamental to where the memory comes
      from when there is an allocation request. In order to achieve these two
      objectives as stated above, zonelists building process has to change as
      both zonelists (i.e FALLBACK and NOFALLBACK) gives access to the node's
      memory zones during any kind of memory allocation. The following changes
      are implemented in this regard.
      
      * CDM node's zones are not part of any other node's FALLBACK zonelist
      * CDM node's FALLBACK list contains it's own memory zones followed by
        all system RAM zones in regular order as before
      * CDM node's zones are part of it's own NOFALLBACK zonelist
      
      These above changes ensure the following which in turn isolates the CDM
      nodes as desired.
      
      * There wont be any implicit memory allocation ending up in the CDM node
      * Only __GFP_THISNODE marked allocations will come from the CDM node
      * CDM node memory can be allocated through mbind(MPOL_BIND) interface
      * System RAM memory will be used as fallback option in regular order in
        case the CDM memory is insufficient during targted allocation request
      
      Sample zonelist configuration:
      
      [NODE (0)]						RAM
              ZONELIST_FALLBACK (0xc00000000140da00)
                      (0) (node 0) (DMA     0xc00000000140c000)
                      (1) (node 1) (DMA     0xc000000100000000)
              ZONELIST_NOFALLBACK (0xc000000001411a10)
                      (0) (node 0) (DMA     0xc00000000140c000)
      [NODE (1)]						RAM
              ZONELIST_FALLBACK (0xc000000100001a00)
                      (0) (node 1) (DMA     0xc000000100000000)
                      (1) (node 0) (DMA     0xc00000000140c000)
              ZONELIST_NOFALLBACK (0xc000000100005a10)
                      (0) (node 1) (DMA     0xc000000100000000)
      [NODE (2)]						CDM
              ZONELIST_FALLBACK (0xc000000001427700)
                      (0) (node 2) (Movable 0xc000000001427080)
                      (1) (node 0) (DMA     0xc00000000140c000)
                      (2) (node 1) (DMA     0xc000000100000000)
              ZONELIST_NOFALLBACK (0xc00000000142b710)
                      (0) (node 2) (Movable 0xc000000001427080)
      [NODE (3)]						CDM
              ZONELIST_FALLBACK (0xc000000001431400)
                      (0) (node 3) (Movable 0xc000000001430d80)
                      (1) (node 0) (DMA     0xc00000000140c000)
                      (2) (node 1) (DMA     0xc000000100000000)
              ZONELIST_NOFALLBACK (0xc000000001435410)
                      (0) (node 3) (Movable 0xc000000001430d80)
      [NODE (4)]						CDM
              ZONELIST_FALLBACK (0xc00000000143b100)
                      (0) (node 4) (Movable 0xc00000000143aa80)
                      (1) (node 0) (DMA     0xc00000000140c000)
                      (2) (node 1) (DMA     0xc000000100000000)
              ZONELIST_NOFALLBACK (0xc00000000143f110)
                      (0) (node 4) (Movable 0xc00000000143aa80)
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      5f387449
    • A
      mm: Define coherent device memory (CDM) node · 4cda25d3
      Anshuman Khandual 提交于
      ascend inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4JMLR
      CVE: NA
      -------------------
      
      There are certain devices like specialized accelerator, GPU cards, network
      cards, FPGA cards etc which might contain onboard memory which is coherent
      along with the existing system RAM while being accessed either from the CPU
      or from the device. They share some similar properties with that of normal
      system RAM but at the same time can also be different with respect to
      system RAM.
      
      User applications might be interested in using this kind of coherent device
      memory explicitly or implicitly along side the system RAM utilizing all
      possible core memory functions like anon mapping (LRU), file mapping (LRU),
      page cache (LRU), driver managed (non LRU), HW poisoning, NUMA migrations
      etc. To achieve this kind of tight integration with core memory subsystem,
      the device onboard coherent memory must be represented as a memory only
      NUMA node. At the same time arch must export some kind of a function to
      identify of this node as a coherent device memory not any other regular
      cpu less memory only NUMA node.
      
      After achieving the integration with core memory subsystem coherent device
      memory might still need some special consideration inside the kernel. There
      can be a variety of coherent memory nodes with different expectations from
      the core kernel memory. But right now only one kind of special treatment is
      considered which requires certain isolation.
      
      Now consider the case of a coherent device memory node type which requires
      isolation. This kind of coherent memory is onboard an external device
      attached to the system through a link where there is always a chance of a
      link failure taking down the entire memory node with it. More over the
      memory might also have higher chance of ECC failure as compared to the
      system RAM. Hence allocation into this kind of coherent memory node should
      be regulated. Kernel allocations must not come here. Normal user space
      allocations too should not come here implicitly (without user application
      knowing about it). This summarizes isolation requirement of certain kind of
      coherent device memory node as an example. There can be different kinds of
      isolation requirement also.
      
      Some coherent memory devices might not require isolation altogether after
      all. Then there might be other coherent memory devices which might require
      some other special treatment after being part of core memory representation
      . For now, will look into isolation seeking coherent device memory node not
      the other ones.
      
      To implement the integration as well as isolation, the coherent memory node
      must be present in N_MEMORY and a new N_COHERENT_DEVICE node mask inside
      the node_states[] array. During memory hotplug operations, the new nodemask
      N_COHERENT_DEVICE is updated along with N_MEMORY for these coherent device
      memory nodes. This also creates the following new sysfs based interface to
      list down all the coherent memory nodes of the system.
      
      	/sys/devices/system/node/is_cdm_node
      
      Architectures must export function arch_check_node_cdm() which identifies
      any coherent device memory node in case they enable CONFIG_COHERENT_DEVICE.
      Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com>
      Signed-off-by: NLijun Fang <fanglijun3@huawei.com>
      Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      4cda25d3
  18. 30 10月, 2021 3 次提交
  19. 15 10月, 2021 1 次提交
  20. 12 10月, 2021 1 次提交
  21. 26 9月, 2021 1 次提交