1. 06 3月, 2023 1 次提交
  2. 22 11月, 2022 1 次提交
  3. 11 11月, 2022 1 次提交
  4. 20 9月, 2022 1 次提交
  5. 01 9月, 2022 1 次提交
  6. 28 7月, 2022 2 次提交
  7. 17 7月, 2022 1 次提交
  8. 08 7月, 2022 1 次提交
  9. 10 5月, 2022 1 次提交
    • Z
      mm: export collect_procs() · df0fbb2a
      Zhang Jian 提交于
      ascend inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I53VVE
      CVE: NA
      
      -------------------------------------------------
      
      Collect the processes who have the page mapped via collect_procs().
      
      @page if the page is a part of the hugepages/compound-page, we must
      using compound_head() to find it's head page to prevent the kernel panic,
      and make the page be locked.
      
      @to_kill the function will return a linked list, when we have used
      this list, we must kfree the list.
      
      @force_early if we want to find all process, we must make it be true, if
      it's false, the function will only return the process who have PF_MCE_PROCESS
      or PF_MCE_EARLY mark.
      
      limits: if force_early is true, sysctl_memory_failure_early_kill is useless.
      If it's false, no process have PF_MCE_PROCESS and PF_MCE_EARLY flag, and
      the sysctl_memory_failure_early_kill is enabled, function will return all tasks
      whether the task have the PF_MCE_PROCESS and PF_MCE_EARLY flag.
      Signed-off-by: NZhang Jian <zhangjian210@huawei.com>
      Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
      Reviewed-by: Kefeng Wang<wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      df0fbb2a
  10. 23 2月, 2022 2 次提交
    • M
      mm: Introduce memory reliable · 6c59ddf2
      Ma Wupeng 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4PM01
      CVE: NA
      
      --------------------------------
      
      Introduction
      
      ============
      
      Memory reliable feature is a memory tiering mechanism. It is based on
      kernel mirror feature, which splits memory into two sperate regions,
      mirrored(reliable) region and non-mirrored (non-reliable) region.
      
      for kernel mirror feature:
      
      - allocate kernel memory from mirrored region by default
      - allocate user memory from non-mirrored region by default
      
      non-mirrored region will be arranged into ZONE_MOVABLE.
      
      for kernel reliable feature, it has additional features below:
      
      - normal user tasks never alloc memory from mirrored region with userspace
        apis(malloc, mmap, etc.)
      - special user tasks will allocate memory from mirrored region by default
      - tmpfs/pagecache allocate memory from mirrored region by default
      - upper limit of mirrored region allcated for user tasks, tmpfs and
        pagecache
      
      Support Reliable fallback mechanism which allows special user tasks, tmpfs
      and pagecache can fallback to alloc non-mirrored region, it's the default
      setting.
      
      In order to fulfil the goal
      
      - ___GFP_RELIABLE flag added for alloc memory from mirrored region.
      
      - the high_zoneidx for special user tasks/tmpfs/pagecache is set to
        ZONE_NORMAL.
      
      - normal user tasks could only alloc from ZONE_MOVABLE.
      
      This patch is just the main framework, memory reliable support for special
      user tasks, pagecache and tmpfs has own patches.
      
      To enable this function, mirrored(reliable) memory is needed and
      "kernelcore=reliable" should be added to kernel parameters.
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      6c59ddf2
    • M
      efi: Disable mirror feature if kernelcore is not specified · 856090e5
      Ma Wupeng 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4PM01
      CVE: NA
      
      --------------------------------
      
      With this patch, kernel will check mirrored_kernelcore before calling
      efi_find_mirror() which will enable basic mirrored feature.
      
      If system have some mirrored memory and mirrored feature is not specified
      in boot parameter, the basic mirrored feature will be enabled and this will
      lead to the following situations:
      
      - memblock memory allocation perfers mirrored region. This may have some
        unexpected influence on numa affinity.
      
      - contiguous memory will be splited into server parts if parts of them
      is mirrored memroy via memblock_mark_mirror().
      Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      856090e5
  11. 22 2月, 2022 1 次提交
  12. 31 12月, 2021 2 次提交
  13. 30 12月, 2021 3 次提交
  14. 27 12月, 2021 1 次提交
    • D
      mm: Add kvrealloc() · 97316767
      Dave Chinner 提交于
      mainline-inclusion
      from mainline-v5.14-rc4
      commit de2860f4
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4KIAO
      CVE: NA
      
      Reference:
      https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=de2860f4636256836450c6543be744a50118fc66
      
      -------------------------------------------------
      
      During log recovery of an XFS filesystem with 64kB directory
      buffers, rebuilding a buffer split across two log records results
      in a memory allocation warning from krealloc like this:
      
      xfs filesystem being mounted at /mnt/scratch supports timestamps until 2038 (0x7fffffff)
      XFS (dm-0): Unmounting Filesystem
      XFS (dm-0): Mounting V5 Filesystem
      XFS (dm-0): Starting recovery (logdev: internal)
      ------------[ cut here ]------------
      WARNING: CPU: 5 PID: 3435170 at mm/page_alloc.c:3539 get_page_from_freelist+0xdee/0xe40
      .....
      RIP: 0010:get_page_from_freelist+0xdee/0xe40
      Call Trace:
       ? complete+0x3f/0x50
       __alloc_pages+0x16f/0x300
       alloc_pages+0x87/0x110
       kmalloc_order+0x2c/0x90
       kmalloc_order_trace+0x1d/0x90
       __kmalloc_track_caller+0x215/0x270
       ? xlog_recover_add_to_cont_trans+0x63/0x1f0
       krealloc+0x54/0xb0
       xlog_recover_add_to_cont_trans+0x63/0x1f0
       xlog_recovery_process_trans+0xc1/0xd0
       xlog_recover_process_ophdr+0x86/0x130
       xlog_recover_process_data+0x9f/0x160
       xlog_recover_process+0xa2/0x120
       xlog_do_recovery_pass+0x40b/0x7d0
       ? __irq_work_queue_local+0x4f/0x60
       ? irq_work_queue+0x3a/0x50
       xlog_do_log_recovery+0x70/0x150
       xlog_do_recover+0x38/0x1d0
       xlog_recover+0xd8/0x170
       xfs_log_mount+0x181/0x300
       xfs_mountfs+0x4a1/0x9b0
       xfs_fs_fill_super+0x3c0/0x7b0
       get_tree_bdev+0x171/0x270
       ? suffix_kstrtoint.constprop.0+0xf0/0xf0
       xfs_fs_get_tree+0x15/0x20
       vfs_get_tree+0x24/0xc0
       path_mount+0x2f5/0xaf0
       __x64_sys_mount+0x108/0x140
       do_syscall_64+0x3a/0x70
       entry_SYSCALL_64_after_hwframe+0x44/0xae
      
      Essentially, we are taking a multi-order allocation from kmem_alloc()
      (which has an open coded no fail, no warn loop) and then
      reallocating it out to 64kB using krealloc(__GFP_NOFAIL) and that is
      then triggering the above warning.
      
      This is a regression caused by converting this code from an open
      coded no fail/no warn reallocation loop to using __GFP_NOFAIL.
      
      What we actually need here is kvrealloc(), so that if contiguous
      page allocation fails we fall back to vmalloc() and we don't
      get nasty warnings happening in XFS.
      
      Fixes: 771915c4 ("xfs: remove kmem_realloc()")
      Signed-off-by: NDave Chinner <dchinner@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
      Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
      Reviewed-by: NLihong Kou <koulihong@huawei.com>
      Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      97316767
  15. 10 12月, 2021 1 次提交
  16. 03 12月, 2021 1 次提交
  17. 29 11月, 2021 2 次提交
  18. 11 11月, 2021 1 次提交
  19. 30 10月, 2021 1 次提交
  20. 12 10月, 2021 1 次提交
    • H
      mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() · 1bfa3cc3
      Hugh Dickins 提交于
      stable inclusion
      from stable-5.10.47
      commit 0010275ca243e6260893207d41843bb8dc3846e4
      bugzilla: 172973 https://gitee.com/openeuler/kernel/issues/I4DAKB
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0010275ca243e6260893207d41843bb8dc3846e4
      
      --------------------------------
      
      [ Upstream commit 22061a1f ]
      
      There is a race between THP unmapping and truncation, when truncate sees
      pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
      it, but before its page_remove_rmap() gets to decrement
      compound_mapcount: generating false "BUG: Bad page cache" reports that
      the page is still mapped when deleted.  This commit fixes that, but not
      in the way I hoped.
      
      The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
      instead of unmap_mapping_range() in truncate_cleanup_page(): it has
      often been an annoyance that we usually call unmap_mapping_range() with
      no pages locked, but there apply it to a single locked page.
      try_to_unmap() looks more suitable for a single locked page.
      
      However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
      it is used to insert THP migration entries, but not used to unmap THPs.
      Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
      needs are different, I'm too ignorant of the DAX cases, and couldn't
      decide how far to go for anon+swap.  Set that aside.
      
      The second attempt took a different tack: make no change in truncate.c,
      but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
      clearing it initially, then pmd_clear() between page_remove_rmap() and
      unlocking at the end.  Nice.  But powerpc blows that approach out of the
      water, with its serialize_against_pte_lookup(), and interesting pgtable
      usage.  It would need serious help to get working on powerpc (with a
      minor optimization issue on s390 too).  Set that aside.
      
      Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
      delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
      that's likely to reduce or eliminate the number of incidents, it would
      give less assurance of whether we had identified the problem correctly.
      
      This successful iteration introduces "unmap_mapping_page(page)" instead
      of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
      with an addition to details.  Then zap_pmd_range() watches for this
      case, and does spin_unlock(pmd_lock) if so - just like
      page_vma_mapped_walk() now does in the PVMW_SYNC case.  Not pretty, but
      safe.
      
      Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
      assert its interface; but currently that's only used to make sure that
      page->mapping is stable, and zap_pmd_range() doesn't care if the page is
      locked or not.  Along these lines, in invalidate_inode_pages2_range()
      move the initial unmap_mapping_range() out from under page lock, before
      then calling unmap_mapping_page() under page lock if still mapped.
      
      Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
      Fixes: fc127da0 ("truncate: handle file thp")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      Note on stable backport: fixed up call to truncate_cleanup_page()
      in truncate_inode_pages_range().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NWeilong Chen <chenweilong@huawei.com>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      
      Conflict:
      	mm/truncate.c
      [Backport from mainline 22061a1f]
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      1bfa3cc3
  21. 19 7月, 2021 1 次提交
  22. 16 7月, 2021 1 次提交
  23. 14 7月, 2021 4 次提交
  24. 03 6月, 2021 1 次提交
  25. 19 4月, 2021 1 次提交
    • A
      kasan: fix per-page tags for non-page_alloc pages · f2756b96
      Andrey Konovalov 提交于
      stable inclusion
      from stable-5.10.27
      commit 6e63cc1fe2532d1aa851a540677e29ba802bf071
      bugzilla: 51493
      
      --------------------------------
      
      commit cf10bd4c upstream.
      
      To allow performing tag checks on page_alloc addresses obtained via
      page_address(), tag-based KASAN modes store tags for page_alloc
      allocations in page->flags.
      
      Currently, the default tag value stored in page->flags is 0x00.
      Therefore, page_address() returns a 0x00ffff...  address for pages that
      were not allocated via page_alloc.
      
      This might cause problems.  A particular case we encountered is a
      conflict with KFENCE.  If a KFENCE-allocated slab object is being freed
      via kfree(page_address(page) + offset), the address passed to kfree()
      will get tagged with 0x00 (as slab pages keep the default per-page
      tags).  This leads to is_kfence_address() check failing, and a KFENCE
      object ending up in normal slab freelist, which causes memory
      corruptions.
      
      This patch changes the way KASAN stores tag in page-flags: they are now
      stored xor'ed with 0xff.  This way, KASAN doesn't need to initialize
      per-page flags for every created page, which might be slow.
      
      With this change, page_address() returns natively-tagged (with 0xff)
      pointers for pages that didn't have tags set explicitly.
      
      This patch fixes the encountered conflict with KFENCE and prevents more
      similar issues that can occur in the future.
      
      Link: https://lkml.kernel.org/r/1a41abb11c51b264511d9e71c303bb16d5cb367b.1615475452.git.andreyknvl@google.com
      Fixes: 2813b9c0 ("kasan, mm, arm64: tag non slab memory allocated via pagealloc")
      Signed-off-by: NAndrey Konovalov <andreyknvl@google.com>
      Reviewed-by: NMarco Elver <elver@google.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Peter Collingbourne <pcc@google.com>
      Cc: Evgenii Stepanov <eugenis@google.com>
      Cc: Branislav Rankov <Branislav.Rankov@arm.com>
      Cc: Kevin Brodsky <kevin.brodsky@arm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      f2756b96
  26. 09 4月, 2021 2 次提交
  27. 18 1月, 2021 1 次提交
    • B
      mm: memmap defer init doesn't work as expected · ad6a5557
      Baoquan He 提交于
      stable inclusion
      from stable-5.10.5
      commit 98b57685c26d8f41040ecf71e190250fb2eb2a0c
      bugzilla: 46931
      
      --------------------------------
      
      commit dc2da7b4 upstream.
      
      VMware observed a performance regression during memmap init on their
      platform, and bisected to commit 73a6e474 ("mm: memmap_init:
      iterate over memblock regions rather that check each PFN") causing it.
      
      Before the commit:
      
        [0.033176] Normal zone: 1445888 pages used for memmap
        [0.033176] Normal zone: 89391104 pages, LIFO batch:63
        [0.035851] ACPI: PM-Timer IO Port: 0x448
      
      With commit
      
        [0.026874] Normal zone: 1445888 pages used for memmap
        [0.026875] Normal zone: 89391104 pages, LIFO batch:63
        [2.028450] ACPI: PM-Timer IO Port: 0x448
      
      The root cause is the current memmap defer init doesn't work as expected.
      
      Before, memmap_init_zone() was used to do memmap init of one whole zone,
      to initialize all low zones of one numa node, but defer memmap init of
      the last zone in that numa node.  However, since commit 73a6e474,
      function memmap_init() is adapted to iterater over memblock regions
      inside one zone, then call memmap_init_zone() to do memmap init for each
      region.
      
      E.g, on VMware's system, the memory layout is as below, there are two
      memory regions in node 2.  The current code will mistakenly initialize the
      whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
      iniatialize only one memmory section on the 2nd region [mem
      0x10000000000-0x1033fffffff].  In fact, we only expect to see that there's
      only one memory section's memmap initialized.  That's why more time is
      costed at the time.
      
      [    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
      [    0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
      [    0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
      [    0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
      [    0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
      [    0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]
      
      Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
      down the real zone end pfn so that defer_init() can use it to judge
      whether defer need be taken in zone wide.
      
      Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com
      Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com
      Fixes: commit 73a6e474 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
      Signed-off-by: NBaoquan He <bhe@redhat.com>
      Reported-by: NRahul Gopakumar <gopakumarr@vmware.com>
      Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
      ad6a5557
  28. 03 11月, 2020 1 次提交
    • J
      mm: always have io_remap_pfn_range() set pgprot_decrypted() · f8f6ae5d
      Jason Gunthorpe 提交于
      The purpose of io_remap_pfn_range() is to map IO memory, such as a
      memory mapped IO exposed through a PCI BAR.  IO devices do not
      understand encryption, so this memory must always be decrypted.
      Automatically call pgprot_decrypted() as part of the generic
      implementation.
      
      This fixes a bug where enabling AMD SME causes subsystems, such as RDMA,
      using io_remap_pfn_range() to expose BAR pages to user space to fail.
      The CPU will encrypt access to those BAR pages instead of passing
      unencrypted IO directly to the device.
      
      Places not mapping IO should use remap_pfn_range().
      
      Fixes: aca20d54 ("x86/mm: Add support to make use of Secure Memory Encryption")
      Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Brijesh Singh <brijesh.singh@amd.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: "Dave Young" <dyoung@redhat.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Larry Woodman <lwoodman@redhat.com>
      Cc: Matt Fleming <matt@codeblueprint.co.uk>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Toshimitsu Kani <toshi.kani@hpe.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/0-v1-025d64bdf6c4+e-amd_sme_fix_jgg@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8f6ae5d
  29. 19 10月, 2020 1 次提交
    • M
      mm/madvise: pass mm to do_madvise · 0726b01e
      Minchan Kim 提交于
      Patch series "introduce memory hinting API for external process", v9.
      
      Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API.  With
      that, application could give hints to kernel what memory range are
      preferred to be reclaimed.  However, in some platform(e.g., Android), the
      information required to make the hinting decision is not known to the app.
      Instead, it is known to a centralized userspace daemon(e.g.,
      ActivityManagerService), and that daemon must be able to initiate reclaim
      on its own without any app involvement.
      
      To solve the concern, this patch introduces new syscall -
      process_madvise(2).  Bascially, it's same with madvise(2) syscall but it
      has some differences.
      
      1. It needs pidfd of target process to provide the hint
      
      2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
         moment.  Other hints in madvise will be opened when there are explicit
         requests from community to prevent unexpected bugs we couldn't support.
      
      3. Only privileged processes can do something for other process's
         address space.
      
      For more detail of the new API, please see "mm: introduce external memory
      hinting API" description in this patchset.
      
      This patch (of 3):
      
      In upcoming patches, do_madvise will be called from external process
      context so we shouldn't asssume "current" is always hinted process's
      task_struct.
      
      Furthermore, we must not access mm_struct via task->mm, but obtain it via
      access_mm() once (in the following patch) and only use that pointer [1],
      so pass it to do_madvise() as well.  Note the vma->vm_mm pointers are
      safe, so we can use them further down the call stack.
      
      And let's pass current->mm as arguments of do_madvise so it shouldn't
      change existing behavior but prepare next patch to make review easy.
      
      [vbabka@suse.cz: changelog tweak]
      [minchan@kernel.org: use current->mm for io_uring]
        Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
      [akpm@linux-foundation.org: fix it for upstream changes]
      [akpm@linux-foundation.org: whoops]
      [rdunlap@infradead.org: add missing includes]
      Signed-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jann Horn <jannh@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Sandeep Patil <sspatil@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Brian Geffon <bgeffon@google.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: John Dias <joaodias@google.com>
      Cc: Joel Fernandes <joel@joelfernandes.org>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: SeongJae Park <sj38.park@gmail.com>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: SeongJae Park <sjpark@amazon.de>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fw@deneb.enyo.de>
      Cc: <linux-man@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
      Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
      Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0726b01e
  30. 17 10月, 2020 1 次提交