1. 26 1月, 2022 1 次提交
  2. 31 12月, 2021 1 次提交
    • Z
      mm: export collect_procs() · bb784b81
      Zhang Jian 提交于
      ascend inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I4OXH9
      CVE: NA
      
      -------------------------------------------------
      
      Collect the processes who have the page mapped via collect_procs().
      
      @page if the page is a part of the hugepages/compound-page, we must
      using compound_head() to find it's head page to prevent the kernel panic,
      and make the page be locked.
      
      @to_kill the function will return a linked list, when we have used
      this list, we must kfree the list.
      
      @force_early if we want to find all process, we must make it be true, if
      it's false, the function will only return the process who have PF_MCE_PROCESS
      or PF_MCE_EARLY mark.
      
      limits: if force_early is true, sysctl_memory_failure_early_kill is useless.
      If it's false, no process have PF_MCE_PROCESS and PF_MCE_EARLY flag, and
      the sysctl_memory_failure_early_kill is enabled, function will return all tasks
      whether the task have the PF_MCE_PROCESS and PF_MCE_EARLY flag.
      Signed-off-by: NZhang Jian <zhangjian210@huawei.com>
      Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
      Reviewed-by: Kefeng Wang<wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      bb784b81
  3. 30 10月, 2021 3 次提交
  4. 29 10月, 2021 2 次提交
  5. 12 8月, 2021 1 次提交
  6. 06 8月, 2021 1 次提交
  7. 02 8月, 2021 1 次提交
    • H
      mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() · 2d1b1cc6
      Hugh Dickins 提交于
      stable inclusion
      from linux-4.19.197
      commit d5cd96a7880322692d64fbe75d321ccd39392537
      
      --------------------------------
      
      [ Upstream commit 22061a1f ]
      
      There is a race between THP unmapping and truncation, when truncate sees
      pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
      it, but before its page_remove_rmap() gets to decrement
      compound_mapcount: generating false "BUG: Bad page cache" reports that
      the page is still mapped when deleted.  This commit fixes that, but not
      in the way I hoped.
      
      The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
      instead of unmap_mapping_range() in truncate_cleanup_page(): it has
      often been an annoyance that we usually call unmap_mapping_range() with
      no pages locked, but there apply it to a single locked page.
      try_to_unmap() looks more suitable for a single locked page.
      
      However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
      it is used to insert THP migration entries, but not used to unmap THPs.
      Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
      needs are different, I'm too ignorant of the DAX cases, and couldn't
      decide how far to go for anon+swap.  Set that aside.
      
      The second attempt took a different tack: make no change in truncate.c,
      but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
      clearing it initially, then pmd_clear() between page_remove_rmap() and
      unlocking at the end.  Nice.  But powerpc blows that approach out of the
      water, with its serialize_against_pte_lookup(), and interesting pgtable
      usage.  It would need serious help to get working on powerpc (with a
      minor optimization issue on s390 too).  Set that aside.
      
      Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
      delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
      that's likely to reduce or eliminate the number of incidents, it would
      give less assurance of whether we had identified the problem correctly.
      
      This successful iteration introduces "unmap_mapping_page(page)" instead
      of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
      with an addition to details.  Then zap_pmd_range() watches for this
      case, and does spin_unlock(pmd_lock) if so - just like
      page_vma_mapped_walk() now does in the PVMW_SYNC case.  Not pretty, but
      safe.
      
      Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
      assert its interface; but currently that's only used to make sure that
      page->mapping is stable, and zap_pmd_range() doesn't care if the page is
      locked or not.  Along these lines, in invalidate_inode_pages2_range()
      move the initial unmap_mapping_range() out from under page lock, before
      then calling unmap_mapping_page() under page lock if still mapped.
      
      Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
      Fixes: fc127da0 ("truncate: handle file thp")
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Alistair Popple <apopple@nvidia.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jue Wang <juew@google.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Miaohe Lin <linmiaohe@huawei.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <naoya.horiguchi@nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Ralph Campbell <rcampbell@nvidia.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wang Yugui <wangyugui@e16-tech.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      Note on stable backport: fixed up call to truncate_cleanup_page()
      in truncate_inode_pages_range().  Use hpage_nr_pages() in
      unmap_mapping_page().
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2d1b1cc6
  8. 29 7月, 2021 1 次提交
  9. 15 4月, 2021 1 次提交
  10. 14 4月, 2021 2 次提交
    • G
      userswap: add a new flag 'MAP_REPLACE' for mmap() · e3452806
      Guo Fan 提交于
      hulk inclusion
      category: feature
      bugzilla: 47439
      CVE: NA
      
      -------------------------------------------------
      
      To make sure there are no other userspace threads access the memory
      region we are swapping out, we need unmmap the memory region, map it
      to a new address and use the new address to perform the swapout. We add
      a new flag 'MAP_REPLACE' for mmap() to unmap the pages of the input
      parameter 'VA' and remap them to a new tmpVA.
      Signed-off-by: NGuo Fan <guofan5@huawei.com>
      Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      e3452806
    • P
      mm: allow VM_FAULT_RETRY for multiple times · 9745f703
      Peter Xu 提交于
      mainline inclusion
      from mainline-5.6
      commit 4064b982
      category: bugfix
      bugzilla: 47439
      CVE: NA
      ---------------------------
      
      The idea comes from a discussion between Linus and Andrea [1].
      
      Before this patch we only allow a page fault to retry once.  We achieved
      this by clearing the FAULT_FLAG_ALLOW_RETRY flag when doing
      handle_mm_fault() the second time.  This was majorly used to avoid
      unexpected starvation of the system by looping over forever to handle the
      page fault on a single page.  However that should hardly happen, and after
      all for each code path to return a VM_FAULT_RETRY we'll first wait for a
      condition (during which time we should possibly yield the cpu) to happen
      before VM_FAULT_RETRY is really returned.
      
      This patch removes the restriction by keeping the FAULT_FLAG_ALLOW_RETRY
      flag when we receive VM_FAULT_RETRY.  It means that the page fault handler
      now can retry the page fault for multiple times if necessary without the
      need to generate another page fault event.  Meanwhile we still keep the
      FAULT_FLAG_TRIED flag so page fault handler can still identify whether a
      page fault is the first attempt or not.
      
      Then we'll have these combinations of fault flags (only considering
      ALLOW_RETRY flag and TRIED flag):
      
        - ALLOW_RETRY and !TRIED:  this means the page fault allows to
                                   retry, and this is the first try
      
        - ALLOW_RETRY and TRIED:   this means the page fault allows to
                                   retry, and this is not the first try
      
        - !ALLOW_RETRY and !TRIED: this means the page fault does not allow
                                   to retry at all
      
        - !ALLOW_RETRY and TRIED:  this is forbidden and should never be used
      
      In existing code we have multiple places that has taken special care of
      the first condition above by checking against (fault_flags &
      FAULT_FLAG_ALLOW_RETRY).  This patch introduces a simple helper to detect
      the first retry of a page fault by checking against both (fault_flags &
      FAULT_FLAG_ALLOW_RETRY) and !(fault_flag & FAULT_FLAG_TRIED) because now
      even the 2nd try will have the ALLOW_RETRY set, then use that helper in
      all existing special paths.  One example is in __lock_page_or_retry(), now
      we'll drop the mmap_sem only in the first attempt of page fault and we'll
      keep it in follow up retries, so old locking behavior will be retained.
      
      This will be a nice enhancement for current code [2] at the same time a
      supporting material for the future userfaultfd-writeprotect work, since in
      that work there will always be an explicit userfault writeprotect retry
      for protected pages, and if that cannot resolve the page fault (e.g., when
      userfaultfd-writeprotect is used in conjunction with swapped pages) then
      we'll possibly need a 3rd retry of the page fault.  It might also benefit
      other potential users who will have similar requirement like userfault
      write-protection.
      
      GUP code is not touched yet and will be covered in follow up patch.
      
      Please read the thread below for more information.
      
      [1] https://lore.kernel.org/lkml/20171102193644.GB22686@redhat.com/
      [2] https://lore.kernel.org/lkml/20181230154648.GB9832@redhat.com/Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Suggested-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NPeter Xu <peterx@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NBrian Geffon <bgeffon@google.com>
      Cc: Bobby Powers <bobbypowers@gmail.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Denis Plotnikov <dplotnikov@virtuozzo.com>
      Cc: "Dr . David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A . Shutemov" <kirill@shutemov.name>
      Cc: Martin Cracauer <cracauer@cons.org>
      Cc: Marty McFadden <mcfadden8@llnl.gov>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Maya Gokhale <gokhale2@llnl.gov>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Link: http://lkml.kernel.org/r/20200220160246.9790-1-peterx@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
       Conflicts:
      	arch/arc/mm/fault.c
      	arch/arm64/mm/fault.c
      	arch/x86/mm/fault.c
      	drivers/gpu/drm/ttm/ttm_bo_vm.c
      	include/linux/mm.h
      	mm/internal.h
      Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
      Reviewed-by: NKefeng  Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
      9745f703
  11. 22 2月, 2021 1 次提交
  12. 04 11月, 2020 1 次提交
    • L
      mm: replace memmap_context by meminit_context · fb9e4c0b
      Laurent Dufour 提交于
      stable inclusion
      from linux-4.19.150
      commit 25eaea1b33f2569f69a82dfddb3fb05384143bd0
      
      --------------------------------
      
      commit c1d0da83 upstream.
      
      Patch series "mm: fix memory to node bad links in sysfs", v3.
      
      Sometimes, firmware may expose interleaved memory layout like this:
      
       Early memory node ranges
         node   1: [mem 0x0000000000000000-0x000000011fffffff]
         node   2: [mem 0x0000000120000000-0x000000014fffffff]
         node   1: [mem 0x0000000150000000-0x00000001ffffffff]
         node   0: [mem 0x0000000200000000-0x000000048fffffff]
         node   2: [mem 0x0000000490000000-0x00000007ffffffff]
      
      In that case, we can see memory blocks assigned to multiple nodes in
      sysfs:
      
        $ ls -l /sys/devices/system/memory/memory21
        total 0
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node1 -> ../../node/node1
        lrwxrwxrwx 1 root root     0 Aug 24 05:27 node2 -> ../../node/node2
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
        -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
        drwxr-xr-x 2 root root     0 Aug 24 05:27 power
        -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
        -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
        lrwxrwxrwx 1 root root     0 Aug 24 05:25 subsystem -> ../../../../bus/memory
        -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
        -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones
      
      The same applies in the node's directory with a memory21 link in both
      the node1 and node2's directory.
      
      This is wrong but doesn't prevent the system to run.  However when
      later, one of these memory blocks is hot-unplugged and then hot-plugged,
      the system is detecting an inconsistency in the sysfs layout and a
      BUG_ON() is raised:
      
        kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
        LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
        Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
        CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
        Call Trace:
          add_memory_resource+0x23c/0x340 (unreliable)
          __add_memory+0x5c/0xf0
          dlpar_add_lmb+0x1b4/0x500
          dlpar_memory+0x1f8/0xb80
          handle_dlpar_errorlog+0xc0/0x190
          dlpar_store+0x198/0x4a0
          kobj_attr_store+0x30/0x50
          sysfs_kf_write+0x64/0x90
          kernfs_fop_write+0x1b0/0x290
          vfs_write+0xe8/0x290
          ksys_write+0xdc/0x130
          system_call_exception+0x160/0x270
          system_call_common+0xf0/0x27c
      
      This has been seen on PowerPC LPAR.
      
      The root cause of this issue is that when node's memory is registered,
      the range used can overlap another node's range, thus the memory block
      is registered to multiple nodes in sysfs.
      
      There are two issues here:
      
       (a) The sysfs memory and node's layouts are broken due to these
           multiple links
      
       (b) The link errors in link_mem_sections() should not lead to a system
           panic.
      
      To address (a) register_mem_sect_under_node should not rely on the
      system state to detect whether the link operation is triggered by a hot
      plug operation or not.  This is addressed by the patches 1 and 2 of this
      series.
      
      Issue (b) will be addressed separately.
      
      This patch (of 2):
      
      The memmap_context enum is used to detect whether a memory operation is
      due to a hot-add operation or happening at boot time.
      
      Make it general to the hotplug operation and rename it as
      meminit_context.
      
      There is no functional change introduced by this patch
      Suggested-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Rafael J . Wysocki" <rafael@kernel.org>
      Cc: Nathan Lynch <nathanl@linux.ibm.com>
      Cc: Scott Cheloha <cheloha@linux.ibm.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: <stable@vger.kernel.org>
      Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
      Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fb9e4c0b
  13. 22 9月, 2020 2 次提交
  14. 31 8月, 2020 3 次提交
  15. 17 3月, 2020 1 次提交
    • Y
      pagecache: support percpu refcount to imporve performance · 8b9ea901
      Yunfeng Ye 提交于
      euleros inclusion
      category: feature
      feature: pagecache percpu refcount
      bugzilla: 31398
      CVE: NA
      
      -------------------------------------------------
      
      The pagecache manages the file physical pages, and the life cycle of
      page is managed by atomic counting. With the increasing number of cpu
      cores, the cost of atomic counting is very large when reading file
      pagecaches at large concurrent.
      
      For example, when running nginx http application, the biggest hotspot is
      found in the atomic operation of find_get_entry():
      
       11.94% [kernel] [k] find_get_entry
        7.45% [kernel] [k] do_tcp_sendpages
        6.12% [kernel] [k] generic_file_buffered_read
      
      So we using the percpu refcount mechanism to fix this problem. and the
      test result show that the read performance of nginx http can be improved
      by 100%:
      
        worker   original(requests/sec)   percpu(requests/sec)   imporve
        64       759656.87                1627088.95             114.2%
      
      Notes: we use page->lru to save percpu information, so the pages with
      percpu attribute will not be recycled by memory recycling process, we
      should avoid grow the file size unlimited.
      Signed-off-by: NYunfeng Ye <yeyunfeng@huawei.com>
      Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8b9ea901
  16. 27 12月, 2019 8 次提交
  17. 29 12月, 2018 1 次提交
    • M
      mm: add mm_pxd_folded checks to pgtable_bytes accounting functions · 28a3b553
      Martin Schwidefsky 提交于
      [ Upstream commit 6d212db11947ae5464e4717536ed9faf61c01e86 ]
      
      The common mm code calls mm_dec_nr_pmds() and mm_dec_nr_puds()
      in free_pgtables() if the address range spans a full pud or pmd.
      If mm_dec_nr_puds/mm_dec_nr_pmds are non-empty due to configuration
      settings they blindly subtract the size of the pmd or pud table from
      pgtable_bytes even if the pud or pmd page table layer is folded.
      
      Add explicit mm_[pmd|pud]_folded checks to the four pgtable_bytes
      accounting functions mm_inc_nr_puds, mm_inc_nr_pmds, mm_dec_nr_puds
      and mm_dec_nr_pmds. As the check for folded page tables can be
      overwritten by the architecture, this allows to keep a correct
      pgtable_bytes value for platforms that use a dynamic number of
      page table levels.
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Signed-off-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      28a3b553
  18. 06 10月, 2018 1 次提交
    • M
      mm: migration: fix migration of huge PMD shared pages · 017b1660
      Mike Kravetz 提交于
      The page migration code employs try_to_unmap() to try and unmap the source
      page.  This is accomplished by using rmap_walk to find all vmas where the
      page is mapped.  This search stops when page mapcount is zero.  For shared
      PMD huge pages, the page map count is always 1 no matter the number of
      mappings.  Shared mappings are tracked via the reference count of the PMD
      page.  Therefore, try_to_unmap stops prematurely and does not completely
      unmap all mappings of the source page.
      
      This problem can result is data corruption as writes to the original
      source page can happen after contents of the page are copied to the target
      page.  Hence, data is lost.
      
      This problem was originally seen as DB corruption of shared global areas
      after a huge page was soft offlined due to ECC memory errors.  DB
      developers noticed they could reproduce the issue by (hotplug) offlining
      memory used to back huge pages.  A simple testcase can reproduce the
      problem by creating a shared PMD mapping (note that this must be at least
      PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
      migrate_pages() to migrate process pages between nodes while continually
      writing to the huge pages being migrated.
      
      To fix, have the try_to_unmap_one routine check for huge PMD sharing by
      calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a shared
      mapping it will be 'unshared' which removes the page table entry and drops
      the reference on the PMD page.  After this, flush caches and TLB.
      
      mmu notifiers are called before locking page tables, but we can not be
      sure of PMD sharing until page tables are locked.  Therefore, check for
      the possibility of PMD sharing before locking so that notifiers can
      prepare for the worst possible case.
      
      Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
      [mike.kravetz@oracle.com: make _range_in_vma() a static inline]
        Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
      Fixes: 39dde65c ("shared page table for hugetlb page")
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      017b1660
  19. 24 8月, 2018 1 次提交
  20. 23 8月, 2018 3 次提交
  21. 18 8月, 2018 4 次提交
    • P
      mm/sparse: delete old sparse_init and enable new one · 2a3cb8ba
      Pavel Tatashin 提交于
      Rename new_sparse_init() to sparse_init() which enables it.  Delete old
      sparse_init() and all the code that became obsolete with.
      
      [pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()]
        Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com
      Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
      Tested-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2a3cb8ba
    • P
      mm/sparse: move buffer init/fini to the common place · afda57bc
      Pavel Tatashin 提交于
      Now that both variants of sparse memory use the same buffers to populate
      memory map, we can move sparse_buffer_init()/sparse_buffer_fini() to the
      common place.
      
      Link: http://lkml.kernel.org/r/20180712203730.8703-4-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
      Tested-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afda57bc
    • P
      mm/sparse: abstract sparse buffer allocations · 35fd1eb1
      Pavel Tatashin 提交于
      Patch series "sparse_init rewrite", v6.
      
      In sparse_init() we allocate two large buffers to temporary hold usemap
      and memmap for the whole machine.  However, we can avoid doing that if
      we changed sparse_init() to operated on per-node bases instead of doing
      it on the whole machine beforehand.
      
      As shown by Baoquan
        http://lkml.kernel.org/r/20180628062857.29658-1-bhe@redhat.com
      
      The buffers are large enough to cause machine stop to boot on small
      memory systems.
      
      Another benefit of these changes is that they also obsolete
      CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER.
      
      This patch (of 5):
      
      When struct pages are allocated for sparse-vmemmap VA layout, we first try
      to allocate one large buffer, and than if that fails allocate struct pages
      for each section as we go.
      
      The code that allocates buffer is uses global variables and is spread
      across several call sites.
      
      Cleanup the code by introducing three functions to handle the global
      buffer:
      
      sparse_buffer_init()	initialize the buffer
      sparse_buffer_fini()	free the remaining part of the buffer
      sparse_buffer_alloc()	alloc from the buffer, and if buffer is empty
      return NULL
      
      Define these functions in sparse.c instead of sparse-vmemmap.c because
      later we will use them for non-vmemmap sparse allocations as well.
      
      [akpm@linux-foundation.org: use PTR_ALIGN()]
      [akpm@linux-foundation.org: s/BUG_ON/WARN_ON/]
      Link: http://lkml.kernel.org/r/20180712203730.8703-2-pasha.tatashin@oracle.comSigned-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
      Tested-by: Michael Ellerman <mpe@ellerman.id.au>	[powerpc]
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Tested-by: NOscar Salvador <osalvador@suse.de>
      Cc: Pasha Tatashin <Pavel.Tatashin@microsoft.com>
      Cc: Steven Sistare <steven.sistare@oracle.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Souptick Joarder <jrdr.linux@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Abdul Haleem <abdhalee@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35fd1eb1
    • H
      mm, huge page: copy target sub-page last when copy huge page · c9f4cd71
      Huang Ying 提交于
      Huge page helps to reduce TLB miss rate, but it has higher cache
      footprint, sometimes this may cause some issue.  For example, when
      copying huge page on x86_64 platform, the cache footprint is 4M.  But on
      a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M LLC
      (last level cache).  That is, in average, there are 2.5M LLC for each
      core and 1.25M LLC for each thread.
      
      If the cache contention is heavy when copying the huge page, and we copy
      the huge page from the begin to the end, it is possible that the begin
      of huge page is evicted from the cache after we finishing copying the
      end of the huge page.  And it is possible for the application to access
      the begin of the huge page after copying the huge page.
      
      In c79b57e4 ("mm: hugetlb: clear target sub-page last when clearing
      huge page"), to keep the cache lines of the target subpage hot, the
      order to clear the subpages in the huge page in clear_huge_page() is
      changed to clearing the subpage which is furthest from the target
      subpage firstly, and the target subpage last.  The similar order
      changing helps huge page copying too.  That is implemented in this
      patch.  Because we have put the order algorithm into a separate
      function, the implementation is quite simple.
      
      The patch is a generic optimization which should benefit quite some
      workloads, not for a specific use case.  To demonstrate the performance
      benefit of the patch, we tested it with vm-scalability run on
      transparent huge page.
      
      With this patch, the throughput increases ~16.6% in vm-scalability
      anon-cow-seq test case with 36 processes on a 2 socket Xeon E5 v3 2699
      system (36 cores, 72 threads).  The test case set
      /sys/kernel/mm/transparent_hugepage/enabled to be always, mmap() a big
      anonymous memory area and populate it, then forked 36 child processes,
      each writes to the anonymous memory area from the begin to the end, so
      cause copy on write.  For each child process, other child processes
      could be seen as other workloads which generate heavy cache pressure.
      At the same time, the IPC (instruction per cycle) increased from 0.63 to
      0.78, and the time spent in user space is reduced ~7.2%.
      
      Link: http://lkml.kernel.org/r/20180524005851.4079-3-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Andi Kleen <andi.kleen@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Shaohua Li <shli@fb.com>
      Cc: Christopher Lameter <cl@linux.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c9f4cd71