1. 01 2月, 2018 1 次提交
    • M
      mm, hugetlb: do not rely on overcommit limit during migration · ab5ac90a
      Michal Hocko 提交于
      hugepage migration relies on __alloc_buddy_huge_page to get a new page.
      This has 2 main disadvantages.
      
      1) it doesn't allow to migrate any huge page if the pool is used
         completely which is not an exceptional case as the pool is static and
         unused memory is just wasted.
      
      2) it leads to a weird semantic when migration between two numa nodes
         might increase the pool size of the destination NUMA node while the
         page is in use.  The issue is caused by per NUMA node surplus pages
         tracking (see free_huge_page).
      
      Address both issues by changing the way how we allocate and account
      pages allocated for migration.  Those should temporal by definition.  So
      we mark them that way (we will abuse page flags in the 3rd page) and
      update free_huge_page to free such pages to the page allocator.  Page
      migration path then just transfers the temporal status from the new page
      to the old one which will be freed on the last reference.  The global
      surplus count will never change during this path but we still have to be
      careful when migrating a per-node suprlus page.  This is now handled in
      move_hugetlb_state which is called from the migration path and it copies
      the hugetlb specific page state and fixes up the accounting when needed
      
      Rename __alloc_buddy_huge_page to __alloc_surplus_huge_page to better
      reflect its purpose.  The new allocation routine for the migration path
      is __alloc_migrate_huge_page.
      
      The user visible effect of this patch is that migrated pages are really
      temporal and they travel between NUMA nodes as per the migration
      request:
      
      Before migration
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:1
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
      
      After
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/nr_hugepages:0
        /sys/devices/system/node/node0/hugepages/hugepages-2048kB/surplus_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/free_hugepages:0
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages:1
        /sys/devices/system/node/node1/hugepages/hugepages-2048kB/surplus_hugepages:0
      
      with the previous implementation, both nodes would have nr_hugepages:1
      until the page is freed.
      
      Link: http://lkml.kernel.org/r/20180103093213.26329-4-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Andrea Reale <ar@linux.vnet.ibm.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab5ac90a
  2. 30 11月, 2017 1 次提交
  3. 28 11月, 2017 1 次提交
    • K
      mm, thp: Do not make pmd/pud dirty without a reason · 152e93af
      Kirill A. Shutemov 提交于
      Currently we make page table entries dirty all the time regardless of
      access type and don't even consider if the mapping is write-protected.
      The reasoning is that we don't really need dirty tracking on THP and
      making the entry dirty upfront may save some time on first write to the
      page.
      
      Unfortunately, such approach may result in false-positive
      can_follow_write_pmd() for huge zero page or read-only shmem file.
      
      Let's only make page dirty only if we about to write to the page anyway
      (as we do for small pages).
      
      I've restructured the code to make entry dirty inside
      maybe_p[mu]d_mkwrite(). It also takes into account if the vma is
      write-protected.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      152e93af
  4. 16 11月, 2017 1 次提交
    • J
      mm/mmu_notifier: avoid call to invalidate_range() in range_end() · 4645b9fe
      Jérôme Glisse 提交于
      This is an optimization patch that only affect mmu_notifier users which
      rely on the invalidate_range() callback.  This patch avoids calling that
      callback twice in a row from inside __mmu_notifier_invalidate_range_end
      
      Existing pattern (before this patch):
          mmu_notifier_invalidate_range_start()
              pte/pmd/pud_clear_flush_notify()
                  mmu_notifier_invalidate_range()
          mmu_notifier_invalidate_range_end()
              mmu_notifier_invalidate_range()
      
      New pattern (after this patch):
          mmu_notifier_invalidate_range_start()
              pte/pmd/pud_clear_flush_notify()
                  mmu_notifier_invalidate_range()
          mmu_notifier_invalidate_range_only_end()
      
      We call the invalidate_range callback after clearing the page table
      under the page table lock and we skip the call to invalidate_range
      inside the __mmu_notifier_invalidate_range_end() function.
      
      Idea from Andrea Arcangeli
      
      Link: http://lkml.kernel.org/r/20171017031003.7481-3-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Joerg Roedel <jroedel@suse.de>
      Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Alistair Popple <alistair@popple.id.au>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Andrew Donnellan <andrew.donnellan@au1.ibm.com>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4645b9fe
  5. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  6. 14 10月, 2017 1 次提交
  7. 09 9月, 2017 9 次提交
    • J
      mm/hmm: avoid bloating arch that do not make use of HMM · 6b368cd4
      Jérôme Glisse 提交于
      This moves all new code including new page migration helper behind kernel
      Kconfig option so that there is no codee bloat for arch or user that do
      not want to use HMM or any of its associated features.
      
      arm allyesconfig (without all the patchset, then with and this patch):
         text	   data	    bss	    dec	    hex	filename
      83721896	46511131	27582964	157815991	96814b7	../without/vmlinux
      83722364	46511131	27582964	157816459	968168b	vmlinux
      
      [jglisse@redhat.com: struct hmm is only use by HMM mirror functionality]
        Link: http://lkml.kernel.org/r/20170825213133.27286-1-jglisse@redhat.com
      [sfr@canb.auug.org.au: fix build (arm multi_v7_defconfig)]
        Link: http://lkml.kernel.org/r/20170828181849.323ab81b@canb.auug.org.au
      Link: http://lkml.kernel.org/r/20170818032858.7447-1-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6b368cd4
    • J
      mm/device-public-memory: device memory cache coherent with CPU · df6ad698
      Jérôme Glisse 提交于
      Platform with advance system bus (like CAPI or CCIX) allow device memory
      to be accessible from CPU in a cache coherent fashion.  Add a new type of
      ZONE_DEVICE to represent such memory.  The use case are the same as for
      the un-addressable device memory but without all the corners cases.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      df6ad698
    • J
      mm/migrate: allow migrate_vma() to alloc new page on empty entry · 8315ada7
      Jérôme Glisse 提交于
      This allows callers of migrate_vma() to allocate new page for empty CPU
      page table entry (pte_none or back by zero page).  This is only for
      anonymous memory and it won't allow new page to be instanced if the
      userfaultfd is armed.
      
      This is useful to device driver that want to migrate a range of virtual
      address and would rather allocate new memory than having to fault later
      on.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-18-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8315ada7
    • J
      mm/migrate: support un-addressable ZONE_DEVICE page in migration · a5430dda
      Jérôme Glisse 提交于
      Allow to unmap and restore special swap entry of un-addressable
      ZONE_DEVICE memory.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-17-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a5430dda
    • J
      mm/migrate: migrate_vma() unmap page from vma while collecting pages · 8c3328f1
      Jérôme Glisse 提交于
      Common case for migration of virtual address range is page are map only
      once inside the vma in which migration is taking place.  Because we
      already walk the CPU page table for that range we can directly do the
      unmap there and setup special migration swap entry.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-16-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: NSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: NSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c3328f1
    • J
      mm/migrate: new memory migration helper for use with device memory · 8763cb45
      Jérôme Glisse 提交于
      This patch add a new memory migration helpers, which migrate memory
      backing a range of virtual address of a process to different memory (which
      can be allocated through special allocator).  It differs from numa
      migration by working on a range of virtual address and thus by doing
      migration in chunk that can be large enough to use DMA engine or special
      copy offloading engine.
      
      Expected users are any one with heterogeneous memory where different
      memory have different characteristics (latency, bandwidth, ...).  As an
      example IBM platform with CAPI bus can make use of this feature to migrate
      between regular memory and CAPI device memory.  New CPU architecture with
      a pool of high performance memory not manage as cache but presented as
      regular memory (while being faster and with lower latency than DDR) will
      also be prime user of this patch.
      
      Migration to private device memory will be useful for device that have
      large pool of such like GPU, NVidia plans to use HMM for that.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-15-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Signed-off-by: NEvgeny Baskakov <ebaskakov@nvidia.com>
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NMark Hairgrove <mhairgrove@nvidia.com>
      Signed-off-by: NSherry Cheung <SCheung@nvidia.com>
      Signed-off-by: NSubhash Gutti <sgutti@nvidia.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8763cb45
    • J
      mm/migrate: new migrate mode MIGRATE_SYNC_NO_COPY · 2916ecc0
      Jérôme Glisse 提交于
      Introduce a new migration mode that allow to offload the copy to a device
      DMA engine.  This changes the workflow of migration and not all
      address_space migratepage callback can support this.
      
      This is intended to be use by migrate_vma() which itself is use for thing
      like HMM (see include/linux/hmm.h).
      
      No additional per-filesystem migratepage testing is needed.  I disables
      MIGRATE_SYNC_NO_COPY in all problematic migratepage() callback and i
      added comment in those to explain why (part of this patch).  The commit
      message is unclear it should say that any callback that wish to support
      this new mode need to be aware of the difference in the migration flow
      from other mode.
      
      Some of these callbacks do extra locking while copying (aio, zsmalloc,
      balloon, ...) and for DMA to be effective you want to copy multiple
      pages in one DMA operations.  But in the problematic case you can not
      easily hold the extra lock accross multiple call to this callback.
      
      Usual flow is:
      
      For each page {
       1 - lock page
       2 - call migratepage() callback
       3 - (extra locking in some migratepage() callback)
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
       5 - copy page
       6 - (unlock any extra lock of migratepage() callback)
       7 - return from migratepage() callback
       8 - unlock page
      }
      
      The new mode MIGRATE_SYNC_NO_COPY:
       1 - lock multiple pages
      For each page {
       2 - call migratepage() callback
       3 - abort in all problematic migratepage() callback
       4 - migrate page state (freeze refcount, update page cache, buffer
           head, ...)
      } // finished all calls to migratepage() callback
       5 - DMA copy multiple pages
       6 - unlock all the pages
      
      To support MIGRATE_SYNC_NO_COPY in the problematic case we would need a
      new callback migratepages() (for instance) that deals with multiple
      pages in one transaction.
      
      Because the problematic cases are not important for current usage I did
      not wanted to complexify this patchset even more for no good reason.
      
      Link: http://lkml.kernel.org/r/20170817000548.32038-14-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Evgeny Baskakov <ebaskakov@nvidia.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mark Hairgrove <mhairgrove@nvidia.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ross Zwisler <ross.zwisler@linux.intel.com>
      Cc: Sherry Cheung <SCheung@nvidia.com>
      Cc: Subhash Gutti <sgutti@nvidia.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2916ecc0
    • N
      mm: migrate: move_pages() supports thp migration · e8db67eb
      Naoya Horiguchi 提交于
      This patch enables thp migration for move_pages(2).
      
      Link: http://lkml.kernel.org/r/20170717193955.20207-10-zi.yan@sent.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8db67eb
    • Z
      mm: thp: enable thp migration in generic path · 616b8371
      Zi Yan 提交于
      Add thp migration's core code, including conversions between a PMD entry
      and a swap entry, setting PMD migration entry, removing PMD migration
      entry, and waiting on PMD migration entries.
      
      This patch makes it possible to support thp migration.  If you fail to
      allocate a destination page as a thp, you just split the source thp as
      we do now, and then enter the normal page migration.  If you succeed to
      allocate destination thp, you enter thp migration.  Subsequent patches
      actually enable thp migration for each caller of page migration by
      allowing its get_new_page() callback to allocate thps.
      
      [zi.yan@cs.rutgers.edu: fix gcc-4.9.0 -Wmissing-braces warning]
        Link: http://lkml.kernel.org/r/A0ABA698-7486-46C3-B209-E95A9048B22C@cs.rutgers.edu
      [akpm@linux-foundation.org: fix x86_64 allnoconfig warning]
      Signed-off-by: NZi Yan <zi.yan@cs.rutgers.edu>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Anshuman Khandual <khandual@linux.vnet.ibm.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: David Nellans <dnellans@nvidia.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      616b8371
  8. 21 8月, 2017 1 次提交
    • L
      Sanitize 'move_pages()' permission checks · 197e7e52
      Linus Torvalds 提交于
      The 'move_paghes()' system call was introduced long long ago with the
      same permission checks as for sending a signal (except using
      CAP_SYS_NICE instead of CAP_SYS_KILL for the overriding capability).
      
      That turns out to not be a great choice - while the system call really
      only moves physical page allocations around (and you need other
      capabilities to do a lot of it), you can check the return value to map
      out some the virtual address choices and defeat ASLR of a binary that
      still shares your uid.
      
      So change the access checks to the more common 'ptrace_may_access()'
      model instead.
      
      This tightens the access checks for the uid, and also effectively
      changes the CAP_SYS_NICE check to CAP_SYS_PTRACE, but it's unlikely that
      anybody really _uses_ this legacy system call any more (we hav ebetter
      NUMA placement models these days), so I expect nobody to notice.
      
      Famous last words.
      Reported-by: NOtto Ebeling <otto.ebeling@iki.fi>
      Acked-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      197e7e52
  9. 11 8月, 2017 1 次提交
  10. 10 8月, 2017 1 次提交
    • P
      mm, locking: Rework {set,clear,mm}_tlb_flush_pending() · 8b1b436d
      Peter Zijlstra 提交于
      Commit:
      
        af2c1401 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")
      
      added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
      can solve the same problem without this barrier.
      
      If instead we mandate that mm_tlb_flush_pending() is used while
      holding the PTL we're guaranteed to observe prior
      set_tlb_flush_pending() instances.
      
      For this to work we need to rework migrate_misplaced_transhuge_page()
      a little and move the test up into do_huge_pmd_numa_page().
      
      NOTE: this relies on flush_tlb_range() to guarantee:
      
         (1) it ensures that prior page table updates are visible to the
             page table walker and
         (2) it ensures that subsequent memory accesses are only made
             visible after the invalidation has completed
      
      This is required for architectures that implement TRANSPARENT_HUGEPAGE
      (arc, arm, arm64, mips, powerpc, s390, sparc, x86) or otherwise use
      mm_tlb_flush_pending() in their page-table operations (arm, arm64,
      x86).
      
      This appears true for:
      
       - arm (DSB ISB before and after),
       - arm64 (DSB ISHST before, and DSB ISH after),
       - powerpc (PTESYNC before and after),
       - s390 and x86 TLB invalidate are serializing instructions
      
      But I failed to understand the situation for:
      
       - arc, mips, sparc
      
      Now SPARC64 is a wee bit special in that flush_tlb_range() is a no-op
      and it flushes the TLBs using arch_{enter,leave}_lazy_mmu_mode()
      inside the PTL. It still needs to guarantee the PTL unlock happens
      _after_ the invalidate completes.
      
      Vineet, Ralf and Dave could you guys please have a look?
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vineet Gupta <vgupta@synopsys.com>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      8b1b436d
  11. 11 7月, 2017 2 次提交
  12. 07 7月, 2017 1 次提交
  13. 04 5月, 2017 3 次提交
  14. 21 4月, 2017 1 次提交
    • R
      mm: prevent NR_ISOLATE_* stats from going negative · fc280fe8
      Rabin Vincent 提交于
      Commit 6afcf8ef ("mm, compaction: fix NR_ISOLATED_* stats for pfn
      based migration") moved the dec_node_page_state() call (along with the
      page_is_file_cache() call) to after putback_lru_page().
      
      But page_is_file_cache() can change after putback_lru_page() is called,
      so it should be called before putback_lru_page(), as it was before that
      patch, to prevent NR_ISOLATE_* stats from going negative.
      
      Without this fix, non-CONFIG_SMP kernels end up hanging in the
      while(too_many_isolated()) { congestion_wait() } loop in
      shrink_active_list() due to the negative stats.
      
       Mem-Info:
        active_anon:32567 inactive_anon:121 isolated_anon:1
        active_file:6066 inactive_file:6639 isolated_file:4294967295
                                                          ^^^^^^^^^^
        unevictable:0 dirty:115 writeback:0 unstable:0
        slab_reclaimable:2086 slab_unreclaimable:3167
        mapped:3398 shmem:18366 pagetables:1145 bounce:0
        free:1798 free_pcp:13 free_cma:0
      
      Fixes: 6afcf8ef ("mm, compaction: fix NR_ISOLATED_* stats for pfn based migration")
      Link: http://lkml.kernel.org/r/1492683865-27549-1-git-send-email-rabin.vincent@axis.comSigned-off-by: NRabin Vincent <rabinv@axis.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Ming Ling <ming.ling@spreadtrum.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fc280fe8
  15. 01 4月, 2017 1 次提交
    • N
      mm: migrate: fix remove_migration_pte() for ksm pages · 4b0ece6f
      Naoya Horiguchi 提交于
      I found that calling page migration for ksm pages causes the following
      bug:
      
          page:ffffea0004d51180 count:2 mapcount:2 mapping:ffff88013c785141 index:0x913
          flags: 0x57ffffc0040068(uptodate|lru|active|swapbacked)
          raw: 0057ffffc0040068 ffff88013c785141 0000000000000913 0000000200000001
          raw: ffffea0004d5f9e0 ffffea0004d53f60 0000000000000000 ffff88007d81b800
          page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
          page->mem_cgroup:ffff88007d81b800
          ------------[ cut here ]------------
          kernel BUG at /src/linux-dev/mm/rmap.c:1086!
          invalid opcode: 0000 [#1] SMP
          Modules linked in: ppdev parport_pc virtio_balloon i2c_piix4 pcspkr parport i2c_core acpi_cpufreq ip_tables xfs libcrc32c ata_generic pata_acpi ata_piix 8139too libata virtio_blk 8139cp crc32c_intel mii virtio_pci virtio_ring serio_raw virtio floppy dm_mirror dm_region_hash dm_log dm_mod
          CPU: 0 PID: 3162 Comm: bash Not tainted 4.11.0-rc2-mm1+ #1
          Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
          RIP: 0010:do_page_add_anon_rmap+0x1ba/0x260
          RSP: 0018:ffffc90002473b30 EFLAGS: 00010282
          RAX: 0000000000000021 RBX: ffffea0004d51180 RCX: 0000000000000006
          RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff88007dc0dfe0
          RBP: ffffc90002473b58 R08: 00000000fffffffe R09: 00000000000001c1
          R10: 0000000000000005 R11: 00000000000001c0 R12: ffff880139ab3d80
          R13: 0000000000000000 R14: 0000700000000200 R15: 0000160000000000
          FS:  00007f5195f50740(0000) GS:ffff88007dc00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 00007fd450287000 CR3: 000000007a08e000 CR4: 00000000001406f0
          Call Trace:
           page_add_anon_rmap+0x18/0x20
           remove_migration_pte+0x220/0x2c0
           rmap_walk_ksm+0x143/0x220
           rmap_walk+0x55/0x60
           remove_migration_ptes+0x53/0x80
           migrate_pages+0x8ed/0xb60
           soft_offline_page+0x309/0x8d0
           store_soft_offline_page+0xaf/0xf0
           dev_attr_store+0x18/0x30
           sysfs_kf_write+0x3a/0x50
           kernfs_fop_write+0xff/0x180
           __vfs_write+0x37/0x160
           vfs_write+0xb2/0x1b0
           SyS_write+0x55/0xc0
           do_syscall_64+0x67/0x180
           entry_SYSCALL64_slow_path+0x25/0x25
          RIP: 0033:0x7f51956339e0
          RSP: 002b:00007ffcfa0dffc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
          RAX: ffffffffffffffda RBX: 000000000000000c RCX: 00007f51956339e0
          RDX: 000000000000000c RSI: 00007f5195f53000 RDI: 0000000000000001
          RBP: 00007f5195f53000 R08: 000000000000000a R09: 00007f5195f50740
          R10: 000000000000000b R11: 0000000000000246 R12: 00007f5195907400
          R13: 000000000000000c R14: 0000000000000001 R15: 0000000000000000
          Code: fe ff ff 48 81 c2 00 02 00 00 48 89 55 d8 e8 2e c3 fd ff 48 8b 55 d8 e9 42 ff ff ff 48 c7 c6 e0 52 a1 81 48 89 df e8 46 ad fe ff <0f> 0b 48 83 e8 01 e9 7f fe ff ff 48 83 e8 01 e9 96 fe ff ff 48
          RIP: do_page_add_anon_rmap+0x1ba/0x260 RSP: ffffc90002473b30
          ---[ end trace a679d00f4af2df48 ]---
          Kernel panic - not syncing: Fatal exception
          Kernel Offset: disabled
          ---[ end Kernel panic - not syncing: Fatal exception
      
      The problem is in the following lines:
      
          new = page - pvmw.page->index +
              linear_page_index(vma, pvmw.address);
      
      The 'new' is calculated with 'page' which is given by the caller as a
      destination page and some offset adjustment for thp.  But this doesn't
      properly work for ksm pages because pvmw.page->index doesn't change for
      each address but linear_page_index() changes, which means that 'new'
      points to different pages for each addresses backed by the ksm page.  As
      a result, we try to set totally unrelated pages as destination pages,
      and that causes kernel crash.
      
      This patch fixes the miscalculation and makes ksm page migration work
      fine.
      
      Fixes: 3fe87967 ("mm: convert remove_migration_pte() to use page_vma_mapped_walk()")
      Link: http://lkml.kernel.org/r/1489717683-29905-1-git-send-email-n-horiguchi@ah.jp.nec.comSigned-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b0ece6f
  16. 02 3月, 2017 1 次提交
    • I
      sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h> · 6e84f315
      Ingo Molnar 提交于
      We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
      will have to be picked up from other headers and a couple of .c files.
      
      Create a trivial placeholder <linux/sched/mm.h> file that just
      maps to <linux/sched.h> to make this patch obviously correct and
      bisectable.
      
      The APIs that are going to be moved first are:
      
         mm_alloc()
         __mmdrop()
         mmdrop()
         mmdrop_async_fn()
         mmdrop_async()
         mmget_not_zero()
         mmput()
         mmput_async()
         get_task_mm()
         mm_access()
         mm_release()
      
      Include the new header in the files that are going to need it.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6e84f315
  17. 25 2月, 2017 2 次提交
  18. 26 12月, 2016 1 次提交
  19. 13 12月, 2016 2 次提交
    • J
      lib: radix-tree: check accounting of existing slot replacement users · 6d75f366
      Johannes Weiner 提交于
      The bug in khugepaged fixed earlier in this series shows that radix tree
      slot replacement is fragile; and it will become more so when not only
      NULL<->!NULL transitions need to be caught but transitions from and to
      exceptional entries as well.  We need checks.
      
      Re-implement radix_tree_replace_slot() on top of the sanity-checked
      __radix_tree_replace().  This requires existing callers to also pass the
      radix tree root, but it'll warn us when somebody replaces slots with
      contents that need proper accounting (transitions between NULL entries,
      real entries, exceptional entries) and where a replacement through the
      slot pointer would corrupt the radix tree node counts.
      
      Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Suggested-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <mawilcox@linuxonhyperv.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d75f366
    • M
      mm, compaction: fix NR_ISOLATED_* stats for pfn based migration · 6afcf8ef
      Ming Ling 提交于
      Since commit bda807d4 ("mm: migrate: support non-lru movable page
      migration") isolate_migratepages_block) can isolate !PageLRU pages which
      would acct_isolated account as NR_ISOLATED_*.  Accounting these non-lru
      pages NR_ISOLATED_{ANON,FILE} doesn't make any sense and it can misguide
      heuristics based on those counters such as pgdat_reclaimable_pages resp.
      too_many_isolated which would lead to unexpected stalls during the
      direct reclaim without any good reason.  Note that
      __alloc_contig_migrate_range can isolate a lot of pages at once.
      
      On mobile devices such as 512M ram android Phone, it may use a big zram
      swap.  In some cases zram(zsmalloc) uses too many non-lru but
      migratedable pages, such as:
      
            MemTotal: 468148 kB
            Normal free:5620kB
            Free swap:4736kB
            Total swap:409596kB
            ZRAM: 164616kB(zsmalloc non-lru pages)
            active_anon:60700kB
            inactive_anon:60744kB
            active_file:34420kB
            inactive_file:37532kB
      
      Fix this by only accounting lru pages to NR_ISOLATED_* in
      isolate_migratepages_block right after they were isolated and we still
      know they were on LRU.  Drop acct_isolated because it is called after
      the fact and we've lost that information.  Batching per-cpu counter
      doesn't make much improvement anyway.  Also make sure that we uncharge
      only LRU pages when putting them back on the LRU in
      putback_movable_pages resp.  when unmap_and_move migrates the page.
      
      [mhocko@suse.com: replace acct_isolated() with direct counting]
      Fixes: bda807d4 ("mm: migrate: support non-lru movable page migration")
      Link: http://lkml.kernel.org/r/20161019080240.9682-1-mhocko@kernel.orgSigned-off-by: NMing Ling <ming.ling@spreadtrum.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <js1304@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6afcf8ef
  20. 08 10月, 2016 1 次提交
  21. 29 7月, 2016 6 次提交
    • V
      mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations · 25160354
      Vlastimil Babka 提交于
      After the previous patch, we can distinguish costly allocations that
      should be really lightweight, such as THP page faults, with
      __GFP_NORETRY.  This means we don't need to recognize khugepaged
      allocations via PF_KTHREAD anymore.  We can also change THP page faults
      in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
      khugepaged, as the process has indicated that it benefits from THP's and
      is willing to pay some initial latency costs.
      
      We can also make the flags handling less cryptic by distinguishing
      GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
      GFP_TRANSHUGE (only direct reclaim, khugepaged default).  Adding
      __GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.
      
      The patch effectively changes the current GFP_TRANSHUGE users as
      follows:
      
      * get_huge_zero_page() - the zero page lifetime should be relatively
        long and it's shared by multiple users, so it's worth spending some
        effort on it.  We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
        This also restores direct reclaim to this allocation, which was
        unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
        by default to madvise and add a stall-free defrag option")
      
      * alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
        is not an issue.  So if khugepaged "defrag" is enabled (the default), do
        reclaim via GFP_TRANSHUGE without __GFP_NORETRY.  We can remove the
        PF_KTHREAD check from page alloc.
      
        As a side-effect, khugepaged will now no longer check if the initial
        compaction was deferred or contended.  This is OK, as khugepaged sleep
        times between collapsion attempts are long enough to prevent noticeable
        disruption, so we should allow it to spend some effort.
      
      * migrate_misplaced_transhuge_page() - already was masking out
        __GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
        equivalent.
      
      * alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
        are now allocating without __GFP_NORETRY.  Other vma's keep using
        __GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
        it's allowed only for madvised vma's).  The rest is conversion to
        GFP_TRANSHUGE(_LIGHT).
      
      [mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
      Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25160354
    • M
      mm: remove reclaim and compaction retry approximations · 5a1c84b4
      Mel Gorman 提交于
      If per-zone LRU accounting is available then there is no point
      approximating whether reclaim and compaction should retry based on pgdat
      statistics.  This is effectively a revert of "mm, vmstat: remove zone
      and node double accounting by approximating retries" with the difference
      that inactive/active stats are still available.  This preserves the
      history of why the approximation was retried and why it had to be
      reverted to handle OOM kills on 32-bit systems.
      
      Link: http://lkml.kernel.org/r/1469110261-7365-4-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5a1c84b4
    • M
      mm, vmstat: remove zone and node double accounting by approximating retries · bca67592
      Mel Gorman 提交于
      The number of LRU pages, dirty pages and writeback pages must be
      accounted for on both zones and nodes because of the reclaim retry
      logic, compaction retry logic and highmem calculations all depending on
      per-zone stats.
      
      Many lowmem allocations are immune from OOM kill due to a check in
      __alloc_pages_may_oom for (ac->high_zoneidx < ZONE_NORMAL) since commit
      03668b3c ("oom: avoid oom killer for lowmem allocations").  The
      exception is costly high-order allocations or allocations that cannot
      fail.  If the __alloc_pages_may_oom avoids OOM-kill for low-order lowmem
      allocations then it would fall through to __alloc_pages_direct_compact.
      
      This patch will blindly retry reclaim for zone-constrained allocations
      in should_reclaim_retry up to MAX_RECLAIM_RETRIES.  This is not ideal
      but without per-zone stats there are not many alternatives.  The impact
      it that zone-constrained allocations may delay before considering the
      OOM killer.
      
      As there is no guarantee enough memory can ever be freed to satisfy
      compaction, this patch avoids retrying compaction for zone-contrained
      allocations.
      
      In combination, that means that the per-node stats can be used when
      deciding whether to continue reclaim using a rough approximation.  While
      it is possible this will make the wrong decision on occasion, it will
      not infinite loop as the number of reclaim attempts is capped by
      MAX_RECLAIM_RETRIES.
      
      The final step is calculating the number of dirtyable highmem pages.  As
      those calculations only care about the global count of file pages in
      highmem.  This patch uses a global counter used instead of per-zone
      stats as it is sufficient.
      
      In combination, this allows the per-zone LRU and dirty state counters to
      be removed.
      
      [mgorman@techsingularity.net: fix acct_highmem_file_pages()]
        Link: http://lkml.kernel.org/r/1468853426-12858-4-git-send-email-mgorman@techsingularity.netLink: http://lkml.kernel.org/r/1467970510-21195-35-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Suggested by: Michal Hocko <mhocko@kernel.org>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bca67592
    • M
      mm: move most file-based accounting to the node · 11fb9989
      Mel Gorman 提交于
      There are now a number of accounting oddities such as mapped file pages
      being accounted for on the node while the total number of file pages are
      accounted on the zone.  This can be coped with to some extent but it's
      confusing so this patch moves the relevant file-based accounted.  Due to
      throttling logic in the page allocator for reliable OOM detection, it is
      still necessary to track dirty and writeback pages on a per-zone basis.
      
      [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
        Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
      Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11fb9989
    • M
      mm: rename NR_ANON_PAGES to NR_ANON_MAPPED · 4b9d0fab
      Mel Gorman 提交于
      NR_FILE_PAGES  is the number of        file pages.
      NR_FILE_MAPPED is the number of mapped file pages.
      NR_ANON_PAGES  is the number of mapped anon pages.
      
      This is unhelpful naming as it's easy to confuse NR_FILE_MAPPED and
      NR_ANON_PAGES for mapped pages.  This patch renames NR_ANON_PAGES so we
      have
      
      NR_FILE_PAGES  is the number of        file pages.
      NR_FILE_MAPPED is the number of mapped file pages.
      NR_ANON_MAPPED is the number of mapped anon pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-19-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b9d0fab
    • M
      mm, vmscan: move LRU lists to node · 599d0c95
      Mel Gorman 提交于
      This moves the LRU lists from the zone to the node and related data such
      as counters, tracing, congestion tracking and writeback tracking.
      
      Unfortunately, due to reclaim and compaction retry logic, it is
      necessary to account for the number of LRU pages on both zone and node
      logic.  Most reclaim logic is based on the node counters but the retry
      logic uses the zone counters which do not distinguish inactive and
      active sizes.  It would be possible to leave the LRU counters on a
      per-zone basis but it's a heavier calculation across multiple cache
      lines that is much more frequent than the retry checks.
      
      Other than the LRU counters, this is mostly a mechanical patch but note
      that it introduces a number of anomalies.  For example, the scans are
      per-zone but using per-node counters.  We also mark a node as congested
      when a zone is congested.  This causes weird problems that are fixed
      later but is easier to review.
      
      In the event that there is excessive overhead on 32-bit systems due to
      the nodes being on LRU then there are two potential solutions
      
      1. Long-term isolation of highmem pages when reclaim is lowmem
      
         When pages are skipped, they are immediately added back onto the LRU
         list. If lowmem reclaim persisted for long periods of time, the same
         highmem pages get continually scanned. The idea would be that lowmem
         keeps those pages on a separate list until a reclaim for highmem pages
         arrives that splices the highmem pages back onto the LRU. It potentially
         could be implemented similar to the UNEVICTABLE list.
      
         That would reduce the skip rate with the potential corner case is that
         highmem pages have to be scanned and reclaimed to free lowmem slab pages.
      
      2. Linear scan lowmem pages if the initial LRU shrink fails
      
         This will break LRU ordering but may be preferable and faster during
         memory pressure than skipping LRU pages.
      
      Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      599d0c95
  22. 27 7月, 2016 1 次提交