1. 16 11月, 2017 1 次提交
  2. 02 11月, 2017 1 次提交
    • G
      License cleanup: add SPDX GPL-2.0 license identifier to files with no license · b2441318
      Greg Kroah-Hartman 提交于
      Many source files in the tree are missing licensing information, which
      makes it harder for compliance tools to determine the correct license.
      
      By default all files without license information are under the default
      license of the kernel, which is GPL version 2.
      
      Update the files which contain no license information with the 'GPL-2.0'
      SPDX license identifier.  The SPDX identifier is a legally binding
      shorthand, which can be used instead of the full boiler plate text.
      
      This patch is based on work done by Thomas Gleixner and Kate Stewart and
      Philippe Ombredanne.
      
      How this work was done:
      
      Patches were generated and checked against linux-4.14-rc6 for a subset of
      the use cases:
       - file had no licensing information it it.
       - file was a */uapi/* one with no licensing information in it,
       - file was a */uapi/* one with existing licensing information,
      
      Further patches will be generated in subsequent months to fix up cases
      where non-standard license headers were used, and references to license
      had to be inferred by heuristics based on keywords.
      
      The analysis to determine which SPDX License Identifier to be applied to
      a file was done in a spreadsheet of side by side results from of the
      output of two independent scanners (ScanCode & Windriver) producing SPDX
      tag:value files created by Philippe Ombredanne.  Philippe prepared the
      base worksheet, and did an initial spot review of a few 1000 files.
      
      The 4.13 kernel was the starting point of the analysis with 60,537 files
      assessed.  Kate Stewart did a file by file comparison of the scanner
      results in the spreadsheet to determine which SPDX license identifier(s)
      to be applied to the file. She confirmed any determination that was not
      immediately clear with lawyers working with the Linux Foundation.
      
      Criteria used to select files for SPDX license identifier tagging was:
       - Files considered eligible had to be source code files.
       - Make and config files were included as candidates if they contained >5
         lines of source
       - File already had some variant of a license header in it (even if <5
         lines).
      
      All documentation files were explicitly excluded.
      
      The following heuristics were used to determine which SPDX license
      identifiers to apply.
      
       - when both scanners couldn't find any license traces, file was
         considered to have no license information in it, and the top level
         COPYING file license applied.
      
         For non */uapi/* files that summary was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0                                              11139
      
         and resulted in the first patch in this series.
      
         If that file was a */uapi/* path one, it was "GPL-2.0 WITH
         Linux-syscall-note" otherwise it was "GPL-2.0".  Results of that was:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|-------
         GPL-2.0 WITH Linux-syscall-note                        930
      
         and resulted in the second patch in this series.
      
       - if a file had some form of licensing information in it, and was one
         of the */uapi/* ones, it was denoted with the Linux-syscall-note if
         any GPL family license was found in the file or had no licensing in
         it (per prior point).  Results summary:
      
         SPDX license identifier                            # files
         ---------------------------------------------------|------
         GPL-2.0 WITH Linux-syscall-note                       270
         GPL-2.0+ WITH Linux-syscall-note                      169
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-2-Clause)    21
         ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause)    17
         LGPL-2.1+ WITH Linux-syscall-note                      15
         GPL-1.0+ WITH Linux-syscall-note                       14
         ((GPL-2.0+ WITH Linux-syscall-note) OR BSD-3-Clause)    5
         LGPL-2.0+ WITH Linux-syscall-note                       4
         LGPL-2.1 WITH Linux-syscall-note                        3
         ((GPL-2.0 WITH Linux-syscall-note) OR MIT)              3
         ((GPL-2.0 WITH Linux-syscall-note) AND MIT)             1
      
         and that resulted in the third patch in this series.
      
       - when the two scanners agreed on the detected license(s), that became
         the concluded license(s).
      
       - when there was disagreement between the two scanners (one detected a
         license but the other didn't, or they both detected different
         licenses) a manual inspection of the file occurred.
      
       - In most cases a manual inspection of the information in the file
         resulted in a clear resolution of the license that should apply (and
         which scanner probably needed to revisit its heuristics).
      
       - When it was not immediately clear, the license identifier was
         confirmed with lawyers working with the Linux Foundation.
      
       - If there was any question as to the appropriate license identifier,
         the file was flagged for further research and to be revisited later
         in time.
      
      In total, over 70 hours of logged manual review was done on the
      spreadsheet to determine the SPDX license identifiers to apply to the
      source files by Kate, Philippe, Thomas and, in some cases, confirmation
      by lawyers working with the Linux Foundation.
      
      Kate also obtained a third independent scan of the 4.13 code base from
      FOSSology, and compared selected files where the other two scanners
      disagreed against that SPDX file, to see if there was new insights.  The
      Windriver scanner is based on an older version of FOSSology in part, so
      they are related.
      
      Thomas did random spot checks in about 500 files from the spreadsheets
      for the uapi headers and agreed with SPDX license identifier in the
      files he inspected. For the non-uapi files Thomas did random spot checks
      in about 15000 files.
      
      In initial set of patches against 4.14-rc6, 3 files were found to have
      copy/paste license identifier errors, and have been fixed to reflect the
      correct identifier.
      
      Additionally Philippe spent 10 hours this week doing a detailed manual
      inspection and review of the 12,461 patched files from the initial patch
      version early this week with:
       - a full scancode scan run, collecting the matched texts, detected
         license ids and scores
       - reviewing anything where there was a license detected (about 500+
         files) to ensure that the applied SPDX license was correct
       - reviewing anything where there was no detection but the patch license
         was not GPL-2.0 WITH Linux-syscall-note to ensure that the applied
         SPDX license was correct
      
      This produced a worksheet with 20 files needing minor correction.  This
      worksheet was then exported into 3 different .csv files for the
      different types of files to be modified.
      
      These .csv files were then reviewed by Greg.  Thomas wrote a script to
      parse the csv files and add the proper SPDX tag to the file, in the
      format that the file expected.  This script was further refined by Greg
      based on the output to detect more types of files automatically and to
      distinguish between header and source .c files (which need different
      comment types.)  Finally Greg ran the script using the .csv files to
      generate the patches.
      Reviewed-by: NKate Stewart <kstewart@linuxfoundation.org>
      Reviewed-by: NPhilippe Ombredanne <pombredanne@nexb.com>
      Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b2441318
  3. 11 8月, 2017 2 次提交
    • M
      mm: make tlb_flush_pending global · 0a2dd266
      Minchan Kim 提交于
      Currently, tlb_flush_pending is used only for CONFIG_[NUMA_BALANCING|
      COMPACTION] but upcoming patches to solve subtle TLB flush batching
      problem will use it regardless of compaction/NUMA so this patch doesn't
      remove the dependency.
      
      [akpm@linux-foundation.org: remove more ifdefs from world's ugliest printk statement]
      Link: http://lkml.kernel.org/r/20170802000818.4760-6-namit@vmware.comSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Nadav Amit <nadav.amit@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0a2dd266
    • N
      mm: migrate: prevent racy access to tlb_flush_pending · 16af97dc
      Nadav Amit 提交于
      Patch series "fixes of TLB batching races", v6.
      
      It turns out that Linux TLB batching mechanism suffers from various
      races.  Races that are caused due to batching during reclamation were
      recently handled by Mel and this patch-set deals with others.  The more
      fundamental issue is that concurrent updates of the page-tables allow
      for TLB flushes to be batched on one core, while another core changes
      the page-tables.  This other core may assume a PTE change does not
      require a flush based on the updated PTE value, while it is unaware that
      TLB flushes are still pending.
      
      This behavior affects KSM (which may result in memory corruption) and
      MADV_FREE and MADV_DONTNEED (which may result in incorrect behavior).  A
      proof-of-concept can easily produce the wrong behavior of MADV_DONTNEED.
      Memory corruption in KSM is harder to produce in practice, but was
      observed by hacking the kernel and adding a delay before flushing and
      replacing the KSM page.
      
      Finally, there is also one memory barrier missing, which may affect
      architectures with weak memory model.
      
      This patch (of 7):
      
      Setting and clearing mm->tlb_flush_pending can be performed by multiple
      threads, since mmap_sem may only be acquired for read in
      task_numa_work().  If this happens, tlb_flush_pending might be cleared
      while one of the threads still changes PTEs and batches TLB flushes.
      
      This can lead to the same race between migration and
      change_protection_range() that led to the introduction of
      tlb_flush_pending.  The result of this race was data corruption, which
      means that this patch also addresses a theoretically possible data
      corruption.
      
      An actual data corruption was not observed, yet the race was was
      confirmed by adding assertion to check tlb_flush_pending is not set by
      two threads, adding artificial latency in change_protection_range() and
      using sysctl to reduce kernel.numa_balancing_scan_delay_ms.
      
      Link: http://lkml.kernel.org/r/20170802000818.4760-2-namit@vmware.com
      Fixes: 20841405 ("mm: fix TLB flush race between migration, and
      change_protection_range")
      Signed-off-by: NNadav Amit <namit@vmware.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Russell King <linux@armlinux.org.uk>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16af97dc
  4. 13 12月, 2016 1 次提交
  5. 08 10月, 2016 1 次提交
  6. 20 9月, 2016 1 次提交
  7. 18 3月, 2016 1 次提交
    • J
      mm: introduce page reference manipulation functions · fe896d18
      Joonsoo Kim 提交于
      The success of CMA allocation largely depends on the success of
      migration and key factor of it is page reference count.  Until now, page
      reference is manipulated by direct calling atomic functions so we cannot
      follow up who and where manipulate it.  Then, it is hard to find actual
      reason of CMA allocation failure.  CMA allocation should be guaranteed
      to succeed so finding offending place is really important.
      
      In this patch, call sites where page reference is manipulated are
      converted to introduced wrapper function.  This is preparation step to
      add tracepoint to each page reference manipulation function.  With this
      facility, we can easily find reason of CMA allocation failure.  There is
      no functional change in this patch.
      
      In addition, this patch also converts reference read sites.  It will
      help a second step that renames page._count to something else and
      prevents later attempt to direct access to it (Suggested by Andrew).
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fe896d18
  8. 16 3月, 2016 6 次提交
    • V
      mm, debug: move bad flags printing to bad_page() · ff8e8116
      Vlastimil Babka 提交于
      Since bad_page() is the only user of the badflags parameter of
      dump_page_badflags(), we can move the code to bad_page() and simplify a
      bit.
      
      The dump_page_badflags() function is renamed to __dump_page() and can
      still be called separately from dump_page() for temporary debug prints
      where page_owner info is not desired.
      
      The only user-visible change is that page->mem_cgroup is printed before
      the bad flags.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ff8e8116
    • V
      mm, page_owner: dump page owner info from dump_page() · 4e462112
      Vlastimil Babka 提交于
      The page_owner mechanism is useful for dealing with memory leaks.  By
      reading /sys/kernel/debug/page_owner one can determine the stack traces
      leading to allocations of all pages, and find e.g.  a buggy driver.
      
      This information might be also potentially useful for debugging, such as
      the VM_BUG_ON_PAGE() calls to dump_page().  So let's print the stored
      info from dump_page().
      
      Example output:
      
        page:ffffea000292f1c0 count:1 mapcount:0 mapping:ffff8800b2f6cc18 index:0x91d
        flags: 0x1fffff8001002c(referenced|uptodate|lru|mappedtodisk)
        page dumped because: VM_BUG_ON_PAGE(1)
        page->mem_cgroup:ffff8801392c5000
        page allocated via order 0, migratetype Movable, gfp_mask 0x24213ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD|__GFP_NOWARN|__GFP_NORETRY)
         [<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230
         [<ffffffff811b40c8>] alloc_pages_current+0x88/0x120
         [<ffffffff8115e386>] __page_cache_alloc+0xe6/0x120
         [<ffffffff8116ba6c>] __do_page_cache_readahead+0xdc/0x240
         [<ffffffff8116bd05>] ondemand_readahead+0x135/0x260
         [<ffffffff8116be9c>] page_cache_async_readahead+0x6c/0x70
         [<ffffffff811604c2>] generic_file_read_iter+0x3f2/0x760
         [<ffffffff811e0dc7>] __vfs_read+0xa7/0xd0
        page has been migrated, last migrate reason: compaction
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e462112
    • V
      mm, page_owner: track and print last migrate reason · 7cd12b4a
      Vlastimil Babka 提交于
      During migration, page_owner info is now copied with the rest of the
      page, so the stacktrace leading to free page allocation during migration
      is overwritten.  For debugging purposes, it might be however useful to
      know that the page has been migrated since its initial allocation.  This
      might happen many times during the lifetime for different reasons and
      fully tracking this, especially with stacktraces would incur extra
      memory costs.  As a compromise, store and print the migrate_reason of
      the last migration that occurred to the page.  This is enough to
      distinguish compaction, numa balancing etc.
      
      Example page_owner entry after the patch:
      
        Page allocated via order 0, mask 0x24200ca(GFP_HIGHUSER_MOVABLE)
        PFN 628753 type Movable Block 1228 type Movable Flags 0x1fffff80040030(dirty|lru|swapbacked)
         [<ffffffff811682c4>] __alloc_pages_nodemask+0x134/0x230
         [<ffffffff811b6325>] alloc_pages_vma+0xb5/0x250
         [<ffffffff81177491>] shmem_alloc_page+0x61/0x90
         [<ffffffff8117a438>] shmem_getpage_gfp+0x678/0x960
         [<ffffffff8117c2b9>] shmem_fallocate+0x329/0x440
         [<ffffffff811de600>] vfs_fallocate+0x140/0x230
         [<ffffffff811df434>] SyS_fallocate+0x44/0x70
         [<ffffffff8158cc2e>] entry_SYSCALL_64_fastpath+0x12/0x71
        Page has been migrated, last migrate reason: compaction
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cd12b4a
    • V
      mm, debug: replace dump_flags() with the new printk formats · b8eceeb9
      Vlastimil Babka 提交于
      With the new printk format strings for flags, we can get rid of
      dump_flags() in mm/debug.c.
      
      This also fixes dump_vma() which used dump_flags() for printing vma
      flags.  However dump_flags() did a page-flags specific filtering of bits
      higher than NR_PAGEFLAGS in order to remove the zone id part.  For
      dump_vma() this resulted in removing several VM_* flags from the
      symbolic translation.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8eceeb9
    • V
      mm, printk: introduce new format string for flags · edf14cdb
      Vlastimil Babka 提交于
      In mm we use several kinds of flags bitfields that are sometimes printed
      for debugging purposes, or exported to userspace via sysfs.  To make
      them easier to interpret independently on kernel version and config, we
      want to dump also the symbolic flag names.  So far this has been done
      with repeated calls to pr_cont(), which is unreliable on SMP, and not
      usable for e.g.  sysfs export.
      
      To get a more reliable and universal solution, this patch extends
      printk() format string for pointers to handle the page flags (%pGp),
      gfp_flags (%pGg) and vma flags (%pGv).  Existing users of
      dump_flag_names() are converted and simplified.
      
      It would be possible to pass flags by value instead of pointer, but the
      %p format string for pointers already has extensions for various kernel
      structures, so it's a good fit, and the extra indirection in a
      non-critical path is negligible.
      
      [linux@rasmusvillemoes.dk: lots of good implementation suggestions]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edf14cdb
    • V
      mm, tracing: unify mm flags handling in tracepoints and printk · 420adbe9
      Vlastimil Babka 提交于
      In tracepoints, it's possible to print gfp flags in a human-friendly
      format through a macro show_gfp_flags(), which defines a translation
      array and passes is to __print_flags().  Since the following patch will
      introduce support for gfp flags printing in printk(), it would be nice
      to reuse the array.  This is not straightforward, since __print_flags()
      can't simply reference an array defined in a .c file such as mm/debug.c
      - it has to be a macro to allow the macro magic to communicate the
      format to userspace tools such as trace-cmd.
      
      The solution is to create a macro __def_gfpflag_names which is used both
      in show_gfp_flags(), and to define the gfpflag_names[] array in
      mm/debug.c.
      
      On the other hand, mm/debug.c also defines translation tables for page
      flags and vma flags, and desire was expressed (but not implemented in
      this series) to use these also from tracepoints.  Thus, this patch also
      renames the events/gfpflags.h file to events/mmflags.h and moves the
      table definitions there, using the same macro approach as for gfpflags.
      This allows translating all three kinds of mm-specific flags both in
      tracepoints and printk.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      420adbe9
  9. 16 1月, 2016 2 次提交
    • K
      mm: rework mapcount accounting to enable 4k mapping of THPs · 53f9263b
      Kirill A. Shutemov 提交于
      We're going to allow mapping of individual 4k pages of THP compound.  It
      means we need to track mapcount on per small page basis.
      
      Straight-forward approach is to use ->_mapcount in all subpages to track
      how many time this subpage is mapped with PMDs or PTEs combined.  But
      this is rather expensive: mapping or unmapping of a THP page with PMD
      would require HPAGE_PMD_NR atomic operations instead of single we have
      now.
      
      The idea is to store separately how many times the page was mapped as
      whole -- compound_mapcount.  This frees up ->_mapcount in subpages to
      track PTE mapcount.
      
      We use the same approach as with compound page destructor and compound
      order to store compound_mapcount: use space in first tail page,
      ->mapping this time.
      
      Any time we map/unmap whole compound page (THP or hugetlb) -- we
      increment/decrement compound_mapcount.  When we map part of compound
      page with PTE we operate on ->_mapcount of the subpage.
      
      page_mapcount() counts both: PTE and PMD mappings of the page.
      
      Basically, we have mapcount for a subpage spread over two counters.  It
      makes tricky to detect when last mapcount for a page goes away.
      
      We introduced PageDoubleMap() for this.  When we split THP PMD for the
      first time and there's other PMD mapping left we offset up ->_mapcount
      in all subpages by one and set PG_double_map on the compound page.
      These additional references go away with last compound_mapcount.
      
      This approach provides a way to detect when last mapcount goes away on
      per small page basis without introducing new overhead for most common
      cases.
      
      [akpm@linux-foundation.org: fix typo in comment]
      [mhocko@suse.com: ignore partial THP when moving task]
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Jerome Marchand <jmarchan@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      53f9263b
    • K
      mm, thp: remove compound_lock() · 3ac808fd
      Kirill A. Shutemov 提交于
      We are going to use migration entries to stabilize page counts.  It
      means we don't need compound_lock() for that.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NSasha Levin <sasha.levin@oracle.com>
      Tested-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ac808fd
  10. 15 1月, 2016 1 次提交
    • K
      mm: rework virtual memory accounting · 84638335
      Konstantin Khlebnikov 提交于
      When inspecting a vague code inside prctl(PR_SET_MM_MEM) call (which
      testing the RLIMIT_DATA value to figure out if we're allowed to assign
      new @start_brk, @brk, @start_data, @end_data from mm_struct) it's been
      commited that RLIMIT_DATA in a form it's implemented now doesn't do
      anything useful because most of user-space libraries use mmap() syscall
      for dynamic memory allocations.
      
      Linus suggested to convert RLIMIT_DATA rlimit into something suitable
      for anonymous memory accounting.  But in this patch we go further, and
      the changes are bundled together as:
      
       * keep vma counting if CONFIG_PROC_FS=n, will be used for limits
       * replace mm->shared_vm with better defined mm->data_vm
       * account anonymous executable areas as executable
       * account file-backed growsdown/up areas as stack
       * drop struct file* argument from vm_stat_account
       * enforce RLIMIT_DATA for size of data areas
      
      This way code looks cleaner: now code/stack/data classification depends
      only on vm_flags state:
      
       VM_EXEC & ~VM_WRITE            -> code  (VmExe + VmLib in proc)
       VM_GROWSUP | VM_GROWSDOWN      -> stack (VmStk)
       VM_WRITE & ~VM_SHARED & !stack -> data  (VmData)
      
      The rest (VmSize - VmData - VmStk - VmExe - VmLib) could be called
      "shared", but that might be strange beast like readonly-private or VM_IO
      area.
      
       - RLIMIT_AS            limits whole address space "VmSize"
       - RLIMIT_STACK         limits stack "VmStk" (but each vma individually)
       - RLIMIT_DATA          now limits "VmData"
      Signed-off-by: NKonstantin Khlebnikov <koct9i@gmail.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Quentin Casasnovas <quentin.casasnovas@oracle.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Willy Tarreau <w@1wt.eu>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Kees Cook <keescook@google.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: Pavel Emelyanov <xemul@virtuozzo.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      84638335
  11. 07 11月, 2015 1 次提交
    • K
      mm: make compound_head() robust · 1d798ca3
      Kirill A. Shutemov 提交于
      Hugh has pointed that compound_head() call can be unsafe in some
      context. There's one example:
      
      	CPU0					CPU1
      
      isolate_migratepages_block()
        page_count()
          compound_head()
            !!PageTail() == true
      					put_page()
      					  tail->first_page = NULL
            head = tail->first_page
      					alloc_pages(__GFP_COMP)
      					   prep_compound_page()
      					     tail->first_page = head
      					     __SetPageTail(p);
            !!PageTail() == true
          <head == NULL dereferencing>
      
      The race is pure theoretical. I don't it's possible to trigger it in
      practice. But who knows.
      
      We can fix the race by changing how encode PageTail() and compound_head()
      within struct page to be able to update them in one shot.
      
      The patch introduces page->compound_head into third double word block in
      front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
      the rest bits are pointer to head page if bit zero is set.
      
      The patch moves page->pmd_huge_pte out of word, just in case if an
      architecture defines pgtable_t into something what can have the bit 0
      set.
      
      hugetlb_cgroup uses page->lru.next in the second tail page to store
      pointer struct hugetlb_cgroup. The patch switch it to use page->private
      in the second tail page instead. The space is free since ->first_page is
      removed from the union.
      
      The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
      limitation, since there's now space in first tail page to store struct
      hugetlb_cgroup pointer. But that's out of scope of the patch.
      
      That means page->compound_head shares storage space with:
      
       - page->lru.next;
       - page->next;
       - page->rcu_head.next;
      
      That's too long list to be absolutely sure, but looks like nobody uses
      bit 0 of the word.
      
      page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
      call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
      call_rcu_lazy() is not allowed as it makes use of the bit and we can
      get false positive PageTail().
      
      [1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d798ca3
  12. 06 11月, 2015 1 次提交
    • E
      mm: introduce VM_LOCKONFAULT · de60f5f1
      Eric B Munson 提交于
      The cost of faulting in all memory to be locked can be very high when
      working with large mappings.  If only portions of the mapping will be used
      this can incur a high penalty for locking.
      
      For the example of a large file, this is the usage pattern for a large
      statical language model (probably applies to other statical or graphical
      models as well).  For the security example, any application transacting in
      data that cannot be swapped out (credit card data, medical records, etc).
      
      This patch introduces the ability to request that pages are not
      pre-faulted, but are placed on the unevictable LRU when they are finally
      faulted in.  The VM_LOCKONFAULT flag will be used together with VM_LOCKED
      and has no effect when set without VM_LOCKED.  Setting the VM_LOCKONFAULT
      flag for a VMA will cause pages faulted into that VMA to be added to the
      unevictable LRU when they are faulted or if they are already present, but
      will not cause any missing pages to be faulted in.
      
      Exposing this new lock state means that we cannot overload the meaning of
      the FOLL_POPULATE flag any longer.  Prior to this patch it was used to
      mean that the VMA for a fault was locked.  This means we need the new
      FOLL_MLOCK flag to communicate the locked state of a VMA.  FOLL_POPULATE
      will now only control if the VMA should be populated and in the case of
      VM_LOCKONFAULT, it will not be set.
      Signed-off-by: NEric B Munson <emunson@akamai.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Guenter Roeck <linux@roeck-us.net>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Shuah Khan <shuahkh@osg.samsung.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      de60f5f1
  13. 11 9月, 2015 1 次提交
    • V
      mm: introduce idle page tracking · 33c3fc71
      Vladimir Davydov 提交于
      Knowing the portion of memory that is not used by a certain application or
      memory cgroup (idle memory) can be useful for partitioning the system
      efficiently, e.g.  by setting memory cgroup limits appropriately.
      Currently, the only means to estimate the amount of idle memory provided
      by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
      access bit for all pages mapped to a particular process by writing 1 to
      clear_refs, wait for some time, and then count smaps:Referenced.  However,
      this method has two serious shortcomings:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      To overcome these drawbacks, this patch introduces two new page flags,
      Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
      A page's Idle flag can only be set from userspace by setting bit in
      /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
      and it is cleared whenever the page is accessed either through page tables
      (it is cleared in page_referenced() in this case) or using the read(2)
      system call (mark_page_accessed()). Thus by setting the Idle flag for
      pages of a particular workload, which can be found e.g.  by reading
      /proc/PID/pagemap, waiting for some time to let the workload access its
      working set, and then reading the bitmap file, one can estimate the amount
      of pages that are not used by the workload.
      
      The Young page flag is used to avoid interference with the memory
      reclaimer.  A page's Young flag is set whenever the Access bit of a page
      table entry pointing to the page is cleared by writing to the bitmap file.
      If page_referenced() is called on a Young page, it will add 1 to its
      return value, therefore concealing the fact that the Access bit was
      cleared.
      
      Note, since there is no room for extra page flags on 32 bit, this feature
      uses extended page flags when compiled on 32 bit.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: kpageidle requires an MMU]
      [akpm@linux-foundation.org: decouple from page-flags rework]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33c3fc71
  14. 14 5月, 2015 1 次提交
  15. 12 2月, 2015 1 次提交
    • K
      mm: account pmd page tables to the process · dc6c9a35
      Kirill A. Shutemov 提交于
      Dave noticed that unprivileged process can allocate significant amount of
      memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
      memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
      kernel doesn't account PMD tables to the process, only PTE.
      
      The use-cases below use few tricks to allocate a lot of PMD page tables
      while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.
      
      	#include <errno.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/prctl.h>
      
      	#define PUD_SIZE (1UL << 30)
      	#define PMD_SIZE (1UL << 21)
      
      	#define NR_PUD 130000
      
      	int main(void)
      	{
      		char *addr = NULL;
      		unsigned long i;
      
      		prctl(PR_SET_THP_DISABLE);
      		for (i = 0; i < NR_PUD ; i++) {
      			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      			if (addr == MAP_FAILED) {
      				perror("mmap");
      				break;
      			}
      			*addr = 'x';
      			munmap(addr, PMD_SIZE);
      			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
      			if (addr == MAP_FAILED)
      				perror("re-mmap"), exit(1);
      		}
      		printf("PID %d consumed %lu KiB in PMD page tables\n",
      				getpid(), i * 4096 >> 10);
      		return pause();
      	}
      
      The patch addresses the issue by account PMD tables to the process the
      same way we account PTE.
      
      The main place where PMD tables is accounted is __pmd_alloc() and
      free_pmd_range(). But there're few corner cases:
      
       - HugeTLB can share PMD page tables. The patch handles by accounting
         the table to all processes who share it.
      
       - x86 PAE pre-allocates few PMD tables on fork.
      
       - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
         check on exit(2).
      
      Accounting only happens on configuration where PMD page table's level is
      present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
      counter value is used to calculate baseline for badness score by
      oom-killer.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc6c9a35
  16. 11 2月, 2015 1 次提交
  17. 11 12月, 2014 1 次提交
  18. 10 10月, 2014 3 次提交