1. 11 9月, 2015 15 次提交
    • O
      mm, mpx: add "vm_flags_t vm_flags" arg to do_mmap_pgoff() · 1fcfd8db
      Oleg Nesterov 提交于
      Add the additional "vm_flags_t vm_flags" argument to do_mmap_pgoff(),
      rename it to do_mmap(), and re-introduce do_mmap_pgoff() as a simple
      wrapper on top of do_mmap().  Perhaps we should update the callers of
      do_mmap_pgoff() and kill it later.
      
      This way mpx_mmap() can simply call do_mmap(vm_flags => VM_MPX) and do not
      play with vm internals.
      
      After this change mmap_region() has a single user outside of mmap.c,
      arch/tile/mm/elf.c:arch_setup_additional_pages().  It would be nice to
      change arch/tile/ and unexport mmap_region().
      
      [kirill@shutemov.name: fix build]
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
      Tested-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1fcfd8db
    • D
      kexec: split kexec_load syscall from kexec core code · 2965faa5
      Dave Young 提交于
      There are two kexec load syscalls, kexec_load another and kexec_file_load.
       kexec_file_load has been splited as kernel/kexec_file.c.  In this patch I
      split kexec_load syscall code to kernel/kexec.c.
      
      And add a new kconfig option KEXEC_CORE, so we can disable kexec_load and
      use kexec_file_load only, or vice verse.
      
      The original requirement is from Ted Ts'o, he want kexec kernel signature
      being checked with CONFIG_KEXEC_VERIFY_SIG enabled.  But kexec-tools use
      kexec_load syscall can bypass the checking.
      
      Vivek Goyal proposed to create a common kconfig option so user can compile
      in only one syscall for loading kexec kernel.  KEXEC/KEXEC_FILE selects
      KEXEC_CORE so that old config files still work.
      
      Because there's general code need CONFIG_KEXEC_CORE, so I updated all the
      architecture Kconfig with a new option KEXEC_CORE, and let KEXEC selects
      KEXEC_CORE in arch Kconfig.  Also updated general kernel code with to
      kexec_load syscall.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2965faa5
    • D
      kexec: split kexec_file syscall code to kexec_file.c · a43cac0d
      Dave Young 提交于
      Split kexec_file syscall related code to another file kernel/kexec_file.c
      so that the #ifdef CONFIG_KEXEC_FILE in kexec.c can be dropped.
      
      Sharing variables and functions are moved to kernel/kexec_internal.h per
      suggestion from Vivek and Petr.
      
      [akpm@linux-foundation.org: fix bisectability]
      [akpm@linux-foundation.org: declare the various arch_kexec functions]
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NDave Young <dyoung@redhat.com>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Petr Tesarik <ptesarik@suse.cz>
      Cc: Theodore Ts'o <tytso@mit.edu>
      Cc: Josh Boyer <jwboyer@fedoraproject.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a43cac0d
    • A
      seq_file: provide an analogue of print_hex_dump() · 37607102
      Andy Shevchenko 提交于
      This introduces a new helper and switches current users to use it.  All
      patches are compiled tested. kmemleak is tested via its own test suite.
      
      This patch (of 6):
      
      The new seq_hex_dump() is a complete analogue of print_hex_dump().
      
      We have few users of this functionality already. It allows to reduce their
      codebase.
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Joe Perches <joe@perches.com>
      Cc: Tadeusz Struk <tadeusz.struk@intel.com>
      Cc: Helge Deller <deller@gmx.de>
      Cc: Ingo Tuchscherer <ingo.tuchscherer@de.ibm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Vladimir Kondratiev <qca_vkondrat@qca.qualcomm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      37607102
    • F
      kmod: use system_unbound_wq instead of khelper · 90f02303
      Frederic Weisbecker 提交于
      We need to launch the usermodehelper kernel threads with the widest
      affinity and this is partly why we use khelper.  This workqueue has
      unbound properties and thus a wide affinity inherited by all its children.
      
      Now khelper also has special properties that we aren't much interested in:
      ordered and singlethread.  There is really no need about ordering as all
      we do is creating kernel threads.  This can be done concurrently.  And
      singlethread is a useless limitation as well.
      
      The workqueue engine already proposes generic unbound workqueues that
      don't share these useless properties and handle well parallel jobs.
      
      The only worrysome specific is their affinity to the node of the current
      CPU.  It's fine for creating the usermodehelper kernel threads but those
      inherit this affinity for longer jobs such as requesting modules.
      
      This patch proposes to use these node affine unbound workqueues assuming
      that a node is sufficient to handle several parallel usermodehelper
      requests.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rik van Riel <riel@redhat.com>
      Reviewed-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      90f02303
    • K
      lib/string_helpers: rename "esc" arg to "only" · b40bdb7f
      Kees Cook 提交于
      To further clarify the purpose of the "esc" argument, rename it to "only"
      to reflect that it is a limit, not a list of additional characters to
      escape.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Suggested-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b40bdb7f
    • L
      hexdump: do not print debug dumps for !CONFIG_DEBUG · cdf17449
      Linus Walleij 提交于
      print_hex_dump_debug() is likely supposed to be analogous to pr_debug() or
      dev_dbg() & friends.  Currently it will adhere to dynamic debug, but will
      not stub out prints if CONFIG_DEBUG is not set.  Let's make it do the
      right thing, because I am tired of having my dmesg buffer full of hex
      dumps on production systems.
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cdf17449
    • J
      include/linux/printk.h: include pr_fmt in pr_debug_ratelimited · 515a9adc
      Jason A. Donenfeld 提交于
      The other two implementations of pr_debug_ratelimited include pr_fmt,
      along with every other pr_* function.  But pr_debug_ratelimited forgot to
      add it with the CONFIG_DYNAMIC_DEBUG implementation.
      
      This patch unifies the behavior.
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      515a9adc
    • V
      include/linux/poison.h: remove not-used poison pointer macros · 8b839635
      Vasily Kulikov 提交于
      Signed-off-by: NVasily Kulikov <segoon@openwall.com>
      Cc: Solar Designer <solar@openwall.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8b839635
    • V
      include/linux/poison.h: fix LIST_POISON{1,2} offset · 8a5e5e02
      Vasily Kulikov 提交于
      Poison pointer values should be small enough to find a room in
      non-mmap'able/hardly-mmap'able space.  E.g.  on x86 "poison pointer space"
      is located starting from 0x0.  Given unprivileged users cannot mmap
      anything below mmap_min_addr, it should be safe to use poison pointers
      lower than mmap_min_addr.
      
      The current poison pointer values of LIST_POISON{1,2} might be too big for
      mmap_min_addr values equal or less than 1 MB (common case, e.g.  Ubuntu
      uses only 0x10000).  There is little point to use such a big value given
      the "poison pointer space" below 1 MB is not yet exhausted.  Changing it
      to a smaller value solves the problem for small mmap_min_addr setups.
      
      The values are suggested by Solar Designer:
      http://www.openwall.com/lists/oss-security/2015/05/02/6Signed-off-by: NVasily Kulikov <segoon@openwall.com>
      Cc: Solar Designer <solar@openwall.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8a5e5e02
    • V
      mm: introduce idle page tracking · 33c3fc71
      Vladimir Davydov 提交于
      Knowing the portion of memory that is not used by a certain application or
      memory cgroup (idle memory) can be useful for partitioning the system
      efficiently, e.g.  by setting memory cgroup limits appropriately.
      Currently, the only means to estimate the amount of idle memory provided
      by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
      access bit for all pages mapped to a particular process by writing 1 to
      clear_refs, wait for some time, and then count smaps:Referenced.  However,
      this method has two serious shortcomings:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      To overcome these drawbacks, this patch introduces two new page flags,
      Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
      A page's Idle flag can only be set from userspace by setting bit in
      /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
      and it is cleared whenever the page is accessed either through page tables
      (it is cleared in page_referenced() in this case) or using the read(2)
      system call (mark_page_accessed()). Thus by setting the Idle flag for
      pages of a particular workload, which can be found e.g.  by reading
      /proc/PID/pagemap, waiting for some time to let the workload access its
      working set, and then reading the bitmap file, one can estimate the amount
      of pages that are not used by the workload.
      
      The Young page flag is used to avoid interference with the memory
      reclaimer.  A page's Young flag is set whenever the Access bit of a page
      table entry pointing to the page is cleared by writing to the bitmap file.
      If page_referenced() is called on a Young page, it will add 1 to its
      return value, therefore concealing the fact that the Access bit was
      cleared.
      
      Note, since there is no room for extra page flags on 32 bit, this feature
      uses extended page flags when compiled on 32 bit.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: kpageidle requires an MMU]
      [akpm@linux-foundation.org: decouple from page-flags rework]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33c3fc71
    • V
      mmu-notifier: add clear_young callback · 1d7715c6
      Vladimir Davydov 提交于
      In the scope of the idle memory tracking feature, which is introduced by
      the following patch, we need to clear the referenced/accessed bit not only
      in primary, but also in secondary ptes.  The latter is required in order
      to estimate wss of KVM VMs.  At the same time we want to avoid flushing
      tlb, because it is quite expensive and it won't really affect the final
      result.
      
      Currently, there is no function for clearing pte young bit that would meet
      our requirements, so this patch introduces one.  To achieve that we have
      to add a new mmu-notifier callback, clear_young, since there is no method
      for testing-and-clearing a secondary pte w/o flushing tlb.  The new method
      is not mandatory and currently only implemented by KVM.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d7715c6
    • V
      memcg: zap try_get_mem_cgroup_from_page · e993d905
      Vladimir Davydov 提交于
      It is only used in mem_cgroup_try_charge, so fold it in and zap it.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e993d905
    • V
      memcg: add page_cgroup_ino helper · 2fc04524
      Vladimir Davydov 提交于
      This patchset introduces a new user API for tracking user memory pages
      that have not been used for a given period of time.  The purpose of this
      is to provide the userspace with the means of tracking a workload's
      working set, i.e.  the set of pages that are actively used by the
      workload.  Knowing the working set size can be useful for partitioning the
      system more efficiently, e.g.  by tuning memory cgroup limits
      appropriately, or for job placement within a compute cluster.
      
      ==== USE CASES ====
      
      The unified cgroup hierarchy has memory.low and memory.high knobs, which
      are defined as the low and high boundaries for the workload working set
      size.  However, the working set size of a workload may be unknown or
      change in time.  With this patch set, one can periodically estimate the
      amount of memory unused by each cgroup and tune their memory.low and
      memory.high parameters accordingly, therefore optimizing the overall
      memory utilization.
      
      Another use case is balancing workloads within a compute cluster.  Knowing
      how much memory is not really used by a workload unit may help take a more
      optimal decision when considering migrating the unit to another node
      within the cluster.
      
      Also, as noted by Minchan, this would be useful for per-process reclaim
      (https://lwn.net/Articles/545668/). With idle tracking, we could reclaim idle
      pages only by smart user memory manager.
      
      ==== USER API ====
      
      The user API consists of two new files:
      
       * /sys/kernel/mm/page_idle/bitmap.  This file implements a bitmap where each
         bit corresponds to a page, indexed by PFN. When the bit is set, the
         corresponding page is idle. A page is considered idle if it has not been
         accessed since it was marked idle. To mark a page idle one should set the
         bit corresponding to the page by writing to the file. A value written to the
         file is OR-ed with the current bitmap value. Only user memory pages can be
         marked idle, for other page types input is silently ignored. Writing to this
         file beyond max PFN results in the ENXIO error. Only available when
         CONFIG_IDLE_PAGE_TRACKING is set.
      
         This file can be used to estimate the amount of pages that are not
         used by a particular workload as follows:
      
         1. mark all pages of interest idle by setting corresponding bits in the
            /sys/kernel/mm/page_idle/bitmap
         2. wait until the workload accesses its working set
         3. read /sys/kernel/mm/page_idle/bitmap and count the number of bits set
      
       * /proc/kpagecgroup.  This file contains a 64-bit inode number of the
         memory cgroup each page is charged to, indexed by PFN. Only available when
         CONFIG_MEMCG is set.
      
         This file can be used to find all pages (including unmapped file pages)
         accounted to a particular cgroup. Using /sys/kernel/mm/page_idle/bitmap, one
         can then estimate the cgroup working set size.
      
      For an example of using these files for estimating the amount of unused
      memory pages per each memory cgroup, please see the script attached
      below.
      
      ==== REASONING ====
      
      The reason to introduce the new user API instead of using
      /proc/PID/{clear_refs,smaps} is that the latter has two serious
      drawbacks:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      The new API attempts to overcome them both. For more details on how it
      is achieved, please see the comment to patch 6.
      
      ==== PATCHSET STRUCTURE ====
      
      The patch set is organized as follows:
      
       - patch 1 adds page_cgroup_ino() helper for the sake of
         /proc/kpagecgroup and patches 2-3 do related cleanup
       - patch 4 adds /proc/kpagecgroup, which reports cgroup ino each page is
         charged to
       - patch 5 introduces a new mmu notifier callback, clear_young, which is
         a lightweight version of clear_flush_young; it is used in patch 6
       - patch 6 implements the idle page tracking feature, including the
         userspace API, /sys/kernel/mm/page_idle/bitmap
       - patch 7 exports idle flag via /proc/kpageflags
      
      ==== SIMILAR WORKS ====
      
      Originally, the patch for tracking idle memory was proposed back in 2011
      by Michel Lespinasse (see http://lwn.net/Articles/459269/).  The main
      difference between Michel's patch and this one is that Michel implemented
      a kernel space daemon for estimating idle memory size per cgroup while
      this patch only provides the userspace with the minimal API for doing the
      job, leaving the rest up to the userspace.  However, they both share the
      same idea of Idle/Young page flags to avoid affecting the reclaimer logic.
      
      ==== PERFORMANCE EVALUATION ====
      
      SPECjvm2008 (https://www.spec.org/jvm2008/) was used to evaluate the
      performance impact introduced by this patch set.  Three runs were carried
      out:
      
       - base: kernel without the patch
       - patched: patched kernel, the feature is not used
       - patched-active: patched kernel, 1 minute-period daemon is used for
         tracking idle memory
      
      For tracking idle memory, idlememstat utility was used:
      https://github.com/locker/idlememstat
      
      testcase            base            patched        patched-active
      
      compiler       537.40 ( 0.00)%   532.26 (-0.96)%   538.31 ( 0.17)%
      compress       305.47 ( 0.00)%   301.08 (-1.44)%   300.71 (-1.56)%
      crypto         284.32 ( 0.00)%   282.21 (-0.74)%   284.87 ( 0.19)%
      derby          411.05 ( 0.00)%   413.44 ( 0.58)%   412.07 ( 0.25)%
      mpegaudio      189.96 ( 0.00)%   190.87 ( 0.48)%   189.42 (-0.28)%
      scimark.large   46.85 ( 0.00)%    46.41 (-0.94)%    47.83 ( 2.09)%
      scimark.small  412.91 ( 0.00)%   415.41 ( 0.61)%   421.17 ( 2.00)%
      serial         204.23 ( 0.00)%   213.46 ( 4.52)%   203.17 (-0.52)%
      startup         36.76 ( 0.00)%    35.49 (-3.45)%    35.64 (-3.05)%
      sunflow        115.34 ( 0.00)%   115.08 (-0.23)%   117.37 ( 1.76)%
      xml            620.55 ( 0.00)%   619.95 (-0.10)%   620.39 (-0.03)%
      
      composite      211.50 ( 0.00)%   211.15 (-0.17)%   211.67 ( 0.08)%
      
      time idlememstat:
      
      17.20user 65.16system 2:15:23elapsed 1%CPU (0avgtext+0avgdata 8476maxresident)k
      448inputs+40outputs (1major+36052minor)pagefaults 0swaps
      
      ==== SCRIPT FOR COUNTING IDLE PAGES PER CGROUP ====
      #! /usr/bin/python
      #
      
      import os
      import stat
      import errno
      import struct
      
      CGROUP_MOUNT = "/sys/fs/cgroup/memory"
      BUFSIZE = 8 * 1024  # must be multiple of 8
      
      def get_hugepage_size():
          with open("/proc/meminfo", "r") as f:
              for s in f:
                  k, v = s.split(":")
                  if k == "Hugepagesize":
                      return int(v.split()[0]) * 1024
      
      PAGE_SIZE = os.sysconf("SC_PAGE_SIZE")
      HUGEPAGE_SIZE = get_hugepage_size()
      
      def set_idle():
          f = open("/sys/kernel/mm/page_idle/bitmap", "wb", BUFSIZE)
          while True:
              try:
                  f.write(struct.pack("Q", pow(2, 64) - 1))
              except IOError as err:
                  if err.errno == errno.ENXIO:
                      break
                  raise
          f.close()
      
      def count_idle():
          f_flags = open("/proc/kpageflags", "rb", BUFSIZE)
          f_cgroup = open("/proc/kpagecgroup", "rb", BUFSIZE)
      
          with open("/sys/kernel/mm/page_idle/bitmap", "rb", BUFSIZE) as f:
              while f.read(BUFSIZE): pass  # update idle flag
      
          idlememsz = {}
          while True:
              s1, s2 = f_flags.read(8), f_cgroup.read(8)
              if not s1 or not s2:
                  break
      
              flags, = struct.unpack('Q', s1)
              cgino, = struct.unpack('Q', s2)
      
              unevictable = (flags >> 18) & 1
              huge = (flags >> 22) & 1
              idle = (flags >> 25) & 1
      
              if idle and not unevictable:
                  idlememsz[cgino] = idlememsz.get(cgino, 0) + \
                      (HUGEPAGE_SIZE if huge else PAGE_SIZE)
      
          f_flags.close()
          f_cgroup.close()
          return idlememsz
      
      if __name__ == "__main__":
          print "Setting the idle flag for each page..."
          set_idle()
      
          raw_input("Wait until the workload accesses its working set, "
                    "then press Enter")
      
          print "Counting idle pages..."
          idlememsz = count_idle()
      
          for dir, subdirs, files in os.walk(CGROUP_MOUNT):
              ino = os.stat(dir)[stat.ST_INO]
              print dir + ": " + str(idlememsz.get(ino, 0) / 1024) + " kB"
      ==== END SCRIPT ====
      
      This patch (of 8):
      
      Add page_cgroup_ino() helper to memcg.
      
      This function returns the inode number of the closest online ancestor of
      the memory cgroup a page is charged to.  It is required for exporting
      information about which page is charged to which cgroup to userspace,
      which will be introduced by a following patch.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2fc04524
    • D
      zpool: add zpool_has_pool() · 3f0e1312
      Dan Streetman 提交于
      This series makes creation of the zpool and compressor dynamic, so that
      they can be changed at runtime.  This makes using/configuring zswap
      easier, as before this zswap had to be configured at boot time, using boot
      params.
      
      This uses a single list to track both the zpool and compressor together,
      although Seth had mentioned an alternative which is to track the zpools
      and compressors using separate lists.  In the most common case, only a
      single zpool and single compressor, using one list is slightly simpler
      than using two lists, and for the uncommon case of multiple zpools and/or
      compressors, using one list is slightly less simple (and uses slightly
      more memory, probably) than using two lists.
      
      This patch (of 4):
      
      Add zpool_has_pool() function, indicating if the specified type of zpool
      is available (i.e.  zsmalloc or zbud).  This allows checking if a pool is
      available, without actually trying to allocate it, similar to
      crypto_has_alg().
      
      This is used by a following patch to zswap that enables the dynamic
      runtime creation of zswap zpools.
      Signed-off-by: NDan Streetman <ddstreet@ieee.org>
      Acked-by: NSeth Jennings <sjennings@variantweb.net>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f0e1312
  2. 10 9月, 2015 2 次提交
    • D
      netlink, mmap: fix edge-case leakages in nf queue zero-copy · 6bb0fef4
      Daniel Borkmann 提交于
      When netlink mmap on receive side is the consumer of nf queue data,
      it can happen that in some edge cases, we write skb shared info into
      the user space mmap buffer:
      
      Assume a possible rx ring frame size of only 4096, and the network skb,
      which is being zero-copied into the netlink skb, contains page frags
      with an overall skb->len larger than the linear part of the netlink
      skb.
      
      skb_zerocopy(), which is generic and thus not aware of the fact that
      shared info cannot be accessed for such skbs then tries to write and
      fill frags, thus leaking kernel data/pointers and in some corner cases
      possibly writing out of bounds of the mmap area (when filling the
      last slot in the ring buffer this way).
      
      I.e. the ring buffer slot is then of status NL_MMAP_STATUS_VALID, has
      an advertised length larger than 4096, where the linear part is visible
      at the slot beginning, and the leaked sizeof(struct skb_shared_info)
      has been written to the beginning of the next slot (also corrupting
      the struct nl_mmap_hdr slot header incl. status etc), since skb->end
      points to skb->data + ring->frame_size - NL_MMAP_HDRLEN.
      
      The fix adds and lets __netlink_alloc_skb() take the actual needed
      linear room for the network skb + meta data into account. It's completely
      irrelevant for non-mmaped netlink sockets, but in case mmap sockets
      are used, it can be decided whether the available skb_tailroom() is
      really large enough for the buffer, or whether it needs to internally
      fallback to a normal alloc_skb().
      
      >From nf queue side, the information whether the destination port is
      an mmap RX ring is not really available without extra port-to-socket
      lookup, thus it can only be determined in lower layers i.e. when
      __netlink_alloc_skb() is called that checks internally for this. I
      chose to add the extra ldiff parameter as mmap will then still work:
      We have data_len and hlen in nfqnl_build_packet_message(), data_len
      is the full length (capped at queue->copy_range) for skb_zerocopy()
      and hlen some possible part of data_len that needs to be copied; the
      rem_len variable indicates the needed remaining linear mmap space.
      
      The only other workaround in nf queue internally would be after
      allocation time by f.e. cap'ing the data_len to the skb_tailroom()
      iff we deal with an mmap skb, but that would 1) expose the fact that
      we use a mmap skb to upper layers, and 2) trim the skb where we
      otherwise could just have moved the full skb into the normal receive
      queue.
      
      After the patch, in my test case the ring slot doesn't fit and therefore
      shows NL_MMAP_STATUS_COPY, where a full skb carries all the data and
      thus needs to be picked up via recv().
      
      Fixes: 3ab1f683 ("nfnetlink: add support for memory mapped netlink")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6bb0fef4
    • W
      add microchip LAN88xx phy driver · 792aec47
      Woojung.Huh@microchip.com 提交于
      Add Microchip LAN88XX phy driver for phylib.
      Signed-off-by: NWoojung Huh <woojung.huh@microchip.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      792aec47
  3. 09 9月, 2015 23 次提交
    • K
      mm: zbud: constify the zbud_ops · c83db4f4
      Krzysztof Kozlowski 提交于
      The structure zbud_ops is not modified so make the pointer to it a
      pointer to const.
      Signed-off-by: NKrzysztof Kozlowski <k.kozlowski@samsung.com>
      Acked-by: NDan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c83db4f4
    • K
      mm: zpool: constify the zpool_ops · 78672779
      Krzysztof Kozlowski 提交于
      The structure zpool_ops is not modified so make the pointer to it a
      pointer to const.
      Signed-off-by: NKrzysztof Kozlowski <k.kozlowski@samsung.com>
      Acked-by: NDan Streetman <ddstreet@ieee.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      78672779
    • D
      mm: swap: zswap: maybe_preload & refactoring · 5b999aad
      Dmitry Safonov 提交于
      zswap_get_swap_cache_page and read_swap_cache_async have pretty much the
      same code with only significant difference in return value and usage of
      swap_readpage.
      
      I a helper __read_swap_cache_async() with the common code.  Behavior
      change: now zswap_get_swap_cache_page will use radix_tree_maybe_preload
      instead radix_tree_preload.  Looks like, this wasn't changed only by the
      reason of code duplication.
      Signed-off-by: NDmitry Safonov <0x7f454c46@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Herrmann <dh.herrmann@gmail.com>
      Cc: Seth Jennings <sjennings@variantweb.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b999aad
    • S
      zsmalloc: account the number of compacted pages · 860c707d
      Sergey Senozhatsky 提交于
      Compaction returns back to zram the number of migrated objects, which is
      quite uninformative -- we have objects of different sizes so user space
      cannot obtain any valuable data from that number.  Change compaction to
      operate in terms of pages and return back to compaction issuer the
      number of pages that were freed during compaction.  So from now on we
      will export more meaningful value in zram<id>/mm_stat -- the number of
      freed (compacted) pages.
      
      This requires:
       (a) a rename of `num_migrated' to 'pages_compacted'
       (b) a internal API change -- return first_page's fullness_group from
           putback_zspage(), so we know when putback_zspage() did
           free_zspage().  It helps us to account compaction stats correctly.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      860c707d
    • S
      zsmalloc/zram: introduce zs_pool_stats api · 7d3f3938
      Sergey Senozhatsky 提交于
      `zs_compact_control' accounts the number of migrated objects but it has
      a limited lifespan -- we lose it as soon as zs_compaction() returns back
      to zram.  It worked fine, because (a) zram had it's own counter of
      migrated objects and (b) only zram could trigger compaction.  However,
      this does not work for automatic pool compaction (not issued by zram).
      To account objects migrated during auto-compaction (issued by the
      shrinker) we need to store this number in zs_pool.
      
      Define a new `struct zs_pool_stats' structure to keep zs_pool's stats
      there.  It provides only `num_migrated', as of this writing, but it
      surely can be extended.
      
      A new zsmalloc zs_pool_stats() symbol exports zs_pool's stats back to
      caller.
      
      Use zs_pool_stats() in zram and remove `num_migrated' from zram_stats.
      Signed-off-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Suggested-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d3f3938
    • V
      mm: use numa_mem_id() in alloc_pages_node() · 82c1fc71
      Vlastimil Babka 提交于
      alloc_pages_node() might fail when called with NUMA_NO_NODE and
      __GFP_THISNODE on a CPU belonging to a memoryless node.  To make the
      local-node fallback more robust and prevent such situations, use
      numa_mem_id(), which was introduced for similar scenarios in the slab
      context.
      Suggested-by: NChristoph Lameter <cl@linux.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82c1fc71
    • V
      mm: unify checks in alloc_pages_node() and __alloc_pages_node() · 0bc35a97
      Vlastimil Babka 提交于
      Perform the same debug checks in alloc_pages_node() as are done in
      __alloc_pages_node(), by making the former function a wrapper of the
      latter one.
      
      In addition to better diagnostics in DEBUG_VM builds for situations
      which have been already fatal (e.g.  out-of-bounds node id), there are
      two visible changes for potential existing buggy callers of
      alloc_pages_node():
      
      - calling alloc_pages_node() with any negative nid (e.g. due to arithmetic
        overflow) was treated as passing NUMA_NO_NODE and fallback to local node was
        applied. This will now be fatal.
      - calling alloc_pages_node() with an offline node will now be checked for
        DEBUG_VM builds. Since it's not fatal if the node has been previously online,
        and this patch may expose some existing buggy callers, change the VM_BUG_ON
        in __alloc_pages_node() to VM_WARN_ON.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0bc35a97
    • V
      mm: rename alloc_pages_exact_node() to __alloc_pages_node() · 96db800f
      Vlastimil Babka 提交于
      alloc_pages_exact_node() was introduced in commit 6484eb3e ("page
      allocator: do not check NUMA node ID when the caller knows the node is
      valid") as an optimized variant of alloc_pages_node(), that doesn't
      fallback to current node for nid == NUMA_NO_NODE.  Unfortunately the
      name of the function can easily suggest that the allocation is
      restricted to the given node and fails otherwise.  In truth, the node is
      only preferred, unless __GFP_THISNODE is passed among the gfp flags.
      
      The misleading name has lead to mistakes in the past, see for example
      commits 5265047a ("mm, thp: really limit transparent hugepage
      allocation to local node") and b360edb4 ("mm, mempolicy:
      migrate_to_node should only migrate to node").
      
      Another issue with the name is that there's a family of
      alloc_pages_exact*() functions where 'exact' means exact size (instead
      of page order), which leads to more confusion.
      
      To prevent further mistakes, this patch effectively renames
      alloc_pages_exact_node() to __alloc_pages_node() to better convey that
      it's an optimized variant of alloc_pages_node() not intended for general
      usage.  Both functions get described in comments.
      
      It has been also considered to really provide a convenience function for
      allocations restricted to a node, but the major opinion seems to be that
      __GFP_THISNODE already provides that functionality and we shouldn't
      duplicate the API needlessly.  The number of users would be small
      anyway.
      
      Existing callers of alloc_pages_exact_node() are simply converted to
      call __alloc_pages_node(), with the exception of sba_alloc_coherent()
      which open-codes the check for NUMA_NO_NODE, so it is converted to use
      alloc_pages_node() instead.  This means it no longer performs some
      VM_BUG_ON checks, and since the current check for nid in
      alloc_pages_node() uses a 'nid < 0' comparison (which includes
      NUMA_NO_NODE), it may hide wrong values which would be previously
      exposed.
      
      Both differences will be rectified by the next patch.
      
      To sum up, this patch makes no functional changes, except temporarily
      hiding potentially buggy callers.  Restricting the checks in
      alloc_pages_node() is left for the next patch which can in turn expose
      more existing buggy callers.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRobin Holt <robinmholt@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NMichael Ellerman <mpe@ellerman.id.au>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Gleb Natapov <gleb@kernel.org>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Cliff Whickman <cpw@sgi.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      96db800f
    • W
      mm/hwpoison: fix race between soft_offline_page and unpoison_memory · da1b13cc
      Wanpeng Li 提交于
      Wanpeng Li reported a race between soft_offline_page() and
      unpoison_memory(), which causes the following kernel panic:
      
         BUG: Bad page state in process bash  pfn:97000
         page:ffffea00025c0000 count:0 mapcount:1 mapping:          (null) index:0x7f4fdbe00
         flags: 0x1fffff80080048(uptodate|active|swapbacked)
         page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
         bad because of flags:
         flags: 0x40(active)
         Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 nfsv4 dns_resolver bnep rfcomm nfsd bluetooth auth_rpcgss nfs_acl nfs rfkill lockd grace sunrpc i2c_algo_bit drm_kms_helper snd_hda_codec_realtek snd_hda_codec_generic drm snd_hda_intel fscache snd_hda_codec x86_pkg_temp_thermal coretemp kvm_intel snd_hda_core snd_hwdep kvm snd_pcm snd_seq_dummy snd_seq_oss crct10dif_pclmul snd_seq_midi crc32_pclmul snd_seq_midi_event ghash_clmulni_intel snd_rawmidi aesni_intel lrw gf128mul snd_seq glue_helper ablk_helper snd_seq_device cryptd fuse snd_timer dcdbas serio_raw mei_me parport_pc snd mei ppdev i2c_core video lp soundcore parport lpc_ich shpchp mfd_core ext4 mbcache jbd2 sd_mod e1000e ahci ptp libahci crc32c_intel libata pps_core
         CPU: 3 PID: 2211 Comm: bash Not tainted 4.2.0-rc5-mm1+ #45
         Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015
         Call Trace:
           dump_stack+0x48/0x5c
           bad_page+0xe6/0x140
           free_pages_prepare+0x2f9/0x320
           ? uncharge_list+0xdd/0x100
           free_hot_cold_page+0x40/0x170
           __put_single_page+0x20/0x30
           put_page+0x25/0x40
           unmap_and_move+0x1a6/0x1f0
           migrate_pages+0x100/0x1d0
           ? kill_procs+0x100/0x100
           ? unlock_page+0x6f/0x90
           __soft_offline_page+0x127/0x2a0
           soft_offline_page+0xa6/0x200
      
      This race is explained like below:
      
        CPU0                    CPU1
      
        soft_offline_page
        __soft_offline_page
        TestSetPageHWPoison
                              unpoison_memory
                              PageHWPoison check (true)
                              TestClearPageHWPoison
                              put_page    -> release refcount held by get_hwpoison_page in unpoison_memory
                              put_page    -> release refcount held by isolate_lru_page in __soft_offline_page
        migrate_pages
      
      The second put_page() releases refcount held by isolate_lru_page() which
      will lead to unmap_and_move() releases the last refcount of page and w/
      mapcount still 1 since try_to_unmap() is not called if there is only one
      user map the page.  Anyway, the page refcount and mapcount will still
      mess if the page is mapped by multiple users.
      
      This race was introduced by commit 4491f712 ("mm/memory-failure: set
      PageHWPoison before migrate_pages()"), which focuses on preventing the
      reuse of successfully migrated page.  Before this commit we prevent the
      reuse by changing the migratetype to MIGRATE_ISOLATE during soft
      offlining, which has the following problems, so simply reverting the
      commit is not a best option:
      
        1) it doesn't eliminate the reuse completely, because
           set_migratetype_isolate() can fail to set MIGRATE_ISOLATE to the
           target page if the pageblock of the page contains one or more
           unmovable pages (i.e.  has_unmovable_pages() returns true).
      
        2) the original code changes migratetype to MIGRATE_ISOLATE
           forcibly, and sets it to MIGRATE_MOVABLE forcibly after soft offline,
           regardless of the original migratetype state, which could impact
           other subsystems like memory hotplug or compaction.
      
      This patch moves PageSetHWPoison just after put_page() in
      unmap_and_move(), which closes up the reported race window and minimizes
      another race window b/w SetPageHWPoison and reallocation (which causes
      the reuse of soft-offlined page.) The latter race window still exists
      but it's acceptable, because it's rare and effectively the same as
      ordinary "containment failure" case even if it happens, so keep the
      window open is acceptable.
      
      Fixes: 4491f712 ("mm/memory-failure: set PageHWPoison before migrate_pages()")
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Tested-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      da1b13cc
    • N
      mm/hwpoison: introduce num_poisoned_pages wrappers · 8e30456b
      Naoya Horiguchi 提交于
      num_poisoned_pages counter will be changed outside mm/memory-failure.c
      by a subsequent patch, so this patch prepares wrappers to manipulate it.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Tested-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e30456b
    • W
      mm/hwpoison: introduce put_hwpoison_page to put refcount for memory error handling · 94bf4ec8
      Wanpeng Li 提交于
      Introduce put_hwpoison_page to put refcount for memory error handling.
      Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
      Suggested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      94bf4ec8
    • S
      pci: mm: add pci_pool_zalloc() call · 01a7fd33
      Sean O. Stalley 提交于
      Add a wrapper function for pci_pool_alloc() to get zeroed memory.
      Signed-off-by: NSean O. Stalley <sean.stalley@intel.com>
      Cc: Vinod Koul <vinod.koul@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Gilles Muller <Gilles.Muller@lip6.fr>
      Cc: Nicolas Palix <nicolas.palix@imag.fr>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      01a7fd33
    • S
      mm: add dma_pool_zalloc() call to DMA API · ad82362b
      Sean O. Stalley 提交于
      Add a wrapper function for dma_pool_alloc() to get zeroed memory.
      Signed-off-by: NSean O. Stalley <sean.stalley@intel.com>
      Cc: Vinod Koul <vinod.koul@intel.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Gilles Muller <Gilles.Muller@lip6.fr>
      Cc: Nicolas Palix <nicolas.palix@imag.fr>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad82362b
    • K
      mm: drop __nocast from vm_flags_t definition · 64b990d2
      Kirill A. Shutemov 提交于
      __nocast does no good for vm_flags_t. It only produces useless sparse
      warnings.
      
      Let's drop it.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64b990d2
    • N
      mm, page_isolation: make set/unset_migratetype_isolate() file-local · c5b4e1b0
      Naoya Horiguchi 提交于
      Nowaday, set/unset_migratetype_isolate() is defined and used only in
      mm/page_isolation, so let's limit the scope within the file.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5b4e1b0
    • T
      mem-hotplug: handle node hole when initializing numa_meminfo. · 95cf82ec
      Tang Chen 提交于
      When parsing SRAT, all memory ranges are added into numa_meminfo.  In
      numa_init(), before entering numa_cleanup_meminfo(), all possible memory
      ranges are in numa_meminfo.  And numa_cleanup_meminfo() removes all
      ranges over max_pfn or empty.
      
      But, this only works if the nodes are continuous.  Let's have a look at
      the following example:
      
      We have an SRAT like this:
      SRAT: Node 0 PXM 0 [mem 0x00000000-0x5fffffff]
      SRAT: Node 0 PXM 0 [mem 0x100000000-0x1ffffffffff]
      SRAT: Node 1 PXM 1 [mem 0x20000000000-0x3ffffffffff]
      SRAT: Node 4 PXM 2 [mem 0x40000000000-0x5ffffffffff] hotplug
      SRAT: Node 5 PXM 3 [mem 0x60000000000-0x7ffffffffff] hotplug
      SRAT: Node 2 PXM 4 [mem 0x80000000000-0x9ffffffffff] hotplug
      SRAT: Node 3 PXM 5 [mem 0xa0000000000-0xbffffffffff] hotplug
      SRAT: Node 6 PXM 6 [mem 0xc0000000000-0xdffffffffff] hotplug
      SRAT: Node 7 PXM 7 [mem 0xe0000000000-0xfffffffffff] hotplug
      
      On boot, only node 0,1,2,3 exist.
      
      And the numa_meminfo will look like this:
      numa_meminfo.nr_blks = 9
      1. on node 0: [0, 60000000]
      2. on node 0: [100000000, 20000000000]
      3. on node 1: [20000000000, 40000000000]
      4. on node 4: [40000000000, 60000000000]
      5. on node 5: [60000000000, 80000000000]
      6. on node 2: [80000000000, a0000000000]
      7. on node 3: [a0000000000, a0800000000]
      8. on node 6: [c0000000000, a0800000000]
      9. on node 7: [e0000000000, a0800000000]
      
      And numa_cleanup_meminfo() will merge 1 and 2, and remove 8,9 because the
      end address is over max_pfn, which is a0800000000.  But 4 and 5 are not
      removed because their end addresses are less then max_pfn.  But in fact,
      node 4 and 5 don't exist.
      
      In a word, numa_cleanup_meminfo() is not able to handle holes between nodes.
      
      Since memory ranges in node 4 and 5 are in numa_meminfo, in
      numa_register_memblks(), node 4 and 5 will be mistakenly set to online.
      
      If you run lscpu, it will show:
      NUMA node0 CPU(s):     0-14,128-142
      NUMA node1 CPU(s):     15-29,143-157
      NUMA node2 CPU(s):
      NUMA node3 CPU(s):
      NUMA node4 CPU(s):     62-76,190-204
      NUMA node5 CPU(s):     78-92,206-220
      
      In this patch, we use memblock_overlaps_region() to check if ranges in
      numa_meminfo overlap with ranges in memory_block.  Since memory_block
      contains all available memory at boot time, if they overlap, it means the
      ranges exist.  If not, then remove them from numa_meminfo.
      
      After this patch, lscpu will show:
      NUMA node0 CPU(s):     0-14,128-142
      NUMA node1 CPU(s):     15-29,143-157
      NUMA node4 CPU(s):     62-76,190-204
      NUMA node5 CPU(s):     78-92,206-220
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vladimir Murzin <vladimir.murzin@arm.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      95cf82ec
    • T
      mm/memblock.c: make memblock_overlaps_region() return bool. · c5c5c9d1
      Tang Chen 提交于
      memblock_overlaps_region() checks if the given memblock region
      intersects a region in memblock.  If so, it returns the index of the
      intersected region.
      
      But its only caller is memblock_is_region_reserved(), and it returns 0
      if false, non-zero if true.
      
      Both of these should return bool.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Vladimir Murzin <vladimir.murzin@arm.com>
      Cc: Fabian Frederick <fabf@skynet.be>
      Cc: Alexander Kuleshov <kuleshovmail@gmail.com>
      Cc: Baoquan He <bhe@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c5c5c9d1
    • M
      hugetlbfs: add hugetlbfs_fallocate() · 70c3547e
      Mike Kravetz 提交于
      This is based on the shmem version, but it has diverged quite a bit.  We
      have no swap to worry about, nor the new file sealing.  Add
      synchronication via the fault mutex table to coordinate page faults,
      fallocate allocation and fallocate hole punch.
      
      What this allows us to do is move physical memory in and out of a
      hugetlbfs file without having it mapped.  This also gives us the ability
      to support MADV_REMOVE since it is currently implemented using
      fallocate().  MADV_REMOVE lets madvise() remove pages from the middle of
      a hugetlbfs file, which wasn't possible before.
      
      hugetlbfs fallocate only operates on whole huge pages.
      
      Based on code by Dave Hansen.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70c3547e
    • M
      hugetlbfs: New huge_add_to_page_cache helper routine · ab76ad54
      Mike Kravetz 提交于
      Currently, there is only a single place where hugetlbfs pages are added
      to the page cache.  The new fallocate code be adding a second one, so
      break the functionality out into its own helper.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab76ad54
    • M
      hugetlbfs: truncate_hugepages() takes a range of pages · b5cec28d
      Mike Kravetz 提交于
      Modify truncate_hugepages() to take a range of pages (start, end)
      instead of simply start.  If an end value of LLONG_MAX is passed, the
      current "truncate" functionality is maintained.  Existing callers are
      modified to pass LLONG_MAX as end of range.  By keying off end ==
      LLONG_MAX, the routine behaves differently for truncate and hole punch.
      Page removal is now synchronized with page allocation via faults by
      using the fault mutex table.  The hole punch case can experience the
      rare region_del error and must handle accordingly.
      
      Add the routine hugetlb_fix_reserve_counts to fix up reserve counts in
      the case where region_del returns an error.
      
      Since the routine handles more than just the truncate case, it is
      renamed to remove_inode_hugepages().  To be consistent, the routine
      truncate_huge_page() is renamed remove_huge_page().
      
      Downstream of remove_inode_hugepages(), the routine
      hugetlb_unreserve_pages() is also modified to take a range of pages.
      hugetlb_unreserve_pages is modified to detect an error from region_del and
      pass it back to the caller.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b5cec28d
    • M
      mm/hugetlb: expose hugetlb fault mutex for use by fallocate · c672c7f2
      Mike Kravetz 提交于
      hugetlb page faults are currently synchronized by the table of mutexes
      (htlb_fault_mutex_table).  fallocate code will need to synchronize with
      the page fault code when it allocates or deletes pages.  Expose
      interfaces so that fallocate operations can be synchronized with page
      faults.  Minor name changes to be more consistent with other global
      hugetlb symbols.
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c672c7f2
    • M
      mm/hugetlb: add cache of descriptors to resv_map for region_add · 5e911373
      Mike Kravetz 提交于
      hugetlbfs is used today by applications that want a high degree of
      control over huge page usage.  Often, large hugetlbfs files are used to
      map a large number huge pages into the application processes.  The
      applications know when page ranges within these large files will no
      longer be used, and ideally would like to release them back to the
      subpool or global pools for other uses.  The fallocate() system call
      provides an interface for preallocation and hole punching within files.
      This patch set adds fallocate functionality to hugetlbfs.
      
      fallocate hole punch will want to remove a specific range of pages.
      When pages are removed, their associated entries in the region/reserve
      map will also be removed.  This will break an assumption in the
      region_chg/region_add calling sequence.  If a new region descriptor must
      be allocated, it is done as part of the region_chg processing.  In this
      way, region_add can not fail because it does not need to attempt an
      allocation.
      
      To prepare for fallocate hole punch, create a "cache" of descriptors
      that can be used by region_add if necessary.  region_chg will ensure
      there are sufficient entries in the cache.  It will be necessary to
      track the number of in progress add operations to know a sufficient
      number of descriptors reside in the cache.  A new routine region_abort
      is added to adjust this in progress count when add operations are
      aborted.  vma_abort_reservation is also added for callers creating
      reservations with vma_needs_reservation/vma_commit_reservation.
      
      [akpm@linux-foundation.org: fix typo in comment, use more cols]
      Signed-off-by: NMike Kravetz <mike.kravetz@oracle.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHillf Danton <hillf.zj@alibaba-inc.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Davidlohr Bueso <dave@stgolabs.net>
      Cc: Aneesh Kumar <aneesh.kumar@linux.vnet.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5e911373
    • V
      mm: rename and move get/set_freepage_migratetype · bb14c2c7
      Vlastimil Babka 提交于
      The pair of get/set_freepage_migratetype() functions are used to cache
      pageblock migratetype for a page put on a pcplist, so that it does not
      have to be retrieved again when the page is put on a free list (e.g.
      when pcplists become full).  Historically it was also assumed that the
      value is accurate for pages on freelists (as the functions' names
      unfortunately suggest), but that cannot be guaranteed without affecting
      various allocator fast paths.  It is in fact not needed and all such
      uses have been removed.
      
      The last remaining (but pointless) usage related to pages of freelists
      is in move_freepages(), which this patch removes.
      
      To prevent further confusion, rename the functions to
      get/set_pcppage_migratetype() and expand their description.  Since all
      the users are now in mm/page_alloc.c, move the functions there from the
      shared header.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: Laura Abbott <lauraa@codeaurora.org>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Seungho Park <seungho1.park@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb14c2c7
新手
引导
客服 返回
顶部