1. 15 1月, 2016 1 次提交
    • V
      mm, proc: account for shmem swap in /proc/pid/smaps · c261e7d9
      Vlastimil Babka 提交于
      Currently, /proc/pid/smaps will always show "Swap: 0 kB" for
      shmem-backed mappings, even if the mapped portion does contain pages
      that were swapped out.  This is because unlike private anonymous
      mappings, shmem does not change pte to swap entry, but pte_none when
      swapping the page out.  In the smaps page walk, such page thus looks
      like it was never faulted in.
      
      This patch changes smaps_pte_entry() to determine the swap status for
      such pte_none entries for shmem mappings, similarly to how
      mincore_page() does it.  Swapped out shmem pages are thus accounted for.
      For private mappings of tmpfs files that COWed some of the pages, swaped
      out status of the original shmem pages is naturally ignored.  If some of
      the private copies was also swapped out, they are accounted via their
      page table swap entries, so the resulting reported swap usage is then a
      sum of both swapped out private copies, and swapped out shmem pages that
      were not COWed.  No double accounting can thus happen.
      
      The accounting is arguably still not as precise as for private anonymous
      mappings, since now we will count also pages that the process in
      question never accessed, but another process populated them and then let
      them become swapped out.  I believe it is still less confusing and
      subtle than not showing any swap usage by shmem mappings at all.
      Swapped out counter might of interest of users who would like to prevent
      from future swapins during performance critical operation and pre-fault
      them at their convenience.  Especially for larger swapped out regions
      the cost of swapin is much higher than a fresh page allocation.  So a
      differentiation between pte_none vs.  swapped out is important for those
      usecases.
      
      One downside of this patch is that it makes /proc/pid/smaps more
      expensive for shmem mappings, as we consult the radix tree for each
      pte_none entry, so the overal complexity is O(n*log(n)).  I have
      measured this on a process that creates a 2GB mapping and dirties single
      pages with a stride of 2MB, and time how long does it take to cat
      /proc/pid/smaps of this process 100 times.
      
      Private anonymous mapping:
      
      real    0m0.949s
      user    0m0.116s
      sys     0m0.348s
      
      Mapping of a /dev/shm/file:
      
      real    0m3.831s
      user    0m0.180s
      sys     0m3.212s
      
      The difference is rather substantial, so the next patch will reduce the
      cost for shared or read-only mappings.
      
      In a less controlled experiment, I've gathered pids of processes on my
      desktop that have either '/dev/shm/*' or 'SYSV*' in smaps.  This
      included the Chrome browser and some KDE processes.  Again, I've run cat
      /proc/pid/smaps on each 100 times.
      
      Before this patch:
      
      real    0m9.050s
      user    0m0.518s
      sys     0m8.066s
      
      After this patch:
      
      real    0m9.221s
      user    0m0.541s
      sys     0m8.187s
      
      This suggests low impact on average systems.
      
      Note that this patch doesn't attempt to adjust the SwapPss field for
      shmem mappings, which would need extra work to determine who else could
      have the pages mapped.  Thus the value stays zero except for COWed
      swapped out pages in a shmem mapping, which are accounted as usual.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Acked-by: NJerome Marchand <jmarchan@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c261e7d9
  2. 06 11月, 2015 4 次提交
  3. 14 10月, 2015 1 次提交
  4. 11 9月, 2015 1 次提交
    • V
      mm: introduce idle page tracking · 33c3fc71
      Vladimir Davydov 提交于
      Knowing the portion of memory that is not used by a certain application or
      memory cgroup (idle memory) can be useful for partitioning the system
      efficiently, e.g.  by setting memory cgroup limits appropriately.
      Currently, the only means to estimate the amount of idle memory provided
      by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
      access bit for all pages mapped to a particular process by writing 1 to
      clear_refs, wait for some time, and then count smaps:Referenced.  However,
      this method has two serious shortcomings:
      
       - it does not count unmapped file pages
       - it affects the reclaimer logic
      
      To overcome these drawbacks, this patch introduces two new page flags,
      Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
      A page's Idle flag can only be set from userspace by setting bit in
      /sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
      and it is cleared whenever the page is accessed either through page tables
      (it is cleared in page_referenced() in this case) or using the read(2)
      system call (mark_page_accessed()). Thus by setting the Idle flag for
      pages of a particular workload, which can be found e.g.  by reading
      /proc/PID/pagemap, waiting for some time to let the workload access its
      working set, and then reading the bitmap file, one can estimate the amount
      of pages that are not used by the workload.
      
      The Young page flag is used to avoid interference with the memory
      reclaimer.  A page's Young flag is set whenever the Access bit of a page
      table entry pointing to the page is cleared by writing to the bitmap file.
      If page_referenced() is called on a Young page, it will add 1 to its
      return value, therefore concealing the fact that the Access bit was
      cleared.
      
      Note, since there is no room for extra page flags on 32 bit, this feature
      uses extended page flags when compiled on 32 bit.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: kpageidle requires an MMU]
      [akpm@linux-foundation.org: decouple from page-flags rework]
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NAndres Lagar-Cavilla <andreslc@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Raghavendra K T <raghavendra.kt@linux.vnet.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33c3fc71
  5. 09 9月, 2015 6 次提交
  6. 05 9月, 2015 1 次提交
    • A
      userfaultfd: add VM_UFFD_MISSING and VM_UFFD_WP · 16ba6f81
      Andrea Arcangeli 提交于
      These two flags gets set in vma->vm_flags to tell the VM common code
      if the userfaultfd is armed and in which mode (only tracking missing
      faults, only tracking wrprotect faults or both). If neither flags is
      set it means the userfaultfd is not armed on the vma.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Sanidhya Kashyap <sanidhya.gatech@gmail.com>
      Cc: zhang.zhanghailiang@huawei.com
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Cc: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: "Huangpeng (Peter)" <peter.huangpeng@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      16ba6f81
  7. 24 6月, 2015 1 次提交
  8. 18 3月, 2015 1 次提交
  9. 13 2月, 2015 2 次提交
  10. 12 2月, 2015 9 次提交
  11. 11 2月, 2015 1 次提交
  12. 11 12月, 2014 1 次提交
  13. 18 11月, 2014 1 次提交
    • Q
      x86, mpx: Introduce VM_MPX to indicate that a VMA is MPX specific · 4aae7e43
      Qiaowei Ren 提交于
      MPX-enabled applications using large swaths of memory can
      potentially have large numbers of bounds tables in process
      address space to save bounds information. These tables can take
      up huge swaths of memory (as much as 80% of the memory on the
      system) even if we clean them up aggressively. In the worst-case
      scenario, the tables can be 4x the size of the data structure
      being tracked. IOW, a 1-page structure can require 4 bounds-table
      pages.
      
      Being this huge, our expectation is that folks using MPX are
      going to be keen on figuring out how much memory is being
      dedicated to it. So we need a way to track memory use for MPX.
      
      If we want to specifically track MPX VMAs we need to be able to
      distinguish them from normal VMAs, and keep them from getting
      merged with normal VMAs. A new VM_ flag set only on MPX VMAs does
      both of those things. With this flag, MPX bounds-table VMAs can
      be distinguished from other VMAs, and userspace can also walk
      /proc/$pid/smaps to get memory usage for MPX.
      
      In addition to this flag, we also introduce a special ->vm_ops
      specific to MPX VMAs (see the patch "add MPX specific mmap
      interface"), but currently different ->vm_ops do not by
      themselves prevent VMA merging, so we still need this flag.
      
      We understand that VM_ flags are scarce and are open to other
      options.
      Signed-off-by: NQiaowei Ren <qiaowei.ren@intel.com>
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: linux-mm@kvack.org
      Cc: linux-mips@linux-mips.org
      Cc: Dave Hansen <dave@sr71.net>
      Link: http://lkml.kernel.org/r/20141114151825.565625B3@viggo.jf.intel.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      4aae7e43
  14. 14 10月, 2014 1 次提交
    • P
      mm: softdirty: enable write notifications on VMAs after VM_SOFTDIRTY cleared · 64e45507
      Peter Feiner 提交于
      For VMAs that don't want write notifications, PTEs created for read faults
      have their write bit set.  If the read fault happens after VM_SOFTDIRTY is
      cleared, then the PTE's softdirty bit will remain clear after subsequent
      writes.
      
      Here's a simple code snippet to demonstrate the bug:
      
        char* m = mmap(NULL, getpagesize(), PROT_READ | PROT_WRITE,
                       MAP_ANONYMOUS | MAP_SHARED, -1, 0);
        system("echo 4 > /proc/$PPID/clear_refs"); /* clear VM_SOFTDIRTY */
        assert(*m == '\0');     /* new PTE allows write access */
        assert(!soft_dirty(x));
        *m = 'x';               /* should dirty the page */
        assert(soft_dirty(x));  /* fails */
      
      With this patch, write notifications are enabled when VM_SOFTDIRTY is
      cleared.  Furthermore, to avoid unnecessary faults, write notifications
      are disabled when VM_SOFTDIRTY is set.
      
      As a side effect of enabling and disabling write notifications with
      care, this patch fixes a bug in mprotect where vm_page_prot bits set by
      drivers were zapped on mprotect.  An analogous bug was fixed in mmap by
      commit c9d0bf24 ("mm: uncached vma support with writenotify").
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Reported-by: NPeter Feiner <pfeiner@google.com>
      Suggested-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Jamie Liu <jamieliu@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64e45507
  15. 10 10月, 2014 9 次提交
    • P
      mm: softdirty: unmapped addresses between VMAs are clean · 81d0fa62
      Peter Feiner 提交于
      If a /proc/pid/pagemap read spans a [VMA, an unmapped region, then a
      VM_SOFTDIRTY VMA], the virtual pages in the unmapped region are reported
      as softdirty.  Here's a program to demonstrate the bug:
      
      int main() {
      	const uint64_t PAGEMAP_SOFTDIRTY = 1ul << 55;
      	uint64_t pme[3];
      	int fd = open("/proc/self/pagemap", O_RDONLY);;
      	char *m = mmap(NULL, 3 * getpagesize(), PROT_READ,
      	               MAP_ANONYMOUS | MAP_SHARED, -1, 0);
      	munmap(m + getpagesize(), getpagesize());
      	pread(fd, pme, 24, (unsigned long) m / getpagesize() * 8);
      	assert(pme[0] & PAGEMAP_SOFTDIRTY);    /* passes */
      	assert(!(pme[1] & PAGEMAP_SOFTDIRTY)); /* fails */
      	assert(pme[2] & PAGEMAP_SOFTDIRTY);    /* passes */
      	return 0;
      }
      
      (Note that all pages in new VMAs are softdirty until cleared).
      
      Tested:
      	Used the program given above. I'm going to include this code in
      	a selftest in the future.
      
      [n-horiguchi@ah.jp.nec.com: prevent pagemap_pte_range() from overrunning]
      Signed-off-by: NPeter Feiner <pfeiner@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Jamie Liu <jamieliu@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      81d0fa62
    • O
      mempolicy: fix show_numa_map() vs exec() + do_set_mempolicy() race · 498f2371
      Oleg Nesterov 提交于
      9e781440 "hold task->mempolicy while numa_maps scans." fixed the
      race with the exiting task but this is not enough.
      
      The current code assumes that get_vma_policy(task) should either see
      task->mempolicy == NULL or it should be equal to ->task_mempolicy saved
      by hold_task_mempolicy(), so we can never race with __mpol_put(). But
      this can only work if we can't race with do_set_mempolicy(), and thus
      we can't race with another do_set_mempolicy() or do_exit() after that.
      
      However, do_set_mempolicy()->down_write(mmap_sem) can not prevent this
      race. This task can exec, change it's ->mm, and call do_set_mempolicy()
      after that; in this case they take 2 different locks.
      
      Change hold_task_mempolicy() to use get_task_policy(), it never returns
      NULL, and change show_numa_map() to use __get_vma_policy() or fall back
      to proc_priv->task_mempolicy.
      
      Note: this is the minimal fix, we will cleanup this code later. I think
      hold_task_mempolicy() and release_task_mempolicy() should die, we can
      move this logic into show_numa_map(). Or we can move get_task_policy()
      outside of ->mmap_sem and !CONFIG_NUMA code at least.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      498f2371
    • O
      proc/maps: make vm_is_stack() logic namespace-friendly · 58cb6548
      Oleg Nesterov 提交于
      - Rename vm_is_stack() to task_of_stack() and change it to return
        "struct task_struct *" rather than the global (and thus wrong in
        general) pid_t.
      
      - Add the new pid_of_stack() helper which calls task_of_stack() and
        uses the right namespace to report the correct pid_t.
      
        Unfortunately we need to define this helper twice, in task_mmu.c
        and in task_nommu.c. perhaps it makes sense to add fs/proc/util.c
        and move at least pid_of_stack/task_of_stack there to avoid the
        code duplication.
      
      - Change show_map_vma() and show_numa_map() to use the new helper.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58cb6548
    • O
      proc/maps: replace proc_maps_private->pid with "struct inode *inode" · 2c03376d
      Oleg Nesterov 提交于
      m_start() can use get_proc_task() instead, and "struct inode *"
      provides more potentially useful info, see the next changes.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c03376d
    • O
      fs/proc/task_mmu.c: update m->version in the main loop in m_start() · 557c2d8a
      Oleg Nesterov 提交于
      Change the main loop in m_start() to update m->version. Mostly for
      consistency, but this can help to avoid the same loop if the very
      1st ->show() fails due to seq_overflow().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      557c2d8a
    • O
      fs/proc/task_mmu.c: reintroduce m->version logic · b8c20a9b
      Oleg Nesterov 提交于
      Add the "last_addr" optimization back. Like before, every ->show()
      method checks !seq_overflow() and sets m->version = vma->vm_start.
      
      However, it also checks that m_next_vma(vma) != NULL, otherwise it
      sets m->version = -1 for the lockless "EOF" fast-path in m_start().
      
      m_start() can simply do find_vma() + m_next_vma() if last_addr is
      not zero, the code looks clear and simple and this case is clearly
      separated from "scan vmas" path.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b8c20a9b
    • O
      fs/proc/task_mmu.c: introduce m_next_vma() helper · ad2a00e4
      Oleg Nesterov 提交于
      Extract the tail_vma/vm_next calculation from m_next() into the new
      trivial helper, m_next_vma().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ad2a00e4
    • O
      fs/proc/task_mmu.c: simplify m_start() to make it readable · 0c255321
      Oleg Nesterov 提交于
      Now that m->version is gone we can cleanup m_start(). In particular,
      
        - Remove the "unsigned long" typecast, m->index can't be negative
          or exceed ->map_count. But lets use "unsigned int pos" to make
          it clear that "pos < map_count" is safe.
      
        - Remove the unnecessary "vma != NULL" check in the main loop. It
          can't be NULL unless we have a vm bug.
      
        - This also means that "pos < map_count" case can simply return the
          valid vma and avoid "goto" and subsequent checks.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0c255321
    • O
      fs/proc/task_mmu.c: kill the suboptimal and confusing m->version logic · ebb6cdde
      Oleg Nesterov 提交于
      m_start() carefully documents, checks, and sets "m->version = -1" if
      we are going to return NULL. The only problem is that we will be never
      called again if m_start() returns NULL, so this is simply pointless
      and misleading.
      
      Otoh, ->show() methods m->version = 0 if vma == tail_vma and this is
      just wrong, we want -1 in this case. And in fact we also want -1 if
      ->vm_next == NULL and ->tail_vma == NULL.
      
      And it is not used consistently, the "scan vmas" loop in m_start()
      should update last_addr too.
      
      Finally, imo the whole "last_addr" logic in m_start() looks horrible.
      find_vma(last_addr) is called unconditionally even if we are not going
      to use the result. But the main problem is that this code participates
      in tail_vma-or-NULL mess, and this looks simply unfixable.
      
      Remove this optimization. We will add it back after some cleanups.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebb6cdde