1. 14 8月, 2013 3 次提交
  2. 04 7月, 2013 5 次提交
    • P
      pagemap: prepare to reuse constant bits with page-shift · 541c237c
      Pavel Emelyanov 提交于
      In order to reuse bits from pagemap entries gracefully, we leave the
      entries as is but on pagemap open emit a warning in dmesg, that bits
      55-60 are about to change in a couple of releases.  Next, if a user
      issues soft-dirty clear command via the clear_refs file (it was disabled
      before v3.9) we assume that he's aware of the new pagemap format, note
      that fact and report the bits in pagemap in the new manner.
      
      The "migration strategy" looks like this then:
      
      1. existing users are not affected -- they don't touch soft-dirty feature, thus
         see old bits in pagemap, but are warned and have time to fix themselves
      2. those who use soft-dirty know about new pagemap format
      3. some time soon we get rid of any signs of page-shift in pagemap as well as
         this trick with clear-soft-dirty affecting pagemap format.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      541c237c
    • P
      mm: soft-dirty bits for user memory changes tracking · 0f8975ec
      Pavel Emelyanov 提交于
      The soft-dirty is a bit on a PTE which helps to track which pages a task
      writes to.  In order to do this tracking one should
      
        1. Clear soft-dirty bits from PTEs ("echo 4 > /proc/PID/clear_refs)
        2. Wait some time.
        3. Read soft-dirty bits (55'th in /proc/PID/pagemap2 entries)
      
      To do this tracking, the writable bit is cleared from PTEs when the
      soft-dirty bit is.  Thus, after this, when the task tries to modify a
      page at some virtual address the #PF occurs and the kernel sets the
      soft-dirty bit on the respective PTE.
      
      Note, that although all the task's address space is marked as r/o after
      the soft-dirty bits clear, the #PF-s that occur after that are processed
      fast.  This is so, since the pages are still mapped to physical memory,
      and thus all the kernel does is finds this fact out and puts back
      writable, dirty and soft-dirty bits on the PTE.
      
      Another thing to note, is that when mremap moves PTEs they are marked
      with soft-dirty as well, since from the user perspective mremap modifies
      the virtual memory at mremap's new address.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f8975ec
    • P
      pagemap: introduce pagemap_entry_t without pmshift bits · 2b0a9f01
      Pavel Emelyanov 提交于
      These bits are always constant (== PAGE_SHIFT) and just occupy space in
      the entry.  Moreover, in next patch we will need to report one more bit
      in the pagemap, but all bits are already busy on it.
      
      That said, describe the pagemap entry that has 6 more free zero bits.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2b0a9f01
    • P
      clear_refs: introduce private struct for mm_walk · af9de7eb
      Pavel Emelyanov 提交于
      In the next patch the clear-refs-type will be required in
      clear_refs_pte_range funciton, so prepare the walk->private to carry
      this info.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af9de7eb
    • P
      clear_refs: sanitize accepted commands declaration · 040fa020
      Pavel Emelyanov 提交于
      This is the implementation of the soft-dirty bit concept that should
      help keep track of changes in user memory, which in turn is very-very
      required by the checkpoint-restore project (http://criu.org).
      
      To create a dump of an application(s) we save all the information about
      it to files, and the biggest part of such dump is the contents of tasks'
      memory.  However, there are usage scenarios where it's not required to
      get _all_ the task memory while creating a dump.  For example, when
      doing periodical dumps, it's only required to take full memory dump only
      at the first step and then take incremental changes of memory.  Another
      example is live migration.  We copy all the memory to the destination
      node without stopping all tasks, then stop them, check for what pages
      has changed, dump it and the rest of the state, then copy it to the
      destination node.  This decreases freeze time significantly.
      
      That said, some help from kernel to watch how processes modify the
      contents of their memory is required.
      
      The proposal is to track changes with the help of new soft-dirty bit
      this way:
      
      1. First do "echo 4 > /proc/$pid/clear_refs".
         At that point kernel clears the soft dirty _and_ the writable bits from all
         ptes of process $pid. From now on every write to any page will result in #pf
         and the subsequent call to pte_mkdirty/pmd_mkdirty, which in turn will set
         the soft dirty flag.
      
      2. Then read the /proc/$pid/pagemap2 and check the soft-dirty bit reported there
         (the 55'th one). If set, the respective pte was written to since last call
         to clear refs.
      
      The soft-dirty bit is the _PAGE_BIT_HIDDEN one.  Although it's used by
      kmemcheck, the latter one marks kernel pages with it, while the former
      bit is put on user pages so they do not conflict to each other.
      
      This patch:
      
      A new clear-refs type will be added in the next patch, so prepare
      code for that.
      
      [akpm@linux-foundation.org: don't assume that sizeof(enum clear_refs_types) == sizeof(int)]
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      040fa020
  3. 23 2月, 2013 1 次提交
  4. 03 1月, 2013 1 次提交
  5. 18 12月, 2012 1 次提交
    • C
      procfs: add VmFlags field in smaps output · 834f82e2
      Cyrill Gorcunov 提交于
      During c/r sessions we've found that there is no way at the moment to
      fetch some VMA associated flags, such as mlock() and madvise().
      
      This leads us to a problem -- we don't know if we should call for mlock()
      and/or madvise() after restore on the vma area we're bringing back to
      life.
      
      This patch intorduces a new field into "smaps" output called VmFlags,
      where all set flags associated with the particular VMA is shown as two
      letter mnemonics.
      
      [ Strictly speaking for c/r we only need mlock/madvise bits but it has been
        said that providing just a few flags looks somehow inconsistent.  So all
        flags are here now. ]
      
      This feature is made available on CONFIG_CHECKPOINT_RESTORE=n kernels, as
      other applications may start to use these fields.
      
      The data is encoded in a somewhat awkward two letters mnemonic form, to
      encourage userspace to be prepared for fields being added or removed in
      the future.
      
      [a.p.zijlstra@chello.nl: props to use for_each_set_bit]
      [sfr@canb.auug.org.au: props to use array instead of struct]
      [akpm@linux-foundation.org: overall redesign and simplification]
      [akpm@linux-foundation.org: remove unneeded braces per sfr, avoid using bloaty for_each_set_bit()]
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      834f82e2
  6. 13 12月, 2012 2 次提交
  7. 20 10月, 2012 1 次提交
  8. 17 10月, 2012 1 次提交
    • D
      mm, mempolicy: fix printing stack contents in numa_maps · 32f8516a
      David Rientjes 提交于
      When reading /proc/pid/numa_maps, it's possible to return the contents of
      the stack where the mempolicy string should be printed if the policy gets
      freed from beneath us.
      
      This happens because mpol_to_str() may return an error the
      stack-allocated buffer is then printed without ever being stored.
      
      There are two possible error conditions in mpol_to_str():
      
       - if the buffer allocated is insufficient for the string to be stored,
         and
      
       - if the mempolicy has an invalid mode.
      
      The first error condition is not triggered in any of the callers to
      mpol_to_str(): at least 50 bytes is always allocated on the stack and this
      is sufficient for the string to be written.  A future patch should convert
      this into BUILD_BUG_ON() since we know the maximum strlen possible, but
      that's not -rc material.
      
      The second error condition is possible if a race occurs in dropping a
      reference to a task's mempolicy causing it to be freed during the read().
      The slab poison value is then used for the mode and mpol_to_str() returns
      -EINVAL.
      
      This race is only possible because get_vma_policy() believes that
      mm->mmap_sem protects task->mempolicy, which isn't true.  The exit path
      does not hold mm->mmap_sem when dropping the reference or setting
      task->mempolicy to NULL: it uses task_lock(task) instead.
      
      Thus, it's required for the caller of a task mempolicy to hold
      task_lock(task) while grabbing the mempolicy and reading it.  Callers with
      a vma policy store their mempolicy earlier and can simply increment the
      reference count so it's guaranteed not to be freed.
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      32f8516a
  9. 09 10月, 2012 1 次提交
    • K
      mm: kill vma flag VM_RESERVED and mm->reserved_vm counter · 314e51b9
      Konstantin Khlebnikov 提交于
      A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
      currently it lost original meaning but still has some effects:
      
       | effect                 | alternative flags
      -+------------------------+---------------------------------------------
      1| account as reserved_vm | VM_IO
      2| skip in core dump      | VM_IO, VM_DONTDUMP
      3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      4| do not mlock           | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
      
      This patch removes reserved_vm counter from mm_struct.  Seems like nobody
      cares about it, it does not exported into userspace directly, it only
      reduces total_vm showed in proc.
      
      Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.
      
      remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
      remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.
      
      [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      314e51b9
  10. 01 6月, 2012 4 次提交
  11. 30 5月, 2012 1 次提交
  12. 11 5月, 2012 1 次提交
  13. 26 4月, 2012 1 次提交
  14. 30 3月, 2012 1 次提交
  15. 29 3月, 2012 1 次提交
  16. 22 3月, 2012 5 次提交
    • S
      procfs: mark thread stack correctly in proc/<pid>/maps · b7643757
      Siddhesh Poyarekar 提交于
      Stack for a new thread is mapped by userspace code and passed via
      sys_clone.  This memory is currently seen as anonymous in
      /proc/<pid>/maps, which makes it difficult to ascertain which mappings
      are being used for thread stacks.  This patch uses the individual task
      stack pointers to determine which vmas are actually thread stacks.
      
      For a multithreaded program like the following:
      
      	#include <pthread.h>
      
      	void *thread_main(void *foo)
      	{
      		while(1);
      	}
      
      	int main()
      	{
      		pthread_t t;
      		pthread_create(&t, NULL, thread_main, NULL);
      		pthread_join(t, NULL);
      	}
      
      proc/PID/maps looks like the following:
      
          00400000-00401000 r-xp 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          00600000-00601000 rw-p 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          019ef000-01a10000 rw-p 00000000 00:00 0                                  [heap]
          7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
          7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0
          7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
          7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
          7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
          7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
          7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
          7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
          7fff627ff000-7fff62800000 r-xp 00000000 00:00 0                          [vdso]
          ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
      
      Here, one could guess that 7f8a44492000-7f8a44c92000 is a stack since
      the earlier vma that has no permissions (7f8a44e3d000-7f8a4503d000) but
      that is not always a reliable way to find out which vma is a thread
      stack.  Also, /proc/PID/maps and /proc/PID/task/TID/maps has the same
      content.
      
      With this patch in place, /proc/PID/task/TID/maps are treated as 'maps
      as the task would see it' and hence, only the vma that that task uses as
      stack is marked as [stack].  All other 'stack' vmas are marked as
      anonymous memory.  /proc/PID/maps acts as a thread group level view,
      where all thread stack vmas are marked as [stack:TID] where TID is the
      process ID of the task that uses that vma as stack, while the process
      stack is marked as [stack].
      
      So /proc/PID/maps will look like this:
      
          00400000-00401000 r-xp 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          00600000-00601000 rw-p 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          019ef000-01a10000 rw-p 00000000 00:00 0                                  [heap]
          7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
          7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack:1442]
          7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
          7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
          7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
          7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
          7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
          7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
          7fff627ff000-7fff62800000 r-xp 00000000 00:00 0                          [vdso]
          ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
      
      Thus marking all vmas that are used as stacks by the threads in the
      thread group along with the process stack.  The task level maps will
      however like this:
      
          00400000-00401000 r-xp 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          00600000-00601000 rw-p 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          019ef000-01a10000 rw-p 00000000 00:00 0                                  [heap]
          7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
          7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack]
          7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
          7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
          7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
          7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
          7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
          7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0
          7fff627ff000-7fff62800000 r-xp 00000000 00:00 0                          [vdso]
          ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
      
      where only the vma that is being used as a stack by *that* task is
      marked as [stack].
      
      Analogous changes have been made to /proc/PID/smaps,
      /proc/PID/numa_maps, /proc/PID/task/TID/smaps and
      /proc/PID/task/TID/numa_maps. Relevant snippets from smaps and
      numa_maps:
      
          [siddhesh@localhost ~ ]$ pgrep a.out
          1441
          [siddhesh@localhost ~ ]$ cat /proc/1441/smaps | grep "\[stack"
          7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack:1442]
          7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
          [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/smaps | grep "\[stack"
          7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack]
          [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/smaps | grep "\[stack"
          7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
          [siddhesh@localhost ~ ]$ cat /proc/1441/numa_maps | grep "stack"
          7f8a44492000 default stack:1442 anon=2 dirty=2 N0=2
          7fff6273a000 default stack anon=3 dirty=3 N0=3
          [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/numa_maps | grep "stack"
          7f8a44492000 default stack anon=2 dirty=2 N0=2
          [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/numa_maps | grep "stack"
          7fff6273a000 default stack anon=3 dirty=3 N0=3
      
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NSiddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jamie Lokier <jamie@shareable.org>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7643757
    • N
      pagemap: introduce data structure for pagemap entry · 092b50ba
      Naoya Horiguchi 提交于
      Currently a local variable of pagemap entry in pagemap_pte_range() is
      named pfn and typed with u64, but it's not correct (pfn should be unsigned
      long.)
      
      This patch introduces special type for pagemap entries and replaces code
      with it.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      092b50ba
    • N
      thp: optimize away unnecessary page table locking · 025c5b24
      Naoya Horiguchi 提交于
      Currently when we check if we can handle thp as it is or we need to split
      it into regular sized pages, we hold page table lock prior to check
      whether a given pmd is mapping thp or not.  Because of this, when it's not
      "huge pmd" we suffer from unnecessary lock/unlock overhead.  To remove it,
      this patch introduces a optimized check function and replace several
      similar logics with it.
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      025c5b24
    • N
      pagemap: avoid splitting thp when reading /proc/pid/pagemap · 5aaabe83
      Naoya Horiguchi 提交于
      Thp split is not necessary if we explicitly check whether pmds are mapping
      thps or not.  This patch introduces this check and adds code to generate
      pagemap entries for pmds mapping thps, which results in less performance
      impact of pagemap on thp.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NAndi Kleen <ak@linux.intel.com>
      Reviewed-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5aaabe83
    • A
      mm: thp: fix pmd_bad() triggering in code paths holding mmap_sem read mode · 1a5a9906
      Andrea Arcangeli 提交于
      In some cases it may happen that pmd_none_or_clear_bad() is called with
      the mmap_sem hold in read mode.  In those cases the huge page faults can
      allocate hugepmds under pmd_none_or_clear_bad() and that can trigger a
      false positive from pmd_bad() that will not like to see a pmd
      materializing as trans huge.
      
      It's not khugepaged causing the problem, khugepaged holds the mmap_sem
      in write mode (and all those sites must hold the mmap_sem in read mode
      to prevent pagetables to go away from under them, during code review it
      seems vm86 mode on 32bit kernels requires that too unless it's
      restricted to 1 thread per process or UP builds).  The race is only with
      the huge pagefaults that can convert a pmd_none() into a
      pmd_trans_huge().
      
      Effectively all these pmd_none_or_clear_bad() sites running with
      mmap_sem in read mode are somewhat speculative with the page faults, and
      the result is always undefined when they run simultaneously.  This is
      probably why it wasn't common to run into this.  For example if the
      madvise(MADV_DONTNEED) runs zap_page_range() shortly before the page
      fault, the hugepage will not be zapped, if the page fault runs first it
      will be zapped.
      
      Altering pmd_bad() not to error out if it finds hugepmds won't be enough
      to fix this, because zap_pmd_range would then proceed to call
      zap_pte_range (which would be incorrect if the pmd become a
      pmd_trans_huge()).
      
      The simplest way to fix this is to read the pmd in the local stack
      (regardless of what we read, no need of actual CPU barriers, only
      compiler barrier needed), and be sure it is not changing under the code
      that computes its value.  Even if the real pmd is changing under the
      value we hold on the stack, we don't care.  If we actually end up in
      zap_pte_range it means the pmd was not none already and it was not huge,
      and it can't become huge from under us (khugepaged locking explained
      above).
      
      All we need is to enforce that there is no way anymore that in a code
      path like below, pmd_trans_huge can be false, but pmd_none_or_clear_bad
      can run into a hugepmd.  The overhead of a barrier() is just a compiler
      tweak and should not be measurable (I only added it for THP builds).  I
      don't exclude different compiler versions may have prevented the race
      too by caching the value of *pmd on the stack (that hasn't been
      verified, but it wouldn't be impossible considering
      pmd_none_or_clear_bad, pmd_bad, pmd_trans_huge, pmd_none are all inlines
      and there's no external function called in between pmd_trans_huge and
      pmd_none_or_clear_bad).
      
      		if (pmd_trans_huge(*pmd)) {
      			if (next-addr != HPAGE_PMD_SIZE) {
      				VM_BUG_ON(!rwsem_is_locked(&tlb->mm->mmap_sem));
      				split_huge_page_pmd(vma->vm_mm, pmd);
      			} else if (zap_huge_pmd(tlb, vma, pmd, addr))
      				continue;
      			/* fall through */
      		}
      		if (pmd_none_or_clear_bad(pmd))
      
      Because this race condition could be exercised without special
      privileges this was reported in CVE-2012-1179.
      
      The race was identified and fully explained by Ulrich who debugged it.
      I'm quoting his accurate explanation below, for reference.
      
      ====== start quote =======
            mapcount 0 page_mapcount 1
            kernel BUG at mm/huge_memory.c:1384!
      
          At some point prior to the panic, a "bad pmd ..." message similar to the
          following is logged on the console:
      
            mm/memory.c:145: bad pmd ffff8800376e1f98(80000000314000e7).
      
          The "bad pmd ..." message is logged by pmd_clear_bad() before it clears
          the page's PMD table entry.
      
              143 void pmd_clear_bad(pmd_t *pmd)
              144 {
          ->  145         pmd_ERROR(*pmd);
              146         pmd_clear(pmd);
              147 }
      
          After the PMD table entry has been cleared, there is an inconsistency
          between the actual number of PMD table entries that are mapping the page
          and the page's map count (_mapcount field in struct page). When the page
          is subsequently reclaimed, __split_huge_page() detects this inconsistency.
      
             1381         if (mapcount != page_mapcount(page))
             1382                 printk(KERN_ERR "mapcount %d page_mapcount %d\n",
             1383                        mapcount, page_mapcount(page));
          -> 1384         BUG_ON(mapcount != page_mapcount(page));
      
          The root cause of the problem is a race of two threads in a multithreaded
          process. Thread B incurs a page fault on a virtual address that has never
          been accessed (PMD entry is zero) while Thread A is executing an madvise()
          system call on a virtual address within the same 2 MB (huge page) range.
      
                     virtual address space
                    .---------------------.
                    |                     |
                    |                     |
                  .-|---------------------|
                  | |                     |
                  | |                     |<-- B(fault)
                  | |                     |
            2 MB  | |/////////////////////|-.
            huge <  |/////////////////////|  > A(range)
            page  | |/////////////////////|-'
                  | |                     |
                  | |                     |
                  '-|---------------------|
                    |                     |
                    |                     |
                    '---------------------'
      
          - Thread A is executing an madvise(..., MADV_DONTNEED) system call
            on the virtual address range "A(range)" shown in the picture.
      
          sys_madvise
            // Acquire the semaphore in shared mode.
            down_read(&current->mm->mmap_sem)
            ...
            madvise_vma
              switch (behavior)
              case MADV_DONTNEED:
                   madvise_dontneed
                     zap_page_range
                       unmap_vmas
                         unmap_page_range
                           zap_pud_range
                             zap_pmd_range
                               //
                               // Assume that this huge page has never been accessed.
                               // I.e. content of the PMD entry is zero (not mapped).
                               //
                               if (pmd_trans_huge(*pmd)) {
                                   // We don't get here due to the above assumption.
                               }
                               //
                               // Assume that Thread B incurred a page fault and
                   .---------> // sneaks in here as shown below.
                   |           //
                   |           if (pmd_none_or_clear_bad(pmd))
                   |               {
                   |                 if (unlikely(pmd_bad(*pmd)))
                   |                     pmd_clear_bad
                   |                     {
                   |                       pmd_ERROR
                   |                         // Log "bad pmd ..." message here.
                   |                       pmd_clear
                   |                         // Clear the page's PMD entry.
                   |                         // Thread B incremented the map count
                   |                         // in page_add_new_anon_rmap(), but
                   |                         // now the page is no longer mapped
                   |                         // by a PMD entry (-> inconsistency).
                   |                     }
                   |               }
                   |
                   v
          - Thread B is handling a page fault on virtual address "B(fault)" shown
            in the picture.
      
          ...
          do_page_fault
            __do_page_fault
              // Acquire the semaphore in shared mode.
              down_read_trylock(&mm->mmap_sem)
              ...
              handle_mm_fault
                if (pmd_none(*pmd) && transparent_hugepage_enabled(vma))
                    // We get here due to the above assumption (PMD entry is zero).
                    do_huge_pmd_anonymous_page
                      alloc_hugepage_vma
                        // Allocate a new transparent huge page here.
                      ...
                      __do_huge_pmd_anonymous_page
                        ...
                        spin_lock(&mm->page_table_lock)
                        ...
                        page_add_new_anon_rmap
                          // Here we increment the page's map count (starts at -1).
                          atomic_set(&page->_mapcount, 0)
                        set_pmd_at
                          // Here we set the page's PMD entry which will be cleared
                          // when Thread A calls pmd_clear_bad().
                        ...
                        spin_unlock(&mm->page_table_lock)
      
          The mmap_sem does not prevent the race because both threads are acquiring
          it in shared mode (down_read).  Thread B holds the page_table_lock while
          the page's map count and PMD table entry are updated.  However, Thread A
          does not synchronize on that lock.
      
      ====== end quote =======
      
      [akpm@linux-foundation.org: checkpatch fixes]
      Reported-by: NUlrich Obergfell <uobergfe@redhat.com>
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Dave Jones <davej@redhat.com>
      Acked-by: NLarry Woodman <lwoodman@redhat.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: <stable@vger.kernel.org>		[2.6.38+]
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a5a9906
  17. 24 1月, 2012 1 次提交
    • W
      proc: clear_refs: do not clear reserved pages · 85e72aa5
      Will Deacon 提交于
      /proc/pid/clear_refs is used to clear the Referenced and YOUNG bits for
      pages and corresponding page table entries of the task with PID pid, which
      includes any special mappings inserted into the page tables in order to
      provide things like vDSOs and user helper functions.
      
      On ARM this causes a problem because the vectors page is mapped as a
      global mapping and since ec706dab ("ARM: add a vma entry for the user
      accessible vector page"), a VMA is also inserted into each task for this
      page to aid unwinding through signals and syscall restarts.  Since the
      vectors page is required for handling faults, clearing the YOUNG bit (and
      subsequently writing a faulting pte) means that we lose the vectors page
      *globally* and cannot fault it back in.  This results in a system deadlock
      on the next exception.
      
      To see this problem in action, just run:
      
      	$ echo 1 > /proc/self/clear_refs
      
      on an ARM platform (as any user) and watch your system hang.  I think this
      has been the case since 2.6.37
      
      This patch avoids clearing the aforementioned bits for reserved pages,
      therefore leaving the vectors page intact on ARM.  Since reserved pages
      are not candidates for swap, this change should not have any impact on the
      usefulness of clear_refs.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Reported-by: NMoussa Ba <moussaba@micron.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Acked-by: NNicolas Pitre <nico@linaro.org>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: <stable@vger.kernel.org>		[2.6.37+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      85e72aa5
  18. 01 11月, 2011 2 次提交
  19. 22 9月, 2011 3 次提交
  20. 27 5月, 2011 3 次提交
  21. 25 5月, 2011 1 次提交
    • S
      proc: allocate storage for numa_maps statistics once · 5b52fc89
      Stephen Wilson 提交于
      In show_numa_map() we collect statistics into a numa_maps structure.
      Since the number of NUMA nodes can be very large, this structure is not a
      candidate for stack allocation.
      
      Instead of going thru a kmalloc()+kfree() cycle each time show_numa_map()
      is invoked, perform the allocation just once when /proc/pid/numa_maps is
      opened.
      
      Performing the allocation when numa_maps is opened, and thus before a
      reference to the target tasks mm is taken, eliminates a potential
      stalemate condition in the oom-killer as originally described by Hugh
      Dickins:
      
        ... imagine what happens if the system is out of memory, and the mm
        we're looking at is selected for killing by the OOM killer: while
        we wait in __get_free_page for more memory, no memory is freed
        from the selected mm because it cannot reach exit_mmap while we hold
        that reference.
      Signed-off-by: NStephen Wilson <wilsons@start.ca>
      Reviewed-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b52fc89