1. 09 10月, 2012 5 次提交
    • K
      mm: kill vma flag VM_EXECUTABLE and mm->num_exe_file_vmas · e9714acf
      Konstantin Khlebnikov 提交于
      Currently the kernel sets mm->exe_file during sys_execve() and then tracks
      number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon
      as this counter drops to zero kernel resets mm->exe_file to NULL.  Plus it
      resets mm->exe_file at last mmput() when mm->mm_users drops to zero.
      
      VMA with VM_EXECUTABLE flag appears after mapping file with flag
      MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma
      splitting, because sys_mmap ignores this flag.  Usually binfmt module sets
      mm->exe_file and mmaps executable vmas with this file, they hold
      mm->exe_file while task is running.
      
      comment from v2.6.25-6245-g925d1c40 ("procfs task exe symlink"),
      where all this stuff was introduced:
      
      > The kernel implements readlink of /proc/pid/exe by getting the file from
      > the first executable VMA.  Then the path to the file is reconstructed and
      > reported as the result.
      >
      > Because of the VMA walk the code is slightly different on nommu systems.
      > This patch avoids separate /proc/pid/exe code on nommu systems.  Instead of
      > walking the VMAs to find the first executable file-backed VMA we store a
      > reference to the exec'd file in the mm_struct.
      >
      > That reference would prevent the filesystem holding the executable file
      > from being unmounted even after unmapping the VMAs.  So we track the number
      > of VM_EXECUTABLE VMAs and drop the new reference when the last one is
      > unmapped.  This avoids pinning the mounted filesystem.
      
      exe_file's vma accounting is hooked into every file mmap/unmmap and vma
      split/merge just to fix some hypothetical pinning fs from umounting by mm,
      which already unmapped all its executable files, but still alive.
      
      Seems like currently nobody depends on this behaviour.  We can try to
      remove this logic and keep mm->exe_file until final mmput().
      
      mm->exe_file is still protected with mm->mmap_sem, because we want to
      change it via new sys_prctl(PR_SET_MM_EXE_FILE).  Also via this syscall
      task can change its mm->exe_file and unpin mountpoint explicitly.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9714acf
    • K
      mm: kill vma flag VM_CAN_NONLINEAR · 0b173bc4
      Konstantin Khlebnikov 提交于
      Move actual pte filling for non-linear file mappings into the new special
      vma operation: ->remap_pages().
      
      Filesystems must implement this method to get non-linear mapping support,
      if it uses filemap_fault() then generic_file_remap_pages() can be used.
      
      Now device drivers can implement this method and obtain nonlinear vma support.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>	#arch/tile
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b173bc4
    • K
      mm: kill vma flag VM_INSERTPAGE · 4b6e1e37
      Konstantin Khlebnikov 提交于
      Merge VM_INSERTPAGE into VM_MIXEDMAP.  VM_MIXEDMAP VMA can mix pure-pfn
      ptes, special ptes and normal ptes.
      
      Now copy_page_range() always copies VM_MIXEDMAP VMA on fork like
      VM_PFNMAP.  If driver populates whole VMA at mmap() it probably not
      expects page-faults.
      
      This patch removes special check from vma_wants_writenotify() which
      disables pages write tracking for VMA populated via vm_instert_page().
      BDI below mapped file should not use dirty-accounting, moreover
      do_wp_page() can handle this.
      
      vm_insert_page() still marks vma after first usage.  Usually it is called
      from f_op->mmap() handler under mm->mmap_sem write-lock, so it able to
      change vma->vm_flags.  Caller must set VM_MIXEDMAP at mmap time if it
      wants to call this function from other places, for example from page-fault
      handler.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4b6e1e37
    • K
      mm: introduce arch-specific vma flag VM_ARCH_1 · cc2383ec
      Konstantin Khlebnikov 提交于
      Combine several arch-specific vma flags into one.
      
      before patch:
      
              0x00000200      0x01000000      0x20000000      0x40000000
      x86     VM_NOHUGEPAGE   VM_HUGEPAGE     -               VM_PAT
      powerpc -               -               VM_SAO          -
      parisc  VM_GROWSUP      -               -               -
      ia64    VM_GROWSUP      -               -               -
      nommu   -               VM_MAPPED_COPY  -               -
      others  -               -               -               -
      
      after patch:
      
              0x00000200      0x01000000      0x20000000      0x40000000
      x86     -               VM_PAT          VM_HUGEPAGE     VM_NOHUGEPAGE
      powerpc -               VM_SAO          -               -
      parisc  -               VM_GROWSUP      -               -
      ia64    -               VM_GROWSUP      -               -
      nommu   -               VM_MAPPED_COPY  -               -
      others  -               VM_ARCH_1       -               -
      
      And voila! One completely free bit.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Suresh Siddha <suresh.b.siddha@intel.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cc2383ec
    • K
      mm, x86, pat: rework linear pfn-mmap tracking · b3b9c293
      Konstantin Khlebnikov 提交于
      Replace the generic vma-flag VM_PFN_AT_MMAP with x86-only VM_PAT.
      
      We can toss mapping address from remap_pfn_range() into
      track_pfn_vma_new(), and collect all PAT-related logic together in
      arch/x86/.
      
      This patch also restores orignal frustration-free is_cow_mapping() check
      in remap_pfn_range(), as it was before commit v2.6.28-rc8-88-g3c8bb73a
      ("x86: PAT: store vm_pgoff for all linear_over_vma_region mappings - v3")
      
      is_linear_pfn_mapping() checks can be removed from mm/huge_memory.c,
      because it already handled by VM_PFNMAP in VM_NO_THP bit-mask.
      
      [suresh.b.siddha@intel.com: Reset the VM_PAT flag as part of untrack_pfn_vma()]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Nick Piggin <npiggin@kernel.dk>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Morris <james.l.morris@oracle.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Kentaro Takeda <takedakn@nttdata.co.jp>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Robert Richter <robert.richter@amd.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Venkatesh Pallipadi <venki@google.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3b9c293
  2. 01 8月, 2012 4 次提交
    • M
      mm: add get_kernel_page[s] for pinning of kernel addresses for I/O · 18022c5d
      Mel Gorman 提交于
      This patch adds two new APIs get_kernel_pages() and get_kernel_page() that
      may be used to pin a vector of kernel addresses for IO.  The initial user
      is expected to be NFS for allowing pages to be written to swap using
      aops->direct_IO().  Strictly speaking, swap-over-NFS only needs to pin one
      page for IO but it makes sense to express the API in terms of a vector and
      add a helper for pinning single pages.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Cc: Mark Salter <msalter@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      18022c5d
    • M
      mm: methods for teaching filesystems about PG_swapcache pages · f981c595
      Mel Gorman 提交于
      In order to teach filesystems to handle swap cache pages, three new page
      functions are introduced:
      
        pgoff_t page_file_index(struct page *);
        loff_t page_file_offset(struct page *);
        struct address_space *page_file_mapping(struct page *);
      
      page_file_index() - gives the offset of this page in the file in
      PAGE_CACHE_SIZE blocks.  Like page->index is for mapped pages, this
      function also gives the correct index for PG_swapcache pages.
      
      page_file_offset() - uses page_file_index(), so that it will give the
      expected result, even for PG_swapcache pages.
      
      page_file_mapping() - gives the mapping backing the actual page; that is
      for swap cache pages it will give swap_file->f_mapping.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric B Munson <emunson@mgebm.net>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: James Morris <jmorris@namei.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Mike Christie <michaelc@cs.wisc.edu>
      Cc: Neil Brown <neilb@suse.de>
      Cc: Sebastian Andrzej Siewior <sebastian@breakpoint.cc>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      Cc: Xiaotian Feng <dfeng@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f981c595
    • J
      mm/hotplug: free zone->pageset when a zone becomes empty · 340175b7
      Jiang Liu 提交于
      When a zone becomes empty after memory offlining, free zone->pageset.
      Otherwise it will cause memory leak when adding memory to the empty zone
      again because build_all_zonelists() will allocate zone->pageset for an
      empty zone.
      Signed-off-by: NJiang Liu <liuj97@gmail.com>
      Signed-off-by: NWei Wang <Bessel.Wang@huawei.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Keping Chen <chenkeping@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      340175b7
    • H
      mm: account the total_vm in the vm_stat_account() · 44de9d0c
      Huang Shijie 提交于
      vm_stat_account() accounts the shared_vm, stack_vm and reserved_vm now.
      But we can also account for total_vm in the vm_stat_account() which makes
      the code tidy.
      
      Even for mprotect_fixup(), we can get the right result in the end.
      Signed-off-by: NHuang Shijie <shijie8@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44de9d0c
  3. 31 7月, 2012 1 次提交
  4. 12 7月, 2012 1 次提交
    • T
      x86/mce: Fix siginfo_t->si_addr value for non-recoverable memory faults · 6751ed65
      Tony Luck 提交于
      In commit dad1743e ("x86/mce: Only restart instruction after machine
      check recovery if it is safe") we fixed mce_notify_process() to force a
      signal to the current process if it was not restartable (RIPV bit not
      set in MCG_STATUS). But doing it here means that the process doesn't
      get told the virtual address of the fault via siginfo_t->si_addr. This
      would prevent application level recovery from the fault.
      
      Make a new MF_MUST_KILL flag bit for memory_failure() et al. to use so
      that we will provide the right information with the signal.
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      Acked-by: NBorislav Petkov <borislav.petkov@amd.com>
      Cc: stable@kernel.org    # 3.4+
      6751ed65
  5. 01 6月, 2012 1 次提交
  6. 30 5月, 2012 2 次提交
  7. 07 5月, 2012 2 次提交
  8. 21 4月, 2012 4 次提交
  9. 29 3月, 2012 2 次提交
    • H
      mm for fs: add truncate_pagecache_range() · 623e3db9
      Hugh Dickins 提交于
      Holepunching filesystems ext4 and xfs are using truncate_inode_pages_range
      but forgetting to unmap pages first (ocfs2 remembers).  This is not really
      a bug, since races already require truncate_inode_page() to handle that
      case once the page is locked; but it can be very inefficient if the file
      being punched happens to be mapped into many vmas.
      
      Provide a drop-in replacement truncate_pagecache_range() which does the
      unmapping pass first, handling the awkward mismatch between arguments to
      truncate_inode_pages_range() and arguments to unmap_mapping_range().
      
      Note that holepunching does not unmap privately COWed pages in the range:
      POSIX requires that we do so when truncating, but it's hard to justify,
      difficult to implement without an i_size cutoff, and no filesystem is
      attempting to implement it.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Andreas Dilger <adilger.kernel@dilger.ca>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Ben Myers <bpm@sgi.com>
      Cc: Alex Elder <elder@kernel.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      623e3db9
    • D
      Move all declarations of free_initmem() to linux/mm.h · 49a7f04a
      David Howells 提交于
      Move all declarations of free_initmem() to linux/mm.h so that there's only one
      and it's used by everything.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: linux-c6x-dev@linux-c6x.org
      cc: microblaze-uclinux@itee.uq.edu.au
      cc: linux-sh@vger.kernel.org
      cc: sparclinux@vger.kernel.org
      cc: x86@kernel.org
      cc: linux-mm@kvack.org
      49a7f04a
  10. 24 3月, 2012 2 次提交
    • J
      coredump: add VM_NODUMP, MADV_NODUMP, MADV_CLEAR_NODUMP · accb61fe
      Jason Baron 提交于
      Since we no longer need the VM_ALWAYSDUMP flag, let's use the freed bit
      for 'VM_NODUMP' flag.  The idea is is to add a new madvise() flag:
      MADV_DONTDUMP, which can be set by applications to specifically request
      memory regions which should not dump core.
      
      The specific application I have in mind is qemu: we can add a flag there
      that wouldn't dump all of guest memory when qemu dumps core.  This flag
      might also be useful for security sensitive apps that want to absolutely
      make sure that parts of memory are not dumped.  To clear the flag use:
      MADV_DODUMP.
      
      [akpm@linux-foundation.org: s/MADV_NODUMP/MADV_DONTDUMP/, s/MADV_CLEAR_NODUMP/MADV_DODUMP/, per Roland]
      [akpm@linux-foundation.org: fix up the architectures which broke]
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Acked-by: NRoland McGrath <roland@hack.frob.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
      Cc: Matt Turner <mattst88@gmail.com>
      Cc: "James E.J. Bottomley" <jejb@parisc-linux.org>
      Cc: Helge Deller <deller@gmx.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      accb61fe
    • J
      coredump: remove VM_ALWAYSDUMP flag · 909af768
      Jason Baron 提交于
      The motivation for this patchset was that I was looking at a way for a
      qemu-kvm process, to exclude the guest memory from its core dump, which
      can be quite large.  There are already a number of filter flags in
      /proc/<pid>/coredump_filter, however, these allow one to specify 'types'
      of kernel memory, not specific address ranges (which is needed in this
      case).
      
      Since there are no more vma flags available, the first patch eliminates
      the need for the 'VM_ALWAYSDUMP' flag.  The flag is used internally by
      the kernel to mark vdso and vsyscall pages.  However, it is simple
      enough to check if a vma covers a vdso or vsyscall page without the need
      for this flag.
      
      The second patch then replaces the 'VM_ALWAYSDUMP' flag with a new
      'VM_NODUMP' flag, which can be set by userspace using new madvise flags:
      'MADV_DONTDUMP', and unset via 'MADV_DODUMP'.  The core dump filters
      continue to work the same as before unless 'MADV_DONTDUMP' is set on the
      region.
      
      The qemu code which implements this features is at:
      
        http://people.redhat.com/~jbaron/qemu-dump/qemu-dump.patch
      
      In my testing the qemu core dump shrunk from 383MB -> 13MB with this
      patch.
      
      I also believe that the 'MADV_DONTDUMP' flag might be useful for
      security sensitive apps, which might want to select which areas are
      dumped.
      
      This patch:
      
      The VM_ALWAYSDUMP flag is currently used by the coredump code to
      indicate that a vma is part of a vsyscall or vdso section.  However, we
      can determine if a vma is in one these sections by checking it against
      the gate_vma and checking for a non-NULL return value from
      arch_vma_name().  Thus, freeing a valuable vma bit.
      Signed-off-by: NJason Baron <jbaron@redhat.com>
      Acked-by: NRoland McGrath <roland@hack.frob.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Avi Kivity <avi@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      909af768
  11. 22 3月, 2012 4 次提交
    • K
      page_alloc.c: remove add_from_early_node_map() · 8d13bddd
      Kautuk Consul 提交于
      add_from_early_node_map() is unused.
      Signed-off-by: NKautuk Consul <consul.kautuk@gmail.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d13bddd
    • D
      mm, counters: remove task argument to sync_mm_rss() and __sync_task_rss_stat() · 05af2e10
      David Rientjes 提交于
      sync_mm_rss() can only be used for current to avoid race conditions in
      iterating and clearing its per-task counters.  Remove the task argument
      for it and its helper function, __sync_task_rss_stat(), to avoid thinking
      it can be used safely for anything other than current.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NKAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      05af2e10
    • S
      procfs: mark thread stack correctly in proc/<pid>/maps · b7643757
      Siddhesh Poyarekar 提交于
      Stack for a new thread is mapped by userspace code and passed via
      sys_clone.  This memory is currently seen as anonymous in
      /proc/<pid>/maps, which makes it difficult to ascertain which mappings
      are being used for thread stacks.  This patch uses the individual task
      stack pointers to determine which vmas are actually thread stacks.
      
      For a multithreaded program like the following:
      
      	#include <pthread.h>
      
      	void *thread_main(void *foo)
      	{
      		while(1);
      	}
      
      	int main()
      	{
      		pthread_t t;
      		pthread_create(&t, NULL, thread_main, NULL);
      		pthread_join(t, NULL);
      	}
      
      proc/PID/maps looks like the following:
      
          00400000-00401000 r-xp 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          00600000-00601000 rw-p 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          019ef000-01a10000 rw-p 00000000 00:00 0                                  [heap]
          7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
          7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0
          7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
          7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
          7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
          7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
          7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
          7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
          7fff627ff000-7fff62800000 r-xp 00000000 00:00 0                          [vdso]
          ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
      
      Here, one could guess that 7f8a44492000-7f8a44c92000 is a stack since
      the earlier vma that has no permissions (7f8a44e3d000-7f8a4503d000) but
      that is not always a reliable way to find out which vma is a thread
      stack.  Also, /proc/PID/maps and /proc/PID/task/TID/maps has the same
      content.
      
      With this patch in place, /proc/PID/task/TID/maps are treated as 'maps
      as the task would see it' and hence, only the vma that that task uses as
      stack is marked as [stack].  All other 'stack' vmas are marked as
      anonymous memory.  /proc/PID/maps acts as a thread group level view,
      where all thread stack vmas are marked as [stack:TID] where TID is the
      process ID of the task that uses that vma as stack, while the process
      stack is marked as [stack].
      
      So /proc/PID/maps will look like this:
      
          00400000-00401000 r-xp 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          00600000-00601000 rw-p 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          019ef000-01a10000 rw-p 00000000 00:00 0                                  [heap]
          7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
          7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack:1442]
          7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
          7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
          7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
          7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
          7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
          7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
          7fff627ff000-7fff62800000 r-xp 00000000 00:00 0                          [vdso]
          ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
      
      Thus marking all vmas that are used as stacks by the threads in the
      thread group along with the process stack.  The task level maps will
      however like this:
      
          00400000-00401000 r-xp 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          00600000-00601000 rw-p 00000000 fd:0a 3671804                            /home/siddhesh/a.out
          019ef000-01a10000 rw-p 00000000 00:00 0                                  [heap]
          7f8a44491000-7f8a44492000 ---p 00000000 00:00 0
          7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack]
          7f8a44c92000-7f8a44e3d000 r-xp 00000000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a44e3d000-7f8a4503d000 ---p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a4503d000-7f8a45041000 r--p 001ab000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45041000-7f8a45043000 rw-p 001af000 fd:00 2097482                    /lib64/libc-2.14.90.so
          7f8a45043000-7f8a45048000 rw-p 00000000 00:00 0
          7f8a45048000-7f8a4505f000 r-xp 00000000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4505f000-7f8a4525e000 ---p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525e000-7f8a4525f000 r--p 00016000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a4525f000-7f8a45260000 rw-p 00017000 fd:00 2099938                    /lib64/libpthread-2.14.90.so
          7f8a45260000-7f8a45264000 rw-p 00000000 00:00 0
          7f8a45264000-7f8a45286000 r-xp 00000000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45457000-7f8a4545a000 rw-p 00000000 00:00 0
          7f8a45484000-7f8a45485000 rw-p 00000000 00:00 0
          7f8a45485000-7f8a45486000 r--p 00021000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45486000-7f8a45487000 rw-p 00022000 fd:00 2097348                    /lib64/ld-2.14.90.so
          7f8a45487000-7f8a45488000 rw-p 00000000 00:00 0
          7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0
          7fff627ff000-7fff62800000 r-xp 00000000 00:00 0                          [vdso]
          ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
      
      where only the vma that is being used as a stack by *that* task is
      marked as [stack].
      
      Analogous changes have been made to /proc/PID/smaps,
      /proc/PID/numa_maps, /proc/PID/task/TID/smaps and
      /proc/PID/task/TID/numa_maps. Relevant snippets from smaps and
      numa_maps:
      
          [siddhesh@localhost ~ ]$ pgrep a.out
          1441
          [siddhesh@localhost ~ ]$ cat /proc/1441/smaps | grep "\[stack"
          7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack:1442]
          7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
          [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/smaps | grep "\[stack"
          7f8a44492000-7f8a44c92000 rw-p 00000000 00:00 0                          [stack]
          [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/smaps | grep "\[stack"
          7fff6273b000-7fff6275c000 rw-p 00000000 00:00 0                          [stack]
          [siddhesh@localhost ~ ]$ cat /proc/1441/numa_maps | grep "stack"
          7f8a44492000 default stack:1442 anon=2 dirty=2 N0=2
          7fff6273a000 default stack anon=3 dirty=3 N0=3
          [siddhesh@localhost ~ ]$ cat /proc/1441/task/1442/numa_maps | grep "stack"
          7f8a44492000 default stack anon=2 dirty=2 N0=2
          [siddhesh@localhost ~ ]$ cat /proc/1441/task/1441/numa_maps | grep "stack"
          7fff6273a000 default stack anon=3 dirty=3 N0=3
      
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix build]
      Signed-off-by: NSiddhesh Poyarekar <siddhesh.poyarekar@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Jamie Lokier <jamie@shareable.org>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Matt Mackall <mpm@selenic.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b7643757
    • K
      mm: make get_mm_counter static-inline · 69c97823
      Konstantin Khlebnikov 提交于
      Make get_mm_counter() always static inline, it is simple enough for that.
      And remove unused set_mm_counter()
      
      bloat-o-meter:
      
      add/remove: 0/1 grow/shrink: 4/12 up/down: 99/-341 (-242)
      function                                     old     new   delta
      try_to_unmap_one                             886     952     +66
      sys_remap_file_pages                        1214    1230     +16
      dup_mm                                      1684    1700     +16
      do_exit                                     2277    2278      +1
      zap_page_range                               208     205      -3
      unmap_region                                 304     296      -8
      static.oom_kill_process                      554     546      -8
      try_to_unmap_file                           1716    1700     -16
      getrusage                                    925     909     -16
      flush_old_exec                              1704    1688     -16
      static.dump_header                           416     390     -26
      acct_update_integrals                        218     187     -31
      do_task_stat                                2986    2954     -32
      get_mm_counter                                34       -     -34
      xacct_add_tsk                                371     334     -37
      task_statm                                   172     118     -54
      task_mem                                     383     323     -60
      
      try_to_unmap_one() grows because update_hiwater_rss() now completely inline.
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69c97823
  12. 21 3月, 2012 2 次提交
  13. 05 3月, 2012 1 次提交
    • P
      BUG: headers with BUG/BUG_ON etc. need linux/bug.h · 187f1882
      Paul Gortmaker 提交于
      If a header file is making use of BUG, BUG_ON, BUILD_BUG_ON, or any
      other BUG variant in a static inline (i.e. not in a #define) then
      that header really should be including <linux/bug.h> and not just
      expecting it to be implicitly present.
      
      We can make this change risk-free, since if the files using these
      headers didn't have exposure to linux/bug.h already, they would have
      been causing compile failures/warnings.
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      187f1882
  14. 11 1月, 2012 2 次提交
    • P
      procfs: introduce the /proc/<pid>/map_files/ directory · 640708a2
      Pavel Emelyanov 提交于
      This one behaves similarly to the /proc/<pid>/fd/ one - it contains
      symlinks one for each mapping with file, the name of a symlink is
      "vma->vm_start-vma->vm_end", the target is the file.  Opening a symlink
      results in a file that point exactly to the same inode as them vma's one.
      
      For example the ls -l of some arbitrary /proc/<pid>/map_files/
      
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80403000-7f8f80404000 -> /lib64/libc-2.5.so
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f8061e000-7f8f80620000 -> /lib64/libselinux.so.1
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80826000-7f8f80827000 -> /lib64/libacl.so.1.1.0
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a2f000-7f8f80a30000 -> /lib64/librt-2.5.so
       | lr-x------ 1 root root 64 Aug 26 06:40 7f8f80a30000-7f8f80a4c000 -> /lib64/ld-2.5.so
      
      This *helps* checkpointing process in three ways:
      
      1. When dumping a task mappings we do know exact file that is mapped
         by particular region.  We do this by opening
         /proc/$pid/map_files/$address symlink the way we do with file
         descriptors.
      
      2. This also helps in determining which anonymous shared mappings are
         shared with each other by comparing the inodes of them.
      
      3. When restoring a set of processes in case two of them has a mapping
         shared, we map the memory by the 1st one and then open its
         /proc/$pid/map_files/$address file and map it by the 2nd task.
      
      Using /proc/$pid/maps for this is quite inconvenient since it brings
      repeatable re-reading and reparsing for this text file which slows down
      restore procedure significantly.  Also as being pointed in (3) it is a way
      easier to use top level shared mapping in children as
      /proc/$pid/map_files/$address when needed.
      
      [akpm@linux-foundation.org: coding-style fixes]
      [gorcunov@openvz.org: make map_files depend on CHECKPOINT_RESTORE]
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Reviewed-by: NVasiliy Kulikov <segoon@openwall.com>
      Reviewed-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: Pavel Machek <pavel@ucw.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      640708a2
    • S
      mm: more intensive memory corruption debugging · c0a32fc5
      Stanislaw Gruszka 提交于
      With CONFIG_DEBUG_PAGEALLOC configured, the CPU will generate an exception
      on access (read,write) to an unallocated page, which permits us to catch
      code which corrupts memory.  However the kernel is trying to maximise
      memory usage, hence there are usually few free pages in the system and
      buggy code usually corrupts some crucial data.
      
      This patch changes the buddy allocator to keep more free/protected pages
      and to interlace free/protected and allocated pages to increase the
      probability of catching corruption.
      
      When the kernel is compiled with CONFIG_DEBUG_PAGEALLOC,
      debug_guardpage_minorder defines the minimum order used by the page
      allocator to grant a request.  The requested size will be returned with
      the remaining pages used as guard pages.
      
      The default value of debug_guardpage_minorder is zero: no change from
      current behaviour.
      
      [akpm@linux-foundation.org: tweak documentation, s/flg/flag/]
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0a32fc5
  15. 04 1月, 2012 2 次提交
    • T
      HWPOISON: Add code to handle "action required" errors. · 7329bbeb
      Tony Luck 提交于
      Add new flag bit "MF_ACTION_REQUIRED" to be used by machine check
      code to force a signal with si_code = BUS_MCEERR_AR in the case
      where the error occurs in processor execution context. Pass the
      flags argument along call chain:
      	memory_failure()
      	  hwpoison_user_mappings()
      	    kill_procs()
      	      kill_proc()
      
      Drop the "_ao" suffix from kill_procs_ao() and kill_proc_ao() since
      they can now handle "action required" as well as "action optional" errors.
      Acked-by: NBorislav Petkov <bp@amd64.org>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      7329bbeb
    • T
      HWPOISON: Clean up memory_failure() vs. __memory_failure() · cd42f4a3
      Tony Luck 提交于
      There is only one caller of memory_failure(), all other users call
      __memory_failure() and pass in the flags argument explicitly. The
      lone user of memory_failure() will soon need to pass flags too.
      
      Add flags argument to the callsite in mce.c. Delete the old memory_failure()
      function, and then rename __memory_failure() without the leading "__".
      
      Provide clearer message when action optional memory errors are ignored.
      Acked-by: NBorislav Petkov <bp@amd64.org>
      Signed-off-by: NTony Luck <tony.luck@intel.com>
      cd42f4a3
  16. 09 12月, 2011 2 次提交
    • K
      vmscan: use atomic-long for shrinker batching · 83aeeada
      Konstantin Khlebnikov 提交于
      Use atomic-long operations instead of looping around cmpxchg().
      
      [akpm@linux-foundation.org: massage atomic.h inclusions]
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83aeeada
    • T
      memblock: Kill early_node_map[] · 0ee332c1
      Tejun Heo 提交于
      Now all ARCH_POPULATES_NODE_MAP archs select HAVE_MEBLOCK_NODE_MAP -
      there's no user of early_node_map[] left.  Kill early_node_map[] and
      replace ARCH_POPULATES_NODE_MAP with HAVE_MEMBLOCK_NODE_MAP.  Also,
      relocate for_each_mem_pfn_range() and helper from mm.h to memblock.h
      as page_alloc.c would no longer host an alternative implementation.
      
      This change is ultimately one to one mapping and shouldn't cause any
      observable difference; however, after the recent changes, there are
      some functions which now would fit memblock.c better than page_alloc.c
      and dependency on HAVE_MEMBLOCK_NODE_MAP instead of HAVE_MEMBLOCK
      doesn't make much sense on some of them.  Further cleanups for
      functions inside HAVE_MEMBLOCK_NODE_MAP in mm.h would be nice.
      
      -v2: Fix compile bug introduced by mis-spelling
       CONFIG_HAVE_MEMBLOCK_NODE_MAP to CONFIG_MEMBLOCK_HAVE_NODE_MAP in
       mmzone.h.  Reported by Stephen Rothwell.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Chen Liqin <liqin.chen@sunplusct.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      0ee332c1
  17. 06 12月, 2011 1 次提交
  18. 03 11月, 2011 2 次提交
    • A
      thp: share get_huge_page_tail() · b35a35b5
      Andrea Arcangeli 提交于
      This avoids duplicating the function in every arch gup_fast.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b35a35b5
    • A
      mm: thp: tail page refcounting fix · 70b50f94
      Andrea Arcangeli 提交于
      Michel while working on the working set estimation code, noticed that
      calling get_page_unless_zero() on a random pfn_to_page(random_pfn)
      wasn't safe, if the pfn ended up being a tail page of a transparent
      hugepage under splitting by __split_huge_page_refcount().
      
      He then found the problem could also theoretically materialize with
      page_cache_get_speculative() during the speculative radix tree lookups
      that uses get_page_unless_zero() in SMP if the radix tree page is freed
      and reallocated and get_user_pages is called on it before
      page_cache_get_speculative has a chance to call get_page_unless_zero().
      
      So the best way to fix the problem is to keep page_tail->_count zero at
      all times.  This will guarantee that get_page_unless_zero() can never
      succeed on any tail page.  page_tail->_mapcount is guaranteed zero and
      is unused for all tail pages of a compound page, so we can simply
      account the tail page references there and transfer them to
      tail_page->_count in __split_huge_page_refcount() (in addition to the
      head_page->_mapcount).
      
      While debugging this s/_count/_mapcount/ change I also noticed get_page is
      called by direct-io.c on pages returned by get_user_pages.  That wasn't
      entirely safe because the two atomic_inc in get_page weren't atomic.  As
      opposed to other get_user_page users like secondary-MMU page fault to
      establish the shadow pagetables would never call any superflous get_page
      after get_user_page returns.  It's safer to make get_page universally safe
      for tail pages and to use get_page_foll() within follow_page (inside
      get_user_pages()).  get_page_foll() is safe to do the refcounting for tail
      pages without taking any locks because it is run within PT lock protected
      critical sections (PT lock for pte and page_table_lock for
      pmd_trans_huge).
      
      The standard get_page() as invoked by direct-io instead will now take
      the compound_lock but still only for tail pages.  The direct-io paths
      are usually I/O bound and the compound_lock is per THP so very
      finegrined, so there's no risk of scalability issues with it.  A simple
      direct-io benchmarks with all lockdep prove locking and spinlock
      debugging infrastructure enabled shows identical performance and no
      overhead.  So it's worth it.  Ideally direct-io should stop calling
      get_page() on pages returned by get_user_pages().  The spinlock in
      get_page() is already optimized away for no-THP builds but doing
      get_page() on tail pages returned by GUP is generally a rare operation
      and usually only run in I/O paths.
      
      This new refcounting on page_tail->_mapcount in addition to avoiding new
      RCU critical sections will also allow the working set estimation code to
      work without any further complexity associated to the tail page
      refcounting with THP.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NMichel Lespinasse <walken@google.com>
      Reviewed-by: NMinchan Kim <minchan.kim@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <jweiner@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: <stable@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70b50f94