1. 13 2月, 2015 11 次提交
  2. 12 2月, 2015 10 次提交
    • N
      arch/powerpc/mm/subpage-prot.c: use walk->vma and walk_page_vma() · 1757bbd9
      Naoya Horiguchi 提交于
      We don't have to use mm_walk->private to pass vma to the callback function
      because of mm_walk->vma.  And walk_page_vma() is useful if we walk over a
      single vma.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1757bbd9
    • A
      mm: gup: use get_user_pages_unlocked within get_user_pages_fast · a7b78075
      Andrea Arcangeli 提交于
      This allows the get_user_pages_fast slow path to release the mmap_sem
      before blocking.
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Andres Lagar-Cavilla <andreslc@google.com>
      Cc: Peter Feiner <pfeiner@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a7b78075
    • K
      mm: fix false-positive warning on exit due mm_nr_pmds(mm) · b30fe6c7
      Kirill A. Shutemov 提交于
      The problem is that we check nr_ptes/nr_pmds in exit_mmap() which happens
      *before* pgd_free().  And if an arch does pte/pmd allocation in
      pgd_alloc() and frees them in pgd_free() we see offset in counters by the
      time of the checks.
      
      We tried to workaround this by offsetting expected counter value according
      to FIRST_USER_ADDRESS for both nr_pte and nr_pmd in exit_mmap().  But it
      doesn't work in some cases:
      
      1. ARM with LPAE enabled also has non-zero USER_PGTABLES_CEILING, but
         upper addresses occupied with huge pmd entries, so the trick with
         offsetting expected counter value will get really ugly: we will have
         to apply it nr_pmds, but not nr_ptes.
      
      2. Metag has non-zero FIRST_USER_ADDRESS, but doesn't do allocation
         pte/pmd page tables allocation in pgd_alloc(), just setup a pgd entry
         which is allocated at boot and shared accross all processes.
      
      The proposal is to move the check to check_mm() which happens *after*
      pgd_free() and do proper accounting during pgd_alloc() and pgd_free()
      which would bring counters to zero if nothing leaked.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NTyler Baker <tyler.baker@linaro.org>
      Tested-by: NTyler Baker <tyler.baker@linaro.org>
      Tested-by: NNishanth Menon <nm@ti.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: Guan Xuetao <gxt@mprc.pku.edu.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b30fe6c7
    • K
      mm: account pmd page tables to the process · dc6c9a35
      Kirill A. Shutemov 提交于
      Dave noticed that unprivileged process can allocate significant amount of
      memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
      memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
      kernel doesn't account PMD tables to the process, only PTE.
      
      The use-cases below use few tricks to allocate a lot of PMD page tables
      while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.
      
      	#include <errno.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/prctl.h>
      
      	#define PUD_SIZE (1UL << 30)
      	#define PMD_SIZE (1UL << 21)
      
      	#define NR_PUD 130000
      
      	int main(void)
      	{
      		char *addr = NULL;
      		unsigned long i;
      
      		prctl(PR_SET_THP_DISABLE);
      		for (i = 0; i < NR_PUD ; i++) {
      			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      			if (addr == MAP_FAILED) {
      				perror("mmap");
      				break;
      			}
      			*addr = 'x';
      			munmap(addr, PMD_SIZE);
      			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
      			if (addr == MAP_FAILED)
      				perror("re-mmap"), exit(1);
      		}
      		printf("PID %d consumed %lu KiB in PMD page tables\n",
      				getpid(), i * 4096 >> 10);
      		return pause();
      	}
      
      The patch addresses the issue by account PMD tables to the process the
      same way we account PTE.
      
      The main place where PMD tables is accounted is __pmd_alloc() and
      free_pmd_range(). But there're few corner cases:
      
       - HugeTLB can share PMD page tables. The patch handles by accounting
         the table to all processes who share it.
      
       - x86 PAE pre-allocates few PMD tables on fork.
      
       - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
         check on exit(2).
      
      Accounting only happens on configuration where PMD page table's level is
      present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
      counter value is used to calculate baseline for badness score by
      oom-killer.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc6c9a35
    • K
      arm: define __PAGETABLE_PMD_FOLDED for !LPAE · 8aa76875
      Kirill A. Shutemov 提交于
      ARM uses custom implementation of PMD folding in 2-level page table case.
      Generic code expects to see __PAGETABLE_PMD_FOLDED to be defined if PMD is
      folded, but ARM doesn't do this.  Let's fix it.
      
      Defining __PAGETABLE_PMD_FOLDED will drop out unused __pmd_alloc().  It
      also fixes problems with recently-introduced pmd accounting on ARM without
      LPAE.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NNishanth Menon <nm@ti.com>
      Reported-by: NSimon Horman <horms@verge.net.au>
      Tested-by: NSimon Horman <horms+renesas@verge.net.au>
      Tested-by: NFabio Estevam <festevam@gmail.com>
      Tested-by: NFelipe Balbi <balbi@ti.com>
      Tested-by: NNishanth Menon <nm@ti.com>
      Tested-by: NPeter Ujfalusi <peter.ujfalusi@ti.com>
      Tested-by: NKrzysztof Kozlowski <k.kozlowski@samsung.com>
      Tested-by: NGeert Uytterhoeven <geert+renesas@glider.be>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Cyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Russell King <linux@arm.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8aa76875
    • K
      mm: make FIRST_USER_ADDRESS unsigned long on all archs · d016bf7e
      Kirill A. Shutemov 提交于
      LKP has triggered a compiler warning after my recent patch "mm: account
      pmd page tables to the process":
      
          mm/mmap.c: In function 'exit_mmap':
       >> mm/mmap.c:2857:2: warning: right shift count >= width of type [enabled by default]
      
      The code:
      
       > 2857                WARN_ON(mm_nr_pmds(mm) >
         2858                                round_up(FIRST_USER_ADDRESS, PUD_SIZE) >> PUD_SHIFT);
      
      In this, on tile, we have FIRST_USER_ADDRESS defined as 0.  round_up() has
      the same type -- int.  PUD_SHIFT.
      
      I think the best way to fix it is to define FIRST_USER_ADDRESS as unsigned
      long.  On every arch for consistency.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NWu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d016bf7e
    • K
      microblaze: define __PAGETABLE_PMD_FOLDED · 3ae3ad4e
      Kirill A. Shutemov 提交于
      Microblaze uses custom implementation of PMD folding, but doesn't define
      __PAGETABLE_PMD_FOLDED, which generic code expects to see.  Let's fix it.
      
      Defining __PAGETABLE_PMD_FOLDED will drop out unused __pmd_alloc().  It
      also fixes problems with recently-introduced pmd accounting.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ae3ad4e
    • K
      sparc32: fix broken set_pte() · 4ecf8860
      Kirill A. Shutemov 提交于
      32-bit sparc uses swap instruction to implement set_pte().  It called
      using GCC inline assembler.  But it misses the "memory" clobber to
      indicate that pte value will be updated in memory.
      
      As result GCC doesn't know that it cannot postpone pte pointer dereference
      which occurs before set_pte() to post-set_pte() time.
      
      It leads to real-world bugs -- [1]. In this situation we have code:
      
      	ptent = ptep_modify_prot_start(mm, addr, pte);
      	ptent = pte_modify(ptent, newprot);
      	...
      	ptep_modify_prot_commit(mm, addr, pte, ptent);
      
      ptep_modify_prot_start() in sparc case is just 'pte' dereference plus
      pte_clear().  pte_clear() calls broken set_pte().  GCC thinks it's valid
      to dereference 'pte' again on pte_modify() and gets cleared pte.
      ptep_modify_prot_commit() puts 'pteent' with pfn==0 back to page table,
      which eventually leads to the crash.
      
      [1] http://lkml.kernel.org/r/54C06B19.8060305@roeck-us.netSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: Paul Moore <pmoore@redhat.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Miller <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4ecf8860
    • N
      mm/hugetlb: pmd_huge() returns true for non-present hugepage · cbef8478
      Naoya Horiguchi 提交于
      Migrating hugepages and hwpoisoned hugepages are considered as non-present
      hugepages, and they are referenced via migration entries and hwpoison
      entries in their page table slots.
      
      This behavior causes race condition because pmd_huge() doesn't tell
      non-huge pages from migrating/hwpoisoned hugepages.  follow_page_mask() is
      one example where the kernel would call follow_page_pte() for such
      hugepage while this function is supposed to handle only normal pages.
      
      To avoid this, this patch makes pmd_huge() return true when pmd_none() is
      true *and* pmd_present() is false.  We don't have to worry about mixing up
      non-present pmd entry with normal pmd (pointing to leaf level pte entry)
      because pmd_present() is true in normal pmd.
      
      The same race condition could happen in (x86-specific) gup_pmd_range(),
      where this patch simply adds pmd_present() check instead of pmd_huge().
      This is because gup_pmd_range() is fast path.  If we have non-present
      hugepage in this function, we will go into gup_huge_pmd(), then return 0
      at flag mask check, and finally fall back to the slow path.
      
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: <stable@vger.kernel.org>	[2.6.36+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbef8478
    • N
      mm/hugetlb: reduce arch dependent code around follow_huge_* · 61f77eda
      Naoya Horiguchi 提交于
      Currently we have many duplicates in definitions around
      follow_huge_addr(), follow_huge_pmd(), and follow_huge_pud(), so this
      patch tries to remove the m.  The basic idea is to put the default
      implementation for these functions in mm/hugetlb.c as weak symbols
      (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETL B), and to implement
      arch-specific code only when the arch needs it.
      
      For follow_huge_addr(), only powerpc and ia64 have their own
      implementation, and in all other architectures this function just returns
      ERR_PTR(-EINVAL).  So this patch sets returning ERR_PTR(-EINVAL) as
      default.
      
      As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to
      always return 0 in your architecture (like in ia64 or sparc,) it's never
      called (the callsite is optimized away) no matter how implemented it is.
      So in such architectures, we don't need arch-specific implementation.
      
      In some architecture (like mips, s390 and tile,) their current
      arch-specific follow_huge_(pmd|pud)() are effectively identical with the
      common code, so this patch lets these architecture use the common code.
      
      One exception is metag, where pmd_huge() could return non-zero but it
      expects follow_huge_pmd() to always return NULL.  This means that we need
      arch-specific implementation which returns NULL.  This behavior looks
      strange to me (because non-zero pmd_huge() implies that the architecture
      supports PMD-based hugepage, so follow_huge_pmd() can/should return some
      relevant value,) but that's beyond this cleanup patch, so let's keep it.
      
      Justification of non-trivial changes:
      - in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
        patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
        is true when follow_huge_pmd() can be called (note that pmd_huge() has
        the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
      - in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
        code. This patch forces these archs use PMD_MASK, but it's OK because
        they are identical in both archs.
        In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
        In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
        PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
        PTE_ORDER is always 0, so these are identical.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61f77eda
  3. 11 2月, 2015 19 次提交