1. 12 2月, 2015 16 次提交
    • V
      mm, page_alloc: reduce number of alloc_pages* functions' parameters · a9263751
      Vlastimil Babka 提交于
      Introduce struct alloc_context to accumulate the numerous parameters
      passed between the alloc_pages* family of functions and
      get_page_from_freelist().  This excludes gfp_flags and alloc_info, which
      mutate too much along the way, and allocation order, which is conceptually
      different.
      
      The result is shorter function signatures, as well as overal code size and
      stack usage reductions.
      
      bloat-o-meter:
      
      add/remove: 0/0 grow/shrink: 1/2 up/down: 127/-310 (-183)
      function                                     old     new   delta
      get_page_from_freelist                      2525    2652    +127
      __alloc_pages_direct_compact                 329     283     -46
      __alloc_pages_nodemask                      2564    2300    -264
      
      checkstack.pl:
      
      function                            old    new
      __alloc_pages_nodemask              248    200
      get_page_from_freelist              168    184
      __alloc_pages_direct_compact         40     24
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9263751
    • V
      mm: set page->pfmemalloc in prep_new_page() · 75379191
      Vlastimil Babka 提交于
      The possibility of replacing the numerous parameters of alloc_pages*
      functions with a single structure has been discussed when Minchan proposed
      to expand the x86 kernel stack [1].  This series implements the change,
      along with few more cleanups/microoptimizations.
      
      The series is based on next-20150108 and I used gcc 4.8.3 20140627 on
      openSUSE 13.2 for compiling.  Config includess NUMA and COMPACTION.
      
      The core change is the introduction of a new struct alloc_context, which looks
      like this:
      
      struct alloc_context {
              struct zonelist *zonelist;
              nodemask_t *nodemask;
              struct zone *preferred_zone;
              int classzone_idx;
              int migratetype;
              enum zone_type high_zoneidx;
      };
      
      All the contents is mostly constant, except that __alloc_pages_slowpath()
      changes preferred_zone, classzone_idx and potentially zonelist.  But
      that's not a problem in case control returns to retry_cpuset: in
      __alloc_pages_nodemask(), those will be reset to initial values again
      (although it's a bit subtle).  On the other hand, gfp_flags and alloc_info
      mutate so much that it doesn't make sense to put them into alloc_context.
      Still, the result is one parameter instead of up to 7.  This is all in
      Patch 2.
      
      Patch 3 is a step to expand alloc_context usage out of page_alloc.c
      itself.  The function try_to_compact_pages() can also much benefit from
      the parameter reduction, but it means the struct definition has to be
      moved to a shared header.
      
      Patch 1 should IMHO be included even if the rest is deemed not useful
      enough.  It improves maintainability and also has some code/stack
      reduction.  Patch 4 is OTOH a tiny optimization.
      
      Overall bloat-o-meter results:
      
      add/remove: 0/0 grow/shrink: 0/4 up/down: 0/-460 (-460)
      function                                     old     new   delta
      nr_free_zone_pages                           129     115     -14
      __alloc_pages_direct_compact                 329     256     -73
      get_page_from_freelist                      2670    2576     -94
      __alloc_pages_nodemask                      2564    2285    -279
      try_to_compact_pages                         582     579      -3
      
      Overall stack sizes per ./scripts/checkstack.pl:
      
                                old   new delta
      get_page_from_freelist:   184   184     0
      __alloc_pages_nodemask    248   200   -48
      __alloc_pages_direct_c     40     -   -40
      try_to_compact_pages       72    72     0
                                            -88
      
      [1] http://marc.info/?l=linux-mm&m=140142462528257&w=2
      
      This patch (of 4):
      
      prep_new_page() sets almost everything in the struct page of the page
      being allocated, except page->pfmemalloc.  This is not obvious and has at
      least once led to a bug where page->pfmemalloc was forgotten to be set
      correctly, see commit 8fb74b9f ("mm: compaction: partially revert
      capture of suitable high-order page").
      
      This patch moves the pfmemalloc setting to prep_new_page(), which means it
      needs to gain alloc_flags parameter.  The call to prep_new_page is moved
      from buffered_rmqueue() to get_page_from_freelist(), which also leads to
      simpler code.  An obsolete comment for buffered_rmqueue() is replaced.
      
      In addition to better maintainability there is a small reduction of code
      and stack usage for get_page_from_freelist(), which inlines the other
      functions involved.
      
      add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-145 (-145)
      function                                     old     new   delta
      get_page_from_freelist                      2670    2525    -145
      
      Stack usage is reduced from 184 to 168 bytes.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      75379191
    • N
      mm/hugetlb: add migration entry check in __unmap_hugepage_range · 9fbc1f63
      Naoya Horiguchi 提交于
      If __unmap_hugepage_range() tries to unmap the address range over which
      hugepage migration is on the way, we get the wrong page because pte_page()
      doesn't work for migration entries.  This patch simply clears the pte for
      migration entries as we do for hwpoison entries.
      
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: <stable@vger.kernel.org>	[2.6.36+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9fbc1f63
    • N
      mm/hugetlb: add migration/hwpoisoned entry check in hugetlb_change_protection · a8bda28d
      Naoya Horiguchi 提交于
      There is a race condition between hugepage migration and
      change_protection(), where hugetlb_change_protection() doesn't care about
      migration entries and wrongly overwrites them.  That causes unexpected
      results like kernel crash.  HWPoison entries also can cause the same
      problem.
      
      This patch adds is_hugetlb_entry_(migration|hwpoisoned) check in this
      function to do proper actions.
      
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: <stable@vger.kernel.org>	[2.6.36+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8bda28d
    • N
      mm/hugetlb: fix getting refcount 0 page in hugetlb_fault() · 0f792cf9
      Naoya Horiguchi 提交于
      When running the test which causes the race as shown in the previous patch,
      we can hit the BUG "get_page() on refcount 0 page" in hugetlb_fault().
      
      This race happens when pte turns into migration entry just after the first
      check of is_hugetlb_entry_migration() in hugetlb_fault() passed with false.
      To fix this, we need to check pte_present() again after huge_ptep_get().
      
      This patch also reorders taking ptl and doing pte_page(), because
      pte_page() should be done in ptl.  Due to this reordering, we need use
      trylock_page() in page != pagecache_page case to respect locking order.
      
      Fixes: 66aebce7 ("hugetlb: fix race condition in hugetlb_fault()")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: <stable@vger.kernel.org>	[3.2+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0f792cf9
    • N
      mm/hugetlb: take page table lock in follow_huge_pmd() · e66f17ff
      Naoya Horiguchi 提交于
      We have a race condition between move_pages() and freeing hugepages, where
      move_pages() calls follow_page(FOLL_GET) for hugepages internally and
      tries to get its refcount without preventing concurrent freeing.  This
      race crashes the kernel, so this patch fixes it by moving FOLL_GET code
      for hugepages into follow_huge_pmd() with taking the page table lock.
      
      This patch intentionally removes page==NULL check after pte_page.
      This is justified because pte_page() never returns NULL for any
      architectures or configurations.
      
      This patch changes the behavior of follow_huge_pmd() for tail pages and
      then tail pages can be pinned/returned.  So the caller must be changed to
      properly handle the returned tail pages.
      
      We could have a choice to add the similar locking to
      follow_huge_(addr|pud) for consistency, but it's not necessary because
      currently these functions don't support FOLL_GET flag, so let's leave it
      for future development.
      
      Here is the reproducer:
      
        $ cat movepages.c
        #include <stdio.h>
        #include <stdlib.h>
        #include <numaif.h>
      
        #define ADDR_INPUT      0x700000000000UL
        #define HPS             0x200000
        #define PS              0x1000
      
        int main(int argc, char *argv[]) {
                int i;
                int nr_hp = strtol(argv[1], NULL, 0);
                int nr_p  = nr_hp * HPS / PS;
                int ret;
                void **addrs;
                int *status;
                int *nodes;
                pid_t pid;
      
                pid = strtol(argv[2], NULL, 0);
                addrs  = malloc(sizeof(char *) * nr_p + 1);
                status = malloc(sizeof(char *) * nr_p + 1);
                nodes  = malloc(sizeof(char *) * nr_p + 1);
      
                while (1) {
                        for (i = 0; i < nr_p; i++) {
                                addrs[i] = (void *)ADDR_INPUT + i * PS;
                                nodes[i] = 1;
                                status[i] = 0;
                        }
                        ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                              MPOL_MF_MOVE_ALL);
                        if (ret == -1)
                                err("move_pages");
      
                        for (i = 0; i < nr_p; i++) {
                                addrs[i] = (void *)ADDR_INPUT + i * PS;
                                nodes[i] = 0;
                                status[i] = 0;
                        }
                        ret = numa_move_pages(pid, nr_p, addrs, nodes, status,
                                              MPOL_MF_MOVE_ALL);
                        if (ret == -1)
                                err("move_pages");
                }
                return 0;
        }
      
        $ cat hugepage.c
        #include <stdio.h>
        #include <sys/mman.h>
        #include <string.h>
      
        #define ADDR_INPUT      0x700000000000UL
        #define HPS             0x200000
      
        int main(int argc, char *argv[]) {
                int nr_hp = strtol(argv[1], NULL, 0);
                char *p;
      
                while (1) {
                        p = mmap((void *)ADDR_INPUT, nr_hp * HPS, PROT_READ | PROT_WRITE,
                                 MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0);
                        if (p != (void *)ADDR_INPUT) {
                                perror("mmap");
                                break;
                        }
                        memset(p, 0, nr_hp * HPS);
                        munmap(p, nr_hp * HPS);
                }
        }
      
        $ sysctl vm.nr_hugepages=40
        $ ./hugepage 10 &
        $ ./movepages 10 $(pgrep -f hugepage)
      
      Fixes: e632a938 ("mm: migrate: add hugepage migration code to move_pages()")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reported-by: NHugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e66f17ff
    • N
      mm/hugetlb: pmd_huge() returns true for non-present hugepage · cbef8478
      Naoya Horiguchi 提交于
      Migrating hugepages and hwpoisoned hugepages are considered as non-present
      hugepages, and they are referenced via migration entries and hwpoison
      entries in their page table slots.
      
      This behavior causes race condition because pmd_huge() doesn't tell
      non-huge pages from migrating/hwpoisoned hugepages.  follow_page_mask() is
      one example where the kernel would call follow_page_pte() for such
      hugepage while this function is supposed to handle only normal pages.
      
      To avoid this, this patch makes pmd_huge() return true when pmd_none() is
      true *and* pmd_present() is false.  We don't have to worry about mixing up
      non-present pmd entry with normal pmd (pointing to leaf level pte entry)
      because pmd_present() is true in normal pmd.
      
      The same race condition could happen in (x86-specific) gup_pmd_range(),
      where this patch simply adds pmd_present() check instead of pmd_huge().
      This is because gup_pmd_range() is fast path.  If we have non-present
      hugepage in this function, we will go into gup_huge_pmd(), then return 0
      at flag mask check, and finally fall back to the slow path.
      
      Fixes: 290408d4 ("hugetlb: hugepage migration core")
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Cc: <stable@vger.kernel.org>	[2.6.36+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cbef8478
    • N
      mm/hugetlb: reduce arch dependent code around follow_huge_* · 61f77eda
      Naoya Horiguchi 提交于
      Currently we have many duplicates in definitions around
      follow_huge_addr(), follow_huge_pmd(), and follow_huge_pud(), so this
      patch tries to remove the m.  The basic idea is to put the default
      implementation for these functions in mm/hugetlb.c as weak symbols
      (regardless of CONFIG_ARCH_WANT_GENERAL_HUGETL B), and to implement
      arch-specific code only when the arch needs it.
      
      For follow_huge_addr(), only powerpc and ia64 have their own
      implementation, and in all other architectures this function just returns
      ERR_PTR(-EINVAL).  So this patch sets returning ERR_PTR(-EINVAL) as
      default.
      
      As for follow_huge_(pmd|pud)(), if (pmd|pud)_huge() is implemented to
      always return 0 in your architecture (like in ia64 or sparc,) it's never
      called (the callsite is optimized away) no matter how implemented it is.
      So in such architectures, we don't need arch-specific implementation.
      
      In some architecture (like mips, s390 and tile,) their current
      arch-specific follow_huge_(pmd|pud)() are effectively identical with the
      common code, so this patch lets these architecture use the common code.
      
      One exception is metag, where pmd_huge() could return non-zero but it
      expects follow_huge_pmd() to always return NULL.  This means that we need
      arch-specific implementation which returns NULL.  This behavior looks
      strange to me (because non-zero pmd_huge() implies that the architecture
      supports PMD-based hugepage, so follow_huge_pmd() can/should return some
      relevant value,) but that's beyond this cleanup patch, so let's keep it.
      
      Justification of non-trivial changes:
      - in s390, follow_huge_pmd() checks !MACHINE_HAS_HPAGE at first, and this
        patch removes the check. This is OK because we can assume MACHINE_HAS_HPAGE
        is true when follow_huge_pmd() can be called (note that pmd_huge() has
        the same check and always returns 0 for !MACHINE_HAS_HPAGE.)
      - in s390 and mips, we use HPAGE_MASK instead of PMD_MASK as done in common
        code. This patch forces these archs use PMD_MASK, but it's OK because
        they are identical in both archs.
        In s390, both of HPAGE_SHIFT and PMD_SHIFT are 20.
        In mips, HPAGE_SHIFT is defined as (PAGE_SHIFT + PAGE_SHIFT - 3) and
        PMD_SHIFT is define as (PAGE_SHIFT + PAGE_SHIFT + PTE_ORDER - 3), but
        PTE_ORDER is always 0, so these are identical.
      Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: James Hogan <james.hogan@imgtec.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Luiz Capitulino <lcapitulino@redhat.com>
      Cc: Nishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
      Cc: Steve Capper <steve.capper@linaro.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61f77eda
    • V
      mm, vmscan: wake up all pfmemalloc-throttled processes at once · cfc51155
      Vlastimil Babka 提交于
      Kswapd in balance_pgdate() currently uses wake_up() on processes waiting
      in throttle_direct_reclaim(), which only wakes up a single process.  This
      might leave processes waiting for longer than necessary, until the check
      is reached in the next loop iteration.  Processes might also be left
      waiting if zone was fully balanced in single iteration.  Note that the
      comment in balance_pgdat() also says "Wake them", so waking up a single
      process does not seem intentional.
      
      Thus, replace wake_up() with wake_up_all().
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cfc51155
    • X
      kmemcheck: move hook into __alloc_pages_nodemask() for the page allocator · 23f086f9
      Xishi Qiu 提交于
      Now kmemcheck_pagealloc_alloc() is only called by __alloc_pages_slowpath().
      __alloc_pages_nodemask()
      	__alloc_pages_slowpath()
      		kmemcheck_pagealloc_alloc()
      
      And the page will not be tracked by kmemcheck in the following path.
      __alloc_pages_nodemask()
      	get_page_from_freelist()
      
      So move kmemcheck_pagealloc_alloc() into __alloc_pages_nodemask(),
      like this:
      __alloc_pages_nodemask()
      	...
      	get_page_from_freelist()
      	if (!page)
      		__alloc_pages_slowpath()
      	kmemcheck_pagealloc_alloc()
      	...
      Signed-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Cc: Vegard Nossum <vegard.nossum@oracle.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Li Zefan <lizefan@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      23f086f9
    • A
      mm/page_alloc.c:__alloc_pages_nodemask(): don't alter arg gfp_mask · 91fbdc0f
      Andrew Morton 提交于
      __alloc_pages_nodemask() strips __GFP_IO when retrying the page
      allocation.  But it does this by altering the function-wide variable
      gfp_mask.  This will cause subsequent allocation attempts to inadvertently
      use the modified gfp_mask.
      
      Also, pass the correct mask (the mask we actually used) into
      trace_mm_page_alloc().
      
      Cc: Ming Lei <ming.lei@canonical.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NYasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91fbdc0f
    • J
      mm: memcontrol: track move_lock state internally · 6de22619
      Johannes Weiner 提交于
      The complexity of memcg page stat synchronization is currently leaking
      into the callsites, forcing them to keep track of the move_lock state and
      the IRQ flags.  Simplify the API by tracking it in the memcg.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6de22619
    • V
      swap: remove unused mem_cgroup_uncharge_swapcache declaration · 93aa7d95
      Vladimir Davydov 提交于
      The body of this function was removed by commit 0a31bc97 ("mm:
      memcontrol: rewrite uncharge API").
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93aa7d95
    • M
      oom: make sure that TIF_MEMDIE is set under task_lock · 83363b91
      Michal Hocko 提交于
      OOM killer tries to exclude tasks which do not have mm_struct associated
      because killing such a task wouldn't help much.  The OOM victim gets
      TIF_MEMDIE set to disable OOM killer while the current victim releases the
      memory and then enables the OOM killer again by dropping the flag.
      
      oom_kill_process is currently prone to a race condition when the OOM
      victim is already exiting and TIF_MEMDIE is set after the task releases
      its address space.  This might theoretically lead to OOM livelock if the
      OOM victim blocks on an allocation later during exiting because it
      wouldn't kill any other process and the exiting one won't be able to exit.
       The situation is highly unlikely because the OOM victim is expected to
      release some memory which should help to sort out OOM situation.
      
      Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock
      which will serialize the OOM killer with exit_mm which sets task->mm to
      NULL.  Setting the flag for current is not necessary because check and set
      is not racy.
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83363b91
    • T
      oom: don't count on mm-less current process · d7a94e7e
      Tetsuo Handa 提交于
      out_of_memory() doesn't trigger the OOM killer if the current task is
      already exiting or it has fatal signals pending, and gives the task
      access to memory reserves instead.  However, doing so is wrong if
      out_of_memory() is called by an allocation (e.g. from exit_task_work())
      after the current task has already released its memory and cleared
      TIF_MEMDIE at exit_mm().  If we again set TIF_MEMDIE to post-exit_mm()
      current task, the OOM killer will be blocked by the task sitting in the
      final schedule() waiting for its parent to reap it.  It will trigger an
      OOM livelock if its parent is unable to reap it due to doing an
      allocation and waiting for the OOM killer to kill it.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7a94e7e
    • W
      mm:add KPF_ZERO_PAGE flag for /proc/kpageflags · 56873f43
      Wang, Yalin 提交于
      Add KPF_ZERO_PAGE flag for zero_page, so that userspace processes can
      detect zero_page in /proc/kpageflags, and then do memory analysis more
      accurately.
      Signed-off-by: NYalin Wang <yalin.wang@sonymobile.com>
      Acked-by: NKirill A. Shutemov <kirill@shutemov.name>
      Cc: Konstantin Khlebnikov <koct9i@gmail.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      56873f43
  2. 11 2月, 2015 18 次提交
  3. 06 2月, 2015 3 次提交
  4. 30 1月, 2015 2 次提交
    • L
      vm: make stack guard page errors return VM_FAULT_SIGSEGV rather than SIGBUS · 9c145c56
      Linus Torvalds 提交于
      The stack guard page error case has long incorrectly caused a SIGBUS
      rather than a SIGSEGV, but nobody actually noticed until commit
      fee7e49d ("mm: propagate error from stack expansion even for guard
      page") because that error case was never actually triggered in any
      normal situations.
      
      Now that we actually report the error, people noticed the wrong signal
      that resulted.  So far, only the test suite of libsigsegv seems to have
      actually cared, but there are real applications that use libsigsegv, so
      let's not wait for any of those to break.
      Reported-and-tested-by: NTakashi Iwai <tiwai@suse.de>
      Tested-by: NJan Engelhardt <jengelh@inai.de>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
      Cc: linux-arch@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c145c56
    • L
      vm: add VM_FAULT_SIGSEGV handling support · 33692f27
      Linus Torvalds 提交于
      The core VM already knows about VM_FAULT_SIGBUS, but cannot return a
      "you should SIGSEGV" error, because the SIGSEGV case was generally
      handled by the caller - usually the architecture fault handler.
      
      That results in lots of duplication - all the architecture fault
      handlers end up doing very similar "look up vma, check permissions, do
      retries etc" - but it generally works.  However, there are cases where
      the VM actually wants to SIGSEGV, and applications _expect_ SIGSEGV.
      
      In particular, when accessing the stack guard page, libsigsegv expects a
      SIGSEGV.  And it usually got one, because the stack growth is handled by
      that duplicated architecture fault handler.
      
      However, when the generic VM layer started propagating the error return
      from the stack expansion in commit fee7e49d ("mm: propagate error
      from stack expansion even for guard page"), that now exposed the
      existing VM_FAULT_SIGBUS result to user space.  And user space really
      expected SIGSEGV, not SIGBUS.
      
      To fix that case, we need to add a VM_FAULT_SIGSEGV, and teach all those
      duplicate architecture fault handlers about it.  They all already have
      the code to handle SIGSEGV, so it's about just tying that new return
      value to the existing code, but it's all a bit annoying.
      
      This is the mindless minimal patch to do this.  A more extensive patch
      would be to try to gather up the mostly shared fault handling logic into
      one generic helper routine, and long-term we really should do that
      cleanup.
      
      Just from this patch, you can generally see that most architectures just
      copied (directly or indirectly) the old x86 way of doing things, but in
      the meantime that original x86 model has been improved to hold the VM
      semaphore for shorter times etc and to handle VM_FAULT_RETRY and other
      "newer" things, so it would be a good idea to bring all those
      improvements to the generic case and teach other architectures about
      them too.
      Reported-and-tested-by: NTakashi Iwai <tiwai@suse.de>
      Tested-by: NJan Engelhardt <jengelh@inai.de>
      Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> # "s390 still compiles and boots"
      Cc: linux-arch@vger.kernel.org
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      33692f27
  5. 28 1月, 2015 1 次提交
    • D
      mm: provide a find_special_page vma operation · 667a0a06
      David Vrabel 提交于
      The optional find_special_page VMA operation is used to lookup the
      pages backing a VMA.  This is useful in cases where the normal
      mechanisms for finding the page don't work.  This is only called if
      the PTE is special.
      
      One use case is a Xen PV guest mapping foreign pages into userspace.
      
      In a Xen PV guest, the PTEs contain MFNs so get_user_pages() (for
      example) must do an MFN to PFN (M2P) lookup before it can get the
      page.  For foreign pages (those owned by another guest) the M2P lookup
      returns the PFN as seen by the foreign guest (which would be
      completely the wrong page for the local guest).
      
      This cannot be fixed up improving the M2P lookup since one MFN may be
      mapped onto two or more pages so getting the right page is impossible
      given just the MFN.
      Signed-off-by: NDavid Vrabel <david.vrabel@citrix.com>
      Acked-by: NAndrew Morton <akpm@linux-foundation.org>
      667a0a06