1. 07 8月, 2014 2 次提交
  2. 31 7月, 2014 1 次提交
    • D
      mm, thp: do not allow thp faults to avoid cpuset restrictions · b104a35d
      David Rientjes 提交于
      The page allocator relies on __GFP_WAIT to determine if ALLOC_CPUSET
      should be set in allocflags.  ALLOC_CPUSET controls if a page allocation
      should be restricted only to the set of allowed cpuset mems.
      
      Transparent hugepages clears __GFP_WAIT when defrag is disabled to prevent
      the fault path from using memory compaction or direct reclaim.  Thus, it
      is unfairly able to allocate outside of its cpuset mems restriction as a
      side-effect.
      
      This patch ensures that ALLOC_CPUSET is only cleared when the gfp mask is
      truly GFP_ATOMIC by verifying it is also not a thp allocation.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NAlex Thorlton <athorlton@sgi.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Hedi Berriche <hedi@sgi.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Srivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b104a35d
  3. 30 7月, 2014 1 次提交
    • R
      mm: fix page_alloc.c kernel-doc warnings · 1aab4d77
      Randy Dunlap 提交于
      Fix kernel-doc warnings and function name in mm/page_alloc.c:
      
        Warning(..//mm/page_alloc.c:6074): No description found for parameter 'pfn'
        Warning(..//mm/page_alloc.c:6074): No description found for parameter 'mask'
        Warning(..//mm/page_alloc.c:6074): Excess function parameter 'start_bitidx' description in 'get_pfnblock_flags_mask'
        Warning(..//mm/page_alloc.c:6102): No description found for parameter 'pfn'
        Warning(..//mm/page_alloc.c:6102): No description found for parameter 'mask'
        Warning(..//mm/page_alloc.c:6102): Excess function parameter 'start_bitidx' description in 'set_pfnblock_flags_mask'
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1aab4d77
  4. 04 7月, 2014 1 次提交
    • M
      mm: page_alloc: fix CMA area initialisation when pageblock > MAX_ORDER · dc78327c
      Michal Nazarewicz 提交于
      With a kernel configured with ARM64_64K_PAGES && !TRANSPARENT_HUGEPAGE,
      the following is triggered at early boot:
      
        SMP: Total of 8 processors activated.
        devtmpfs: initialized
        Unable to handle kernel NULL pointer dereference at virtual address 00000008
        pgd = fffffe0000050000
        [00000008] *pgd=00000043fba00003, *pmd=00000043fba00003, *pte=00e0000078010407
        Internal error: Oops: 96000006 [#1] SMP
        Modules linked in:
        CPU: 0 PID: 1 Comm: swapper/0 Not tainted 3.15.0-rc864k+ #44
        task: fffffe03bc040000 ti: fffffe03bc080000 task.ti: fffffe03bc080000
        PC is at __list_add+0x10/0xd4
        LR is at free_one_page+0x270/0x638
        ...
        Call trace:
          __list_add+0x10/0xd4
          free_one_page+0x26c/0x638
          __free_pages_ok.part.52+0x84/0xbc
          __free_pages+0x74/0xbc
          init_cma_reserved_pageblock+0xe8/0x104
          cma_init_reserved_areas+0x190/0x1e4
          do_one_initcall+0xc4/0x154
          kernel_init_freeable+0x204/0x2a8
          kernel_init+0xc/0xd4
      
      This happens because init_cma_reserved_pageblock() calls
      __free_one_page() with pageblock_order as page order but it is bigger
      than MAX_ORDER.  This in turn causes accesses past zone->free_list[].
      
      Fix the problem by changing init_cma_reserved_pageblock() such that it
      splits pageblock into individual MAX_ORDER pages if pageblock is bigger
      than a MAX_ORDER page.
      
      In cases where !CONFIG_HUGETLB_PAGE_SIZE_VARIABLE, which is all
      architectures expect for ia64, powerpc and tile at the moment, the
      “pageblock_order > MAX_ORDER” condition will be optimised out since both
      sides of the operator are constants.  In cases where pageblock size is
      variable, the performance degradation should not be significant anyway
      since init_cma_reserved_pageblock() is called only at boot time at most
      MAX_CMA_AREAS times which by default is eight.
      Signed-off-by: NMichal Nazarewicz <mina86@mina86.com>
      Reported-by: NMark Salter <msalter@redhat.com>
      Tested-by: NMark Salter <msalter@redhat.com>
      Tested-by: NChristopher Covington <cov@codeaurora.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: <stable@vger.kernel.org>	[3.5+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc78327c
  5. 24 6月, 2014 1 次提交
    • D
      mm, pcp: allow restoring percpu_pagelist_fraction default · 7cd2b0a3
      David Rientjes 提交于
      Oleg reports a division by zero error on zero-length write() to the
      percpu_pagelist_fraction sysctl:
      
          divide error: 0000 [#1] SMP DEBUG_PAGEALLOC
          CPU: 1 PID: 9142 Comm: badarea_io Not tainted 3.15.0-rc2-vm-nfs+ #19
          Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
          task: ffff8800d5aeb6e0 ti: ffff8800d87a2000 task.ti: ffff8800d87a2000
          RIP: 0010: percpu_pagelist_fraction_sysctl_handler+0x84/0x120
          RSP: 0018:ffff8800d87a3e78  EFLAGS: 00010246
          RAX: 0000000000000f89 RBX: ffff88011f7fd000 RCX: 0000000000000000
          RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000010
          RBP: ffff8800d87a3e98 R08: ffffffff81d002c8 R09: ffff8800d87a3f50
          R10: 000000000000000b R11: 0000000000000246 R12: 0000000000000060
          R13: ffffffff81c3c3e0 R14: ffffffff81cfddf8 R15: ffff8801193b0800
          FS:  00007f614f1e9740(0000) GS:ffff88011f440000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
          CR2: 00007f614f1fa000 CR3: 00000000d9291000 CR4: 00000000000006e0
          Call Trace:
            proc_sys_call_handler+0xb3/0xc0
            proc_sys_write+0x14/0x20
            vfs_write+0xba/0x1e0
            SyS_write+0x46/0xb0
            tracesys+0xe1/0xe6
      
      However, if the percpu_pagelist_fraction sysctl is set by the user, it
      is also impossible to restore it to the kernel default since the user
      cannot write 0 to the sysctl.
      
      This patch allows the user to write 0 to restore the default behavior.
      It still requires a fraction equal to or larger than 8, however, as
      stated by the documentation for sanity.  If a value in the range [1, 7]
      is written, the sysctl will return EINVAL.
      
      This successfully solves the divide by zero issue at the same time.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NOleg Drokin <green@linuxhacker.ru>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7cd2b0a3
  6. 07 6月, 2014 1 次提交
  7. 05 6月, 2014 21 次提交
  8. 08 4月, 2014 5 次提交
    • J
      mm/page_alloc.c: change mm debug routines back to EXPORT_SYMBOL · ed12d845
      John Hubbard 提交于
      A new dump_page() routine was recently added, and marked
      EXPORT_SYMBOL_GPL.  dump_page() was also added to the VM_BUG_ON_PAGE()
      macro, and so the end result is that non-GPL code can no longer call
      get_page() and a few other routines.
      
      This only happens if the kernel was compiled with CONFIG_DEBUG_VM.
      
      Change dump_page() to be EXPORT_SYMBOL.
      
      Longer explanation:
      
      Prior to commit 309381fe ("mm: dump page when hitting a VM_BUG_ON
      using VM_BUG_ON_PAGE") , it was possible to build MIT-licensed (non-GPL)
      drivers on Fedora.  Fedora is semi-unique, in that it sets
      CONFIG_VM_DEBUG.
      
      Because Fedora sets CONFIG_VM_DEBUG, they end up pulling in dump_page(),
      via VM_BUG_ON_PAGE, via get_page().  As one of the authors of NVIDIA's
      new, open source, "UVM-Lite" kernel module, I originally choose to use
      the kernel's get_page() routine from within nvidia_uvm_page_cache.c,
      because get_page() has always seemed to be very clearly intended for use
      by non-GPL, driver code.
      
      So I'm hoping that making get_page() widely accessible again will not be
      too controversial.  We did check with Fedora first, and they responded
      (https://bugzilla.redhat.com/show_bug.cgi?id=1074710#c3) that we should
      try to get upstream changed, before asking Fedora to change.  Their
      reasoning seems beneficial to Linux: leaving CONFIG_DEBUG_VM set allows
      Fedora to help catch mm bugs.
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Cc: Sasha Levin <sasha.levin@oracle.com>
      Cc: Josh Boyer <jwboyer@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed12d845
    • E
      memblock: use for_each_memblock() · 136199f0
      Emil Medve 提交于
      This is a small cleanup.
      Signed-off-by: NEmil Medve <Emilian.Medve@Freescale.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      136199f0
    • J
      mm: page_alloc: spill to remote nodes before waking kswapd · 3a025760
      Johannes Weiner 提交于
      On NUMA systems, a node may start thrashing cache or even swap anonymous
      pages while there are still free pages on remote nodes.
      
      This is a result of commits 81c0a2bb ("mm: page_alloc: fair zone
      allocator policy") and fff4068c ("mm: page_alloc: revert NUMA aspect
      of fair allocation policy").
      
      Before those changes, the allocator would first try all allowed zones,
      including those on remote nodes, before waking any kswapds.  But now,
      the allocator fastpath doubles as the fairness pass, which in turn can
      only consider the local node to prevent remote spilling based on
      exhausted fairness batches alone.  Remote nodes are only considered in
      the slowpath, after the kswapds are woken up.  But if remote nodes still
      have free memory, kswapd should not be woken to rebalance the local node
      or it may thrash cash or swap prematurely.
      
      Fix this by adding one more unfair pass over the zonelist that is
      allowed to spill to remote nodes after the local fairness pass fails but
      before entering the slowpath and waking the kswapds.
      
      This also gets rid of the GFP_THISNODE exemption from the fairness
      protocol because the unfair pass is no longer tied to kswapd, which
      GFP_THISNODE is not allowed to wake up.
      
      However, because remote spills can be more frequent now - we prefer them
      over local kswapd reclaim - the allocation batches on remote nodes could
      underflow more heavily.  When resetting the batches, use
      atomic_long_read() directly instead of zone_page_state() to calculate the
      delta as the latter filters negative counter values.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: <stable@kernel.org>		[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3a025760
    • K
      mm: use 'const char *' insted of 'char *' for reason in dump_page() · d230dec1
      Kirill A. Shutemov 提交于
      I tried to use 'dump_page(page, __func__)' for debugging, but it triggers
      warning:
      
        warning: passing argument 2 of `dump_page' discards `const' qualifier from pointer target type [enabled by default]
      
      Let's convert 'reason' to 'const char *' in dump_page() and friends: we
      shouldn't modify it anyway.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d230dec1
    • M
      mm: exclude memoryless nodes from zone_reclaim · 70ef57e6
      Michal Hocko 提交于
      We had a report about strange OOM killer strikes on a PPC machine
      although there was a lot of swap free and a tons of anonymous memory
      which could be swapped out.  In the end it turned out that the OOM was a
      side effect of zone reclaim which wasn't unmapping and swapping out and
      so the system was pushed to the OOM.  Although this sounds like a bug
      somewhere in the kswapd vs.  zone reclaim vs.  direct reclaim
      interaction numactl on the said hardware suggests that the zone reclaim
      should not have been set in the first place:
      
        node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
        node 0 size: 0 MB
        node 0 free: 0 MB
        node 2 cpus:
        node 2 size: 7168 MB
        node 2 free: 6019 MB
        node distances:
        node   0   2
        0:  10  40
        2:  40  10
      
      So all the CPUs are associated with Node0 which doesn't have any memory
      while Node2 contains all the available memory.  Node distances cause an
      automatic zone_reclaim_mode enabling.
      
      Zone reclaim is intended to keep the allocations local but this doesn't
      make any sense on the memoryless nodes.  So let's exclude such nodes for
      init_zone_allows_reclaim which evaluates zone reclaim behavior and
      suitable reclaim_nodes.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Tested-by: NNishanth Aravamudan <nacc@linux.vnet.ibm.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70ef57e6
  9. 04 4月, 2014 1 次提交
  10. 04 3月, 2014 2 次提交
    • J
      mm: page_alloc: exempt GFP_THISNODE allocations from zone fairness · 27329369
      Johannes Weiner 提交于
      Jan Stancek reports manual page migration encountering allocation
      failures after some pages when there is still plenty of memory free, and
      bisected the problem down to commit 81c0a2bb ("mm: page_alloc: fair
      zone allocator policy").
      
      The problem is that GFP_THISNODE obeys the zone fairness allocation
      batches on one hand, but doesn't reset them and wake kswapd on the other
      hand.  After a few of those allocations, the batches are exhausted and
      the allocations fail.
      
      Fixing this means either having GFP_THISNODE wake up kswapd, or
      GFP_THISNODE not participating in zone fairness at all.  The latter
      seems safer as an acute bugfix, we can clean up later.
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: <stable@kernel.org>		[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27329369
    • D
      mm: close PageTail race · 668f9abb
      David Rientjes 提交于
      Commit bf6bddf1 ("mm: introduce compaction and migration for
      ballooned pages") introduces page_count(page) into memory compaction
      which dereferences page->first_page if PageTail(page).
      
      This results in a very rare NULL pointer dereference on the
      aforementioned page_count(page).  Indeed, anything that does
      compound_head(), including page_count() is susceptible to racing with
      prep_compound_page() and seeing a NULL or dangling page->first_page
      pointer.
      
      This patch uses Andrea's implementation of compound_trans_head() that
      deals with such a race and makes it the default compound_head()
      implementation.  This includes a read memory barrier that ensures that
      if PageTail(head) is true that we return a head page that is neither
      NULL nor dangling.  The patch then adds a store memory barrier to
      prep_compound_page() to ensure page->first_page is set.
      
      This is the safest way to ensure we see the head page that we are
      expecting, PageTail(page) is already in the unlikely() path and the
      memory barriers are unfortunately required.
      
      Hugetlbfs is the exception, we don't enforce a store memory barrier
      during init since no race is possible.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      668f9abb
  11. 24 1月, 2014 4 次提交