1. 04 4月, 2014 21 次提交
  2. 02 4月, 2014 1 次提交
  3. 29 3月, 2014 1 次提交
  4. 21 3月, 2014 1 次提交
    • H
      mm: fix swapops.h:131 bug if remap_file_pages raced migration · 7e09e738
      Hugh Dickins 提交于
      Add remove_linear_migration_ptes_from_nonlinear(), to fix an interesting
      little include/linux/swapops.h:131 BUG_ON(!PageLocked) found by trinity:
      indicating that remove_migration_ptes() failed to find one of the
      migration entries that was temporarily inserted.
      
      The problem comes from remap_file_pages()'s switch from vma_interval_tree
      (good for inserting the migration entry) to i_mmap_nonlinear list (no good
      for locating it again); but can only be a problem if the remap_file_pages()
      range does not cover the whole of the vma (zap_pte() clears the range).
      
      remove_migration_ptes() needs a file_nonlinear method to go down the
      i_mmap_nonlinear list, applying linear location to look for migration
      entries in those vmas too, just in case there was this race.
      
      The file_nonlinear method does need rmap_walk_control.arg to do this;
      but it never needed vma passed in - vma comes from its own iteration.
      Reported-and-tested-by: NDave Jones <davej@redhat.com>
      Reported-and-tested-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e09e738
  5. 20 3月, 2014 1 次提交
    • H
      mm: fix bad rss-counter if remap_file_pages raced migration · 88784396
      Hugh Dickins 提交于
      Fix some "Bad rss-counter state" reports on exit, arising from the
      interaction between page migration and remap_file_pages(): zap_pte()
      must count a migration entry when zapping it.
      
      And yes, it is possible (though very unusual) to find an anon page or
      swap entry in a VM_SHARED nonlinear mapping: coming from that horrid
      get_user_pages(write, force) case which COWs even in a shared mapping.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Tested-by: Sasha Levin sasha.levin@oracle.com>
      Tested-by: Dave Jones davej@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88784396
  6. 19 3月, 2014 1 次提交
  7. 18 3月, 2014 1 次提交
    • V
      percpu: allocation size should be even · 2f69fa82
      Viro 提交于
      723ad1d9 ("percpu: store offsets instead of lengths in ->map[]")
      updated percpu area allocator to use the lowest bit, instead of sign,
      to signify whether the area is occupied and forced min align to 2;
      unfortunately, it forgot to force the allocation size to be even
      causing malfunctions for the very rare odd-sized allocations.
      
      Always force the allocations to be even sized.
      
      tj: Wrote patch description.
      Original-patch-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2f69fa82
  8. 11 3月, 2014 3 次提交
    • B
      mm/Kconfig: fix URL for zsmalloc benchmark · 2216ee85
      Ben Hutchings 提交于
      The help text for CONFIG_PGTABLE_MAPPING has an incorrect URL.  While
      we're at it, remove the unnecessary footnote notation.
      Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2216ee85
    • L
      mm/compaction: break out of loop on !PageBuddy in isolate_freepages_block · 2af120bc
      Laura Abbott 提交于
      We received several reports of bad page state when freeing CMA pages
      previously allocated with alloc_contig_range:
      
          BUG: Bad page state in process Binder_A  pfn:63202
          page:d21130b0 count:0 mapcount:1 mapping:  (null) index:0x7dfbf
          page flags: 0x40080068(uptodate|lru|active|swapbacked)
      
      Based on the page state, it looks like the page was still in use.  The
      page flags do not make sense for the use case though.  Further debugging
      showed that despite alloc_contig_range returning success, at least one
      page in the range still remained in the buddy allocator.
      
      There is an issue with isolate_freepages_block.  In strict mode (which
      CMA uses), if any pages in the range cannot be isolated,
      isolate_freepages_block should return failure 0.  The current check
      keeps track of the total number of isolated pages and compares against
      the size of the range:
      
              if (strict && nr_strict_required > total_isolated)
                      total_isolated = 0;
      
      After taking the zone lock, if one of the pages in the range is not in
      the buddy allocator, we continue through the loop and do not increment
      total_isolated.  If in the last iteration of the loop we isolate more
      than one page (e.g.  last page needed is a higher order page), the check
      for total_isolated may pass and we fail to detect that a page was
      skipped.  The fix is to bail out if the loop immediately if we are in
      strict mode.  There's no benfit to continuing anyway since we need all
      pages to be isolated.  Additionally, drop the error checking based on
      nr_strict_required and just check the pfn ranges.  This matches with
      what isolate_freepages_range does.
      Signed-off-by: NLaura Abbott <lauraa@codeaurora.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2af120bc
    • J
      mm: fix GFP_THISNODE callers and clarify · e97ca8e5
      Johannes Weiner 提交于
      GFP_THISNODE is for callers that implement their own clever fallback to
      remote nodes.  It restricts the allocation to the specified node and
      does not invoke reclaim, assuming that the caller will take care of it
      when the fallback fails, e.g.  through a subsequent allocation request
      without GFP_THISNODE set.
      
      However, many current GFP_THISNODE users only want the node exclusive
      aspect of the flag, without actually implementing their own fallback or
      triggering reclaim if necessary.  This results in things like page
      migration failing prematurely even when there is easily reclaimable
      memory available, unless kswapd happens to be running already or a
      concurrent allocation attempt triggers the necessary reclaim.
      
      Convert all callsites that don't implement their own fallback strategy
      to __GFP_THISNODE.  This restricts the allocation a single node too, but
      at the same time allows the allocator to enter the slowpath, wake
      kswapd, and invoke direct reclaim if necessary, to make the allocation
      happen when memory is full.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Jan Stancek <jstancek@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e97ca8e5
  9. 07 3月, 2014 3 次提交
    • A
      percpu: speed alloc_pcpu_area() up · 3d331ad7
      Al Viro 提交于
      If we know that first N areas are all in use, we can obviously skip
      them when searching for a free one.  And that kind of hint is very
      easy to maintain.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3d331ad7
    • A
      percpu: store offsets instead of lengths in ->map[] · 723ad1d9
      Al Viro 提交于
      Current code keeps +-length for each area in chunk->map[].  It has
      several unpleasant consequences:
      	* even if we know that first 50 areas are all in use, allocation
      still needs to go through all those areas just to sum their sizes, just
      to get the offset of free one.
      	* freeing needs to find the array entry refering to the area
      in question; again, the need to sum the sizes until we reach the offset
      we are interested in.  Note that offsets are monotonous, so simple
      binary search would do here.
      
      	New data representation: array of <offset,in-use flag> pairs.
      Each pair is represented by one int - we use offset|1 for <offset, in use>
      and offset for <offset, free> (we make sure that all offsets are even).
      In the end we put a sentry entry - <total size, in use>.  The first
      entry is <0, flag>; it would be possible to store together the flag
      for Nth area and offset for N+1st, but that leads to much hairier code.
      
      In other words, where the old variant would have
      	4, -8, -4, 4, -12, 100
      (4 bytes free, 8 in use, 4 in use, 4 free, 12 in use, 100 free) we store
      	<0,0>, <4,1>, <12,1>, <16,0>, <20,1>, <32,0>, <132,1>
      i.e.
      	0, 5, 13, 16, 21, 32, 133
      
      This commit switches to new data representation and takes care of a couple
      of low-hanging fruits in free_pcpu_area() - one is the switch to binary
      search, another is not doing two memmove() when one would do.  Speeding
      the alloc side up (by keeping track of how many areas in the beginning are
      known to be all in use) also becomes possible - that'll be done in the next
      commit.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      723ad1d9
    • A
      perpcu: fold pcpu_split_block() into the only caller · 706c16f2
      Al Viro 提交于
      ... and simplify the results a bit.  Makes the next step easier
      to deal with - we will be changing the data representation for
      chunk->map[] and it's easier to do if the code in question is
      not split between pcpu_alloc_area() and pcpu_split_block().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      706c16f2
  10. 06 3月, 2014 2 次提交
  11. 04 3月, 2014 5 次提交
    • J
      mm: page_alloc: exempt GFP_THISNODE allocations from zone fairness · 27329369
      Johannes Weiner 提交于
      Jan Stancek reports manual page migration encountering allocation
      failures after some pages when there is still plenty of memory free, and
      bisected the problem down to commit 81c0a2bb ("mm: page_alloc: fair
      zone allocator policy").
      
      The problem is that GFP_THISNODE obeys the zone fairness allocation
      batches on one hand, but doesn't reset them and wake kswapd on the other
      hand.  After a few of those allocations, the batches are exhausted and
      the allocations fail.
      
      Fixing this means either having GFP_THISNODE wake up kswapd, or
      GFP_THISNODE not participating in zone fairness at all.  The latter
      seems safer as an acute bugfix, we can clean up later.
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: <stable@kernel.org>		[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27329369
    • V
      mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid m(un)locking · 9050d7eb
      Vlastimil Babka 提交于
      Daniel Borkmann reported a VM_BUG_ON assertion failing:
      
        ------------[ cut here ]------------
        kernel BUG at mm/mlock.c:528!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: ccm arc4 iwldvm [...]
         video
        CPU: 3 PID: 2266 Comm: netsniff-ng Not tainted 3.14.0-rc2+ #8
        Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
        task: ffff8801f87f9820 ti: ffff88002cb44000 task.ti: ffff88002cb44000
        RIP: 0010:[<ffffffff81171ad0>]  [<ffffffff81171ad0>] munlock_vma_pages_range+0x2e0/0x2f0
        Call Trace:
          do_munmap+0x18f/0x3b0
          vm_munmap+0x41/0x60
          SyS_munmap+0x22/0x30
          system_call_fastpath+0x1a/0x1f
        RIP   munlock_vma_pages_range+0x2e0/0x2f0
        ---[ end trace a0088dcf07ae10f2 ]---
      
      because munlock_vma_pages_range() thinks it's unexpectedly in the middle
      of a THP page.  This can be reproduced with default config since 3.11
      kernels.  A reproducer can be found in the kernel's selftest directory
      for networking by running ./psock_tpacket.
      
      The problem is that an order=2 compound page (allocated by
      alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped
      by packet_mmap()) and mistaken for a THP page and assumed to be order=9.
      
      The checks for THP in munlock came with commit ff6a6da6 ("mm:
      accelerate munlock() treatment of THP pages"), i.e.  since 3.9, but did
      not trigger a bug.  It just makes munlock_vma_pages_range() skip such
      compound pages until the next 512-pages-aligned page, when it encounters
      a head page.  This is however not a problem for vma's where mlocking has
      no effect anyway, but it can distort the accounting.
      
      Since commit 7225522b ("mm: munlock: batch non-THP page isolation
      and munlock+putback using pagevec") this can trigger a VM_BUG_ON in
      PageTransHuge() check.
      
      This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL, a
      list of flags that make vma's non-mlockable and non-mergeable.  The
      reasoning is that VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is
      already on the VM_SPECIAL list, and both are intended for non-LRU pages
      where mlocking makes no sense anyway.  Related Lkml discussion can be
      found in [2].
      
       [1] tools/testing/selftests/net/psock_tpacket
       [2] https://lkml.org/lkml/2014/1/10/427Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Reported-by: NDaniel Borkmann <dborkman@redhat.com>
      Tested-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: John David Anglin <dave.anglin@bell.net>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Tested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org> [3.11.x+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9050d7eb
    • F
      memcg: reparent charges of children before processing parent · 4fb1a86f
      Filipe Brandenburger 提交于
      Sometimes the cleanup after memcg hierarchy testing gets stuck in
      mem_cgroup_reparent_charges(), unable to bring non-kmem usage down to 0.
      
      There may turn out to be several causes, but a major cause is this: the
      workitem to offline parent can get run before workitem to offline child;
      parent's mem_cgroup_reparent_charges() circles around waiting for the
      child's pages to be reparented to its lrus, but it's holding
      cgroup_mutex which prevents the child from reaching its
      mem_cgroup_reparent_charges().
      
      Further testing showed that an ordered workqueue for cgroup_destroy_wq
      is not always good enough: percpu_ref_kill_and_confirm's call_rcu_sched
      stage on the way can mess up the order before reaching the workqueue.
      
      Instead, when offlining a memcg, call mem_cgroup_reparent_charges() on
      all its children (and grandchildren, in the correct order) to have their
      charges reparented first.
      
      Fixes: e5fca243 ("cgroup: use a dedicated workqueue for cgroup destruction")
      Signed-off-by: NFilipe Brandenburger <filbranden@google.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>	[v3.10+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4fb1a86f
    • H
      memcg: fix endless loop in __mem_cgroup_iter_next() · ce48225f
      Hugh Dickins 提交于
      Commit 0eef6156 ("memcg: fix css reference leak and endless loop in
      mem_cgroup_iter") got the interaction with the commit a few before it
      d8ad3055 ("mm/memcg: iteration skip memcgs not yet fully
      initialized") slightly wrong, and we didn't notice at the time.
      
      It's elusive, and harder to get than the original, but for a couple of
      days before rc1, I several times saw a endless loop similar to that
      supposedly being fixed.
      
      This time it was a tighter loop in __mem_cgroup_iter_next(): because we
      can get here when our root has already been offlined, and the ordering
      of conditions was such that we then just cycled around forever.
      
      Fixes: 0eef6156 ("memcg: fix css reference leak and endless loop in mem_cgroup_iter").
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: <stable@vger.kernel.org>	[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ce48225f
    • D
      mm: close PageTail race · 668f9abb
      David Rientjes 提交于
      Commit bf6bddf1 ("mm: introduce compaction and migration for
      ballooned pages") introduces page_count(page) into memory compaction
      which dereferences page->first_page if PageTail(page).
      
      This results in a very rare NULL pointer dereference on the
      aforementioned page_count(page).  Indeed, anything that does
      compound_head(), including page_count() is susceptible to racing with
      prep_compound_page() and seeing a NULL or dangling page->first_page
      pointer.
      
      This patch uses Andrea's implementation of compound_trans_head() that
      deals with such a race and makes it the default compound_head()
      implementation.  This includes a read memory barrier that ensures that
      if PageTail(head) is true that we return a head page that is neither
      NULL nor dangling.  The patch then adds a store memory barrier to
      prep_compound_page() to ensure page->first_page is set.
      
      This is the safest way to ensure we see the head page that we are
      expecting, PageTail(page) is already in the unlikely() path and the
      memory barriers are unfortunately required.
      
      Hugetlbfs is the exception, we don't enforce a store memory barrier
      during init since no race is possible.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Holger Kiehl <Holger.Kiehl@dwd.de>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rafael Aquini <aquini@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      668f9abb