1. 04 4月, 2014 24 次提交
    • J
      mm: keep page cache radix tree nodes in check · 449dd698
      Johannes Weiner 提交于
      Previously, page cache radix tree nodes were freed after reclaim emptied
      out their page pointers.  But now reclaim stores shadow entries in their
      place, which are only reclaimed when the inodes themselves are
      reclaimed.  This is problematic for bigger files that are still in use
      after they have a significant amount of their cache reclaimed, without
      any of those pages actually refaulting.  The shadow entries will just
      sit there and waste memory.  In the worst case, the shadow entries will
      accumulate until the machine runs out of memory.
      
      To get this under control, the VM will track radix tree nodes
      exclusively containing shadow entries on a per-NUMA node list.  Per-NUMA
      rather than global because we expect the radix tree nodes themselves to
      be allocated node-locally and we want to reduce cross-node references of
      otherwise independent cache workloads.  A simple shrinker will then
      reclaim these nodes on memory pressure.
      
      A few things need to be stored in the radix tree node to implement the
      shadow node LRU and allow tree deletions coming from the list:
      
      1. There is no index available that would describe the reverse path
         from the node up to the tree root, which is needed to perform a
         deletion.  To solve this, encode in each node its offset inside the
         parent.  This can be stored in the unused upper bits of the same
         member that stores the node's height at no extra space cost.
      
      2. The number of shadow entries needs to be counted in addition to the
         regular entries, to quickly detect when the node is ready to go to
         the shadow node LRU list.  The current entry count is an unsigned
         int but the maximum number of entries is 64, so a shadow counter
         can easily be stored in the unused upper bits.
      
      3. Tree modification needs tree lock and tree root, which are located
         in the address space, so store an address_space backpointer in the
         node.  The parent pointer of the node is in a union with the 2-word
         rcu_head, so the backpointer comes at no extra cost as well.
      
      4. The node needs to be linked to an LRU list, which requires a list
         head inside the node.  This does increase the size of the node, but
         it does not change the number of objects that fit into a slab page.
      
      [akpm@linux-foundation.org: export the right function]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      449dd698
    • J
      mm: thrash detection-based file cache sizing · a528910e
      Johannes Weiner 提交于
      The VM maintains cached filesystem pages on two types of lists.  One
      list holds the pages recently faulted into the cache, the other list
      holds pages that have been referenced repeatedly on that first list.
      The idea is to prefer reclaiming young pages over those that have shown
      to benefit from caching in the past.  We call the recently usedbut
      ultimately was not significantly better than a FIFO policy and still
      thrashed cache based on eviction speed, rather than actual demand for
      cache.
      
      This patch solves one half of the problem by decoupling the ability to
      detect working set changes from the inactive list size.  By maintaining
      a history of recently evicted file pages it can detect frequently used
      pages with an arbitrarily small inactive list size, and subsequently
      apply pressure on the active list based on actual demand for cache, not
      just overall eviction speed.
      
      Every zone maintains a counter that tracks inactive list aging speed.
      When a page is evicted, a snapshot of this counter is stored in the
      now-empty page cache radix tree slot.  On refault, the minimum access
      distance of the page can be assessed, to evaluate whether the page
      should be part of the active list or not.
      
      This fixes the VM's blindness towards working set changes in excess of
      the inactive list.  And it's the foundation to further improve the
      protection ability and reduce the minimum inactive list size of 50%.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NBob Liu <bob.liu@oracle.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a528910e
    • J
      mm + fs: store shadow entries in page cache · 91b0abe3
      Johannes Weiner 提交于
      Reclaim will be leaving shadow entries in the page cache radix tree upon
      evicting the real page.  As those pages are found from the LRU, an
      iput() can lead to the inode being freed concurrently.  At this point,
      reclaim must no longer install shadow pages because the inode freeing
      code needs to ensure the page tree is really empty.
      
      Add an address_space flag, AS_EXITING, that the inode freeing code sets
      under the tree lock before doing the final truncate.  Reclaim will check
      for this flag before installing shadow pages.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91b0abe3
    • J
      mm + fs: prepare for non-page entries in page cache radix trees · 0cd6144a
      Johannes Weiner 提交于
      shmem mappings already contain exceptional entries where swap slot
      information is remembered.
      
      To be able to store eviction information for regular page cache, prepare
      every site dealing with the radix trees directly to handle entries other
      than pages.
      
      The common lookup functions will filter out non-page entries and return
      NULL for page cache holes, just as before.  But provide a raw version of
      the API which returns non-page entries as well, and switch shmem over to
      use it.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cd6144a
    • J
      mm: filemap: move radix tree hole searching here · e7b563bb
      Johannes Weiner 提交于
      The radix tree hole searching code is only used for page cache, for
      example the readahead code trying to get a a picture of the area
      surrounding a fault.
      
      It sufficed to rely on the radix tree definition of holes, which is
      "empty tree slot".  But this is about to change, though, as shadow page
      descriptors will be stored in the page cache after the actual pages get
      evicted from memory.
      
      Move the functions over to mm/filemap.c and make them native page cache
      operations, where they can later be adapted to handle the new definition
      of "page cache hole".
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e7b563bb
    • J
      mm: shmem: save one radix tree lookup when truncating swapped pages · 6dbaf22c
      Johannes Weiner 提交于
      Page cache radix tree slots are usually stabilized by the page lock, but
      shmem's swap cookies have no such thing.  Because the overall truncation
      loop is lockless, the swap entry is currently confirmed by a tree lookup
      and then deleted by another tree lookup under the same tree lock region.
      
      Use radix_tree_delete_item() instead, which does the verification and
      deletion with only one lookup.  This also allows removing the
      delete-only special case from shmem_radix_tree_replace().
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Bob Liu <bob.liu@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Luigi Semenzato <semenzato@google.com>
      Cc: Metin Doslu <metin@citusdata.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Ozgun Erdogan <ozgun@citusdata.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Roman Gushchin <klamm@yandex-team.ru>
      Cc: Ryan Mallon <rmallon@gmail.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6dbaf22c
    • V
      mm: vmscan: shrink_slab: rename max_pass -> freeable · d5bc5fd3
      Vladimir Davydov 提交于
      The name `max_pass' is misleading, because this variable actually keeps
      the estimate number of freeable objects, not the maximal number of
      objects we can scan in this pass, which can be twice that.  Rename it to
      reflect its actual meaning.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d5bc5fd3
    • D
      mm, hugetlb: improve page-fault scalability · 8382d914
      Davidlohr Bueso 提交于
      The kernel can currently only handle a single hugetlb page fault at a
      time.  This is due to a single mutex that serializes the entire path.
      This lock protects from spurious OOM errors under conditions of low
      availability of free hugepages.  This problem is specific to hugepages,
      because it is normal to want to use every single hugepage in the system
      - with normal pages we simply assume there will always be a few spare
      pages which can be used temporarily until the race is resolved.
      
      Address this problem by using a table of mutexes, allowing a better
      chance of parallelization, where each hugepage is individually
      serialized.  The hash key is selected depending on the mapping type.
      For shared ones it consists of the address space and file offset being
      faulted; while for private ones the mm and virtual address are used.
      The size of the table is selected based on a compromise of collisions
      and memory footprint of a series of database workloads.
      
      Large database workloads that make heavy use of hugepages can be
      particularly exposed to this issue, causing start-up times to be
      painfully slow.  This patch reduces the startup time of a 10 Gb Oracle
      DB (with ~5000 faults) from 37.5 secs to 25.7 secs.  Larger workloads
      will naturally benefit even more.
      
      NOTE:
      The only downside to this patch, detected by Joonsoo Kim, is that a
      small race is possible in private mappings: A child process (with its
      own mm, after cow) can instantiate a page that is already being handled
      by the parent in a cow fault.  When low on pages, can trigger spurious
      OOMs.  I have not been able to think of a efficient way of handling
      this...  but do we really care about such a tiny window? We already
      maintain another theoretical race with normal pages.  If not, one
      possible way to is to maintain the single hash for private mappings --
      any workloads that *really* suffer from this scaling problem should
      already use shared mappings.
      
      [akpm@linux-foundation.org: remove stray + characters, go BUG if hugetlb_init() kmalloc fails]
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Cc: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8382d914
    • J
      mm, hugetlb: use vma_resv_map() map types · 4e35f483
      Joonsoo Kim 提交于
      Util now, we get a resv_map by two ways according to each mapping type.
      This makes code dirty and unreadable.  Unify it.
      
      [davidlohr@hp.com: code cleanups]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e35f483
    • J
      mm, hugetlb: remove resv_map_put · f031dd27
      Joonsoo Kim 提交于
      This is a preparation patch to unify the use of vma_resv_map()
      regardless of the map type.  This patch prepares it by removing
      resv_map_put(), which only works for HPAGE_RESV_OWNER's resv_map, not
      for all resv_maps.
      
      [davidlohr@hp.com: update changelog]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f031dd27
    • D
      mm, hugetlb: fix race in region tracking · 7b24d861
      Davidlohr Bueso 提交于
      There is a race condition if we map a same file on different processes.
      Region tracking is protected by mmap_sem and hugetlb_instantiation_mutex.
      When we do mmap, we don't grab a hugetlb_instantiation_mutex, but only
      mmap_sem (exclusively).  This doesn't prevent other tasks from modifying
      the region structure, so it can be modified by two processes
      concurrently.
      
      To solve this, introduce a spinlock to resv_map and make region
      manipulation function grab it before they do actual work.
      
      [davidlohr@hp.com: updated changelog]
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Suggested-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NDavid Gibson <david@gibson.dropbear.id.au>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7b24d861
    • J
      mm, hugetlb: improve, cleanup resv_map parameters · 1406ec9b
      Joonsoo Kim 提交于
      To change a protection method for region tracking to find grained one,
      we pass the resv_map, instead of list_head, to region manipulation
      functions.
      
      This doesn't introduce any functional change, and it is just for
      preparing a next step.
      
      [davidlohr@hp.com: update changelog]
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1406ec9b
    • J
      mm, hugetlb: unify region structure handling · 9119a41e
      Joonsoo Kim 提交于
      Currently, to track reserved and allocated regions, we use two different
      ways, depending on the mapping.  For MAP_SHARED, we use
      address_mapping's private_list and, while for MAP_PRIVATE, we use a
      resv_map.
      
      Now, we are preparing to change a coarse grained lock which protect a
      region structure to fine grained lock, and this difference hinder it.
      So, before changing it, unify region structure handling, consistently
      using a resv_map regardless of the kind of mapping.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NDavidlohr Bueso <davidlohr@hp.com>
      Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
      Reviewed-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: David Gibson <david@gibson.dropbear.id.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9119a41e
    • M
      mm: optimize put_mems_allowed() usage · d26914d1
      Mel Gorman 提交于
      Since put_mems_allowed() is strictly optional, its a seqcount retry, we
      don't need to evaluate the function if the allocation was in fact
      successful, saving a smp_rmb some loads and comparisons on some relative
      fast-paths.
      
      Since the naming, get/put_mems_allowed() does suggest a mandatory
      pairing, rename the interface, as suggested by Mel, to resemble the
      seqcount interface.
      
      This gives us: read_mems_allowed_begin() and read_mems_allowed_retry(),
      where it is important to note that the return value of the latter call
      is inverted from its previous incarnation.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d26914d1
    • D
      mm, compaction: ignore pageblock skip when manually invoking compaction · 91ca9186
      David Rientjes 提交于
      The cached pageblock hint should be ignored when triggering compaction
      through /proc/sys/vm/compact_memory so all eligible memory is isolated.
      Manually invoking compaction is known to be expensive, there's no need
      to skip pageblocks based on heuristics (mainly for debugging).
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      91ca9186
    • V
      mm: vmscan: remove shrink_control arg from do_try_to_free_pages() · 3115cd91
      Vladimir Davydov 提交于
      There is no need passing on a shrink_control struct from
      try_to_free_pages() and friends to do_try_to_free_pages() and then to
      shrink_zones(), because it is only used in shrink_zones() and the only
      field initialized on the top level is gfp_mask, which is always equal to
      scan_control.gfp_mask.  So let's move shrink_control initialization to
      shrink_zones().
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3115cd91
    • V
      mm: vmscan: move call to shrink_slab() to shrink_zones() · 65ec02cb
      Vladimir Davydov 提交于
      This reduces the indentation level of do_try_to_free_pages() and removes
      extra loop over all eligible zones counting the number of on-LRU pages.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Reviewed-by: NGlauber Costa <glommer@gmail.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      65ec02cb
    • V
      mm: vmscan: respect NUMA policy mask when shrinking slab on direct reclaim · 99120b77
      Vladimir Davydov 提交于
      When direct reclaim is executed by a process bound to a set of NUMA
      nodes, we should scan only those nodes when possible, but currently we
      will scan kmem from all online nodes even if the kmem shrinker is NUMA
      aware.  That said, binding a process to a particular NUMA node won't
      prevent it from shrinking inode/dentry caches from other nodes, which is
      not good.  Fix this.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Dave Chinner <dchinner@redhat.com>
      Cc: Glauber Costa <glommer@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      99120b77
    • L
      kmemleak: change some global variables to int · 8910ae89
      Li Zefan 提交于
      They don't have to be atomic_t, because they are simple boolean toggles.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8910ae89
    • L
      kmemleak: remove redundant code · 5f3bf19a
      Li Zefan 提交于
      Remove kmemleak_padding() and kmemleak_release().
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5f3bf19a
    • L
      kmemleak: allow freeing internal objects after kmemleak was disabled · c89da70c
      Li Zefan 提交于
      Currently if kmemleak is disabled, the kmemleak objects can never be
      freed, no matter if it's disabled by a user or due to fatal errors.
      
      Those objects can be a big waste of memory.
      
          OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
        1200264 1197433  99%    0.30K  46164       26    369312K kmemleak_object
      
      With this patch, after kmemleak was disabled you can reclaim memory
      with:
      
      	# echo clear > /sys/kernel/debug/kmemleak
      
      Also inform users about this with a printk.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c89da70c
    • L
      kmemleak: free internal objects only if there're no leaks to be reported · dc9b3f42
      Li Zefan 提交于
      Currently if you stop kmemleak thread before disabling kmemleak,
      kmemleak objects will be freed and so you won't be able to check
      previously reported leaks.
      
      With this patch, kmemleak objects won't be freed if there're leaks that
      can be reported.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc9b3f42
    • J
      bdi: avoid oops on device removal · 5acda9d1
      Jan Kara 提交于
      After commit 839a8e86 ("writeback: replace custom worker pool
      implementation with unbound workqueue") when device is removed while we
      are writing to it we crash in bdi_writeback_workfn() ->
      set_worker_desc() because bdi->dev is NULL.
      
      This can happen because even though bdi_unregister() cancels all pending
      flushing work, nothing really prevents new ones from being queued from
      balance_dirty_pages() or other places.
      
      Fix the problem by clearing BDI_registered bit in bdi_unregister() and
      checking it before scheduling of any flushing work.
      
      Fixes: 839a8e86Reviewed-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Derek Basehore <dbasehore@chromium.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5acda9d1
    • D
      backing_dev: fix hung task on sync · 6ca738d6
      Derek Basehore 提交于
      bdi_wakeup_thread_delayed() used the mod_delayed_work() function to
      schedule work to writeback dirty inodes.  The problem with this is that
      it can delay work that is scheduled for immediate execution, such as the
      work from sync_inodes_sb().  This can happen since mod_delayed_work()
      can now steal work from a work_queue.  This fixes the problem by using
      queue_delayed_work() instead.  This is a regression caused by commit
      839a8e86 ("writeback: replace custom worker pool implementation with
      unbound workqueue").
      
      The reason that this causes a problem is that laptop-mode will change
      the delay, dirty_writeback_centisecs, to 60000 (10 minutes) by default.
      In the case that bdi_wakeup_thread_delayed() races with
      sync_inodes_sb(), sync will be stopped for 10 minutes and trigger a hung
      task.  Even if dirty_writeback_centisecs is not long enough to cause a
      hung task, we still don't want to delay sync for that long.
      
      We fix the problem by using queue_delayed_work() when we want to
      schedule writeback sometime in future.  This function doesn't change the
      timer if it is already armed.
      
      For the same reason, we also change bdi_writeback_workfn() to
      immediately queue the work again in the case that the work_list is not
      empty.  The same problem can happen if the sync work is run on the
      rescue worker.
      
      [jack@suse.cz: update changelog, add comment, use bdi_wakeup_thread_delayed()]
      Signed-off-by: NDerek Basehore <dbasehore@chromium.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zento.linux.org.uk>
      Reviewed-by: NTejun Heo <tj@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "Darrick J. Wong" <darrick.wong@oracle.com>
      Cc: Derek Basehore <dbasehore@chromium.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Benson Leung <bleung@chromium.org>
      Cc: Sonny Rao <sonnyrao@chromium.org>
      Cc: Luigi Semenzato <semenzato@chromium.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ca738d6
  2. 02 4月, 2014 1 次提交
  3. 29 3月, 2014 1 次提交
  4. 21 3月, 2014 1 次提交
    • H
      mm: fix swapops.h:131 bug if remap_file_pages raced migration · 7e09e738
      Hugh Dickins 提交于
      Add remove_linear_migration_ptes_from_nonlinear(), to fix an interesting
      little include/linux/swapops.h:131 BUG_ON(!PageLocked) found by trinity:
      indicating that remove_migration_ptes() failed to find one of the
      migration entries that was temporarily inserted.
      
      The problem comes from remap_file_pages()'s switch from vma_interval_tree
      (good for inserting the migration entry) to i_mmap_nonlinear list (no good
      for locating it again); but can only be a problem if the remap_file_pages()
      range does not cover the whole of the vma (zap_pte() clears the range).
      
      remove_migration_ptes() needs a file_nonlinear method to go down the
      i_mmap_nonlinear list, applying linear location to look for migration
      entries in those vmas too, just in case there was this race.
      
      The file_nonlinear method does need rmap_walk_control.arg to do this;
      but it never needed vma passed in - vma comes from its own iteration.
      Reported-and-tested-by: NDave Jones <davej@redhat.com>
      Reported-and-tested-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7e09e738
  5. 20 3月, 2014 1 次提交
    • H
      mm: fix bad rss-counter if remap_file_pages raced migration · 88784396
      Hugh Dickins 提交于
      Fix some "Bad rss-counter state" reports on exit, arising from the
      interaction between page migration and remap_file_pages(): zap_pte()
      must count a migration entry when zapping it.
      
      And yes, it is possible (though very unusual) to find an anon page or
      swap entry in a VM_SHARED nonlinear mapping: coming from that horrid
      get_user_pages(write, force) case which COWs even in a shared mapping.
      Signed-off-by: NHugh Dickins <hughd@google.com>
      Tested-by: Sasha Levin sasha.levin@oracle.com>
      Tested-by: Dave Jones davej@redhat.com>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88784396
  6. 19 3月, 2014 1 次提交
  7. 18 3月, 2014 1 次提交
    • V
      percpu: allocation size should be even · 2f69fa82
      Viro 提交于
      723ad1d9 ("percpu: store offsets instead of lengths in ->map[]")
      updated percpu area allocator to use the lowest bit, instead of sign,
      to signify whether the area is occupied and forced min align to 2;
      unfortunately, it forgot to force the allocation size to be even
      causing malfunctions for the very rare odd-sized allocations.
      
      Always force the allocations to be even sized.
      
      tj: Wrote patch description.
      Original-patch-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      2f69fa82
  8. 11 3月, 2014 3 次提交
    • B
      mm/Kconfig: fix URL for zsmalloc benchmark · 2216ee85
      Ben Hutchings 提交于
      The help text for CONFIG_PGTABLE_MAPPING has an incorrect URL.  While
      we're at it, remove the unnecessary footnote notation.
      Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2216ee85
    • L
      mm/compaction: break out of loop on !PageBuddy in isolate_freepages_block · 2af120bc
      Laura Abbott 提交于
      We received several reports of bad page state when freeing CMA pages
      previously allocated with alloc_contig_range:
      
          BUG: Bad page state in process Binder_A  pfn:63202
          page:d21130b0 count:0 mapcount:1 mapping:  (null) index:0x7dfbf
          page flags: 0x40080068(uptodate|lru|active|swapbacked)
      
      Based on the page state, it looks like the page was still in use.  The
      page flags do not make sense for the use case though.  Further debugging
      showed that despite alloc_contig_range returning success, at least one
      page in the range still remained in the buddy allocator.
      
      There is an issue with isolate_freepages_block.  In strict mode (which
      CMA uses), if any pages in the range cannot be isolated,
      isolate_freepages_block should return failure 0.  The current check
      keeps track of the total number of isolated pages and compares against
      the size of the range:
      
              if (strict && nr_strict_required > total_isolated)
                      total_isolated = 0;
      
      After taking the zone lock, if one of the pages in the range is not in
      the buddy allocator, we continue through the loop and do not increment
      total_isolated.  If in the last iteration of the loop we isolate more
      than one page (e.g.  last page needed is a higher order page), the check
      for total_isolated may pass and we fail to detect that a page was
      skipped.  The fix is to bail out if the loop immediately if we are in
      strict mode.  There's no benfit to continuing anyway since we need all
      pages to be isolated.  Additionally, drop the error checking based on
      nr_strict_required and just check the pfn ranges.  This matches with
      what isolate_freepages_range does.
      Signed-off-by: NLaura Abbott <lauraa@codeaurora.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NBartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com>
      Acked-by: NMichal Nazarewicz <mina86@mina86.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2af120bc
    • J
      mm: fix GFP_THISNODE callers and clarify · e97ca8e5
      Johannes Weiner 提交于
      GFP_THISNODE is for callers that implement their own clever fallback to
      remote nodes.  It restricts the allocation to the specified node and
      does not invoke reclaim, assuming that the caller will take care of it
      when the fallback fails, e.g.  through a subsequent allocation request
      without GFP_THISNODE set.
      
      However, many current GFP_THISNODE users only want the node exclusive
      aspect of the flag, without actually implementing their own fallback or
      triggering reclaim if necessary.  This results in things like page
      migration failing prematurely even when there is easily reclaimable
      memory available, unless kswapd happens to be running already or a
      concurrent allocation attempt triggers the necessary reclaim.
      
      Convert all callsites that don't implement their own fallback strategy
      to __GFP_THISNODE.  This restricts the allocation a single node too, but
      at the same time allows the allocator to enter the slowpath, wake
      kswapd, and invoke direct reclaim if necessary, to make the allocation
      happen when memory is full.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Jan Stancek <jstancek@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e97ca8e5
  9. 07 3月, 2014 3 次提交
    • A
      percpu: speed alloc_pcpu_area() up · 3d331ad7
      Al Viro 提交于
      If we know that first N areas are all in use, we can obviously skip
      them when searching for a free one.  And that kind of hint is very
      easy to maintain.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      3d331ad7
    • A
      percpu: store offsets instead of lengths in ->map[] · 723ad1d9
      Al Viro 提交于
      Current code keeps +-length for each area in chunk->map[].  It has
      several unpleasant consequences:
      	* even if we know that first 50 areas are all in use, allocation
      still needs to go through all those areas just to sum their sizes, just
      to get the offset of free one.
      	* freeing needs to find the array entry refering to the area
      in question; again, the need to sum the sizes until we reach the offset
      we are interested in.  Note that offsets are monotonous, so simple
      binary search would do here.
      
      	New data representation: array of <offset,in-use flag> pairs.
      Each pair is represented by one int - we use offset|1 for <offset, in use>
      and offset for <offset, free> (we make sure that all offsets are even).
      In the end we put a sentry entry - <total size, in use>.  The first
      entry is <0, flag>; it would be possible to store together the flag
      for Nth area and offset for N+1st, but that leads to much hairier code.
      
      In other words, where the old variant would have
      	4, -8, -4, 4, -12, 100
      (4 bytes free, 8 in use, 4 in use, 4 free, 12 in use, 100 free) we store
      	<0,0>, <4,1>, <12,1>, <16,0>, <20,1>, <32,0>, <132,1>
      i.e.
      	0, 5, 13, 16, 21, 32, 133
      
      This commit switches to new data representation and takes care of a couple
      of low-hanging fruits in free_pcpu_area() - one is the switch to binary
      search, another is not doing two memmove() when one would do.  Speeding
      the alloc side up (by keeping track of how many areas in the beginning are
      known to be all in use) also becomes possible - that'll be done in the next
      commit.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      723ad1d9
    • A
      perpcu: fold pcpu_split_block() into the only caller · 706c16f2
      Al Viro 提交于
      ... and simplify the results a bit.  Makes the next step easier
      to deal with - we will be changing the data representation for
      chunk->map[] and it's easier to do if the code in question is
      not split between pcpu_alloc_area() and pcpu_split_block().
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      706c16f2
  10. 06 3月, 2014 2 次提交
  11. 04 3月, 2014 2 次提交
    • J
      mm: page_alloc: exempt GFP_THISNODE allocations from zone fairness · 27329369
      Johannes Weiner 提交于
      Jan Stancek reports manual page migration encountering allocation
      failures after some pages when there is still plenty of memory free, and
      bisected the problem down to commit 81c0a2bb ("mm: page_alloc: fair
      zone allocator policy").
      
      The problem is that GFP_THISNODE obeys the zone fairness allocation
      batches on one hand, but doesn't reset them and wake kswapd on the other
      hand.  After a few of those allocations, the batches are exhausted and
      the allocations fail.
      
      Fixing this means either having GFP_THISNODE wake up kswapd, or
      GFP_THISNODE not participating in zone fairness at all.  The latter
      seems safer as an acute bugfix, we can clean up later.
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NRik van Riel <riel@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Cc: <stable@kernel.org>		[3.12+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27329369
    • V
      mm: include VM_MIXEDMAP flag in the VM_SPECIAL list to avoid m(un)locking · 9050d7eb
      Vlastimil Babka 提交于
      Daniel Borkmann reported a VM_BUG_ON assertion failing:
      
        ------------[ cut here ]------------
        kernel BUG at mm/mlock.c:528!
        invalid opcode: 0000 [#1] SMP
        Modules linked in: ccm arc4 iwldvm [...]
         video
        CPU: 3 PID: 2266 Comm: netsniff-ng Not tainted 3.14.0-rc2+ #8
        Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
        task: ffff8801f87f9820 ti: ffff88002cb44000 task.ti: ffff88002cb44000
        RIP: 0010:[<ffffffff81171ad0>]  [<ffffffff81171ad0>] munlock_vma_pages_range+0x2e0/0x2f0
        Call Trace:
          do_munmap+0x18f/0x3b0
          vm_munmap+0x41/0x60
          SyS_munmap+0x22/0x30
          system_call_fastpath+0x1a/0x1f
        RIP   munlock_vma_pages_range+0x2e0/0x2f0
        ---[ end trace a0088dcf07ae10f2 ]---
      
      because munlock_vma_pages_range() thinks it's unexpectedly in the middle
      of a THP page.  This can be reproduced with default config since 3.11
      kernels.  A reproducer can be found in the kernel's selftest directory
      for networking by running ./psock_tpacket.
      
      The problem is that an order=2 compound page (allocated by
      alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped
      by packet_mmap()) and mistaken for a THP page and assumed to be order=9.
      
      The checks for THP in munlock came with commit ff6a6da6 ("mm:
      accelerate munlock() treatment of THP pages"), i.e.  since 3.9, but did
      not trigger a bug.  It just makes munlock_vma_pages_range() skip such
      compound pages until the next 512-pages-aligned page, when it encounters
      a head page.  This is however not a problem for vma's where mlocking has
      no effect anyway, but it can distort the accounting.
      
      Since commit 7225522b ("mm: munlock: batch non-THP page isolation
      and munlock+putback using pagevec") this can trigger a VM_BUG_ON in
      PageTransHuge() check.
      
      This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL, a
      list of flags that make vma's non-mlockable and non-mergeable.  The
      reasoning is that VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is
      already on the VM_SPECIAL list, and both are intended for non-LRU pages
      where mlocking makes no sense anyway.  Related Lkml discussion can be
      found in [2].
      
       [1] tools/testing/selftests/net/psock_tpacket
       [2] https://lkml.org/lkml/2014/1/10/427Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Reported-by: NDaniel Borkmann <dborkman@redhat.com>
      Tested-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Thomas Hellstrom <thellstrom@vmware.com>
      Cc: John David Anglin <dave.anglin@bell.net>
      Cc: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
      Cc: Konstantin Khlebnikov <khlebnikov@openvz.org>
      Cc: Carsten Otte <cotte@de.ibm.com>
      Cc: Jared Hulbert <jaredeh@gmail.com>
      Tested-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org> [3.11.x+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9050d7eb