1. 11 12月, 2014 40 次提交
    • V
      mm, compaction: more focused lru and pcplists draining · fdaf7f5c
      Vlastimil Babka 提交于
      The goal of memory compaction is to create high-order freepages through
      page migration.  Page migration however puts pages on the per-cpu lru_add
      cache, which is later flushed to per-cpu pcplists, and only after pcplists
      are drained the pages can actually merge.  This can happen due to the
      per-cpu caches becoming full through further freeing, or explicitly.
      
      During direct compaction, it is useful to do the draining explicitly so
      that pages merge as soon as possible and compaction can detect success
      immediately and keep the latency impact at minimum.  However the current
      implementation is far from ideal.  Draining is done only in
      __alloc_pages_direct_compact(), after all zones were already compacted,
      and the decisions to continue or stop compaction in individual zones was
      done without the last batch of migrations being merged.  It is also
      missing the draining of lru_add cache before the pcplists.
      
      This patch moves the draining for direct compaction into compact_zone().
      It adds the missing lru_cache draining and uses the newly introduced
      single zone pcplists draining to reduce overhead and avoid impact on
      unrelated zones.  Draining is only performed when it can actually lead to
      merging of a page of desired order (passed by cc->order).  This means it
      is only done when migration occurred in the previously scanned cc->order
      aligned block(s) and the migration scanner is now pointing to the next
      cc->order aligned block.
      
      The patch has been tested with stress-highalloc benchmark from mmtests.
      Although overal allocation success rates of the benchmark were not
      affected, the number of detected compaction successes has doubled.  This
      suggests that allocations were previously successful due to implicit
      merging caused by background activity, making a later allocation attempt
      succeed immediately, but not attributing the success to compaction.  Since
      stress-highalloc always tries to allocate almost the whole memory, it
      cannot show the improvement in its reported success rate metric.  However
      after this patch, compaction should detect success and terminate earlier,
      reducing the direct compaction latencies in a real scenario.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fdaf7f5c
    • V
      mm, compaction: always update cached scanner positions · 6bace090
      Vlastimil Babka 提交于
      Compaction caches the migration and free scanner positions between
      compaction invocations, so that the whole zone gets eventually scanned and
      there is no bias towards the initial scanner positions at the
      beginning/end of the zone.
      
      The cached positions are continuously updated as scanners progress and the
      updating stops as soon as a page is successfully isolated.  The reasoning
      behind this is that a pageblock where isolation succeeded is likely to
      succeed again in near future and it should be worth revisiting it.
      
      However, the downside is that potentially many pages are rescanned without
      successful isolation.  At worst, there might be a page where isolation
      from LRU succeeds but migration fails (potentially always).  So upon
      encountering this page, cached position would always stop being updated
      for no good reason.  It might have been useful to let such page be
      rescanned with sync compaction after async one failed, but this is now
      handled by caching scanner position for async and sync mode separately
      since commit 35979ef3 ("mm, compaction: add per-zone migration pfn
      cache for async compaction").
      
      After this patch, cached positions are updated unconditionally.  In
      stress-highalloc benchmark, this has decreased the numbers of scanned
      pages by few percent, without affecting allocation success rates.
      
      To prevent free scanner from leaving free pages behind after they are
      returned due to page migration failure, the cached scanner pfn is changed
      to point to the pageblock of the returned free page with the highest pfn,
      before leaving compact_zone().
      
      [akpm@linux-foundation.org: coding-style fixes]
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6bace090
    • V
      mm, compaction: defer only on COMPACT_COMPLETE · f8669795
      Vlastimil Babka 提交于
      Deferred compaction is employed to avoid compacting zone where sync direct
      compaction has recently failed.  As such, it makes sense to only defer
      when a full zone was scanned, which is when compact_zone returns with
      COMPACT_COMPLETE.  It's less useful to defer when compact_zone returns
      with apparent success (COMPACT_PARTIAL), followed by a watermark check
      failure, which can happen due to parallel allocation activity.  It also
      does not make much sense to defer compaction which was completely skipped
      (COMPACT_SKIP) for being unsuitable in the first place.
      
      This patch therefore makes deferred compaction trigger only when
      COMPACT_COMPLETE is returned from compact_zone().  Results of
      stress-highalloc becnmark show the difference is within measurement error,
      so the issue is rather cosmetic.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8669795
    • V
      mm, compaction: simplify deferred compaction · 97d47a65
      Vlastimil Babka 提交于
      Since commit 53853e2d ("mm, compaction: defer each zone individually
      instead of preferred zone"), compaction is deferred for each zone where
      sync direct compaction fails, and reset where it succeeds.  However, it
      was observed that for DMA zone compaction often appeared to succeed
      while subsequent allocation attempt would not, due to different outcome
      of watermark check.
      
      In order to properly defer compaction in this zone, the candidate zone
      has to be passed back to __alloc_pages_direct_compact() and compaction
      deferred in the zone after the allocation attempt fails.
      
      The large source of mismatch between watermark check in compaction and
      allocation was the lack of alloc_flags and classzone_idx values in
      compaction, which has been fixed in the previous patch.  So with this
      problem fixed, we can simplify the code by removing the candidate_zone
      parameter and deferring in __alloc_pages_direct_compact().
      
      After this patch, the compaction activity during stress-highalloc
      benchmark is still somewhat increased, but it's negligible compared to the
      increase that occurred without the better watermark checking.  This
      suggests that it is still possible to apparently succeed in compaction but
      fail to allocate, possibly due to parallel allocation activity.
      
      [akpm@linux-foundation.org: fix build]
      Suggested-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97d47a65
    • V
      mm, compaction: pass classzone_idx and alloc_flags to watermark checking · ebff3980
      Vlastimil Babka 提交于
      Compaction relies on zone watermark checks for decisions such as if it's
      worth to start compacting in compaction_suitable() or whether compaction
      should stop in compact_finished().  The watermark checks take
      classzone_idx and alloc_flags parameters, which are related to the memory
      allocation request.  But from the context of compaction they are currently
      passed as 0, including the direct compaction which is invoked to satisfy
      the allocation request, and could therefore know the proper values.
      
      The lack of proper values can lead to mismatch between decisions taken
      during compaction and decisions related to the allocation request.  Lack
      of proper classzone_idx value means that lowmem_reserve is not taken into
      account.  This has manifested (during recent changes to deferred
      compaction) when DMA zone was used as fallback for preferred Normal zone.
      compaction_suitable() without proper classzone_idx would think that the
      watermarks are already satisfied, but watermark check in
      get_page_from_freelist() would fail.  Because of this problem, deferring
      compaction has extra complexity that can be removed in the following
      patch.
      
      The issue (not confirmed in practice) with missing alloc_flags is opposite
      in nature.  For allocations that include ALLOC_HIGH, ALLOC_HIGHER or
      ALLOC_CMA in alloc_flags (the last includes all MOVABLE allocations on
      CMA-enabled systems) the watermark checking in compaction with 0 passed
      will be stricter than in get_page_from_freelist().  In these cases
      compaction might be running for a longer time than is really needed.
      
      Another issue compaction_suitable() is that the check for "does the zone
      need compaction at all?" comes only after the check "does the zone have
      enough free free pages to succeed compaction".  The latter considers extra
      pages for migration and can therefore in some situations fail and return
      COMPACT_SKIPPED, although the high-order allocation would succeed and we
      should return COMPACT_PARTIAL.
      
      This patch fixes these problems by adding alloc_flags and classzone_idx to
      struct compact_control and related functions involved in direct compaction
      and watermark checking.  Where possible, all other callers of
      compaction_suitable() pass proper values where those are known.  This is
      currently limited to classzone_idx, which is sometimes known in kswapd
      context.  However, the direct reclaim callers should_continue_reclaim()
      and compaction_ready() do not currently know the proper values, so the
      coordination between reclaim and compaction may still not be as accurate
      as it could.  This can be fixed later, if it's shown to be an issue.
      
      Additionaly the checks in compact_suitable() are reordered to address the
      second issue described above.
      
      The effect of this patch should be slightly better high-order allocation
      success rates and/or less compaction overhead, depending on the type of
      allocations and presence of CMA.  It allows simplifying deferred
      compaction code in a followup patch.
      
      When testing with stress-highalloc, there was some slight improvement
      (which might be just due to variance) in success rates of non-THP-like
      allocations.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Christoph Lameter <cl@linux.com>
      Acked-by: NRik van Riel <riel@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ebff3980
    • J
      mm: vmscan: count only dirty pages as congested · 1da58ee2
      Jamie Liu 提交于
      shrink_page_list() counts all pages with a mapping, including clean pages,
      toward nr_congested if they're on a write-congested BDI.
      shrink_inactive_list() then sets ZONE_CONGESTED if nr_dirty ==
      nr_congested.  Fix this apples-to-oranges comparison by only counting
      pages for nr_congested if they count for nr_dirty.
      Signed-off-by: NJamie Liu <jamieliu@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Greg Thelen <gthelen@google.com>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1da58ee2
    • Y
      mm: verify compound order when freeing a page · ab1f306f
      Yu Zhao 提交于
      This allows us to catch the bug fixed in the previous patch (mm: free
      compound page with correct order).
      
      Here we also verify whether a page is tail page or not -- tail pages are
      supposed to be freed along with their head, not by themselves.
      Signed-off-by: NYu Zhao <yuzhao@google.com>
      Reviewed-by: N"Kirill A. Shutemov" <kirill@shutemov.name>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Bob Liu <lliubbo@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab1f306f
    • A
      cma: make default CMA area size zero for x86 · d7be003a
      Akinobu Mita 提交于
      This makes CMA memory area size zero for x86 in default configuration
      (doesn't change on the other architectures).  If default CMA size is
      zero, DMA_CMA is disabled.  It can be enabled by passing cma= to the
      kernel.
      
      This makes less impact on x86.  Because there is no mainline driver that
      requires it for x86, and Peter Hurley reported the performance
      regression, as this is trying to drive _all_ dma mapping allocations
      through a _very_ small window.
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Reported-by: NPeter Hurley <peter@hurleysoftware.com>
      Cc: Peter Hurley <peter@hurleysoftware.com>
      Cc: Chuck Ebbert <cebbert.lkml@gmail.com>
      Cc: Jean Delvare <jdelvare@suse.de>
      Acked-by: NMarek Szyprowski <m.szyprowski@samsung.com>
      Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Don Dutile <ddutile@redhat.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Yinghai Lu <yinghai@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7be003a
    • V
      mm, memory_hotplug/failure: drain single zone pcplists · c0554329
      Vlastimil Babka 提交于
      Memory hotplug and failure mechanisms have several places where pcplists
      are drained so that pages are returned to the buddy allocator and can be
      e.g. prepared for offlining.  This is always done in the context of a
      single zone, we can reduce the pcplists drain to the single zone, which
      is now possible.
      
      The change should make memory offlining due to hotremove or failure
      faster and not disturbing unrelated pcplists anymore.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c0554329
    • V
      mm, cma: drain single zone pcplists · 510f5507
      Vlastimil Babka 提交于
      CMA allocation drains pcplists so that pages can merge back to buddy
      allocator.  Since it operates on a single zone, we can reduce the
      pcplists drain to the single zone, which is now possible.
      
      The change should make CMA allocations faster and not disturbing
      unrelated pcplists anymore.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      510f5507
    • V
      mm, page_isolation: drain single zone pcplists · ec25af84
      Vlastimil Babka 提交于
      When setting MIGRATETYPE_ISOLATE on a pageblock, pcplists are drained to
      have a better chance that all pages will be successfully isolated and
      not left in the per-cpu caches.  Since isolation is always concerned
      with a single zone, we can reduce the pcplists drain to the single zone,
      which is now possible.
      
      The change should make memory isolation faster and not disturbing
      unrelated pcplists anymore.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec25af84
    • V
      mm: introduce single zone pcplists drain · 93481ff0
      Vlastimil Babka 提交于
      The functions for draining per-cpu pages back to buddy allocators
      currently always operate on all zones.  There are however several cases
      where the drain is only needed in the context of a single zone, and
      spilling other pcplists is a waste of time both due to the extra
      spilling and later refilling.
      
      This patch introduces new zone pointer parameter to drain_all_pages()
      and changes the dummy parameter of drain_local_pages() to be also a zone
      pointer.  When NULL is passed, the functions operate on all zones as
      usual.  Passing a specific zone pointer reduces the work to the single
      zone.
      
      All callers are updated to pass the NULL pointer in this patch.
      Conversion to single zone (where appropriate) is done in further
      patches.
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: Zhang Yanfei <zhangyanfei@cn.fujitsu.com>
      Cc: Xishi Qiu <qiuxishi@huawei.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Michal Nazarewicz <mina86@mina86.com>
      Cc: Marek Szyprowski <m.szyprowski@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93481ff0
    • P
      mm/vmscan.c: replace printk with pr_err · 8612c663
      Pintu Kumar 提交于
      This patch replaces printk(KERN_ERR..) with pr_err found under
      shrink_slab.  Thus it also reduces one line extra because of formatting.
      Signed-off-by: NPintu Kumar <pintu.k@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8612c663
    • P
      mm/vmalloc.c: replace printk with pr_warn · 0cbc8533
      Pintu Kumar 提交于
      This patch replaces printk(KERN_WARNING..) with pr_warn.
      Thus it also reduces one line extra because of formatting.
      Signed-off-by: NPintu Kumar <pintu.k@samsung.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0cbc8533
    • A
    • J
      mm: memcontrol: remove synchronous stock draining code · 6d3d6aa2
      Johannes Weiner 提交于
      With charge reparenting, the last synchronous stock drainer left.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6d3d6aa2
    • J
      mm: memcontrol: continue cache reclaim from offlined groups · b2052564
      Johannes Weiner 提交于
      On cgroup deletion, outstanding page cache charges are moved to the parent
      group so that they're not lost and can be reclaimed during pressure
      on/inside said parent.  But this reparenting is fairly tricky and its
      synchroneous nature has led to several lock-ups in the past.
      
      Since c2931b70 ("cgroup: iterate cgroup_subsys_states directly") css
      iterators now also include offlined css, so memcg iterators can be changed
      to include offlined children during reclaim of a group, and leftover cache
      can just stay put.
      
      There is a slight change of behavior in that charges of deleted groups no
      longer show up as local charges in the parent.  But they are still
      included in the parent's hierarchical statistics.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b2052564
    • J
      mm: memcontrol: remove obsolete kmemcg pinning tricks · 64f21993
      Johannes Weiner 提交于
      As charges now pin the css explicitely, there is no more need for kmemcg
      to acquire a proxy reference for outstanding pages during offlining, or
      maintain state to identify such "dead" groups.
      
      This was the last user of the uncharge functions' return values, so remove
      them as well.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      64f21993
    • J
      mm: memcontrol: take a css reference for each charged page · e8ea14cc
      Johannes Weiner 提交于
      Charges currently pin the css indirectly by playing tricks during
      css_offline(): user pages stall the offlining process until all of them
      have been reparented, whereas kmemcg acquires a keep-alive reference if
      outstanding kernel pages are detected at that point.
      
      In preparation for removing all this complexity, make the pinning explicit
      and acquire a css references for every charged page.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e8ea14cc
    • J
      mm: memcontrol: convert reclaim iterator to simple css refcounting · 5ac8fb31
      Johannes Weiner 提交于
      The memcg reclaim iterators use a complicated weak reference scheme to
      prevent pinning cgroups indefinitely in the absence of memory pressure.
      
      However, during the ongoing cgroup core rework, css lifetime has been
      decoupled such that a pinned css no longer interferes with removal of
      the user-visible cgroup, and all this complexity is now unnecessary.
      
      [mhocko@suse.cz: ensure that the cached reference is always released]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5ac8fb31
    • J
      kernel: res_counter: remove the unused API · 5b1efc02
      Johannes Weiner 提交于
      All memory accounting and limiting has been switched over to the
      lockless page counters.  Bye, res_counter!
      
      [akpm@linux-foundation.org: update Documentation/cgroups/memory.txt]
      [mhocko@suse.cz: ditch the last remainings of res_counter]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Paul Bolle <pebolle@tiscali.nl>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5b1efc02
    • J
      mm: hugetlb_cgroup: convert to lockless page counters · 71f87bee
      Johannes Weiner 提交于
      Abandon the spinlock-protected byte counters in favor of the unlocked
      page counters in the hugetlb controller as well.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      71f87bee
    • J
      mm: memcontrol: lockless page counters · 3e32cb2e
      Johannes Weiner 提交于
      Memory is internally accounted in bytes, using spinlock-protected 64-bit
      counters, even though the smallest accounting delta is a page.  The
      counter interface is also convoluted and does too many things.
      
      Introduce a new lockless word-sized page counter API, then change all
      memory accounting over to it.  The translation from and to bytes then only
      happens when interfacing with userspace.
      
      The removed locking overhead is noticable when scaling beyond the per-cpu
      charge caches - on a 4-socket machine with 144-threads, the following test
      shows the performance differences of 288 memcgs concurrently running a
      page fault benchmark:
      
      vanilla:
      
         18631648.500498      task-clock (msec)         #  140.643 CPUs utilized            ( +-  0.33% )
               1,380,638      context-switches          #    0.074 K/sec                    ( +-  0.75% )
                  24,390      cpu-migrations            #    0.001 K/sec                    ( +-  8.44% )
           1,843,305,768      page-faults               #    0.099 M/sec                    ( +-  0.00% )
      50,134,994,088,218      cycles                    #    2.691 GHz                      ( +-  0.33% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       8,049,712,224,651      instructions              #    0.16  insns per cycle          ( +-  0.04% )
       1,586,970,584,979      branches                  #   85.176 M/sec                    ( +-  0.05% )
           1,724,989,949      branch-misses             #    0.11% of all branches          ( +-  0.48% )
      
           132.474343877 seconds time elapsed                                          ( +-  0.21% )
      
      lockless:
      
         12195979.037525      task-clock (msec)         #  133.480 CPUs utilized            ( +-  0.18% )
                 832,850      context-switches          #    0.068 K/sec                    ( +-  0.54% )
                  15,624      cpu-migrations            #    0.001 K/sec                    ( +- 10.17% )
           1,843,304,774      page-faults               #    0.151 M/sec                    ( +-  0.00% )
      32,811,216,801,141      cycles                    #    2.690 GHz                      ( +-  0.18% )
         <not supported>      stalled-cycles-frontend
         <not supported>      stalled-cycles-backend
       9,999,265,091,727      instructions              #    0.30  insns per cycle          ( +-  0.10% )
       2,076,759,325,203      branches                  #  170.282 M/sec                    ( +-  0.12% )
           1,656,917,214      branch-misses             #    0.08% of all branches          ( +-  0.55% )
      
            91.369330729 seconds time elapsed                                          ( +-  0.45% )
      
      On top of improved scalability, this also gets rid of the icky long long
      types in the very heart of memcg, which is great for 32 bit and also makes
      the code a lot more readable.
      
      Notable differences between the old and new API:
      
      - res_counter_charge() and res_counter_charge_nofail() become
        page_counter_try_charge() and page_counter_charge() resp. to match
        the more common kernel naming scheme of try_do()/do()
      
      - res_counter_uncharge_until() is only ever used to cancel a local
        counter and never to uncharge bigger segments of a hierarchy, so
        it's replaced by the simpler page_counter_cancel()
      
      - res_counter_set_limit() is replaced by page_counter_limit(), which
        expects its callers to serialize against themselves
      
      - res_counter_memparse_write_strategy() is replaced by
        page_counter_limit(), which rounds down to the nearest page size -
        rather than up.  This is more reasonable for explicitely requested
        hard upper limits.
      
      - to keep charging light-weight, page_counter_try_charge() charges
        speculatively, only to roll back if the result exceeds the limit.
        Because of this, a failing bigger charge can temporarily lock out
        smaller charges that would otherwise succeed.  The error is bounded
        to the difference between the smallest and the biggest possible
        charge size, so for memcg, this means that a failing THP charge can
        send base page charges into reclaim upto 2MB (4MB) before the limit
        would have been reached.  This should be acceptable.
      
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE and memparse]
      [akpm@linux-foundation.org: add includes for WARN_ON_ONCE, memparse, strncmp, and PAGE_SIZE]
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3e32cb2e
    • P
      slab: replace smp_read_barrier_depends() with lockless_dereference() · 8df0c2dc
      Pranith Kumar 提交于
      Recently lockless_dereference() was added which can be used in place of
      hard-coding smp_read_barrier_depends().  The following PATCH makes the
      change.
      Signed-off-by: NPranith Kumar <bobby.prani@gmail.com>
      Cc: "Paul E. McKenney" <paulmck@us.ibm.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8df0c2dc
    • A
      slab: improve checking for invalid gfp_flags · c871ac4e
      Andrew Morton 提交于
      The code goes BUG, but doesn't tell us which bits were unexpectedly set.
      Print that out.
      
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c871ac4e
    • A
      mm: slub: fix format mismatches in slab_err() callers · f6edde9c
      Andrey Ryabinin 提交于
      Adding __printf(3, 4) to slab_err exposed following:
      
        mm/slub.c: In function `check_slab':
        mm/slub.c:852:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
            s->name, page->objects, maxobj);
            ^
        mm/slub.c:852:4: warning: too many arguments for format [-Wformat-extra-args]
        mm/slub.c:857:4: warning: format `%u' expects argument of type `unsigned int', but argument 4 has type `const char *' [-Wformat=]
            s->name, page->inuse, page->objects);
            ^
        mm/slub.c:857:4: warning: too many arguments for format [-Wformat-extra-args]
      
        mm/slub.c: In function `on_freelist':
        mm/slub.c:905:4: warning: format `%d' expects argument of type `int', but argument 5 has type `long unsigned int' [-Wformat=]
            "should be %d", page->objects, max_objects);
      
      Fix first two warnings by removing redundant s->name.
      Fix the last by changing type of max_object from unsigned long to int.
      Signed-off-by: NAndrey Ryabinin <a.ryabinin@samsung.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f6edde9c
    • J
      mm/slab: reverse iteration on find_mergeable() · 54362057
      Joonsoo Kim 提交于
      Unlike SLUB, sometimes, object isn't started at the beginning of the slab
      in the SLAB.  This causes the unalignment problem when after slab merging
      is supported by commit 12220dea ("mm/slab: support slab merge").
      Alignment mismatch check is introduced ("mm/slab: fix unalignment problem
      on Malta with EVA due to slab merge") to prevent merge in this case.
      
      This causes undesirable result that merging happens between infrequently
      used kmem_caches if there are kmem_caches with same size and is 256 bytes,
      are merged into pool_workqueue rather than kmalloc-256, because
      kmem_caches for kmalloc are at the tail of the list.
      
      To prevent this situation, this patch reverses iteration order in
      find_mergeable() to find frequently used kmem_caches.  This change helps
      to merge kmem_cache to frequently used kmem_caches, such as kmalloc
      kmem_caches.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54362057
    • V
      slab: print slabinfo header in seq show · 1df3b26f
      Vladimir Davydov 提交于
      Currently we print the slabinfo header in the seq start method, which
      makes it unusable for showing leaks, so we have leaks_show, which does
      practically the same as s_show except it doesn't show the header.
      
      However, we can print the header in the seq show method - we only need
      to check if the current element is the first on the list.  This will
      allow us to use the same set of seq iterators for both leaks and
      slabinfo reporting, which is nice.
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1df3b26f
    • L
      mm: slab/slub: coding style: whitespaces and tabs mixture · b455def2
      LQYMGT 提交于
      Some code in mm/slab.c and mm/slub.c use whitespaces in indent.
      Clean them up.
      Signed-off-by: NLQYMGT <lqymgt@gmail.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b455def2
    • J
      fs/char_dev.c: remove pointless assignment from __register_chrdev_region() · e2ab879e
      Jan Kara 提交于
      At one place we assign major number we found to ret.  That assignment is
      then never used and actually doesn't make any sense given how the code is
      currently structured (the assignment comes from pre-git times).  Just
      remove it.
      
      Coverity id: 1226852.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2ab879e
    • D
      ocfs2: remove unneeded NULL check · b3e3e5af
      Dan Carpenter 提交于
      In commit 1faf2894 ("ocfs2_dlm: disallow a domain join if node maps
      mismatch") we introduced a new earlier NULL check so this one is not
      needed.  Also static checkers complain because we dereference it first
      and then check for NULL.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3e3e5af
    • D
      ocfs2: remove bogus NULL check in ocfs2_move_extents() · 88d69b92
      Dan Carpenter 提交于
      "inode" isn't NULL here, and also we dereference it on the previous line
      so static checkers get annoyed.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      88d69b92
    • J
      ocfs2: do not set filesystem readonly if link down · 61fb9ea4
      jiangyiwen 提交于
      Do not set the filesystem readonly if the storage link is down.  In this
      case, metadata is not corrupted and only -EIO is returned.  And if it is
      indeed corrupted metadata, it has already called ocfs2_error() in
      ocfs2_validate_inode_block().
      Signed-off-by: NYiwen Jiang <jiangyiwen@huawei.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      61fb9ea4
    • X
      ocfs2: do not set OCFS2_LOCK_UPCONVERT_FINISHING if nonblocking lock can not be granted at once · d1e78238
      Xue jiufei 提交于
      ocfs2_readpages() use nonblocking flag to avoid page lock inversion.  It
      will trigger cluster hang because that flag OCFS2_LOCK_UPCONVERT_FINISHING
      is not cleared if nonblocking lock cannot be granted at once.  The flag
      would prevent dc thread from downconverting.  So other nodes cannot
      acheive this lockres for ever.
      
      So we should not set OCFS2_LOCK_UPCONVERT_FINISHING when receiving ast if
      nonblocking lock had already returned.
      Signed-off-by: Njoyce.xue <xuejiufei@huawei.com>
      Reviewed-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d1e78238
    • J
      ocfs2: fix error handling when creating debugfs root in ocfs2_init() · dc171580
      Jan Kara 提交于
      Error handling if creation of root of debugfs in ocfs2_init() fails is
      broken.  Although error code is set we fail to exit ocfs2_init() with
      error and thus initialization ends with success.  Later when mounting a
      filesystem, ocfs2 debugfs entries end up being created in the root of
      debugfs filesystem which is confusing.
      
      Fix the error handling to bail out.
      
      Coverity id: 1227009.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NJoseph Qi <joseph.qi@huawei.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc171580
    • G
      ocfs2: remove filesize checks for sync I/O journal commit · 86b9c6f3
      Goldwyn Rodrigues 提交于
      Filesize is not a good indication that the file needs to be synced.
      An example where this breaks is:
       1. Open the file in O_SYNC|O_RDWR
       2. Read a small portion of the file (say 64 bytes)
       3. Lseek to starting of the file
       4. Write 64 bytes
      
      If the node crashes, it is not written out to disk because this was not
      committed in the journal and the other node which reads the file after
      recovery reads stale data (even if the write on the other node was
      successful)
      Signed-off-by: NGoldwyn Rodrigues <rgoldwyn@suse.de>
      Reviewed-by: NMark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      86b9c6f3
    • J
      ocfs2: o2net: fix connect expired · 196fe71d
      Junxiao Bi 提交于
      Set nn_persistent_error to -ENOTCONN will stop reconnect since the
      "stop" condition in o2net_start_connect() will be true.
      
          stop = (nn->nn_sc ||
                      (nn->nn_persistent_error &&
                      (nn->nn_persistent_error != -ENOTCONN || timeout == 0)));
      
      This will make connection never be established if the first connection
      request is lost.
      
      Set nn_persistent_error to 0 when connect expired to fix this.  With
      this changes, dlm will not be waken up when connect expired, this is OK
      since dlm depends on network, dlm can do nothing in this case if waken
      up.  Let it wait there for network recover and connect built again to
      continue.
      Signed-off-by: NJunxiao Bi <junxiao.bi@oracle.com>
      Reviewed-by: NSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      196fe71d
    • S
      ocfs2: o2dlm: fix a race between purge and master query · cb79662b
      Srinivas Eeda 提交于
      Node A sends master query request to node B which is the master.  At this
      time lockres happens to be on purgelist.  dlm_master_request_handler gets
      the dlm spinlock, finds the resource and releases the dlm spin lock.
      Right at this dlm_thread on this node could purge the lockres.
      dlm_master_request_handler can then acquire lockres spinlock and reply to
      Node A that node B is the master even though lockres on node B is purged.
      
      The above scenario will now make node A falsely think node B is the master
      which is inconsistent.  Further if another node C tries to master the same
      resource, every node will respond they are not the master.  Node C then
      masters the resource and sends assert master to all nodes.  This will now
      make node A crash with the following message.
      
      dlm_assert_master_handler:1831 ERROR: DIE! Mastery assert from 9, but current
      owner is 10!
      Signed-off-by: NSrinivas Eeda <srinivas.eeda@oracle.com>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Reviewed-by: NWengang Wang <wen.gang.wang@oracle.com>
      Tested-by: NJoseph Qi <joseph.qi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cb79662b
    • J
      ocfs2: report error from o2hb_do_disk_heartbeat() to user · f5425fce
      Jan Kara 提交于
      Report return value of o2hb_do_disk_heartbeat() as a part of ML_HEARTBEAT
      message so that we know whether a heartbeat actually happened or not.
      This also makes assigned but otherwise unused 'ret' variable used.
      
      Coverity id: 1227053.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f5425fce
    • J
      ocfs2: remove bogus test from ocfs2_read_locked_inode() · 4a635a11
      Jan Kara 提交于
      'args' are always set for ocfs2_read_locked_inode() and brelse() checks
      whether bh is NULL.  So the test (args && bh) is unnecessary (plus the
      args part is really confusing anyway).  Remove it.
      
      Coverity id: 1128856.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Cc: Mark Fasheh <mfasheh@suse.com>
      Cc: Joel Becker <jlbec@evilplan.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4a635a11