1. 08 10月, 2016 40 次提交
    • J
      mm/page_ext: support extra space allocation by page_ext user · 980ac167
      Joonsoo Kim 提交于
      Until now, if some page_ext users want to use it's own field on
      page_ext, it should be defined in struct page_ext by hard-coding.  It
      has a problem that wastes memory in following situation.
      
        struct page_ext {
         #ifdef CONFIG_A
        	int a;
         #endif
         #ifdef CONFIG_B
        	int b;
         #endif
        };
      
      Assume that kernel is built with both CONFIG_A and CONFIG_B.  Even if we
      enable feature A and doesn't enable feature B at runtime, each entry of
      struct page_ext takes two int rather than one int.  It's undesirable
      result so this patch tries to fix it.
      
      To solve above problem, this patch implements to support extra space
      allocation at runtime.  When need() callback returns true, it's extra
      memory requirement is summed to entry size of page_ext.  Also, offset
      for each user's extra memory space is returned.  With this offset, user
      can use this extra space and there is no need to define needed field on
      page_ext by hard-coding.
      
      This patch only implements an infrastructure.  Following patch will use
      it for page_owner which is only user having it's own fields on page_ext.
      
      Link: http://lkml.kernel.org/r/1471315879-32294-6-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      980ac167
    • J
      mm/page_ext: rename offset to index · 0b06bb3f
      Joonsoo Kim 提交于
      Here, 'offset' means entry index in page_ext array.  Following patch
      will use 'offset' for field offset in each entry so rename current
      'offset' to prevent confusion.
      
      Link: http://lkml.kernel.org/r/1471315879-32294-5-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b06bb3f
    • J
      mm/page_owner: move page_owner specific function to page_owner.c · e2f612e6
      Joonsoo Kim 提交于
      There is no reason that page_owner specific function resides on
      vmstat.c.
      
      Link: http://lkml.kernel.org/r/1471315879-32294-4-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2f612e6
    • J
      mm/debug_pagealloc.c: don't allocate page_ext if we don't use guard page · f1c1e9f7
      Joonsoo Kim 提交于
      What debug_pagealloc does is just mapping/unmapping page table.
      Basically, it doesn't need additional memory space to memorize
      something.  But, with guard page feature, it requires additional memory
      to distinguish if the page is for guard or not.  Guard page is only used
      when debug_guardpage_minorder is non-zero so this patch removes
      additional memory allocation (page_ext) if debug_guardpage_minorder is
      zero.
      
      It saves memory if we just use debug_pagealloc and not guard page.
      
      Link: http://lkml.kernel.org/r/1471315879-32294-3-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Reviewed-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f1c1e9f7
    • J
      mm/debug_pagealloc.c: clean-up guard page handling code · acbc15a4
      Joonsoo Kim 提交于
      Patch series "Reduce memory waste by page extension user".
      
      This patchset tries to reduce memory waste by page extension user.
      
      First case is architecture supported debug_pagealloc.  It doesn't
      requires additional memory if guard page isn't used.  8 bytes per page
      will be saved in this case.
      
      Second case is related to page owner feature.  Until now, if page_ext
      users want to use it's own fields on page_ext, fields should be defined
      in struct page_ext by hard-coding.  It has a following problem.
      
        struct page_ext {
         #ifdef CONFIG_A
        	int a;
         #endif
         #ifdef CONFIG_B
      	int b;
         #endif
        };
      
      Assume that kernel is built with both CONFIG_A and CONFIG_B.  Even if we
      enable feature A and doesn't enable feature B at runtime, each entry of
      struct page_ext takes two int rather than one int.  It's undesirable
      waste so this patch tries to reduce it.  By this patchset, we can save
      20 bytes per page dedicated for page owner feature in some
      configurations.
      
      This patch (of 6):
      
      We can make code clean by moving decision condition for set_page_guard()
      into set_page_guard() itself.  It will help code readability.  There is
      no functional change.
      
      Link: http://lkml.kernel.org/r/1471315879-32294-2-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      acbc15a4
    • M
      mm, vmscan: get rid of throttle_vm_writeout · bf484383
      Michal Hocko 提交于
      throttle_vm_writeout() was introduced back in 2005 to fix OOMs caused by
      excessive pageout activity during the reclaim.  Too many pages could be
      put under writeback therefore LRUs would be full of unreclaimable pages
      until the IO completes and in turn the OOM killer could be invoked.
      
      There have been some important changes introduced since then in the
      reclaim path though.  Writers are throttled by balance_dirty_pages when
      initiating the buffered IO and later during the memory pressure, the
      direct reclaim is throttled by wait_iff_congested if the node is
      considered congested by dirty pages on LRUs and the underlying bdi is
      congested by the queued IO.  The kswapd is throttled as well if it
      encounters pages marked for immediate reclaim or under writeback which
      signals that that there are too many pages under writeback already.
      Finally should_reclaim_retry does congestion_wait if the reclaim cannot
      make any progress and there are too many dirty/writeback pages.
      
      Another important aspect is that we do not issue any IO from the direct
      reclaim context anymore.  In a heavy parallel load this could queue a
      lot of IO which would be very scattered and thus unefficient which would
      just make the problem worse.
      
      This three mechanisms should throttle and keep the amount of IO in a
      steady state even under heavy IO and memory pressure so yet another
      throttling point doesn't really seem helpful.  Quite contrary, Mikulas
      Patocka has reported that swap backed by dm-crypt doesn't work properly
      because the swapout IO cannot make sufficient progress as the writeout
      path depends on dm_crypt worker which has to allocate memory to perform
      the encryption.  In order to guarantee a forward progress it relies on
      the mempool allocator.  mempool_alloc(), however, prefers to use the
      underlying (usually page) allocator before it grabs objects from the
      pool.  Such an allocation can dive into the memory reclaim and
      consequently to throttle_vm_writeout.  If there are too many dirty or
      pages under writeback it will get throttled even though it is in fact a
      flusher to clear pending pages.
      
        kworker/u4:0    D ffff88003df7f438 10488     6      2	0x00000000
        Workqueue: kcryptd kcryptd_crypt [dm_crypt]
        Call Trace:
          schedule+0x3c/0x90
          schedule_timeout+0x1d8/0x360
          io_schedule_timeout+0xa4/0x110
          congestion_wait+0x86/0x1f0
          throttle_vm_writeout+0x44/0xd0
          shrink_zone_memcg+0x613/0x720
          shrink_zone+0xe0/0x300
          do_try_to_free_pages+0x1ad/0x450
          try_to_free_pages+0xef/0x300
          __alloc_pages_nodemask+0x879/0x1210
          alloc_pages_current+0xa1/0x1f0
          new_slab+0x2d7/0x6a0
          ___slab_alloc+0x3fb/0x5c0
          __slab_alloc+0x51/0x90
          kmem_cache_alloc+0x27b/0x310
          mempool_alloc_slab+0x1d/0x30
          mempool_alloc+0x91/0x230
          bio_alloc_bioset+0xbd/0x260
          kcryptd_crypt+0x114/0x3b0 [dm_crypt]
      
      Let's just drop throttle_vm_writeout altogether.  It is not very much
      helpful anymore.
      
      I have tried to test a potential writeback IO runaway similar to the one
      described in the original patch which has introduced that [1].  Small
      virtual machine (512MB RAM, 4 CPUs, 2G of swap space and disk image on a
      rather slow NFS in a sync mode on the host) with 8 parallel writers each
      writing 1G worth of data.  As soon as the pagecache fills up and the
      direct reclaim hits then I start anon memory consumer in a loop
      (allocating 300M and exiting after populating it) in the background to
      make the memory pressure even stronger as well as to disrupt the steady
      state for the IO.  The direct reclaim is throttled because of the
      congestion as well as kswapd hitting congestion_wait due to nr_immediate
      but throttle_vm_writeout doesn't ever trigger the sleep throughout the
      test.  Dirty+writeback are close to nr_dirty_threshold with some
      fluctuations caused by the anon consumer.
      
      [1] https://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc1/2.6.9-rc1-mm3/broken-out/vm-pageout-throttling.patch
      Link: http://lkml.kernel.org/r/1471171473-21418-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ondrej Kozina <okozina@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf484383
    • X
      mm: fix set pageblock migratetype in deferred struct page init · e780149b
      Xishi Qiu 提交于
      On x86_64 MAX_ORDER_NR_PAGES is usually 4M, and a pageblock is usually
      2M, so we only set one pageblock's migratetype in deferred_free_range()
      if pfn is aligned to MAX_ORDER_NR_PAGES.  That means it causes
      uninitialized migratetype blocks, you can see from "cat
      /proc/pagetypeinfo", almost half blocks are Unmovable.
      
      Also we missed freeing the last block in deferred_init_memmap(), it
      causes memory leak.
      
      Fixes: ac5d2539 ("mm: meminit: reduce number of times pageblocks are set during struct page init")
      Link: http://lkml.kernel.org/r/57A3260F.4050709@huawei.comSigned-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e780149b
    • X
      mem-hotplug: fix node spanned pages when we have a movable node · e506b996
      Xishi Qiu 提交于
      Commit 342332e6 ("mm/page_alloc.c: introduce kernelcore=mirror
      option") rewrote the calculation of node spanned pages.  But when we
      have a movable node, the size of node spanned pages is double added.
      That's because we have an empty normal zone, the present pages is zero,
      but its spanned pages is not zero.
      
      e.g.
          Zone ranges:
            DMA      [mem 0x0000000000001000-0x0000000000ffffff]
            DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
            Normal   [mem 0x0000000100000000-0x0000007c7fffffff]
          Movable zone start for each node
            Node 1: 0x0000001080000000
            Node 2: 0x0000002080000000
            Node 3: 0x0000003080000000
            Node 4: 0x0000003c80000000
            Node 5: 0x0000004c80000000
            Node 6: 0x0000005c80000000
          Early memory node ranges
            node   0: [mem 0x0000000000001000-0x000000000009ffff]
            node   0: [mem 0x0000000000100000-0x000000007552afff]
            node   0: [mem 0x000000007bd46000-0x000000007bd46fff]
            node   0: [mem 0x000000007bdcd000-0x000000007bffffff]
            node   0: [mem 0x0000000100000000-0x000000107fffffff]
            node   1: [mem 0x0000001080000000-0x000000207fffffff]
            node   2: [mem 0x0000002080000000-0x000000307fffffff]
            node   3: [mem 0x0000003080000000-0x0000003c7fffffff]
            node   4: [mem 0x0000003c80000000-0x0000004c7fffffff]
            node   5: [mem 0x0000004c80000000-0x0000005c7fffffff]
            node   6: [mem 0x0000005c80000000-0x0000006c7fffffff]
            node   7: [mem 0x0000006c80000000-0x0000007c7fffffff]
      
        node1:
          Normal, start=0x1080000, present=0x0, spanned=0x1000000
          Movable, start=0x1080000, present=0x1000000, spanned=0x1000000
          pgdat, start=0x1080000, present=0x1000000, spanned=0x2000000
      
      After this patch, the problem is fixed.
      
        node1:
          Normal, start=0x0, present=0x0, spanned=0x0
          Movable, start=0x1080000, present=0x1000000, spanned=0x1000000
          pgdat, start=0x1080000, present=0x1000000, spanned=0x1000000
      
      Link: http://lkml.kernel.org/r/57A325E8.6070100@huawei.comSigned-off-by: NXishi Qiu <qiuxishi@huawei.com>
      Cc: Taku Izumi <izumi.taku@jp.fujitsu.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: "Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e506b996
    • V
      mm, vmscan: make compaction_ready() more accurate and readable · fdd4c614
      Vlastimil Babka 提交于
      The compaction_ready() is used during direct reclaim for costly order
      allocations to skip reclaim for zones where compaction should be
      attempted instead.  It's combining the standard compaction_suitable()
      check with its own watermark check based on high watermark with extra
      gap, and the result is confusing at best.
      
      This patch attempts to better structure and document the checks
      involved.  First, compaction_suitable() can determine that the
      allocation should either succeed already, or that compaction doesn't
      have enough free pages to proceed.  The third possibility is that
      compaction has enough free pages, but we still decide to reclaim first -
      unless we are already above the high watermark with gap.  This does not
      mean that the reclaim will actually reach this watermark during single
      attempt, this is rather an over-reclaim protection.  So document the
      code as such.  The check for compaction_deferred() is removed
      completely, as it in fact had no proper role here.
      
      The result after this patch is mainly a less confusing code.  We also
      skip some over-reclaim in cases where the allocation should already
      succed.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-12-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fdd4c614
    • V
      mm, compaction: require only min watermarks for non-costly orders · 8348faf9
      Vlastimil Babka 提交于
      The __compaction_suitable() function checks the low watermark plus a
      compact_gap() gap to decide if there's enough free memory to perform
      compaction.  Then __isolate_free_page uses low watermark check to decide
      if particular free page can be isolated.  In the latter case, using low
      watermark is needlessly pessimistic, as the free page isolations are
      only temporary.  For __compaction_suitable() the higher watermark makes
      sense for high-order allocations where more freepages increase the
      chance of success, and we can typically fail with some order-0 fallback
      when the system is struggling to reach that watermark.  But for
      low-order allocation, forming the page should not be that hard.  So
      using low watermark here might just prevent compaction from even trying,
      and eventually lead to OOM killer even if we are above min watermarks.
      
      So after this patch, we use min watermark for non-costly orders in
      __compaction_suitable(), and for all orders in __isolate_free_page().
      
      [vbabka@suse.cz: clarify __isolate_free_page() comment]
       Link: http://lkml.kernel.org/r/7ae4baec-4eca-e70b-2a69-94bea4fb19fa@suse.cz
      Link: http://lkml.kernel.org/r/20160810091226.6709-11-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8348faf9
    • V
      mm, compaction: use proper alloc_flags in __compaction_suitable() · 984fdba6
      Vlastimil Babka 提交于
      The __compaction_suitable() function checks the low watermark plus a
      compact_gap() gap to decide if there's enough free memory to perform
      compaction.  This check uses direct compactor's alloc_flags, but that's
      wrong, since these flags are not applicable for freepage isolation.
      
      For example, alloc_flags may indicate access to memory reserves, making
      compaction proceed, and then fail watermark check during the isolation.
      
      A similar problem exists for ALLOC_CMA, which may be part of
      alloc_flags, but not during freepage isolation.  In this case however it
      makes sense to use ALLOC_CMA both in __compaction_suitable() and
      __isolate_free_page(), since there's actually nothing preventing the
      freepage scanner to isolate from CMA pageblocks, with the assumption
      that a page that could be migrated once by compaction can be migrated
      also later by CMA allocation.  Thus we should count pages in CMA
      pageblocks when considering compaction suitability and when isolating
      freepages.
      
      To sum up, this patch should remove some false positives from
      __compaction_suitable(), and allow compaction to proceed when free pages
      required for compaction reside in the CMA pageblocks.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-10-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      984fdba6
    • V
      mm, compaction: create compact_gap wrapper · 9861a62c
      Vlastimil Babka 提交于
      Compaction uses a watermark gap of (2UL << order) pages at various
      places and it's not immediately obvious why.  Abstract it through a
      compact_gap() wrapper to create a single place with a thorough
      explanation.
      
      [vbabka@suse.cz: clarify the comment of compact_gap()]
       Link: http://lkml.kernel.org/r/7b6aed1f-fdf8-2063-9ff4-bbe4de712d37@suse.cz
      Link: http://lkml.kernel.org/r/20160810091226.6709-9-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9861a62c
    • V
      mm, compaction: use correct watermark when checking compaction success · f2b8228c
      Vlastimil Babka 提交于
      The __compact_finished() function uses low watermark in a check that has
      to pass if the direct compaction is to finish and allocation should
      succeed.  This is too pessimistic, as the allocation will typically use
      min watermark.  It may happen that during compaction, we drop below the
      low watermark (due to parallel activity), but still form the target
      high-order page.  By checking against low watermark, we might needlessly
      continue compaction.
      
      Similarly, __compaction_suitable() uses low watermark in a check whether
      allocation can succeed without compaction.  Again, this is unnecessarily
      pessimistic.
      
      After this patch, these check will use direct compactor's alloc_flags to
      determine the watermark, which is effectively the min watermark.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-8-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f2b8228c
    • V
      mm, compaction: add the ultimate direct compaction priority · a8e025e5
      Vlastimil Babka 提交于
      During reclaim/compaction loop, it's desirable to get a final answer
      from unsuccessful compaction so we can either fail the allocation or
      invoke the OOM killer.  However, heuristics such as deferred compaction
      or pageblock skip bits can cause compaction to skip parts or whole zones
      and lead to premature OOM's, failures or excessive reclaim/compaction
      retries.
      
      To remedy this, we introduce a new direct compaction priority called
      COMPACT_PRIO_SYNC_FULL, which instructs direct compaction to:
      
       - ignore deferred compaction status for a zone
       - ignore pageblock skip hints
       - ignore cached scanner positions and scan the whole zone
      
      The new priority should get eventually picked up by
      should_compact_retry() and this should improve success rates for costly
      allocations using __GFP_REPEAT, such as hugetlbfs allocations, and
      reduce some corner-case OOM's for non-costly allocations.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-6-vbabka@suse.cz
      [vbabka@suse.cz: use the MIN_COMPACT_PRIORITY alias]
        Link: http://lkml.kernel.org/r/d443b884-87e7-1c93-8684-3a3a35759fb1@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8e025e5
    • V
      mm, compaction: don't recheck watermarks after COMPACT_SUCCESS · 7ceb009a
      Vlastimil Babka 提交于
      Joonsoo has reminded me that in a later patch changing watermark checks
      throughout compaction I forgot to update checks in
      try_to_compact_pages() and compactd_do_work().  Closer inspection
      however shows that they are redundant now in the success case, because
      compact_zone() now reliably reports this with COMPACT_SUCCESS.  So
      effectively the checks just repeat (a subset) of checks that have just
      passed.  So instead of checking watermarks again, just test the return
      value.
      
      Note it's also possible that compaction would declare failure e.g.
      because its find_suitable_fallback() is more strict than simple
      watermark check, and then the watermark check we are removing would then
      still succeed.  After this patch this is not possible and it's arguably
      better, because for long-term fragmentation avoidance we should rather
      try a different zone than allocate with the unsuitable fallback.  If
      compaction of all zones fail and the allocation is important enough, it
      will retry and succeed anyway.
      
      Also remove the stray "bool success" variable from kcompactd_do_work().
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-5-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@kernel.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ceb009a
    • V
      mm, compaction: rename COMPACT_PARTIAL to COMPACT_SUCCESS · cf378319
      Vlastimil Babka 提交于
      COMPACT_PARTIAL has historically meant that compaction returned after
      doing some work without fully compacting a zone.  It however didn't
      distinguish if compaction terminated because it succeeded in creating
      the requested high-order page.  This has changed recently and now we
      only return COMPACT_PARTIAL when compaction thinks it succeeded, or the
      high-order watermark check in compaction_suitable() passes and no
      compaction needs to be done.
      
      So at this point we can make the return value clearer by renaming it to
      COMPACT_SUCCESS.  The next patch will remove some redundant tests for
      success where compaction just returned COMPACT_SUCCESS.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-4-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cf378319
    • V
      mm, compaction: cleanup unused functions · 791cae96
      Vlastimil Babka 提交于
      Since kswapd compaction moved to kcompactd, compact_pgdat() is not
      called anymore, so we remove it.  The only caller of __compact_pgdat()
      is compact_node(), so we merge them and remove code that was only
      reachable from kswapd.
      
      Link: http://lkml.kernel.org/r/20160810091226.6709-3-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      791cae96
    • V
      mm, compaction: make whole_zone flag ignore cached scanner positions · 06ed2998
      Vlastimil Babka 提交于
      Patch series "make direct compaction more deterministic")
      
      This is mostly a followup to Michal's oom detection rework, which
      highlighted the need for direct compaction to provide better feedback in
      reclaim/compaction loop, so that it can reliably recognize when
      compaction cannot make further progress, and allocation should invoke
      OOM killer or fail.  We've discussed this at LSF/MM [1] where I proposed
      expanding the async/sync migration mode used in compaction to more
      general "priorities".  This patchset adds one new priority that just
      overrides all the heuristics and makes compaction fully scan all zones.
      I don't currently think that we need more fine-grained priorities, but
      we'll see.  Other than that there's some smaller fixes and cleanups,
      mainly related to the THP-specific hacks.
      
      I've tested this with stress-highalloc in GFP_KERNEL order-4 and
      THP-like order-9 scenarios.  There's some improvement for compaction
      stats for the order-4, which is likely due to the better watermarks
      handling.  In the previous version I reported mostly noise wrt
      compaction stats, and decreased direct reclaim - now the reclaim is
      without difference.  I believe this is due to the less aggressive
      compaction priority increase in patch 6.
      
      "before" is a mmotm tree prior to 4.7 release plus the first part of the
      series that was sent and merged separately
      
                                          before        after
      order-4:
      
      Compaction stalls                    27216       30759
      Compaction success                   19598       25475
      Compaction failures                   7617        5283
      Page migrate success                370510      464919
      Page migrate failure                 25712       27987
      Compaction pages isolated           849601     1041581
      Compaction migrate scanned       143146541   101084990
      Compaction free scanned          208355124   144863510
      Compaction cost                       1403        1210
      
      order-9:
      
      Compaction stalls                     7311        7401
      Compaction success                    1634        1683
      Compaction failures                   5677        5718
      Page migrate success                194657      183988
      Page migrate failure                  4753        4170
      Compaction pages isolated           498790      456130
      Compaction migrate scanned          565371      524174
      Compaction free scanned            4230296     4250744
      Compaction cost                        215         203
      
      [1] https://lwn.net/Articles/684611/
      
      This patch (of 11):
      
      A recent patch has added whole_zone flag that compaction sets when
      scanning starts from the zone boundary, in order to report that zone has
      been fully scanned in one attempt.  For allocations that want to try
      really hard or cannot fail, we will want to introduce a mode where
      scanning whole zone is guaranteed regardless of the cached positions.
      
      This patch reuses the whole_zone flag in a way that if it's already
      passed true to compaction, the cached scanner positions are ignored.
      Employing this flag during reclaim/compaction loop will be done in the
      next patch.  This patch however converts compaction invoked from
      userspace via procfs to use this flag.  Before this patch, the cached
      positions were first reset to zone boundaries and then read back from
      struct zone, so there was a window where a parallel compaction could
      replace the reset values, making the manual compaction less effective.
      Using the flag instead of performing reset is more robust.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20160810091226.6709-2-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NLorenzo Stoakes <lstoakes@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      06ed2998
    • M
      mm/oom_kill.c: fix task_will_free_mem() comment · 5870c2e1
      Michal Hocko 提交于
      Attempt to demystify the task_will_free_mem() loop.
      
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5870c2e1
    • V
    • Z
      mm/vmalloc.c: fix align value calculation error · 252e5c6e
      zijun_hu 提交于
      It causes double align requirement for __get_vm_area_node() if parameter
      size is power of 2 and VM_IOREMAP is set in parameter flags, for example
      size=0x10000 -> fls_long(0x10000)=17 -> align=0x20000
      
      get_count_order_long() is implemented and can be used instead of
      fls_long() for fixing the bug, for example size=0x10000 ->
      get_count_order_long(0x10000)=16 -> align=0x10000
      
      [akpm@linux-foundation.org: s/get_order_long()/get_count_order_long()/]
      [zijun_hu@zoho.com: fixes]
       Link: http://lkml.kernel.org/r/57AABC8B.1040409@zoho.com
      [akpm@linux-foundation.org: locate get_count_order_long() next to get_count_order()]
      [akpm@linux-foundation.org: move get_count_order[_long] definitions to pick up fls_long()]
      [zijun_hu@htc.com: move out get_count_order[_long]() from __KERNEL__ scope]
       Link: http://lkml.kernel.org/r/57B2C4CE.80303@zoho.com
      Link: http://lkml.kernel.org/r/fc045ecf-20fa-0722-b3ac-9a6140488fad@zoho.comSigned-off-by: Nzijun_hu <zijun_hu@htc.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: Nzijun_hu <zijun_hu@htc.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      252e5c6e
    • V
      mm: oom: deduplicate victim selection code for memcg and global oom · 7c5f64f8
      Vladimir Davydov 提交于
      When selecting an oom victim, we use the same heuristic for both memory
      cgroup and global oom.  The only difference is the scope of tasks to
      select the victim from.  So we could just export an iterator over all
      memcg tasks and keep all oom related logic in oom_kill.c, but instead we
      duplicate pieces of it in memcontrol.c reusing some initially private
      functions of oom_kill.c in order to not duplicate all of it.  That looks
      ugly and error prone, because any modification of select_bad_process
      should also be propagated to mem_cgroup_out_of_memory.
      
      Let's rework this as follows: keep all oom heuristic related code private
      to oom_kill.c and make oom_kill.c use exported memcg functions when it's
      really necessary (like in case of iterating over memcg tasks).
      
      Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.comSigned-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c5f64f8
    • J
      ocfs2: fix undefined struct variable in inode.h · 48e509ec
      Joseph Qi 提交于
      The extern struct variable ocfs2_inode_cache is not defined. It meant to
      use ocfs2_inode_cachep defined in super.c, I think. Fortunately it is
      not used anywhere now, so no impact actually. Clean it up to fix this
      mistake.
      
      Link: http://lkml.kernel.org/r/57E1E49D.8050503@huawei.comSigned-off-by: NJoseph Qi <joseph.qi@huawei.com>
      Reviewed-by: NEric Ren <zren@suse.com>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      48e509ec
    • B
      fs/ocfs2/dlm: remove deprecated create_singlethread_workqueue() · 055fdcff
      Bhaktipriya Shridhar 提交于
      The workqueue "dlm_worker" queues a single work item &dlm->dispatched_work
      and thus it doesn't require execution ordering.  Hence, alloc_workqueue
      has been used to replace the deprecated create_singlethread_workqueue
      instance.
      
      The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
      memory pressure.
      
      Since there are fixed number of work items, explicit concurrency
      limit is unnecessary here.
      
      Link: http://lkml.kernel.org/r/2b5ad8d6688effe1a9ddb2bc2082d26fbbe00302.1472590094.git.bhaktipriya96@gmail.comSigned-off-by: NBhaktipriya Shridhar <bhaktipriya96@gmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      055fdcff
    • B
      fs/ocfs2/super: remove deprecated create_singlethread_workqueue() · 44be9756
      Bhaktipriya Shridhar 提交于
      The workqueue "ocfs2_wq" queues multiple work items viz
      &osb->la_enable_wq, &journal->j_recovery_work, &os->os_orphan_scan_work,
      &osb->osb_truncate_log_wq which require strict execution ordering.  Hence,
      an ordered dedicated workqueue has been used.
      
      WQ_MEM_RECLAIM has been set to ensure forward progress under memory
      pressure because the workqueue is being used on a memory reclaim path.
      
      Link: http://lkml.kernel.org/r/66279de510a7f4cfc6e386d99b7e04b3f65fb11b.1472590094.git.bhaktipriya96@gmail.comSigned-off-by: NBhaktipriya Shridhar <bhaktipriya96@gmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44be9756
    • B
      fs/ocfs2/cluster: remove deprecated create_singlethread_workqueue() · bf940776
      Bhaktipriya Shridhar 提交于
      The workqueue "o2net_wq" queues multiple work items viz
      &old_sc->sc_shutdown_work, &sc->sc_rx_work, &sc->sc_connect_work which
      require strict execution ordering.  Hence, an ordered dedicated
      workqueue has been used.
      
      WQ_MEM_RECLAIM has been set to ensure forward progress under memory
      pressure.
      
      Link: http://lkml.kernel.org/r/ddc12e5766c79ba26f8a00d98049107f8a1d4866.1472590094.git.bhaktipriya96@gmail.comSigned-off-by: NBhaktipriya Shridhar <bhaktipriya96@gmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bf940776
    • B
      fs/ocfs2/dlmfs: remove deprecated create_singlethread_workqueue() · 0b41be07
      Bhaktipriya Shridhar 提交于
      The workqueue "user_dlm_worker" queues a single work item
      &lockres->l_work per user_lock_res instance and so it doesn't require
      execution ordering.  Hence, alloc_workqueue has been used to replace the
      deprecated create_singlethread_workqueue instance.
      
      The WQ_MEM_RECLAIM flag has been set to ensure forward progress under
      memory pressure.
      
      Since there are fixed number of work items, explicit concurrency
      limit is unnecessary here.
      
      Link: http://lkml.kernel.org/r/9748136d3a3b18138ad1d6ba708367aa1fe9f98c.1472590094.git.bhaktipriya96@gmail.comSigned-off-by: NBhaktipriya Shridhar <bhaktipriya96@gmail.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Mark Fasheh <mfasheh@suse.de>
      Cc: Joel Becker <jlbec@evilplan.org>
      Cc: Junxiao Bi <junxiao.bi@oracle.com>
      Cc: Joseph Qi <joseph.qi@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b41be07
    • J
      jiffies: add time comparison functions for 64 bit jiffies · 3740dcdf
      Jason A. Donenfeld 提交于
      Though the time_before and time_after family of functions were nicely
      extended to support jiffies64, so that the interface would be consistent,
      it was forgotten to also extend the before/after jiffies functions to
      support jiffies64.  This commit brings the interface to parity between
      jiffies and jiffies64, which is quite convenient.
      
      Link: http://lkml.kernel.org/r/20160929033319.12188-1-Jason@zx2c4.comSigned-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: John Stultz <john.stultz@linaro.org
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3740dcdf
    • J
      fsnotify: clean up spinlock assertions · ed272640
      Jan Kara 提交于
      Use assert_spin_locked() macro instead of hand-made BUG_ON statements.
      
      Link: http://lkml.kernel.org/r/1474537439-18919-1-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Suggested-by: NHeiner Kallweit <hkallweit1@gmail.com>
      Reviewed-by: NJeff Layton <jlayton@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed272640
    • J
      fanotify: fix possible false warning when freeing events · 0b1b8652
      Jan Kara 提交于
      When freeing permission events by fsnotify_destroy_event(), the warning
      WARN_ON(!list_empty(&event->list)); may falsely hit.
      
      This is because although fanotify_get_response() saw event->response
      set, there is nothing to make sure the current CPU also sees the removal
      of the event from the list.  Add proper locking around the WARN_ON() to
      avoid the false warning.
      
      Link: http://lkml.kernel.org/r/1473797711-14111-7-git-send-email-jack@suse.czReported-by: NMiklos Szeredi <mszeredi@redhat.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b1b8652
    • J
      fanotify: use notification_lock instead of access_lock · 073f6552
      Jan Kara 提交于
      Fanotify code has its own lock (access_lock) to protect a list of events
      waiting for a response from userspace.
      
      However this is somewhat awkward as the same list_head in the event is
      protected by notification_lock if it is part of the notification queue
      and by access_lock if it is part of the fanotify private queue which
      makes it difficult for any reliable checks in the generic code.  So make
      fanotify use the same lock - notification_lock - for protecting its
      private event list.
      
      Link: http://lkml.kernel.org/r/1473797711-14111-6-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      073f6552
    • J
      fsnotify: convert notification_mutex to a spinlock · c21dbe20
      Jan Kara 提交于
      notification_mutex is used to protect the list of pending events.  As such
      there's no reason to use a sleeping lock for it.  Convert it to a
      spinlock.
      
      [jack@suse.cz: fixed version]
        Link: http://lkml.kernel.org/r/1474031567-1831-1-git-send-email-jack@suse.cz
      Link: http://lkml.kernel.org/r/1473797711-14111-5-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NLino Sanfilippo <LinoSanfilippo@gmx.de>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c21dbe20
    • J
      fsnotify: drop notification_mutex before destroying event · 1404ff3c
      Jan Kara 提交于
      fsnotify_flush_notify() and fanotify_release() destroy notification
      event while holding notification_mutex.
      
      The destruction of fanotify event includes a path_put() call which may
      end up calling into a filesystem to delete an inode if we happen to be
      the last holders of dentry reference which happens to be the last holder
      of inode reference.
      
      That in turn may violate lock ordering for some filesystems since
      notification_mutex is also acquired e. g. during write when generating
      fanotify event.
      
      Also this is the only thing that forces notification_mutex to be a
      sleeping lock.  So drop notification_mutex before destroying a
      notification event.
      
      Link: http://lkml.kernel.org/r/1473797711-14111-4-git-send-email-jack@suse.czSigned-off-by: NJan Kara <jack@suse.cz>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Cc: Lino Sanfilippo <LinoSanfilippo@gmx.de>
      Cc: Eric Paris <eparis@redhat.com>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1404ff3c
    • L
      Merge branch 'i2c/for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux · 87840a2b
      Linus Torvalds 提交于
      Pull i2c updates from Wolfram Sang:
       "Here is the 4.9 pull request from I2C including:
      
         - centralized error messages when registering to the core
         - improved lockdep annotations to prevent false positives
         - DT support for muxes, gates, and arbitrators
         - bus speeds can now be obtained from ACPI
         - i2c-octeon got refactored and now supports ThunderX SoCs, too
         - i2c-tegra and i2c-designware got a bigger bunch of updates
         - a couple of standard driver fixes and improvements"
      
      * 'i2c/for-4.9' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux: (71 commits)
        i2c: axxia: disable clks in case of failure in probe
        i2c: octeon: thunderx: Limit register access retries
        i2c: uniphier-f: fix misdetection of incomplete STOP condition
        gpio: pca953x: variable 'id' was used twice
        i2c: i801: Add support for Kaby Lake PCH-H
        gpio: pca953x: fix an incorrect lockdep warning
        i2c: add a warning to i2c_adapter_depth()
        lockdep: make MAX_LOCKDEP_SUBCLASSES unconditionally visible
        i2c: export i2c_adapter_depth()
        i2c: rk3x: Fix variable 'min_total_ns' unused warning
        i2c: rk3x: Fix sparse warning
        i2c / ACPI: Do not touch an I2C device if it belongs to another adapter
        i2c: octeon: Fix high-level controller status check
        i2c: octeon: Avoid sending STOP during recovery
        i2c: octeon: Fix set SCL recovery function
        i2c: rcar: add support for r8a7796 (R-Car M3-W)
        i2c: imx: make bus recovery through pinctrl optional
        i2c: meson: add gxbb compatible string
        i2c: uniphier-f: set the adapter to master mode when probing
        i2c: uniphier-f: avoid WARN_ON() of clk_disable() in failure path
        ...
      87840a2b
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial · 2ab704a4
      Linus Torvalds 提交于
      Pull trivial updates from Jiri Kosina:
       "The usual rocket science from the trivial tree"
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
        tracing/syscalls: fix multiline in error message text
        lib/Kconfig.debug: fix DEBUG_SECTION_MISMATCH description
        doc: vfs: fix fadvise() sycall name
        x86/entry: spell EBX register correctly in documentation
        securityfs: fix securityfs_create_dir comment
        irq: Fix typo in tracepoint.xml
      2ab704a4
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching · ddc4e6d2
      Linus Torvalds 提交于
      Pull livepatching updates from Jiri Kosina:
      
       - fix for patching modules that contain .altinstructions or
         .parainstructions sections, from Jessica Yu
      
       - make TAINT_LIVEPATCH a per-module flag (so that it's immediately
         clear which module caused the taint), from Josh Poimboeuf
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/livepatching:
        livepatch/module: make TAINT_LIVEPATCH module-specific
        Documentation: livepatch: add section about arch-specific code
        livepatch/x86: apply alternatives and paravirt patches after relocations
        livepatch: use arch_klp_init_object_loaded() to finish arch-specific tasks
      ddc4e6d2
    • L
      Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid · bc75450c
      Linus Torvalds 提交于
      Pull HID updates from Jiri Kosina:
      
       - Integrated Sensor Hub support (Cherrytrail+) from Srinivas Pandruvada
      
       - Big cleanup of Wacom driver; namely it's now using devres, and the
         standardized LED API so that libinput doesn't need to have root
         access any more, with substantial amount of other cleanups
         piggy-backing on top. All this from Benjamin Tissoires
      
       - Report descriptor parsing would now ignore and out-of-range System
         controls in case of the application actually being System Control.
         This fixes quite some issues with several devices, and allows us to
         remove a few ->report_fixup callbacks. From Benjamin Tissoires
      
       - ... a lot of other assorted small fixes and device ID additions
      
      * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/hid: (76 commits)
        HID: add missing \n to end of dev_warn messages
        HID: alps: fix multitouch cursor issue
        HID: hid-logitech: Documentation updates/corrections
        HID: hid-logitech: Improve Wingman Formula Force GP support
        HID: hid-logitech: Rewrite of descriptor for all DF wheels
        HID: hid-logitech: Compute combined pedals value
        HID: hid-logitech: Add combined pedal support Logitech wheels
        HID: hid-logitech: Introduce control for combined pedals feature
        HID: sony: Update copyright and add Dualshock 4 rate control note
        HID: sony: Defer the initial USB Sixaxis output report
        HID: sony: Relax duplicate checking for USB-only devices
        Revert "HID: microsoft: fix invalid rdesc for 3k kbd"
        HID: alps: fix error return code in alps_input_configured()
        HID: alps: fix stick device not working after resume
        HID: support for keyboard - Corsair STRAFE
        HID: alps: Fix memory leak
        HID: uclogic: Add support for UC-Logic TWHA60 v3
        HID: uclogic: Override constant descriptors
        HID: uclogic: Support UGTizer GP0610 partially
        HID: uclogic: Add support for several more tablets
        ...
      bc75450c
    • L
      Merge tag 'pci-v4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci · e6e3d8f8
      Linus Torvalds 提交于
      Pull PCI updates from Bjorn Helgaas:
       "Summary of PCI changes for the v4.9 merge window:
      
        Enumeration:
         - microblaze: Add multidomain support for procfs (Bharat Kumar Gogada)
      
        Resource management:
         - Ignore requested alignment for PROBE_ONLY and fixed resources (Yongji Xie)
         - Ignore requested alignment for VF BARs (Yongji Xie)
      
        PCI device hotplug:
         - Make core explicitly non-modular (Paul Gortmaker)
      
        PCIe native device hotplug:
         - Rename pcie_isr() locals for clarity (Bjorn Helgaas)
         - Return IRQ_NONE when we can't read interrupt status (Bjorn Helgaas)
         - Remove unnecessary guard (Bjorn Helgaas)
         - Clean up dmesg "Slot(%s)" messages (Bjorn Helgaas)
         - Remove useless pciehp_get_latch_status() calls (Bjorn Helgaas)
         - Clear attention LED on device add (Keith Busch)
         - Allow exclusive userspace control of indicators (Keith Busch)
         - Process all hotplug events before looking for new ones (Mayurkumar Patel)
         - Don't re-read Slot Status when queuing hotplug event (Mayurkumar Patel)
         - Don't re-read Slot Status when handling surprise event (Mayurkumar Patel)
         - Make explicitly non-modular (Paul Gortmaker)
      
        Power management:
         - Afford direct-complete to devices with non-standard PM (Lukas Wunner)
         - Query platform firmware for device power state (Lukas Wunner)
         - Recognize D3cold in pci_update_current_state() (Lukas Wunner)
         - Avoid unnecessary resume after direct-complete (Lukas Wunner)
         - Make explicitly non-modular (Paul Gortmaker)
      
        Virtualization:
         - Mark Atheros AR9580 to avoid bus reset (Maik Broemme)
         - Check for pci_setup_device() failure in pci_iov_add_virtfn() (Po Liu)
      
        MSI:
         - Enable PCI_MSI_IRQ_DOMAIN support for ARC (Joao Pinto)
      
        AER:
         - Remove aerdriver.nosourceid kernel parameter (Bjorn Helgaas)
         - Remove aerdriver.forceload kernel parameter (Bjorn Helgaas)
         - Fix aer_probe() kernel-doc comment (Cao jin)
         - Add bus flag to skip source ID matching (Jon Derrick)
         - Avoid memory allocation in interrupt handling path (Jon Derrick)
         - Cache capability position (Keith Busch)
         - Make explicitly non-modular (Paul Gortmaker)
         - Remove duplicate AER severity translation (Tyler Baicar)
         - Send correct severity to calculate AER severity (Tyler Baicar)
      
        Precision Time Measurement:
         - Add Precision Time Measurement (PTM) support (Jonathan Yong)
         - Add PTM clock granularity information (Bjorn Helgaas)
         - Add pci_enable_ptm() for drivers to enable PTM on endpoints (Bjorn Helgaas)
      
        Generic host bridge driver:
         - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)
         - Make explicitly non-modular (Paul Gortmaker)
      
        Altera host bridge driver:
         - Remove redundant platform_get_resource() return value check (Bjorn Helgaas)
         - Poll for link training status after retraining the link (Ley Foon Tan)
         - Rework config accessors for use without a struct pci_bus (Ley Foon Tan)
         - Move retrain from fixup to altera_pcie_host_init() (Ley Foon Tan)
         - Make MSI explicitly non-modular (Paul Gortmaker)
         - Make explicitly non-modular (Paul Gortmaker)
         - Relax device number checking to allow SR-IOV (Po Liu)
      
        ARM Versatile host bridge driver:
         - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)
      
        Axis ARTPEC-6 host bridge driver:
         - Drop __init from artpec6_add_pcie_port() (Niklas Cassel)
      
        Freescale i.MX6 host bridge driver:
         - Make explicitly non-modular (Paul Gortmaker)
      
        Intel VMD host bridge driver:
         - Add quirk for AER to ignore source ID (Jon Derrick)
         - Allocate IRQ lists with correct MSI-X count (Jon Derrick)
         - Convert to use pci_alloc_irq_vectors() API (Jon Derrick)
         - Eliminate vmd_vector member from list type (Jon Derrick)
         - Eliminate index member from IRQ list (Jon Derrick)
         - Synchronize with RCU freeing MSI IRQ descs (Keith Busch)
         - Request userspace control of PCIe hotplug indicators (Keith Busch)
         - Move VMD driver to drivers/pci/host (Keith Busch)
      
        Marvell Aardvark host bridge driver:
         - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)
         - Remove redundant dev_err call in advk_pcie_probe() (Wei Yongjun)
      
        Microsoft Hyper-V host bridge driver:
         - Use zero-length array in struct pci_packet (Dexuan Cui)
         - Use pci_function_description[0] in struct definitions (Dexuan Cui)
         - Remove the unused 'wrk' in struct hv_pcibus_device (Dexuan Cui)
         - Handle vmbus_sendpacket() failure in hv_compose_msi_msg() (Dexuan Cui)
         - Handle hv_pci_generic_compl() error case (Dexuan Cui)
         - Use list_move_tail() instead of list_del() + list_add_tail() (Wei Yongjun)
      
        NVIDIA Tegra host bridge driver:
         - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)
         - Remove redundant _data suffix (Thierry Reding)
         - Use of_device_get_match_data() (Thierry Reding)
      
        Qualcomm host bridge driver:
         - Make explicitly non-modular (Paul Gortmaker)
      
        Renesas R-Car host bridge driver:
         - Consolidate register space lookup and ioremap (Bjorn Helgaas)
         - Don't disable/unprepare clocks on prepare/enable failure (Geert Uytterhoeven)
         - Add multi-MSI support (Grigory Kletsko)
         - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)
         - Fix some checkpatch warnings (Sergei Shtylyov)
         - Try increasing PCIe link speed to 5 GT/s at boot (Sergei Shtylyov)
      
        Rockchip host bridge driver:
         - Add DT bindings for Rockchip PCIe controller (Shawn Lin)
         - Add Rockchip PCIe controller support (Shawn Lin)
         - Improve the deassert sequence of four reset pins (Shawn Lin)
         - Fix wrong transmitted FTS count (Shawn Lin)
         - Increase the Max Credit update interval (Rajat Jain)
      
        Samsung Exynos host bridge driver:
         - Make explicitly non-modular (Paul Gortmaker)
      
        ST Microelectronics SPEAr13xx host bridge driver:
         - Make explicitly non-modular (Paul Gortmaker)
      
        Synopsys DesignWare host bridge driver:
         - Return data directly from dw_pcie_readl_rc() (Bjorn Helgaas)
         - Exchange viewport of `MEMORYs' and `CFGs/IOs' (Dong Bo)
         - Check LTSSM training bit before deciding link is up (Jisheng Zhang)
         - Move link wait definitions to .c file (Joao Pinto)
         - Wait for iATU enable (Joao Pinto)
         - Add iATU Unroll feature (Joao Pinto)
         - Fix pci_remap_iospace() failure path (Lorenzo Pieralisi)
         - Make explicitly non-modular (Paul Gortmaker)
         - Relax device number checking to allow SR-IOV (Po Liu)
         - Keep viewport fixed for IO transaction if num_viewport > 2 (Pratyush Anand)
         - Remove redundant platform_get_resource() return value check (Wei Yongjun)
      
        TI DRA7xx host bridge driver:
         - Make explicitly non-modular (Paul Gortmaker)
      
        TI Keystone host bridge driver:
         - Propagate request_irq() failure (Wei Yongjun)
      
        Xilinx AXI host bridge driver:
         - Keep both legacy and MSI interrupt domain references (Bharat Kumar Gogada)
         - Clear interrupt register for invalid interrupt (Bharat Kumar Gogada)
         - Clear correct MSI set bit (Bharat Kumar Gogada)
         - Dispose of MSI virtual IRQ (Bharat Kumar Gogada)
         - Make explicitly non-modular (Paul Gortmaker)
         - Relax device number checking to allow SR-IOV (Po Liu)
      
        Xilinx NWL host bridge driver:
         - Expand error logging (Bharat Kumar Gogada)
         - Enable all MSI interrupts using MSI mask (Bharat Kumar Gogada)
         - Make explicitly non-modular (Paul Gortmaker)
      
        Miscellaneous:
         - Drop CONFIG_KEXEC_CORE ifdeffery (Lukas Wunner)
         - portdrv: Make explicitly non-modular (Paul Gortmaker)
         - Make DPC explicitly non-modular (Paul Gortmaker)"
      
      * tag 'pci-v4.9-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: (105 commits)
        x86/PCI: VMD: Move VMD driver to drivers/pci/host
        PCI: rockchip: Fix wrong transmitted FTS count
        PCI: rockchip: Improve the deassert sequence of four reset pins
        PCI: rockchip: Increase the Max Credit update interval
        PCI: rcar: Try increasing PCIe link speed to 5 GT/s at boot
        PCI/AER: Fix aer_probe() kernel-doc comment
        PCI: Ignore requested alignment for VF BARs
        PCI: Ignore requested alignment for PROBE_ONLY and fixed resources
        PCI: Avoid unnecessary resume after direct-complete
        PCI: Recognize D3cold in pci_update_current_state()
        PCI: Query platform firmware for device power state
        PCI: Afford direct-complete to devices with non-standard PM
        PCI/AER: Cache capability position
        PCI/AER: Avoid memory allocation in interrupt handling path
        x86/PCI: VMD: Request userspace control of PCIe hotplug indicators
        PCI: pciehp: Allow exclusive userspace control of indicators
        ACPI / APEI: Send correct severity to calculate AER severity
        PCI/AER: Remove duplicate AER severity translation
        x86/PCI: VMD: Synchronize with RCU freeing MSI IRQ descs
        x86/PCI: VMD: Eliminate index member from IRQ list
        ...
      e6e3d8f8
    • L
      Merge tag 'vfio-v4.9-rc1' of git://github.com/awilliam/linux-vfio · fbbea389
      Linus Torvalds 提交于
      Pull VFIO updates from Alex Williamson:
       - comment fixes (Wei Jiangang)
       - static symbols (Baoyou Xie)
       - FLR virtualization (Alex Williamson)
       - catching INTx enabling after MSI/X teardown (Alex Williamson)
       - update to pci_alloc_irq_vectors helpers (Christoph Hellwig)
      
      * tag 'vfio-v4.9-rc1' of git://github.com/awilliam/linux-vfio:
        vfio_pci: use pci_alloc_irq_vectors
        vfio-pci: Disable INTx after MSI/X teardown
        vfio-pci: Virtualize PCIe & AF FLR
        vfio: platform: mark symbols static where possible
        vfio/pci: Fix typos in comments
      fbbea389
    • L
      Merge tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md · c23112e0
      Linus Torvalds 提交于
      Pull MD updates from Shaohua Li:
       "This update includes:
      
         - new AVX512 instruction based raid6 gen/recovery algorithm
      
         - a couple of md-cluster related bug fixes
      
         - fix a potential deadlock
      
         - set nonrotational bit for raid array with SSD
      
         - set correct max_hw_sectors for raid5/6, which hopefuly can improve
           performance a little bit
      
         - other minor fixes"
      
      * tag 'md/4.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shli/md:
        md: set rotational bit
        raid6/test/test.c: bug fix: Specify aligned(alignment) attributes to the char arrays
        raid5: handle register_shrinker failure
        raid5: fix to detect failure of register_shrinker
        md: fix a potential deadlock
        md/bitmap: fix wrong cleanup
        raid5: allow arbitrary max_hw_sectors
        lib/raid6: Add AVX512 optimized xor_syndrome functions
        lib/raid6/test/Makefile: Add avx512 gen_syndrome and recovery functions
        lib/raid6: Add AVX512 optimized recovery functions
        lib/raid6: Add AVX512 optimized gen_syndrome functions
        md-cluster: make resync lock also could be interruptted
        md-cluster: introduce dlm_lock_sync_interruptible to fix tasks hang
        md-cluster: convert the completion to wait queue
        md-cluster: protect md_find_rdev_nr_rcu with rcu lock
        md-cluster: clean related infos of cluster
        md: changes for MD_STILL_CLOSED flag
        md-cluster: remove some unnecessary dlm_unlock_sync
        md-cluster: use FORCEUNLOCK in lockres_free
        md-cluster: call md_kick_rdev_from_array once ack failed
      c23112e0