1. 02 9月, 2020 40 次提交
    • M
      mm: do not boost watermarks to avoid fragmentation for the DISCONTIG memory model · d5971569
      Mel Gorman 提交于
      to #28825456
      
      commit 24512228b7a3f412b5a51f189df302616b021c33 upstream.
      
      Mikulas Patocka reported that commit 1c30844d2dfe ("mm: reclaim small
      amounts of memory when an external fragmentation event occurs") "broke"
      memory management on parisc.
      
      The machine is not NUMA but the DISCONTIG model creates three pgdats
      even though it's a UMA machine for the following ranges
      
              0) Start 0x0000000000000000 End 0x000000003fffffff Size   1024 MB
              1) Start 0x0000000100000000 End 0x00000001bfdfffff Size   3070 MB
              2) Start 0x0000004040000000 End 0x00000040ffffffff Size   3072 MB
      
      Mikulas reported:
      
      	With the patch 1c30844d2, the kernel will incorrectly reclaim the
      	first zone when it fills up, ignoring the fact that there are two
      	completely free zones. Basiscally, it limits cache size to 1GiB.
      
      	For example, if I run:
      	# dd if=/dev/sda of=/dev/null bs=1M count=2048
      
      	- with the proper kernel, there should be "Buffers - 2GiB"
      	when this command finishes. With the patch 1c30844d2, buffers
      	will consume just 1GiB or slightly more, because the kernel was
      	incorrectly reclaiming them.
      
      The page allocator and reclaim makes assumptions that pgdats really
      represent NUMA nodes and zones represent ranges and makes decisions on
      that basis.  Watermark boosting for small pgdats leads to unexpected
      results even though this would have behaved reasonably on SPARSEMEM.
      
      DISCONTIG is essentially deprecated and even parisc plans to move to
      SPARSEMEM so there is no need to be fancy, this patch simply disables
      watermark boosting by default on DISCONTIGMEM.
      
      Link: http://lkml.kernel.org/r/20190419094335.GJ18914@techsingularity.net
      Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Tested-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      d5971569
    • M
      mm, page_alloc: fix a division by zero error when boosting watermarks v2 · 92bebf3c
      Mel Gorman 提交于
      to #28825456
      
      commit 94b3334cbebea34d56a7e6321c6fe9d89b309a49 upstream.
      
      Yury Norov reported that an arm64 KVM instance could not boot since
      after v5.0-rc1 and could addressed by reverting the patches
      
        1c30844d2dfe272d58c ("mm: reclaim small amounts of memory when an external
        73444bc4d8f92e46a20 ("mm, page_alloc: do not wake kswapd with zone lock held")
      
      The problem is that a division by zero error is possible if boosting
      occurs very early in boot if the system has very little memory.  This
      patch avoids the division by zero error.
      
      Link: http://lkml.kernel.org/r/20190213143012.GT9565@techsingularity.net
      Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reported-by: NYury Norov <yury.norov@gmail.com>
      Tested-by: NYury Norov <yury.norov@gmail.com>
      Tested-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      92bebf3c
    • M
      mm, page_alloc: do not wake kswapd with zone lock held · ba16c9c8
      Mel Gorman 提交于
      to #28825456
      
      commit 73444bc4d8f92e46a20cb6bd3342fc2ea75c6787 upstream.
      
      syzbot reported the following regression in the latest merge window and
      it was confirmed by Qian Cai that a similar bug was visible from a
      different context.
      
        ======================================================
        WARNING: possible circular locking dependency detected
        4.20.0+ #297 Not tainted
        ------------------------------------------------------
        syz-executor0/8529 is trying to acquire lock:
        000000005e7fb829 (&pgdat->kswapd_wait){....}, at:
        __wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120
      
        but task is already holding lock:
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: spin_lock
        include/linux/spinlock.h:329 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_bulk
        mm/page_alloc.c:2548 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: __rmqueue_pcplist
        mm/page_alloc.c:3021 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_pcplist
        mm/page_alloc.c:3050 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue
        mm/page_alloc.c:3072 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at:
        get_page_from_freelist+0x1bae/0x52a0 mm/page_alloc.c:3491
      
      It appears to be a false positive in that the only way the lock ordering
      should be inverted is if kswapd is waking itself and the wakeup
      allocates debugging objects which should already be allocated if it's
      kswapd doing the waking.  Nevertheless, the possibility exists and so
      it's best to avoid the problem.
      
      This patch flags a zone as needing a kswapd using the, surprisingly,
      unused zone flag field.  The flag is read without the lock held to do
      the wakeup.  It's possible that the flag setting context is not the same
      as the flag clearing context or for small races to occur.  However, each
      race possibility is harmless and there is no visible degredation in
      fragmentation treatment.
      
      While zone->flag could have continued to be unused, there is potential
      for moving some existing fields into the flags field instead.
      Particularly read-mostly ones like zone->initialized and
      zone->contiguous.
      
      Link: http://lkml.kernel.org/r/20190103225712.GJ31517@techsingularity.net
      Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Reported-by: syzbot+93d94a001cfbce9e60e1@syzkaller.appspotmail.com
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NQian Cai <cai@lca.pw>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      Conflicts:
      	include/linux/mmzone.h
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      ba16c9c8
    • M
      mm: reclaim small amounts of memory when an external fragmentation event occurs · 9bcadc70
      Mel Gorman 提交于
      to #28825456
      
      commit 1c30844d2dfe272d58c8fc000960b835d13aa2ac upstream.
      
      An external fragmentation event was previously described as
      
          When the page allocator fragments memory, it records the event using
          the mm_page_alloc_extfrag event. If the fallback_order is smaller
          than a pageblock order (order-9 on 64-bit x86) then it's considered
          an event that will cause external fragmentation issues in the future.
      
      The kernel reduces the probability of such events by increasing the
      watermark sizes by calling set_recommended_min_free_kbytes early in the
      lifetime of the system.  This works reasonably well in general but if
      there are enough sparsely populated pageblocks then the problem can still
      occur as enough memory is free overall and kswapd stays asleep.
      
      This patch introduces a watermark_boost_factor sysctl that allows a zone
      watermark to be temporarily boosted when an external fragmentation causing
      events occurs.  The boosting will stall allocations that would decrease
      free memory below the boosted low watermark and kswapd is woken if the
      calling context allows to reclaim an amount of memory relative to the size
      of the high watermark and the watermark_boost_factor until the boost is
      cleared.  When kswapd finishes, it wakes kcompactd at the pageblock order
      to clean some of the pageblocks that may have been affected by the
      fragmentation event.  kswapd avoids any writeback, slab shrinkage and swap
      from reclaim context during this operation to avoid excessive system
      disruption in the name of fragmentation avoidance.  Care is taken so that
      kswapd will do normal reclaim work if the system is really low on memory.
      
      This was evaluated using the same workloads as "mm, page_alloc: Spread
      allocations across zones before introducing fragmentation".
      
      1-socket Skylake machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 1 THP allocating thread
      --------------------------------------
      
      4.20-rc3 extfrag events < order 9:   804694
      4.20-rc3+patch:                      408912 (49% reduction)
      4.20-rc3+patch1-4:                    18421 (98% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-1      653.58 (   0.00%)      652.71 (   0.13%)
      Amean     fault-huge-1        0.00 (   0.00%)      178.93 * -99.00%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1        0.00 (   0.00%)        5.12 ( 100.00%)
      
      Note that external fragmentation causing events are massively reduced by
      this path whether in comparison to the previous kernel or the vanilla
      kernel.  The fault latency for huge pages appears to be increased but that
      is only because THP allocations were successful with the patch applied.
      
      1-socket Skylake machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  291392
      4.20-rc3+patch:                     191187 (34% reduction)
      4.20-rc3+patch1-4:                   13464 (95% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Min       fault-base-1      912.00 (   0.00%)      905.00 (   0.77%)
      Min       fault-huge-1      127.00 (   0.00%)      135.00 (  -6.30%)
      Amean     fault-base-1     1467.55 (   0.00%)     1481.67 (  -0.96%)
      Amean     fault-huge-1     1127.11 (   0.00%)     1063.88 *   5.61%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1       77.64 (   0.00%)       83.46 (   7.49%)
      
      As before, massive reduction in external fragmentation events, some jitter
      on latencies and an increase in THP allocation success rates.
      
      2-socket Haswell machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 5 THP allocating threads
      ----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  215698
      4.20-rc3+patch:                     200210 (7% reduction)
      4.20-rc3+patch1-4:                   14263 (93% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     1346.45 (   0.00%)     1306.87 (   2.94%)
      Amean     fault-huge-5     3418.60 (   0.00%)     1348.94 (  60.54%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5        0.78 (   0.00%)        7.91 ( 910.64%)
      
      There is a 93% reduction in fragmentation causing events, there is a big
      reduction in the huge page fault latency and allocation success rate is
      higher.
      
      2-socket Haswell machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9: 166352
      4.20-rc3+patch:                    147463 (11% reduction)
      4.20-rc3+patch1-4:                  11095 (93% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     6217.43 (   0.00%)     7419.67 * -19.34%*
      Amean     fault-huge-5     3163.33 (   0.00%)     3263.80 (  -3.18%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5       95.14 (   0.00%)       87.98 (  -7.53%)
      
      There is a large reduction in fragmentation events with some jitter around
      the latencies and success rates.  As before, the high THP allocation
      success rate does mean the system is under a lot of pressure.  However, as
      the fragmentation events are reduced, it would be expected that the
      long-term allocation success rate would be higher.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      9bcadc70
    • M
      mm: use alloc_flags to record if kswapd can wake · fd98e14a
      Mel Gorman 提交于
      to #28825456
      
      commit 0a79cdad5eb213b3a629e624565b1b3bf9192b7c upstream.
      
      This is a preparation patch that copies the GFP flag __GFP_KSWAPD_RECLAIM
      into alloc_flags.  This is a preparation patch only that avoids having to
      pass gfp_mask through a long callchain in a future patch.
      
      Note that the setting in the fast path happens in alloc_flags_nofragment()
      and it may be claimed that this has nothing to do with ALLOC_NO_FRAGMENT.
      That's true in this patch but is not true later so it's done now for
      easier review to show where the flag needs to be recorded.
      
      No functional change.
      
      [mgorman@techsingularity.net: ALLOC_KSWAPD flag needs to be applied in the !CONFIG_ZONE_DMA32 case]
        Link: http://lkml.kernel.org/r/20181126143503.GO23260@techsingularity.net
      Link: http://lkml.kernel.org/r/20181123114528.28802-4-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      fd98e14a
    • M
      mm, page_alloc: spread allocations across zones before introducing fragmentation · 039531d2
      Mel Gorman 提交于
      to #28825456
      
      commit 6bb154504f8b496780ec53ec81aba957a12981fa upstream.
      
      Patch series "Fragmentation avoidance improvements", v5.
      
      It has been noted before that fragmentation avoidance (aka
      anti-fragmentation) is not perfect. Given sufficient time or an adverse
      workload, memory gets fragmented and the long-term success of high-order
      allocations degrades. This series defines an adverse workload, a definition
      of external fragmentation events (including serious) ones and a series
      that reduces the level of those fragmentation events.
      
      The details of the workload and the consequences are described in more
      detail in the changelogs. However, from patch 1, this is a high-level
      summary of the adverse workload. The exact details are found in the
      mmtests implementation.
      
      The broad details of the workload are as follows;
      
      1. Create an XFS filesystem (not specified in the configuration but done
         as part of the testing for this patch)
      2. Start 4 fio threads that write a number of 64K files inefficiently.
         Inefficiently means that files are created on first access and not
         created in advance (fio parameterr create_on_open=1) and fallocate
         is not used (fallocate=none). With multiple IO issuers this creates
         a mix of slab and page cache allocations over time. The total size
         of the files is 150% physical memory so that the slabs and page cache
         pages get mixed
      3. Warm up a number of fio read-only threads accessing the same files
         created in step 2. This part runs for the same length of time it
         took to create the files. It'll fault back in old data and further
         interleave slab and page cache allocations. As it's now low on
         memory due to step 2, fragmentation occurs as pageblocks get
         stolen.
      4. While step 3 is still running, start a process that tries to allocate
         75% of memory as huge pages with a number of threads. The number of
         threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
         threads contending with fio, any other threads or forcing cross-NUMA
         scheduling. Note that the test has not been used on a machine with less
         than 8 cores. The benchmark records whether huge pages were allocated
         and what the fault latency was in microseconds
      5. Measure the number of events potentially causing external fragmentation,
         the fault latency and the huge page allocation success rate.
      6. Cleanup
      
      Overall the series reduces external fragmentation causing events by over 94%
      on 1 and 2 socket machines, which in turn impacts high-order allocation
      success rates over the long term. There are differences in latencies and
      high-order allocation success rates. Latencies are a mixed bag as they
      are vulnerable to exact system state and whether allocations succeeded
      so they are treated as a secondary metric.
      
      Patch 1 uses lower zones if they are populated and have free memory
      	instead of fragmenting a higher zone. It's special cased to
      	handle a Normal->DMA32 fallback with the reasons explained
      	in the changelog.
      
      Patch 2-4 boosts watermarks temporarily when an external fragmentation
      	event occurs. kswapd wakes to reclaim a small amount of old memory
      	and then wakes kcompactd on completion to recover the system
      	slightly. This introduces some overhead in the slowpath. The level
      	of boosting can be tuned or disabled depending on the tolerance
      	for fragmentation vs allocation latency.
      
      Patch 5 stalls some movable allocation requests to let kswapd from patch 4
      	make some progress. The duration of the stalls is very low but it
      	is possible to tune the system to avoid fragmentation events if
      	larger stalls can be tolerated.
      
      The bulk of the improvement in fragmentation avoidance is from patches
      1-4 but patch 5 can deal with a rare corner case and provides the option
      of tuning a system for THP allocation success rates in exchange for
      some stalls to control fragmentation.
      
      This patch (of 5):
      
      The page allocator zone lists are iterated based on the watermarks of each
      zone which does not take anti-fragmentation into account.  On x86, node 0
      may have multiple zones while other nodes have one zone.  A consequence is
      that tasks running on node 0 may fragment ZONE_NORMAL even though
      ZONE_DMA32 has plenty of free memory.  This patch special cases the
      allocator fast path such that it'll try an allocation from a lower local
      zone before fragmenting a higher zone.  In this case, stealing of
      pageblocks or orders larger than a pageblock are still allowed in the fast
      path as they are uninteresting from a fragmentation point of view.
      
      This was evaluated using a benchmark designed to fragment memory before
      attempting THP allocations.  It's implemented in mmtests as the following
      configurations
      
      configs/config-global-dhp__workload_thpfioscale
      configs/config-global-dhp__workload_thpfioscale-defrag
      configs/config-global-dhp__workload_thpfioscale-madvhugepage
      
      e.g. from mmtests
      ./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1
      
      The broad details of the workload are as follows;
      
      1. Create an XFS filesystem (not specified in the configuration but done
         as part of the testing for this patch).
      2. Start 4 fio threads that write a number of 64K files inefficiently.
         Inefficiently means that files are created on first access and not
         created in advance (fio parameter create_on_open=1) and fallocate
         is not used (fallocate=none). With multiple IO issuers this creates
         a mix of slab and page cache allocations over time. The total size
         of the files is 150% physical memory so that the slabs and page cache
         pages get mixed.
      3. Warm up a number of fio read-only processes accessing the same files
         created in step 2. This part runs for the same length of time it
         took to create the files. It'll refault old data and further
         interleave slab and page cache allocations. As it's now low on
         memory due to step 2, fragmentation occurs as pageblocks get
         stolen.
      4. While step 3 is still running, start a process that tries to allocate
         75% of memory as huge pages with a number of threads. The number of
         threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
         threads contending with fio, any other threads or forcing cross-NUMA
         scheduling. Note that the test has not been used on a machine with less
         than 8 cores. The benchmark records whether huge pages were allocated
         and what the fault latency was in microseconds.
      5. Measure the number of events potentially causing external fragmentation,
         the fault latency and the huge page allocation success rate.
      6. Cleanup the test files.
      
      Note that due to the use of IO and page cache that this benchmark is not
      suitable for running on large machines where the time to fragment memory
      may be excessive.  Also note that while this is one mix that generates
      fragmentation that it's not the only mix that generates fragmentation.
      Differences in workload that are more slab-intensive or whether SLUB is
      used with high-order pages may yield different results.
      
      When the page allocator fragments memory, it records the event using the
      mm_page_alloc_extfrag ftrace event.  If the fallback_order is smaller than
      a pageblock order (order-9 on 64-bit x86) then it's considered to be an
      "external fragmentation event" that may cause issues in the future.
      Hence, the primary metric here is the number of external fragmentation
      events that occur with order < 9.  The secondary metric is allocation
      latency and huge page allocation success rates but note that differences
      in latencies and what the success rate also can affect the number of
      external fragmentation event which is why it's a secondary metric.
      
      1-socket Skylake machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 1 THP allocating thread
      --------------------------------------
      
      4.20-rc3 extfrag events < order 9:   804694
      4.20-rc3+patch:                      408912 (49% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-1      662.92 (   0.00%)      653.58 *   1.41%*
      Amean     fault-huge-1        0.00 (   0.00%)        0.00 (   0.00%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-1        0.00 (   0.00%)        0.00 (   0.00%)
      
      Fault latencies are slightly reduced while allocation success rates remain
      at zero as this configuration does not make any special effort to allocate
      THP and fio is heavily active at the time and either filling memory or
      keeping pages resident.  However, a 49% reduction of serious fragmentation
      events reduces the changes of external fragmentation being a problem in
      the future.
      
      Vlastimil asked during review for a breakdown of the allocation types
      that are falling back.
      
      vanilla
         3816 MIGRATE_UNMOVABLE
       800845 MIGRATE_MOVABLE
           33 MIGRATE_UNRECLAIMABLE
      
      patch
          735 MIGRATE_UNMOVABLE
       408135 MIGRATE_MOVABLE
           42 MIGRATE_UNRECLAIMABLE
      
      The majority of the fallbacks are due to movable allocations and this is
      consistent for the workload throughout the series so will not be presented
      again as the primary source of fallbacks are movable allocations.
      
      Movable fallbacks are sometimes considered "ok" to fallback because they
      can be migrated.  The problem is that they can fill an
      unmovable/reclaimable pageblock causing those allocations to fallback
      later and polluting pageblocks with pages that cannot move.  If there is a
      movable fallback, it is pretty much guaranteed to affect an
      unmovable/reclaimable pageblock and while it might not be enough to
      actually cause a unmovable/reclaimable fallback in the future, we cannot
      know that in advance so the patch takes the only option available to it.
      Hence, it's important to control them.  This point is also consistent
      throughout the series and will not be repeated.
      
      1-socket Skylake machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  291392
      4.20-rc3+patch:                     191187 (34% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-1     1495.14 (   0.00%)     1467.55 (   1.85%)
      Amean     fault-huge-1     1098.48 (   0.00%)     1127.11 (  -2.61%)
      
      thpfioscale Percentage Faults Huge
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-1       78.57 (   0.00%)       77.64 (  -1.18%)
      
      Fragmentation events were reduced quite a bit although this is known
      to be a little variable. The latencies and allocation success rates
      are similar but they were already quite high.
      
      2-socket Haswell machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 5 THP allocating threads
      ----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  215698
      4.20-rc3+patch:                     200210 (7% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-5     1350.05 (   0.00%)     1346.45 (   0.27%)
      Amean     fault-huge-5     4181.01 (   0.00%)     3418.60 (  18.24%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-5        1.15 (   0.00%)        0.78 ( -31.88%)
      
      The reduction of external fragmentation events is slight and this is
      partially due to the removal of __GFP_THISNODE in commit ac5b2c18911f
      ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP
      allocations can now spill over to remote nodes instead of fragmenting
      local memory.
      
      2-socket Haswell machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9: 166352
      4.20-rc3+patch:                    147463 (11% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                            vanilla           lowzone-v5r8
      Amean     fault-base-5     6138.97 (   0.00%)     6217.43 (  -1.28%)
      Amean     fault-huge-5     2294.28 (   0.00%)     3163.33 * -37.88%*
      
      thpfioscale Percentage Faults Huge
                                    4.20.0-rc3             4.20.0-rc3
                                       vanilla           lowzone-v5r8
      Percentage huge-5       96.82 (   0.00%)       95.14 (  -1.74%)
      
      There was a slight reduction in external fragmentation events although the
      latencies were higher.  The allocation success rate is high enough that
      the system is struggling and there is quite a lot of parallel reclaim and
      compaction activity.  There is also a certain degree of luck on whether
      processes start on node 0 or not for this patch but the relevance is
      reduced later in the series.
      
      Overall, the patch reduces the number of external fragmentation causing
      events so the success of THP over long periods of time would be improved
      for this adverse workload.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      Conflicts:
      	mm/page_alloc.c
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      039531d2
    • J
      mm/filemap.c: don't bother dropping mmap_sem for zero size readahead · 0b74ae70
      Jan Kara 提交于
      to #28718400
      
      commit 5c72feee3e45b40a3c96c7145ec422899d0e8964 upstream.
      
      When handling a page fault, we drop mmap_sem to start async readahead so
      that we don't block on IO submission with mmap_sem held.  However there's
      no point to drop mmap_sem in case readahead is disabled.  Handle that case
      to avoid pointless dropping of mmap_sem and retrying the fault.  This was
      actually reported to block mlockall(MCL_CURRENT) indefinitely.
      
      Fixes: 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking operations")
      Reported-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: NRobert Stupp <snazy@gmx.de>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NMinchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/20200212101356.30759-1-jack@suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      0b74ae70
    • Y
      mm: mmu_gather: remove __tlb_reset_range() for force flush · 1d42b185
      Yang Shi 提交于
      to #28718400
      
      commit 7a30df49f63ad92318ddf1f7498d1129a77dd4bd upstream.
      
      A few new fields were added to mmu_gather to make TLB flush smarter for
      huge page by telling what level of page table is changed.
      
      __tlb_reset_range() is used to reset all these page table state to
      unchanged, which is called by TLB flush for parallel mapping changes for
      the same range under non-exclusive lock (i.e.  read mmap_sem).
      
      Before commit dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in
      munmap"), the syscalls (e.g.  MADV_DONTNEED, MADV_FREE) which may update
      PTEs in parallel don't remove page tables.  But, the forementioned
      commit may do munmap() under read mmap_sem and free page tables.  This
      may result in program hang on aarch64 reported by Jan Stancek.  The
      problem could be reproduced by his test program with slightly modified
      below.
      
      ---8<---
      
      static int map_size = 4096;
      static int num_iter = 500;
      static long threads_total;
      
      static void *distant_area;
      
      void *map_write_unmap(void *ptr)
      {
      	int *fd = ptr;
      	unsigned char *map_address;
      	int i, j = 0;
      
      	for (i = 0; i < num_iter; i++) {
      		map_address = mmap(distant_area, (size_t) map_size, PROT_WRITE | PROT_READ,
      			MAP_SHARED | MAP_ANONYMOUS, -1, 0);
      		if (map_address == MAP_FAILED) {
      			perror("mmap");
      			exit(1);
      		}
      
      		for (j = 0; j < map_size; j++)
      			map_address[j] = 'b';
      
      		if (munmap(map_address, map_size) == -1) {
      			perror("munmap");
      			exit(1);
      		}
      	}
      
      	return NULL;
      }
      
      void *dummy(void *ptr)
      {
      	return NULL;
      }
      
      int main(void)
      {
      	pthread_t thid[2];
      
      	/* hint for mmap in map_write_unmap() */
      	distant_area = mmap(0, DISTANT_MMAP_SIZE, PROT_WRITE | PROT_READ,
      			MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
      	munmap(distant_area, (size_t)DISTANT_MMAP_SIZE);
      	distant_area += DISTANT_MMAP_SIZE / 2;
      
      	while (1) {
      		pthread_create(&thid[0], NULL, map_write_unmap, NULL);
      		pthread_create(&thid[1], NULL, dummy, NULL);
      
      		pthread_join(thid[0], NULL);
      		pthread_join(thid[1], NULL);
      	}
      }
      ---8<---
      
      The program may bring in parallel execution like below:
      
              t1                                        t2
      munmap(map_address)
        downgrade_write(&mm->mmap_sem);
        unmap_region()
        tlb_gather_mmu()
          inc_tlb_flush_pending(tlb->mm);
        free_pgtables()
          tlb->freed_tables = 1
          tlb->cleared_pmds = 1
      
                                              pthread_exit()
                                              madvise(thread_stack, 8M, MADV_DONTNEED)
                                                zap_page_range()
                                                  tlb_gather_mmu()
                                                    inc_tlb_flush_pending(tlb->mm);
      
        tlb_finish_mmu()
          if (mm_tlb_flush_nested(tlb->mm))
            __tlb_reset_range()
      
      __tlb_reset_range() would reset freed_tables and cleared_* bits, but this
      may cause inconsistency for munmap() which do free page tables.  Then it
      may result in some architectures, e.g.  aarch64, may not flush TLB
      completely as expected to have stale TLB entries remained.
      
      Use fullmm flush since it yields much better performance on aarch64 and
      non-fullmm doesn't yields significant difference on x86.
      
      The original proposed fix came from Jan Stancek who mainly debugged this
      issue, I just wrapped up everything together.
      
      Jan's testing results:
      
      v5.2-rc2-24-gbec7550cca10
      --------------------------
               mean     stddev
      real    37.382   2.780
      user     1.420   0.078
      sys     54.658   1.855
      
      v5.2-rc2-24-gbec7550cca10 + "mm: mmu_gather: remove __tlb_reset_range() for force flush"
      ---------------------------------------------------------------------------------------_
               mean     stddev
      real    37.119   2.105
      user     1.548   0.087
      sys     55.698   1.357
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/1558322252-113575-1-git-send-email-yang.shi@linux.alibaba.com
      Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NJan Stancek <jstancek@redhat.com>
      Reported-by: NJan Stancek <jstancek@redhat.com>
      Tested-by: NJan Stancek <jstancek@redhat.com>
      Suggested-by: NWill Deacon <will.deacon@arm.com>
      Tested-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Nick Piggin <npiggin@gmail.com>
      Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
      Cc: Nadav Amit <namit@vmware.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: <stable@vger.kernel.org>	[4.20+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      [xuyu: backport from mm/mmu_gather.c to mm/memory.c]
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      1d42b185
    • J
      filemap: drop the mmap_sem for all blocking operations · 2841653a
      Josef Bacik 提交于
      to #28718400
      
      commit 6b4c9f4469819a0c1a38a0a4541337e0f9bf6c11 upstream.
      
      Currently we only drop the mmap_sem if there is contention on the page
      lock.  The idea is that we issue readahead and then go to lock the page
      while it is under IO and we want to not hold the mmap_sem during the IO.
      
      The problem with this is the assumption that the readahead does anything.
      In the case that the box is under extreme memory or IO pressure we may end
      up not reading anything at all for readahead, which means we will end up
      reading in the page under the mmap_sem.
      
      Even if the readahead does something, it could get throttled because of io
      pressure on the system and the process is in a lower priority cgroup.
      
      Holding the mmap_sem while doing IO is problematic because it can cause
      system-wide priority inversions.  Consider some large company that does a
      lot of web traffic.  This large company has load balancing logic in it's
      core web server, cause some engineer thought this was a brilliant plan.
      This load balancing logic gets statistics from /proc about the system,
      which trip over processes mmap_sem for various reasons.  Now the web
      server application is in a protected cgroup, but these other processes may
      not be, and if they are being throttled while their mmap_sem is held we'll
      stall, and cause this nice death spiral.
      
      Instead rework filemap fault path to drop the mmap sem at any point that
      we may do IO or block for an extended period of time.  This includes while
      issuing readahead, locking the page, or needing to call ->readpage because
      readahead did not occur.  Then once we have a fully uptodate page we can
      return with VM_FAULT_RETRY and come back again to find our nicely in-cache
      page that was gotten outside of the mmap_sem.
      
      This patch also adds a new helper for locking the page with the mmap_sem
      dropped.  This doesn't make sense currently as generally speaking if the
      page is already locked it'll have been read in (unless there was an error)
      before it was unlocked.  However a forthcoming patchset will change this
      with the ability to abort read-ahead bio's if necessary, making it more
      likely that we could contend for a page lock and still have a not uptodate
      page.  This allows us to deal with this case by grabbing the lock and
      issuing the IO without the mmap_sem held, and then returning
      VM_FAULT_RETRY to come back around.
      
      [josef@toxicpanda.com: v6]
        Link: http://lkml.kernel.org/r/20181212152757.10017-1-josef@toxicpanda.com
      [kirill@shutemov.name: fix race in filemap_fault()]
        Link: http://lkml.kernel.org/r/20181228235106.okk3oastsnpxusxs@kshutemo-mobl1
      [akpm@linux-foundation.org: coding style fixes]
      Link: http://lkml.kernel.org/r/20181211173801.29535-4-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Tested-by: syzbot+b437b5a429d680cf2217@syzkaller.appspotmail.com
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      2841653a
    • J
      filemap: kill page_cache_read usage in filemap_fault · d33d6167
      Josef Bacik 提交于
      to #28718400
      
      commit a75d4c33377277b6034dd1e2663bce444f952c14 upstream.
      
      Patch series "drop the mmap_sem when doing IO in the fault path", v6.
      
      Now that we have proper isolation in place with cgroups2 we have started
      going through and fixing the various priority inversions.  Most are all
      gone now, but this one is sort of weird since it's not necessarily a
      priority inversion that happens within the kernel, but rather because of
      something userspace does.
      
      We have giant applications that we want to protect, and parts of these
      giant applications do things like watch the system state to determine how
      healthy the box is for load balancing and such.  This involves running
      'ps' or other such utilities.  These utilities will often walk
      /proc/<pid>/whatever, and these files can sometimes need to
      down_read(&task->mmap_sem).  Not usually a big deal, but we noticed when
      we are stress testing that sometimes our protected application has latency
      spikes trying to get the mmap_sem for tasks that are in lower priority
      cgroups.
      
      This is because any down_write() on a semaphore essentially turns it into
      a mutex, so even if we currently have it held for reading, any new readers
      will not be allowed on to keep from starving the writer.  This is fine,
      except a lower priority task could be stuck doing IO because it has been
      throttled to the point that its IO is taking much longer than normal.  But
      because a higher priority group depends on this completing it is now stuck
      behind lower priority work.
      
      In order to avoid this particular priority inversion we want to use the
      existing retry mechanism to stop from holding the mmap_sem at all if we
      are going to do IO.  This already exists in the read case sort of, but
      needed to be extended for more than just grabbing the page lock.  With
      io.latency we throttle at submit_bio() time, so the readahead stuff can
      block and even page_cache_read can block, so all these paths need to have
      the mmap_sem dropped.
      
      The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
      because we have to reserve space for the dirty page, which can be a very
      expensive operation.  We use the same retry method as the read path, and
      simply cache the page and verify the page is still setup properly the next
      pass through ->page_mkwrite().
      
      I've tested these patches with xfstests and there are no regressions.
      
      This patch (of 3):
      
      If we do not have a page at filemap_fault time we'll do this weird forced
      page_cache_read thing to populate the page, and then drop it again and
      loop around and find it.  This makes for 2 ways we can read a page in
      filemap_fault, and it's not really needed.  Instead add a FGP_FOR_MMAP
      flag so that pagecache_get_page() will return a unlocked page that's in
      pagecache.  Then use the normal page locking and readpage logic already in
      filemap_fault.  This simplifies the no page in page cache case
      significantly.
      
      [akpm@linux-foundation.org: fix comment text]
      [josef@toxicpanda.com: don't unlock null page in FGP_FOR_MMAP case]
        Link: http://lkml.kernel.org/r/20190312201742.22935-1-josef@toxicpanda.com
      Link: http://lkml.kernel.org/r/20181211173801.29535-2-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      Conflicts:
      	mm/filemap.c
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      d33d6167
    • J
      filemap: pass vm_fault to the mmap ra helpers · 4023e1eb
      Josef Bacik 提交于
      to #28718400
      
      commit 2a1180f1bd389e9d47693e5eb384b95f482d8d19 upstream.
      
      All of the arguments to these functions come from the vmf.
      
      Cut down on the amount of arguments passed by simply passing in the vmf
      to these two helpers.
      
      Link: http://lkml.kernel.org/r/20181211173801.29535-3-josef@toxicpanda.comSigned-off-by: NJosef Bacik <josef@toxicpanda.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      4023e1eb
    • Y
      mm: unmap VM_PFNMAP mappings with optimized path · 7124f1ac
      Yang Shi 提交于
      to #28718400
      
      commit cb4922496ae40a775a1b17025eaa1060e8991253 upstream.
      
      When unmapping VM_PFNMAP mappings, vm flags need to be updated.  Since the
      vmas have been detached, so it sounds safe to update vm flags with read
      mmap_sem.
      
      Link: http://lkml.kernel.org/r/1537376621-51150-4-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NMatthew Wilcox <willy@infradead.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      7124f1ac
    • Y
      mm: unmap VM_HUGETLB mappings with optimized path · 435ce551
      Yang Shi 提交于
      to #28718400
      
      commit b4cefb36051244bcb5651026d862c332a6cac7df upstream.
      
      When unmapping VM_HUGETLB mappings, vm flags need to be updated.  Since
      the vmas have been detached, so it sounds safe to update vm flags with
      read mmap_sem.
      
      Link: http://lkml.kernel.org/r/1537376621-51150-3-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NMatthew Wilcox <willy@infradead.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      435ce551
    • Y
      mm: mmap: zap pages with read mmap_sem in munmap · 7027c305
      Yang Shi 提交于
      to #28718400
      
      commit dd2283f2605e3b3e9c61bcae844b34f2afa4813f upstream.
      
      Patch series "mm: zap pages with read mmap_sem in munmap for large
      mapping", v11.
      
      Background:
      Recently, when we ran some vm scalability tests on machines with large memory,
      we ran into a couple of mmap_sem scalability issues when unmapping large memory
      space, please refer to https://lkml.org/lkml/2017/12/14/733 and
      https://lkml.org/lkml/2018/2/20/576.
      
      History:
      Then akpm suggested to unmap large mapping section by section and drop mmap_sem
      at a time to mitigate it (see https://lkml.org/lkml/2018/3/6/784).
      
      V1 patch series was submitted to the mailing list per Andrew's suggestion
      (see https://lkml.org/lkml/2018/3/20/786).  Then I received a lot great
      feedback and suggestions.
      
      Then this topic was discussed on LSFMM summit 2018.  In the summit, Michal
      Hocko suggested (also in the v1 patches review) to try "two phases"
      approach.  Zapping pages with read mmap_sem, then doing via cleanup with
      write mmap_sem (for discussion detail, see
      https://lwn.net/Articles/753269/)
      
      Approach:
      Zapping pages is the most time consuming part, according to the suggestion from
      Michal Hocko [1], zapping pages can be done with holding read mmap_sem, like
      what MADV_DONTNEED does. Then re-acquire write mmap_sem to cleanup vmas.
      
      But, we can't call MADV_DONTNEED directly, since there are two major drawbacks:
        * The unexpected state from PF if it wins the race in the middle of munmap.
          It may return zero page, instead of the content or SIGSEGV.
        * Can't handle VM_LOCKED | VM_HUGETLB | VM_PFNMAP and uprobe mappings, which
          is a showstopper from akpm
      
      But, some part may need write mmap_sem, for example, vma splitting. So,
      the design is as follows:
              acquire write mmap_sem
              lookup vmas (find and split vmas)
              deal with special mappings
              detach vmas
              downgrade_write
      
              zap pages
              free page tables
              release mmap_sem
      
      The vm events with read mmap_sem may come in during page zapping, but
      since vmas have been detached before, they, i.e.  page fault, gup, etc,
      will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
      expected.
      
      If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
      mappings.  They will be handled by falling back to regular do_munmap()
      with exclusive mmap_sem held in this patch since they may update vm flags.
      
      But, with the "detach vmas first" approach, the vmas have been detached
      when vm flags are updated, so it sounds safe to update vm flags with read
      mmap_sem for this specific case.  So, VM_HUGETLB and VM_PFNMAP will be
      handled by using the optimized path in the following separate patches for
      bisectable sake.
      
      Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
      However it is fine to have false-positive MMF_RECALC_UPROBES according to
      uprobes developer.  So, uprobe unmap will not be handled by the regular
      path.
      
      With the "detach vmas first" approach we don't have to re-acquire mmap_sem
      again to clean up vmas to avoid race window which might get the address
      space changed since downgrade_write() doesn't release the lock to lead
      regression, which simply downgrades to read lock.
      
      And, since the lock acquire/release cost is managed to the minimum and
      almost as same as before, the optimization could be extended to any size
      of mapping without incurring significant penalty to small mappings.
      
      For the time being, just do this in munmap syscall path.  Other
      vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
      intact due to some implementation difficulties since they acquire write
      mmap_sem from very beginning and hold it until the end, do_munmap() might
      be called in the middle.  But, the optimized do_munmap would like to be
      called without mmap_sem held so that we can do the optimization.  So, if
      we want to do the similar optimization for mmap/mremap path, I'm afraid we
      would have to redesign them.  mremap might be called on very large area
      depending on the usecases, the optimization to it will be considered in
      the future.
      
      This patch (of 3):
      
      When running some mmap/munmap scalability tests with large memory (i.e.
      > 300GB), the below hung task issue may happen occasionally.
      
      INFO: task ps:14018 blocked for more than 120 seconds.
             Tainted: G            E 4.9.79-009.ali3000.alios7.x86_64 #1
       "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
      message.
       ps              D    0 14018      1 0x00000004
        ffff885582f84000 ffff885e8682f000 ffff880972943000 ffff885ebf499bc0
        ffff8828ee120000 ffffc900349bfca8 ffffffff817154d0 0000000000000040
        00ffffff812f872a ffff885ebf499bc0 024000d000948300 ffff880972943000
       Call Trace:
        [<ffffffff817154d0>] ? __schedule+0x250/0x730
        [<ffffffff817159e6>] schedule+0x36/0x80
        [<ffffffff81718560>] rwsem_down_read_failed+0xf0/0x150
        [<ffffffff81390a28>] call_rwsem_down_read_failed+0x18/0x30
        [<ffffffff81717db0>] down_read+0x20/0x40
        [<ffffffff812b9439>] proc_pid_cmdline_read+0xd9/0x4e0
        [<ffffffff81253c95>] ? do_filp_open+0xa5/0x100
        [<ffffffff81241d87>] __vfs_read+0x37/0x150
        [<ffffffff812f824b>] ? security_file_permission+0x9b/0xc0
        [<ffffffff81242266>] vfs_read+0x96/0x130
        [<ffffffff812437b5>] SyS_read+0x55/0xc0
        [<ffffffff8171a6da>] entry_SYSCALL_64_fastpath+0x1a/0xc5
      
      It is because munmap holds mmap_sem exclusively from very beginning to all
      the way down to the end, and doesn't release it in the middle.  When
      unmapping large mapping, it may take long time (take ~18 seconds to unmap
      320GB mapping with every single page mapped on an idle machine).
      
      Zapping pages is the most time consuming part, according to the suggestion
      from Michal Hocko [1], zapping pages can be done with holding read
      mmap_sem, like what MADV_DONTNEED does.  Then re-acquire write mmap_sem to
      cleanup vmas.
      
      But, some part may need write mmap_sem, for example, vma splitting. So,
      the design is as follows:
              acquire write mmap_sem
              lookup vmas (find and split vmas)
              deal with special mappings
              detach vmas
              downgrade_write
      
              zap pages
              free page tables
              release mmap_sem
      
      The vm events with read mmap_sem may come in during page zapping, but
      since vmas have been detached before, they, i.e.  page fault, gup, etc,
      will not be able to find valid vma, then just return SIGSEGV or -EFAULT as
      expected.
      
      If the vma has VM_HUGETLB | VM_PFNMAP, they are considered as special
      mappings.  They will be handled by without downgrading mmap_sem in this
      patch since they may update vm flags.
      
      But, with the "detach vmas first" approach, the vmas have been detached
      when vm flags are updated, so it sounds safe to update vm flags with read
      mmap_sem for this specific case.  So, VM_HUGETLB and VM_PFNMAP will be
      handled by using the optimized path in the following separate patches for
      bisectable sake.
      
      Unmapping uprobe areas may need update mm flags (MMF_RECALC_UPROBES).
      However it is fine to have false-positive MMF_RECALC_UPROBES according to
      uprobes developer.
      
      With the "detach vmas first" approach we don't have to re-acquire mmap_sem
      again to clean up vmas to avoid race window which might get the address
      space changed since downgrade_write() doesn't release the lock to lead
      regression, which simply downgrades to read lock.
      
      And, since the lock acquire/release cost is managed to the minimum and
      almost as same as before, the optimization could be extended to any size
      of mapping without incurring significant penalty to small mappings.
      
      For the time being, just do this in munmap syscall path.  Other
      vm_munmap() or do_munmap() call sites (i.e mmap, mremap, etc) remain
      intact due to some implementation difficulties since they acquire write
      mmap_sem from very beginning and hold it until the end, do_munmap() might
      be called in the middle.  But, the optimized do_munmap would like to be
      called without mmap_sem held so that we can do the optimization.  So, if
      we want to do the similar optimization for mmap/mremap path, I'm afraid we
      would have to redesign them.  mremap might be called on very large area
      depending on the usecases, the optimization to it will be considered in
      the future.
      
      With the patches, exclusive mmap_sem hold time when munmap a 80GB address
      space on a machine with 32 cores of E5-2680 @ 2.70GHz dropped to us level
      from second.
      
      munmap_test-15002 [008]   594.380138: funcgraph_entry: |
      __vm_munmap() {
      munmap_test-15002 [008]   594.380146: funcgraph_entry:      !2485684 us
      |    unmap_region();
      munmap_test-15002 [008]   596.865836: funcgraph_exit:       !2485692 us
      |  }
      
      Here the execution time of unmap_region() is used to evaluate the time of
      holding read mmap_sem, then the remaining time is used with holding
      exclusive lock.
      
      [1] https://lwn.net/Articles/753269/
      
      Link: http://lkml.kernel.org/r/1537376621-51150-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi &lt;yang.shi@linux.alibaba.com&gt;Suggested-by: Michal Hocko <mhocko@kernel.org>
      Suggested-by: NKirill A. Shutemov <kirill@shutemov.name>
      Suggested-by: NMatthew Wilcox <willy@infradead.org>
      Reviewed-by: NMatthew Wilcox <willy@infradead.org>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Laurent Dufour <ldufour@linux.vnet.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      7027c305
    • E
      ACPICA: ACPI 6.3: MADT: add support for statistical profiling in GICC · 6299dc1e
      Erik Schmauss 提交于
      fix #26734090
      
      commit 31b184052a986dc8d80c878edeca574d4ffa1cf5 ACPICA
      
      Link: https://github.com/acpica/acpica/commit/31b18405Signed-off-by: NErik Schmauss <erik.schmauss@intel.com>
      Signed-off-by: NBob Moore <robert.moore@intel.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      6299dc1e
    • J
      perf: arm_spe: Enable ACPI/Platform automatic module loading · ebd1fa46
      Jeremy Linton 提交于
      fix #26734090
      
      commit d482e575fbf0f7ec9319bded951f21bbc84312bf upstream
      
      Lets add the MODULE_TABLE and platform id_table entries so that
      the SPE driver can attach to the ACPI platform device created by
      the core pmu code.
      Tested-by: NHanjun Guo <hanjun.guo@linaro.org>
      Reviewed-by: NSudeep Holla <sudeep.holla@arm.com>
      Signed-off-by: NJeremy Linton <jeremy.linton@arm.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      Signed-off-by: NXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      ebd1fa46
    • J
      arm_pmu: acpi: spe: Add initial MADT/SPE probing · f1c96d2b
      Jeremy Linton 提交于
      fix #26734090
      
      commit d24a0c7099b32b6981d7f126c45348e381718350 upstream
      
      ACPI 6.3 adds additional fields to the MADT GICC
      structure to describe SPE PPI's. We pick these out
      of the cached reference to the madt_gicc structure
      similarly to the core PMU code. We then create a platform
      device referring to the IRQ and let the user/module loader
      decide whether to load the SPE driver.
      Tested-by: NHanjun Guo <hanjun.guo@linaro.org>
      Reviewed-by: NSudeep Holla <sudeep.holla@arm.com>
      Reviewed-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
      Signed-off-by: NJeremy Linton <jeremy.linton@arm.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      Signed-off-by: NXin Hao <xhao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      f1c96d2b
    • H
      blk-iolatency: only call ktime_get() if needed · d02cca18
      Hongnan Li 提交于
      to #29139300
      
      commit 6e2fa4dd683a22a7697e7ff51dad499406094d28 upstream
      
      ktime_to_ns(ktime_get()), which is expensive, does not need to be called
      if blk_iolatency_enabled() return false in blkcg_iolatency_done_bio().
      Postponing ktime_to_ns(ktime_get()) execution reduces the CPU usage when
      blk_iolatency is disabled.
      Signed-off-by: NHongnan Li <hongnan.li@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      d02cca18
    • S
      ICX: platform/x86: ISST: Fix wrong unregister type · 7cd07433
      Srinivas Pandruvada 提交于
      fix #29131496
      
      commit 6cc8f6598978b8f30e70bc12f28fbbc9e26227cc upstream
      
      The MMIO driver is not unregistering with the correct type with the ISST
      common core during module removal. This should be unregistered with
      ISST_IF_DEV_MMIO instead of ISST_IF_DEV_MBOX.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Signed-off-by: NZelin Deng <zelin.deng@linux.alibaba.com>
      Acked-by: NArtie Ding <artie.ding@linux.alibaba.com>
      7cd07433
    • S
      ICX: platform/x86: ISST: Allow additional core-power mailbox commands · 00db41e9
      Srinivas Pandruvada 提交于
      fix #29131496
      
      commit 9749b376be181a98c75b6c2093e6fc30d92e38cc upstream
      
      To discover core-power capability, some new mailbox commands are added.
      Allow those commands to execute.
      Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
      Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
      Signed-off-by: NYouquan Song <youquan.song@intel.com>
      Signed-off-by: NZelin Deng <zelin.deng@linux.alibaba.com>
      Acked-by: NArtie Ding <artie.ding@linux.alibaba.com>
      00db41e9
    • R
      perf stat: Fix shadow stats for clock events · 9a1de619
      Ravi Bangoria 提交于
      fix #29008298
      
      commit 57ddf09173c1e7d0511ead8924675c7198e56545 upstream
      
      Commit 0aa802a7 ("perf stat: Get rid of extra clock display
      function") introduced scale and unit for clock events. Thus,
      perf_stat__update_shadow_stats() now saves scaled values of clock events
      in msecs, instead of original nsecs. But while calculating values of
      shadow stats we still consider clock event values in nsecs. This results
      in a wrong shadow stat values. Ex,
      
        # ./perf stat -e task-clock,cycles ls
          <SNIP>
                    2.60 msec task-clock:u    #    0.877 CPUs utilized
               2,430,564      cycles:u        # 1215282.000 GHz
      
      Fix this by saving original nsec values for clock events in
      perf_stat__update_shadow_stats(). After patch:
      
        # ./perf stat -e task-clock,cycles ls
          <SNIP>
                    3.14 msec task-clock:u    #    0.839 CPUs utilized
               3,094,528      cycles:u        #    0.985 GHz
      Suggested-by: NJiri Olsa <jolsa@redhat.com>
      Reported-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NRavi Bangoria <ravi.bangoria@linux.ibm.com>
      Reviewed-by: NJiri Olsa <jolsa@kernel.org>
      Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
      Cc: Jin Yao <yao.jin@linux.intel.com>
      Cc: Namhyung Kim <namhyung@kernel.org>
      Cc: Thomas Richter <tmricht@linux.vnet.ibm.com>
      Cc: yuzhoujian@didichuxing.com
      Fixes: 0aa802a7 ("perf stat: Get rid of extra clock display function")
      Link: http://lkml.kernel.org/r/20181116042843.24067-1-ravi.bangoria@linux.ibm.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
      Acked-by: NShanpei Chen <shanpeic@linux.alibaba.com>
      9a1de619
    • W
      arm64: fix kernel stack overflow in kdump capture kernel · d5a3153a
      Wei Li 提交于
      task #25552995
      
      commit e1d22385ea6686ff3dcd7092d84465c193849829 upstream.
      
      When enabling ARM64_PSEUDO_NMI feature in kdump capture kernel, it will
      report a kernel stack overflow exception:
      
      [    0.000000] CPU features: detected: IRQ priority masking
      [    0.000000] alternatives: patching kernel code
      [    0.000000] Insufficient stack space to handle exception!
      [    0.000000] ESR: 0x96000044 -- DABT (current EL)
      [    0.000000] FAR: 0x0000000000000040
      [    0.000000] Task stack:     [0xffff0000097f0000..0xffff0000097f4000]
      [    0.000000] IRQ stack:      [0x0000000000000000..0x0000000000004000]
      [    0.000000] Overflow stack: [0xffff80002b7cf290..0xffff80002b7d0290]
      [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.34-lw+ #3
      [    0.000000] pstate: 400003c5 (nZcv DAIF -PAN -UAO)
      [    0.000000] pc : el1_sync+0x0/0xb8
      [    0.000000] lr : el1_irq+0xb8/0x140
      [    0.000000] sp : 0000000000000040
      [    0.000000] pmr_save: 00000070
      [    0.000000] x29: ffff0000097f3f60 x28: ffff000009806240
      [    0.000000] x27: 0000000080000000 x26: 0000000000004000
      [    0.000000] x25: 0000000000000000 x24: ffff000009329028
      [    0.000000] x23: 0000000040000005 x22: ffff000008095c6c
      [    0.000000] x21: ffff0000097f3f70 x20: 0000000000000070
      [    0.000000] x19: ffff0000097f3e30 x18: ffffffffffffffff
      [    0.000000] x17: 0000000000000000 x16: 0000000000000000
      [    0.000000] x15: ffff0000097f9708 x14: ffff000089a382ef
      [    0.000000] x13: ffff000009a382fd x12: ffff000009824000
      [    0.000000] x11: ffff0000097fb7b0 x10: ffff000008730028
      [    0.000000] x9 : ffff000009440018 x8 : 000000000000000d
      [    0.000000] x7 : 6b20676e69686374 x6 : 000000000000003b
      [    0.000000] x5 : 0000000000000000 x4 : ffff000008093600
      [    0.000000] x3 : 0000000400000008 x2 : 7db2e689fc2b8e00
      [    0.000000] x1 : 0000000000000000 x0 : ffff0000097f3e30
      [    0.000000] Kernel panic - not syncing: kernel stack overflow
      [    0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.34-lw+ #3
      [    0.000000] Call trace:
      [    0.000000]  dump_backtrace+0x0/0x1b8
      [    0.000000]  show_stack+0x24/0x30
      [    0.000000]  dump_stack+0xa8/0xcc
      [    0.000000]  panic+0x134/0x30c
      [    0.000000]  __stack_chk_fail+0x0/0x28
      [    0.000000]  handle_bad_stack+0xfc/0x108
      [    0.000000]  __bad_stack+0x90/0x94
      [    0.000000]  el1_sync+0x0/0xb8
      [    0.000000]  init_gic_priority_masking+0x4c/0x70
      [    0.000000]  smp_prepare_boot_cpu+0x60/0x68
      [    0.000000]  start_kernel+0x1e8/0x53c
      [    0.000000] ---[ end Kernel panic - not syncing: kernel stack overflow ]---
      
      The reason is init_gic_priority_masking() may unmask PSR.I while the
      irq stacks are not inited yet. Some "NMI" could be raised unfortunately
      and it will just go into this exception.
      
      In this patch, we just write the PMR in smp_prepare_boot_cpu(), and delay
      unmasking PSR.I after irq stacks inited in init_IRQ().
      
      Fixes: e79321883842 ("arm64: Switch to PMR masking when starting CPUs")
      Cc: Will Deacon <will.deacon@arm.com>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NWei Li <liwei391@huawei.com>
      [JT: make init_gic_priority_masking() not modify daif, rebase on other
           priority masking fixes]
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      d5a3153a
    • M
      arm64: Relax ICC_PMR_EL1 accesses when ICC_CTLR_EL1.PMHE is clear · 625b8a72
      Marc Zyngier 提交于
      task #25552995
      
      commit f226650494c6aa87526d12135b7de8b8c074f3de upstream.
      
      The GICv3 architecture specification is incredibly misleading when it
      comes to PMR and the requirement for a DSB. It turns out that this DSB
      is only required if the CPU interface sends an Upstream Control
      message to the redistributor in order to update the RD's view of PMR.
      
      This message is only sent when ICC_CTLR_EL1.PMHE is set, which isn't
      the case in Linux. It can still be set from EL3, so some special care
      is required. But the upshot is that in the (hopefuly large) majority
      of the cases, we can drop the DSB altogether.
      
      This relies on a new static key being set if the boot CPU has PMHE
      set. The drawback is that this static key has to be exported to
      modules.
      
      Cc: Will Deacon <will@kernel.org>
      Cc: James Morse <james.morse@arm.com>
      Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: NMarc Zyngier <maz@kernel.org>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      625b8a72
    • J
      arm64: Lower priority mask for GIC_PRIO_IRQON · b534b226
      Julien Thierry 提交于
      task #25552995
      
      commit 677379bc9139ac24b310a281fcb21a2f04288353 upstream.
      
      On a system with two security states, if SCR_EL3.FIQ is cleared,
      non-secure IRQ priorities get shifted to fit the secure view but
      priority masks aren't.
      
      On such system, it turns out that GIC_PRIO_IRQON masks the priority of
      normal interrupts, which obviously ends up in a hang.
      
      Increase GIC_PRIO_IRQON value (i.e. lower priority) to make sure
      interrupts are not blocked by it.
      
      Cc: Oleg Nesterov <oleg@redhat.com>
      Fixes: bd82d4bd21880b7c ("arm64: Fix incorrect irqflag restore for priority masking")
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NJulien Thierry <julien.thierry.kdev@gmail.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      [will: fixed Fixes: tag]
      Signed-off-by: NWill Deacon <will@kernel.org>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      b534b226
    • J
      arm64: Fix incorrect irqflag restore for priority masking for compat · 89d26aa9
      James Morse 提交于
      task #25552995
      
      commit f46f27a576cc3b1e3d45ea50bc06287aa46b04b2 upstream.
      
      Commit bd82d4bd2188 ("arm64: Fix incorrect irqflag restore for priority
      masking") added a macro to the entry.S call paths that leave the
      PSTATE.I bit set. This tells the pPNMI masking logic that interrupts
      are masked by the CPU, not by the PMR. This value is read back by
      local_daif_save().
      
      Commit bd82d4bd2188 added this call to el0_svc, as el0_svc_handler
      is called with interrupts masked. el0_svc_compat was missed, but should
      be covered in the same way as both of these paths end up in
      el0_svc_common(), which expects to unmask interrupts.
      
      Fixes: bd82d4bd2188 ("arm64: Fix incorrect irqflag restore for priority masking")
      Signed-off-by: NJames Morse <james.morse@arm.com>
      Cc: Julien Thierry <julien.thierry.kdev@gmail.com>
      Signed-off-by: NWill Deacon <will@kernel.org>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      89d26aa9
    • J
      arm64: irqflags: Introduce explicit debugging for IRQ priorities · dce43ff7
      Julien Thierry 提交于
      Fix #25552995
      
      commit 48ce8f80f5901f1f031b00be66d659d39f33b0a1 upstream.
      
      Using IRQ priority masking to enable/disable interrupts is a bit
      sensitive as it requires to deal with both ICC_PMR_EL1 and PSR.I.
      
      Introduce some validity checks to both highlight the states in which
      functions dealing with IRQ enabling/disabling can (not) be called, and
      bark a warning when called in an unexpected state.
      
      Since these checks are done on hotpaths, introduce a build option to
      choose whether to do the checking.
      
      Cc: Will Deacon <will.deacon@arm.com>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      dce43ff7
    • J
      arm64: Fix incorrect irqflag restore for priority masking · d53b2738
      Julien Thierry 提交于
      task #25552995
      
      commit bd82d4bd21880b7c4d5f5756be435095d6ae07b5 upstream.
      
      When using IRQ priority masking to disable interrupts, in order to deal
      with the PSR.I state, local_irq_save() would convert the I bit into a
      PMR value (GIC_PRIO_IRQOFF). This resulted in local_irq_restore()
      potentially modifying the value of PMR in undesired location due to the
      state of PSR.I upon flag saving [1].
      
      In an attempt to solve this issue in a less hackish manner, introduce
      a bit (GIC_PRIO_IGNORE_PMR) for the PMR values that can represent
      whether PSR.I is being used to disable interrupts, in which case it
      takes precedence of the status of interrupt masking via PMR.
      
      GIC_PRIO_PSR_I_SET is chosen such that (<pmr_value> |
      GIC_PRIO_PSR_I_SET) does not mask more interrupts than <pmr_value> as
      some sections (e.g. arch_cpu_idle(), interrupt acknowledge path)
      requires PMR not to mask interrupts that could be signaled to the
      CPU when using only PSR.I.
      
      [1] https://www.spinics.net/lists/arm-kernel/msg716956.html
      
      Fixes: 4a503217ce37 ("arm64: irqflags: Use ICC_PMR_EL1 for interrupt masking")
      Cc: <stable@vger.kernel.org> # 5.1.x-
      Reported-by: NZenghui Yu <yuzenghui@huawei.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Wei Li <liwei391@huawei.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Christoffer Dall <christoffer.dall@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Suzuki K Pouloze <suzuki.poulose@arm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      d53b2738
    • J
      arm64: Fix interrupt tracing in the presence of NMIs · bf4c79db
      Julien Thierry 提交于
      task #25552995
      
      commit 17ce302f3117e9518395847a3120c8a108b587b8 upstream.
      
      In the presence of any form of instrumentation, nmi_enter() should be
      done before calling any traceable code and any instrumentation code.
      
      Currently, nmi_enter() is done in handle_domain_nmi(), which is much
      too late as instrumentation code might get called before. Move the
      nmi_enter/exit() calls to the arch IRQ vector handler.
      
      On arm64, it is not possible to know if the IRQ vector handler was
      called because of an NMI before acknowledging the interrupt. However, It
      is possible to know whether normal interrupts could be taken in the
      interrupted context (i.e. if taking an NMI in that context could
      introduce a potential race condition).
      
      When interrupting a context with IRQs disabled, call nmi_enter() as soon
      as possible. In contexts with IRQs enabled, defer this to the interrupt
      controller, which is in a better position to know if an interrupt taken
      is an NMI.
      
      Fixes: bc3c03ccb464 ("arm64: Enable the support of pseudo-NMIs")
      Cc: <stable@vger.kernel.org> # 5.1.x-
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      bf4c79db
    • J
      arm64: irqflags: Add condition flags to inline asm clobber list · ea17bf4a
      Julien Thierry 提交于
      task #25552995
      
      commit f57065782f245ca96f1472209a485073bbc11247 upstream.
      
      Some of the inline assembly instruction use the condition flags and need
      to include "cc" in the clobber list.
      
      Fixes: 4a503217ce37 ("arm64: irqflags: Use ICC_PMR_EL1 for interrupt masking")
      Cc: <stable@vger.kernel.org> # 5.1.x-
      Suggested-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Acked-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      ea17bf4a
    • J
      arm64: irqflags: Pass flags as readonly operand to restore instruction · df76e7b7
      Julien Thierry 提交于
      task #25552995
      
      commit 19c36b185a1d13f79f3a382e08695a2633155e5a upstream.
      
      Flags are only read by the instructions doing the irqflags restore
      operation. Pass the operand as read only to the asm inline instead of
      read-write.
      
      Cc: Will Deacon <will.deacon@arm.com>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Acked-by: NMark Rutland <mark.rutland@ar.com>
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      df76e7b7
    • K
      arm64: sysreg: Make mrs_s and msr_s macros work with Clang and LTO · df594b47
      Kees Cook 提交于
      task #25552995
      
      commit be604c616ca71cbf5c860d0cfa4595128ab74189 upstream.
      
      Clang's integrated assembler does not allow assembly macros defined
      in one inline asm block using the .macro directive to be used across
      separate asm blocks. LLVM developers consider this a feature and not a
      bug, recommending code refactoring:
      
        https://bugs.llvm.org/show_bug.cgi?id=19749
      
      As binutils doesn't allow macros to be redefined, this change uses
      UNDEFINE_MRS_S and UNDEFINE_MSR_S to define corresponding macros
      in-place and workaround gcc and clang limitations on redefining macros
      across different assembler blocks.
      
      Specifically, the current state after preprocessing looks like this:
      
      asm volatile(".macro mXX_s ... .endm");
      void f()
      {
      	asm volatile("mXX_s a, b");
      }
      
      With GCC, it gives macro redefinition error because sysreg.h is included
      in multiple source files, and assembler code for all of them is later
      combined for LTO (I've seen an intermediate file with hundreds of
      identical definitions).
      
      With clang, it gives macro undefined error because clang doesn't allow
      sharing macros between inline asm statements.
      
      I also seem to remember catching another sort of undefined error with
      GCC due to reordering of macro definition asm statement and generated
      asm code for function that uses the macro.
      
      The solution with defining and undefining for each use, while certainly
      not elegant, satisfies both GCC and clang, LTO and non-LTO.
      Co-developed-by: NAlex Matveev <alxmtvv@gmail.com>
      Co-developed-by: NYury Norov <ynorov@caviumnetworks.com>
      Co-developed-by: NSami Tolvanen <samitolvanen@google.com>
      Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
      Reviewed-by: NMark Rutland <mark.rutland@arm.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      df594b47
    • J
      arm64: irqflags: Fix clang build warnings · 20c588c5
      Julien Thierry 提交于
      task #25552995
      
      commit a80554fc36ba41d96af8e72fb54cd5d490e06c54 upstream
      
      Clang complains when passing asm operands that are smaller than the
      registers they are mapped to:
      
      arch/arm64/include/asm/irqflags.h:50:10: warning: value size does not
      	match register size specified by the constraint and modifier
      	[-Wasm-operand-widths]
                      : "r" (GIC_PRIO_IRQON)
      
      Fix it by casting the affected input operands to a type of the correct
      size.
      Reported-by: NNathan Chancellor <natechancellor@gmail.com>
      Tested-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      20c588c5
    • J
      arm64: Enable the support of pseudo-NMIs · beaa4f75
      Julien Thierry 提交于
      task #25552995
      
      commit bc3c03ccb4641fb940b27a0d369431876923a8fe upstream
      
      Add a build option and a command line parameter to build and enable the
      support of pseudo-NMIs.
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Suggested-by: NDaniel Thompson <daniel.thompson@linaro.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      beaa4f75
    • Z
      configs: aarch64: add PSEUDO_NMI configuration item · b20a91fe
      Zou Cao 提交于
      task #25552995
      
      add PSEUDO_NMI configuration item
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      b20a91fe
    • J
      arm64: Skip irqflags tracing for NMI in IRQs disabled context · b63f21bc
      Julien Thierry 提交于
      task #25552995
      
      commit c25349fd3c8024cfebcc9b01ee6cfb093fab9be0 upstream
      
      When an NMI is raised while interrupts where disabled, the IRQ tracing
      already is in the correct state (i.e. hardirqs_off) and should be left
      as such when returning to the interrupted context.
      
      Check whether PMR was masking interrupts when the NMI was raised and
      skip IRQ tracing if necessary.
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      b63f21bc
    • J
      arm64: Skip preemption when exiting an NMI · dce9c126
      Julien Thierry 提交于
      task #25552995
      
      commit 1234ad686fb1bde5a9c2447fc4c9df8430358763 upstream
      
      Handling of an NMI should not set any TIF flags. For NMIs received from
      EL0 the current exit path is safe to use.
      
      However, an NMI received at EL1 could have interrupted some task context
      that has set the TIF_NEED_RESCHED flag. Preempting a task should not
      happen as a result of an NMI.
      
      Skip preemption after handling an NMI from EL1.
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      dce9c126
    • J
      arm64: Handle serror in NMI context · 3f44beda
      Julien Thierry 提交于
      task #25552995
      
      commit 7d31464adf20fb8c075a3a3dfe2002a195566510 upstream
      
      Per definition of the daifflags, Serrors can occur during any interrupt
      context, that includes NMI contexts. Trying to nmi_enter in an nmi context
      will crash.
      
      Skip nmi_enter/nmi_exit when serror occurred during an NMI.
      Suggested-by: NJames Morse <james.morse@arm.com>
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Dave Martin <dave.martin@arm.com>
      Cc: James Morse <james.morse@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      3f44beda
    • J
      irqchip/gic-v3: Allow interrupts to be set as pseudo-NMI · b50b9a7b
      Julien Thierry 提交于
      task #25552995
      
      commit 101b35f7def1775bf589d86676983bc359843916 upstream
      
      Implement NMI callbacks for GICv3 irqchip. Install NMI safe handlers
      when setting up interrupt line as NMI.
      
      Only SPIs and PPIs are allowed to be set up as NMI.
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      b50b9a7b
    • J
      irqchip/gic-v3: Handle pseudo-NMIs · 53fa25a9
      Julien Thierry 提交于
      task #25552995
      
      commit f32c926651dcd1683f4d896ee52609000a62a3dc upstream
      
      Provide a higher priority to be used for pseudo-NMIs. When such an
      interrupt is received, keep interrupts fully disabled at CPU level to
      prevent receiving other pseudo-NMIs while handling the current one.
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      53fa25a9
    • J
      irqchip/gic-v3: Detect if GIC can support pseudo-NMIs · 7bf70240
      Julien Thierry 提交于
      task #25552995
      
      commit d98d0a990ca1446d3c0ca8f0b9ac127a66e40cdf upstream
      
      The values non secure EL1 needs to use for PMR and RPR registers depends on
      the value of SCR_EL3.FIQ.
      
      The values non secure EL1 sees from the distributor and redistributor
      depend on whether security is enabled for the GIC or not.
      
      To avoid having to deal with two sets of values for PMR
      masking/unmasking, only enable pseudo-NMIs when GIC has non-secure view
      of priorities.
      
      Also, add firmware requirements related to SCR_EL3.
      Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Jason Cooper <jason@lakedaemon.net>
      Cc: Marc Zyngier <marc.zyngier@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
      Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>
      7bf70240