1. 30 6月, 2021 1 次提交
  2. 07 5月, 2021 1 次提交
  3. 06 5月, 2021 7 次提交
  4. 25 2月, 2021 7 次提交
  5. 06 2月, 2021 1 次提交
    • R
      mm, compaction: move high_pfn to the for loop scope · 74e21484
      Rokudo Yan 提交于
      In fast_isolate_freepages, high_pfn will be used if a prefered one (ie
      PFN >= low_fn) not found.
      
      But the high_pfn is not reset before searching an free area, so when it
      was used as freepage, it may from another free area searched before.  As
      a result move_freelist_head(freelist, freepage) will have unexpected
      behavior (eg corrupt the MOVABLE freelist)
      
        Unable to handle kernel paging request at virtual address dead000000000200
        Mem abort info:
          ESR = 0x96000044
          Exception class = DABT (current EL), IL = 32 bits
          SET = 0, FnV = 0
          EA = 0, S1PTW = 0
        Data abort info:
          ISV = 0, ISS = 0x00000044
          CM = 0, WnR = 1
        [dead000000000200] address between user and kernel address ranges
      
        -000|list_cut_before(inline)
        -000|move_freelist_head(inline)
        -000|fast_isolate_freepages(inline)
        -000|isolate_freepages(inline)
        -000|compaction_alloc(?, ?)
        -001|unmap_and_move(inline)
        -001|migrate_pages([NSD:0xFFFFFF80088CBBD0] from = 0xFFFFFF80088CBD88, [NSD:0xFFFFFF80088CBBC8] get_new_p
        -002|__read_once_size(inline)
        -002|static_key_count(inline)
        -002|static_key_false(inline)
        -002|trace_mm_compaction_migratepages(inline)
        -002|compact_zone(?, [NSD:0xFFFFFF80088CBCB0] capc = 0x0)
        -003|kcompactd_do_work(inline)
        -003|kcompactd([X19] p = 0xFFFFFF93227FBC40)
        -004|kthread([X20] _create = 0xFFFFFFE1AFB26380)
        -005|ret_from_fork(asm)
      
      The issue was reported on an smart phone product with 6GB ram and 3GB
      zram as swap device.
      
      This patch fixes the issue by reset high_pfn before searching each free
      area, which ensure freepage and freelist match when call
      move_freelist_head in fast_isolate_freepages().
      
      Link: http://lkml.kernel.org/r/20190118175136.31341-12-mgorman@techsingularity.net
      Link: https://lkml.kernel.org/r/20210112094720.1238444-1-wu-yan@tcl.com
      Fixes: 5a811889 ("mm, compaction: use free lists to quickly locate a migration target")
      Signed-off-by: NRokudo Yan <wu-yan@tcl.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      74e21484
  6. 16 12月, 2020 5 次提交
  7. 15 11月, 2020 2 次提交
  8. 17 10月, 2020 1 次提交
  9. 14 10月, 2020 1 次提交
  10. 15 8月, 2020 1 次提交
  11. 13 8月, 2020 5 次提交
    • R
    • A
      mm/compaction: correct the comments of compact_defer_shift · 860b3272
      Alex Shi 提交于
      There is no compact_defer_limit. It should be compact_defer_shift in
      use. and add compact_order_failed explanation.
      Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
      Link: http://lkml.kernel.org/r/3bd60e1b-a74e-050d-ade4-6e8f54e00b92@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      860b3272
    • N
      mm: use unsigned types for fragmentation score · d34c0a75
      Nitin Gupta 提交于
      Proactive compaction uses per-node/zone "fragmentation score" which is
      always in range [0, 100], so use unsigned type of these scores as well as
      for related constants.
      Signed-off-by: NNitin Gupta <nigupta@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d34c0a75
    • N
      mm: fix compile error due to COMPACTION_HPAGE_ORDER · 25788738
      Nitin Gupta 提交于
      Fix compile error when COMPACTION_HPAGE_ORDER is assigned to
      HUGETLB_PAGE_ORDER.  The correct way to check if this constant is defined
      is to check for CONFIG_HUGETLBFS.
      Reported-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NNitin Gupta <nigupta@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NNathan Chancellor <natechancellor@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200623064544.25766-1-nigupta@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      25788738
    • N
      mm: proactive compaction · facdaa91
      Nitin Gupta 提交于
      For some applications, we need to allocate almost all memory as hugepages.
      However, on a running system, higher-order allocations can fail if the
      memory is fragmented.  Linux kernel currently does on-demand compaction as
      we request more hugepages, but this style of compaction incurs very high
      latency.  Experiments with one-time full memory compaction (followed by
      hugepage allocations) show that kernel is able to restore a highly
      fragmented memory state to a fairly compacted memory state within <1 sec
      for a 32G system.  Such data suggests that a more proactive compaction can
      help us allocate a large fraction of memory as hugepages keeping
      allocation latencies low.
      
      For a more proactive compaction, the approach taken here is to define a
      new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
      external fragmentation which kcompactd tries to maintain.
      
      The tunable takes a value in range [0, 100], with a default of 20.
      
      Note that a previous version of this patch [1] was found to introduce too
      many tunables (per-order extfrag{low, high}), but this one reduces them to
      just one sysctl.  Also, the new tunable is an opaque value instead of
      asking for specific bounds of "external fragmentation", which would have
      been difficult to estimate.  The internal interpretation of this opaque
      value allows for future fine-tuning.
      
      Currently, we use a simple translation from this tunable to [low, high]
      "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
      The score for a node is defined as weighted mean of per-zone external
      fragmentation.  A zone's present_pages determines its weight.
      
      To periodically check per-node score, we reuse per-node kcompactd threads,
      which are woken up every 500 milliseconds to check the same.  If a node's
      score exceeds its high threshold (as derived from user-provided
      proactiveness value), proactive compaction is started until its score
      reaches its low threshold value.  By default, proactiveness is set to 20,
      which implies threshold values of low=80 and high=90.
      
      This patch is largely based on ideas from Michal Hocko [2].  See also the
      LWN article [3].
      
      Performance data
      ================
      
      System: x64_64, 1T RAM, 80 CPU threads.
      Kernel: 5.6.0-rc3 + this patch
      
      echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
      echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
      
      Before starting the driver, the system was fragmented from a userspace
      program that allocates all memory and then for each 2M aligned section,
      frees 3/4 of base pages using munmap.  The workload is mainly anonymous
      userspace pages, which are easy to move around.  I intentionally avoided
      unmovable pages in this test to see how much latency we incur when
      hugepage allocations hit direct compaction.
      
      1. Kernel hugepage allocation latencies
      
      With the system in such a fragmented state, a kernel driver then allocates
      as many hugepages as possible and measures allocation latency:
      
      (all latency values are in microseconds)
      
      - With vanilla 5.6.0-rc3
      
        percentile latency
        –––––––––– –––––––
      	   5    7894
      	  10    9496
      	  25   12561
      	  30   15295
      	  40   18244
      	  50   21229
      	  60   27556
      	  75   30147
      	  80   31047
      	  90   32859
      	  95   33799
      
      Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
      total free => 98% of free memory could be allocated as hugepages)
      
      - With 5.6.0-rc3 + this patch, with proactiveness=20
      
      sysctl -w vm.compaction_proactiveness=20
      
        percentile latency
        –––––––––– –––––––
      	   5       2
      	  10       2
      	  25       3
      	  30       3
      	  40       3
      	  50       4
      	  60       4
      	  75       4
      	  80       4
      	  90       5
      	  95     429
      
      Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
      total free => 98% of free memory could be allocated as hugepages)
      
      2. JAVA heap allocation
      
      In this test, we first fragment memory using the same method as for (1).
      
      Then, we start a Java process with a heap size set to 700G and request the
      heap to be allocated with THP hugepages.  We also set THP to madvise to
      allow hugepage backing of this heap.
      
      /usr/bin/time
       java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
      
      The above command allocates 700G of Java heap using hugepages.
      
      - With vanilla 5.6.0-rc3
      
      17.39user 1666.48system 27:37.89elapsed
      
      - With 5.6.0-rc3 + this patch, with proactiveness=20
      
      8.35user 194.58system 3:19.62elapsed
      
      Elapsed time remains around 3:15, as proactiveness is further increased.
      
      Note that proactive compaction happens throughout the runtime of these
      workloads.  The situation of one-time compaction, sufficient to supply
      hugepages for following allocation stream, can probably happen for more
      extreme proactiveness values, like 80 or 90.
      
      In the above Java workload, proactiveness is set to 20.  The test starts
      with a node's score of 80 or higher, depending on the delay between the
      fragmentation step and starting the benchmark, which gives more-or-less
      time for the initial round of compaction.  As t he benchmark consumes
      hugepages, node's score quickly rises above the high threshold (90) and
      proactive compaction starts again, which brings down the score to the low
      threshold level (80).  Repeat.
      
      bpftrace also confirms proactive compaction running 20+ times during the
      runtime of this Java benchmark.  kcompactd threads consume 100% of one of
      the CPUs while it tries to bring a node's score within thresholds.
      
      Backoff behavior
      ================
      
      Above workloads produce a memory state which is easy to compact.  However,
      if memory is filled with unmovable pages, proactive compaction should
      essentially back off.  To test this aspect:
      
      - Created a kernel driver that allocates almost all memory as hugepages
        followed by freeing first 3/4 of each hugepage.
      - Set proactiveness=40
      - Note that proactive_compact_node() is deferred maximum number of times
        with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
        (=> ~30 seconds between retries).
      
      [1] https://patchwork.kernel.org/patch/11098289/
      [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
      [3] https://lwn.net/Articles/817905/Signed-off-by: NNitin Gupta <nigupta@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NOleksandr Natalenko <oleksandr@redhat.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Reviewed-by: NOleksandr Natalenko <oleksandr@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nitin Gupta <ngupta@nitingupta.dev>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      facdaa91
  12. 26 6月, 2020 1 次提交
    • V
      mm, compaction: make capture control handling safe wrt interrupts · b9e20f0d
      Vlastimil Babka 提交于
      Hugh reports:
      
       "While stressing compaction, one run oopsed on NULL capc->cc in
        __free_one_page()'s task_capc(zone): compact_zone_order() had been
        interrupted, and a page was being freed in the return from interrupt.
      
        Though you would not expect it from the source, both gccs I was using
        (4.8.1 and 7.5.0) had chosen to compile compact_zone_order() with the
        ".cc = &cc" implemented by mov %rbx,-0xb0(%rbp) immediately before
        callq compact_zone - long after the "current->capture_control =
        &capc". An interrupt in between those finds capc->cc NULL (zeroed by
        an earlier rep stos).
      
        This could presumably be fixed by a barrier() before setting
        current->capture_control in compact_zone_order(); but would also need
        more care on return from compact_zone(), in order not to risk leaking
        a page captured by interrupt just before capture_control is reset.
      
        Maybe that is the preferable fix, but I felt safer for task_capc() to
        exclude the rather surprising possibility of capture at interrupt
        time"
      
      I have checked that gcc10 also behaves the same.
      
      The advantage of fix in compact_zone_order() is that we don't add
      another test in the page freeing hot path, and that it might prevent
      future problems if we stop exposing pointers to uninitialized structures
      in current task.
      
      So this patch implements the suggestion for compact_zone_order() with
      barrier() (and WRITE_ONCE() to prevent store tearing) for setting
      current->capture_control, and prevents page leaking with
      WRITE_ONCE/READ_ONCE in the proper order.
      
      Link: http://lkml.kernel.org/r/20200616082649.27173-1-vbabka@suse.cz
      Fixes: 5e1f0f09 ("mm, compaction: capture a page under direct compaction")
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reported-by: NHugh Dickins <hughd@google.com>
      Suggested-by: NHugh Dickins <hughd@google.com>
      Acked-by: NHugh Dickins <hughd@google.com>
      Cc: Alex Shi <alex.shi@linux.alibaba.com>
      Cc: Li Wang <liwang@redhat.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: <stable@vger.kernel.org>	[5.1+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b9e20f0d
  13. 05 6月, 2020 1 次提交
  14. 04 6月, 2020 3 次提交
  15. 28 5月, 2020 1 次提交
    • I
      mm/swap: Use local_lock for protection · b01b2141
      Ingo Molnar 提交于
      The various struct pagevec per CPU variables are protected by disabling
      either preemption or interrupts across the critical sections. Inside
      these sections spinlocks have to be acquired.
      
      These spinlocks are regular spinlock_t types which are converted to
      "sleeping" spinlocks on PREEMPT_RT enabled kernels. Obviously sleeping
      locks cannot be acquired in preemption or interrupt disabled sections.
      
      local locks provide a trivial way to substitute preempt and interrupt
      disable instances. On a non PREEMPT_RT enabled kernel local_lock() maps
      to preempt_disable() and local_lock_irq() to local_irq_disable().
      
      Create lru_rotate_pvecs containing the pagevec and the locallock.
      Create lru_pvecs containing the remaining pagevecs and the locallock.
      Add lru_add_drain_cpu_zone() which is used from compact_zone() to avoid
      exporting the pvec structure.
      
      Change the relevant call sites to acquire these locks instead of using
      preempt_disable() / get_cpu() / get_cpu_var() and local_irq_disable() /
      local_irq_save().
      
      There is neither a functional change nor a change in the generated
      binary code for non PREEMPT_RT enabled non-debug kernels.
      
      When lockdep is enabled local locks have lockdep maps embedded. These
      allow lockdep to validate the protections, i.e. inappropriate usage of a
      preemption only protected sections would result in a lockdep warning
      while the same problem would not be noticed with a plain
      preempt_disable() based protection.
      
      local locks also improve readability as they provide a named scope for
      the protections while preempt/interrupt disable are opaque scopeless.
      
      Finally local locks allow PREEMPT_RT to substitute them with real
      locking primitives to ensure the correctness of operation in a fully
      preemptible kernel.
      
      [ bigeasy: Adopted to use local_lock ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Link: https://lore.kernel.org/r/20200527201119.1692513-4-bigeasy@linutronix.de
      b01b2141
  16. 27 4月, 2020 1 次提交
  17. 08 4月, 2020 1 次提交