1. 15 8月, 2020 1 次提交
  2. 13 8月, 2020 4 次提交
    • A
      mm/vmstat: add events for THP migration without split · 1a5bae25
      Anshuman Khandual 提交于
      Add following new vmstat events which will help in validating THP
      migration without split.  Statistics reported through these new VM events
      will help in performance debugging.
      
      1. THP_MIGRATION_SUCCESS
      2. THP_MIGRATION_FAILURE
      3. THP_MIGRATION_SPLIT
      
      In addition, these new events also update normal page migration statistics
      appropriately via PGMIGRATE_SUCCESS and PGMIGRATE_FAILURE.  While here,
      this updates current trace event 'mm_migrate_pages' to accommodate now
      available THP statistics.
      
      [akpm@linux-foundation.org: s/hpage_nr_pages/thp_nr_pages/]
      [ziy@nvidia.com: v2]
        Link: http://lkml.kernel.org/r/C5E3C65C-8253-4638-9D3C-71A61858BB8B@nvidia.com
      [anshuman.khandual@arm.com: s/thp_nr_pages/hpage_nr_pages/]
        Link: http://lkml.kernel.org/r/1594287583-16568-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NZi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Link: http://lkml.kernel.org/r/1594080415-27924-1-git-send-email-anshuman.khandual@arm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1a5bae25
    • N
      mm: use unsigned types for fragmentation score · d34c0a75
      Nitin Gupta 提交于
      Proactive compaction uses per-node/zone "fragmentation score" which is
      always in range [0, 100], so use unsigned type of these scores as well as
      for related constants.
      Signed-off-by: NNitin Gupta <nigupta@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Link: http://lkml.kernel.org/r/20200618010319.13159-1-nigupta@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d34c0a75
    • N
      mm: proactive compaction · facdaa91
      Nitin Gupta 提交于
      For some applications, we need to allocate almost all memory as hugepages.
      However, on a running system, higher-order allocations can fail if the
      memory is fragmented.  Linux kernel currently does on-demand compaction as
      we request more hugepages, but this style of compaction incurs very high
      latency.  Experiments with one-time full memory compaction (followed by
      hugepage allocations) show that kernel is able to restore a highly
      fragmented memory state to a fairly compacted memory state within <1 sec
      for a 32G system.  Such data suggests that a more proactive compaction can
      help us allocate a large fraction of memory as hugepages keeping
      allocation latencies low.
      
      For a more proactive compaction, the approach taken here is to define a
      new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
      external fragmentation which kcompactd tries to maintain.
      
      The tunable takes a value in range [0, 100], with a default of 20.
      
      Note that a previous version of this patch [1] was found to introduce too
      many tunables (per-order extfrag{low, high}), but this one reduces them to
      just one sysctl.  Also, the new tunable is an opaque value instead of
      asking for specific bounds of "external fragmentation", which would have
      been difficult to estimate.  The internal interpretation of this opaque
      value allows for future fine-tuning.
      
      Currently, we use a simple translation from this tunable to [low, high]
      "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
      The score for a node is defined as weighted mean of per-zone external
      fragmentation.  A zone's present_pages determines its weight.
      
      To periodically check per-node score, we reuse per-node kcompactd threads,
      which are woken up every 500 milliseconds to check the same.  If a node's
      score exceeds its high threshold (as derived from user-provided
      proactiveness value), proactive compaction is started until its score
      reaches its low threshold value.  By default, proactiveness is set to 20,
      which implies threshold values of low=80 and high=90.
      
      This patch is largely based on ideas from Michal Hocko [2].  See also the
      LWN article [3].
      
      Performance data
      ================
      
      System: x64_64, 1T RAM, 80 CPU threads.
      Kernel: 5.6.0-rc3 + this patch
      
      echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
      echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
      
      Before starting the driver, the system was fragmented from a userspace
      program that allocates all memory and then for each 2M aligned section,
      frees 3/4 of base pages using munmap.  The workload is mainly anonymous
      userspace pages, which are easy to move around.  I intentionally avoided
      unmovable pages in this test to see how much latency we incur when
      hugepage allocations hit direct compaction.
      
      1. Kernel hugepage allocation latencies
      
      With the system in such a fragmented state, a kernel driver then allocates
      as many hugepages as possible and measures allocation latency:
      
      (all latency values are in microseconds)
      
      - With vanilla 5.6.0-rc3
      
        percentile latency
        –––––––––– –––––––
      	   5    7894
      	  10    9496
      	  25   12561
      	  30   15295
      	  40   18244
      	  50   21229
      	  60   27556
      	  75   30147
      	  80   31047
      	  90   32859
      	  95   33799
      
      Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
      total free => 98% of free memory could be allocated as hugepages)
      
      - With 5.6.0-rc3 + this patch, with proactiveness=20
      
      sysctl -w vm.compaction_proactiveness=20
      
        percentile latency
        –––––––––– –––––––
      	   5       2
      	  10       2
      	  25       3
      	  30       3
      	  40       3
      	  50       4
      	  60       4
      	  75       4
      	  80       4
      	  90       5
      	  95     429
      
      Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
      total free => 98% of free memory could be allocated as hugepages)
      
      2. JAVA heap allocation
      
      In this test, we first fragment memory using the same method as for (1).
      
      Then, we start a Java process with a heap size set to 700G and request the
      heap to be allocated with THP hugepages.  We also set THP to madvise to
      allow hugepage backing of this heap.
      
      /usr/bin/time
       java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
      
      The above command allocates 700G of Java heap using hugepages.
      
      - With vanilla 5.6.0-rc3
      
      17.39user 1666.48system 27:37.89elapsed
      
      - With 5.6.0-rc3 + this patch, with proactiveness=20
      
      8.35user 194.58system 3:19.62elapsed
      
      Elapsed time remains around 3:15, as proactiveness is further increased.
      
      Note that proactive compaction happens throughout the runtime of these
      workloads.  The situation of one-time compaction, sufficient to supply
      hugepages for following allocation stream, can probably happen for more
      extreme proactiveness values, like 80 or 90.
      
      In the above Java workload, proactiveness is set to 20.  The test starts
      with a node's score of 80 or higher, depending on the delay between the
      fragmentation step and starting the benchmark, which gives more-or-less
      time for the initial round of compaction.  As t he benchmark consumes
      hugepages, node's score quickly rises above the high threshold (90) and
      proactive compaction starts again, which brings down the score to the low
      threshold level (80).  Repeat.
      
      bpftrace also confirms proactive compaction running 20+ times during the
      runtime of this Java benchmark.  kcompactd threads consume 100% of one of
      the CPUs while it tries to bring a node's score within thresholds.
      
      Backoff behavior
      ================
      
      Above workloads produce a memory state which is easy to compact.  However,
      if memory is filled with unmovable pages, proactive compaction should
      essentially back off.  To test this aspect:
      
      - Created a kernel driver that allocates almost all memory as hugepages
        followed by freeing first 3/4 of each hugepage.
      - Set proactiveness=40
      - Note that proactive_compact_node() is deferred maximum number of times
        with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
        (=> ~30 seconds between retries).
      
      [1] https://patchwork.kernel.org/patch/11098289/
      [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
      [3] https://lwn.net/Articles/817905/Signed-off-by: NNitin Gupta <nigupta@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NOleksandr Natalenko <oleksandr@redhat.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Reviewed-by: NOleksandr Natalenko <oleksandr@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nitin Gupta <ngupta@nitingupta.dev>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      facdaa91
    • J
      mm/workingset: prepare the workingset detection infrastructure for anon LRU · 170b04b7
      Joonsoo Kim 提交于
      To prepare the workingset detection for anon LRU, this patch splits
      workingset event counters for refault, activate and restore into anon and
      file variants, as well as the refaults counter in struct lruvec.
      Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Link: http://lkml.kernel.org/r/1595490560-15117-4-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      170b04b7
  3. 08 8月, 2020 2 次提交
  4. 05 6月, 2020 1 次提交
  5. 04 6月, 2020 2 次提交
  6. 03 6月, 2020 1 次提交
    • N
      mm/writeback: discard NR_UNSTABLE_NFS, use NR_WRITEBACK instead · 8d92890b
      NeilBrown 提交于
      After an NFS page has been written it is considered "unstable" until a
      COMMIT request succeeds.  If the COMMIT fails, the page will be
      re-written.
      
      These "unstable" pages are currently accounted as "reclaimable", either
      in WB_RECLAIMABLE, or in NR_UNSTABLE_NFS which is included in a
      'reclaimable' count.  This might have made sense when sending the COMMIT
      required a separate action by the VFS/MM (e.g.  releasepage() used to
      send a COMMIT).  However now that all writes generated by ->writepages()
      will automatically be followed by a COMMIT (since commit 919e3bd9
      ("NFS: Ensure we commit after writeback is complete")) it makes more
      sense to treat them as writeback pages.
      
      So this patch removes NR_UNSTABLE_NFS and accounts unstable pages in
      NR_WRITEBACK and WB_WRITEBACK.
      
      A particular effect of this change is that when
      wb_check_background_flush() calls wb_over_bg_threshold(), the latter
      will report 'true' a lot less often as the 'unstable' pages are no
      longer considered 'dirty' (as there is nothing that writeback can do
      about them anyway).
      
      Currently wb_check_background_flush() will trigger writeback to NFS even
      when there are relatively few dirty pages (if there are lots of unstable
      pages), this can result in small writes going to the server (10s of
      Kilobytes rather than a Megabyte) which hurts throughput.  With this
      patch, there are fewer writes which are each larger on average.
      
      Where the NR_UNSTABLE_NFS count was included in statistics
      virtual-files, the entry is retained, but the value is hard-coded as
      zero.  static trace points and warning printks which mentioned this
      counter no longer report it.
      
      [akpm@linux-foundation.org: re-layout comment]
      [akpm@linux-foundation.org: fix printk warning]
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Acked-by: Michal Hocko <mhocko@suse.com>	[mm]
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Chuck Lever <chuck.lever@oracle.com>
      Link: http://lkml.kernel.org/r/87d06j7gqa.fsf@notabene.neil.brown.nameSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d92890b
  7. 15 5月, 2020 1 次提交
  8. 27 4月, 2020 1 次提交
  9. 08 4月, 2020 2 次提交
  10. 03 4月, 2020 1 次提交
    • J
      mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting · 1970dc6f
      John Hubbard 提交于
      Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via
      unpin_user_pages*(), we need some visibility into whether all of this is
      working correctly.
      
      Add two new fields to /proc/vmstat:
      
          nr_foll_pin_acquired
          nr_foll_pin_released
      
      These are documented in Documentation/core-api/pin_user_pages.rst.  They
      represent the number of pages (since boot time) that have been pinned
      ("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"), via
      pin_user_pages*() and unpin_user_pages*().
      
      In the absence of long-running DMA or RDMA operations that hold pages
      pinned, the above two fields will normally be equal to each other.
      
      Also: update Documentation/core-api/pin_user_pages.rst, to remove an
      earlier (now confirmed untrue) claim about a performance problem with
      /proc/vmstat.
      
      Also: update Documentation/core-api/pin_user_pages.rst to rename the new
      /proc/vmstat entries, to the names listed here.
      Signed-off-by: NJohn Hubbard <jhubbard@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Chinner <david@fromorbit.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Link: http://lkml.kernel.org/r/20200211001536.1027652-9-jhubbard@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1970dc6f
  11. 05 12月, 2019 2 次提交
  12. 07 11月, 2019 2 次提交
  13. 25 9月, 2019 1 次提交
  14. 21 5月, 2019 1 次提交
  15. 20 4月, 2019 1 次提交
  16. 06 3月, 2019 1 次提交
  17. 29 12月, 2018 1 次提交
  18. 19 11月, 2018 1 次提交
  19. 27 10月, 2018 4 次提交
  20. 06 10月, 2018 2 次提交
  21. 29 6月, 2018 1 次提交
    • S
      Revert mm/vmstat.c: fix vmstat_update() preemption BUG · 28557cc1
      Sebastian Andrzej Siewior 提交于
      Revert commit c7f26ccf ("mm/vmstat.c: fix vmstat_update() preemption
      BUG").  Steven saw a "using smp_processor_id() in preemptible" message
      and added a preempt_disable() section around it to keep it quiet.  This
      is not the right thing to do it does not fix the real problem.
      
      vmstat_update() is invoked by a kworker on a specific CPU.  This worker
      it bound to this CPU.  The name of the worker was "kworker/1:1" so it
      should have been a worker which was bound to CPU1.  A worker which can
      run on any CPU would have a `u' before the first digit.
      
      smp_processor_id() can be used in a preempt-enabled region as long as
      the task is bound to a single CPU which is the case here.  If it could
      run on an arbitrary CPU then this is the problem we have an should seek
      to resolve.
      
      Not only this smp_processor_id() must not be migrated to another CPU but
      also refresh_cpu_vm_stats() which might access wrong per-CPU variables.
      Not to mention that other code relies on the fact that such a worker
      runs on one specific CPU only.
      
      Therefore revert that commit and we should look instead what broke the
      affinity mask of the kworker.
      
      Link: http://lkml.kernel.org/r/20180504104451.20278-1-bigeasy@linutronix.deSigned-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: Steven J. Hill <steven.hill@cavium.com>
      Cc: Tejun Heo <htejun@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      28557cc1
  22. 16 5月, 2018 1 次提交
  23. 12 5月, 2018 1 次提交
  24. 12 4月, 2018 1 次提交
  25. 29 3月, 2018 1 次提交
  26. 16 11月, 2017 2 次提交
  27. 09 9月, 2017 1 次提交