1. 13 8月, 2020 3 次提交
    • L
      coredump: add %f for executable filename · f38c85f1
      Lepton Wu 提交于
      The document reads "%e" should be "executable filename" while actually it
      could be changed by things like pr_ctl PR_SET_NAME.  People who uses "%e"
      in core_pattern get surprised when they find out they get thread name
      instead of executable filename.
      
      This is either a bug of document or a bug of code.  Since the behavior of
      "%e" is there for long time, it could bring another surprise for users if
      we "fix" the code.
      
      So we just "fix" the document.  And more, for users who really need the
      "executable filename" in core_pattern, we introduce a new "%f" for the
      real executable filename.  We already have "%E" for executable path in
      kernel, so just reuse most of its code for the new added "%f" format.
      Signed-off-by: NLepton Wu <ytht.net@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20200701031432.2978761-1-ytht.net@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f38c85f1
    • N
      mm: proactive compaction · facdaa91
      Nitin Gupta 提交于
      For some applications, we need to allocate almost all memory as hugepages.
      However, on a running system, higher-order allocations can fail if the
      memory is fragmented.  Linux kernel currently does on-demand compaction as
      we request more hugepages, but this style of compaction incurs very high
      latency.  Experiments with one-time full memory compaction (followed by
      hugepage allocations) show that kernel is able to restore a highly
      fragmented memory state to a fairly compacted memory state within <1 sec
      for a 32G system.  Such data suggests that a more proactive compaction can
      help us allocate a large fraction of memory as hugepages keeping
      allocation latencies low.
      
      For a more proactive compaction, the approach taken here is to define a
      new sysctl called 'vm.compaction_proactiveness' which dictates bounds for
      external fragmentation which kcompactd tries to maintain.
      
      The tunable takes a value in range [0, 100], with a default of 20.
      
      Note that a previous version of this patch [1] was found to introduce too
      many tunables (per-order extfrag{low, high}), but this one reduces them to
      just one sysctl.  Also, the new tunable is an opaque value instead of
      asking for specific bounds of "external fragmentation", which would have
      been difficult to estimate.  The internal interpretation of this opaque
      value allows for future fine-tuning.
      
      Currently, we use a simple translation from this tunable to [low, high]
      "fragmentation score" thresholds (low=100-proactiveness, high=low+10%).
      The score for a node is defined as weighted mean of per-zone external
      fragmentation.  A zone's present_pages determines its weight.
      
      To periodically check per-node score, we reuse per-node kcompactd threads,
      which are woken up every 500 milliseconds to check the same.  If a node's
      score exceeds its high threshold (as derived from user-provided
      proactiveness value), proactive compaction is started until its score
      reaches its low threshold value.  By default, proactiveness is set to 20,
      which implies threshold values of low=80 and high=90.
      
      This patch is largely based on ideas from Michal Hocko [2].  See also the
      LWN article [3].
      
      Performance data
      ================
      
      System: x64_64, 1T RAM, 80 CPU threads.
      Kernel: 5.6.0-rc3 + this patch
      
      echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
      echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
      
      Before starting the driver, the system was fragmented from a userspace
      program that allocates all memory and then for each 2M aligned section,
      frees 3/4 of base pages using munmap.  The workload is mainly anonymous
      userspace pages, which are easy to move around.  I intentionally avoided
      unmovable pages in this test to see how much latency we incur when
      hugepage allocations hit direct compaction.
      
      1. Kernel hugepage allocation latencies
      
      With the system in such a fragmented state, a kernel driver then allocates
      as many hugepages as possible and measures allocation latency:
      
      (all latency values are in microseconds)
      
      - With vanilla 5.6.0-rc3
      
        percentile latency
        –––––––––– –––––––
      	   5    7894
      	  10    9496
      	  25   12561
      	  30   15295
      	  40   18244
      	  50   21229
      	  60   27556
      	  75   30147
      	  80   31047
      	  90   32859
      	  95   33799
      
      Total 2M hugepages allocated = 383859 (749G worth of hugepages out of 762G
      total free => 98% of free memory could be allocated as hugepages)
      
      - With 5.6.0-rc3 + this patch, with proactiveness=20
      
      sysctl -w vm.compaction_proactiveness=20
      
        percentile latency
        –––––––––– –––––––
      	   5       2
      	  10       2
      	  25       3
      	  30       3
      	  40       3
      	  50       4
      	  60       4
      	  75       4
      	  80       4
      	  90       5
      	  95     429
      
      Total 2M hugepages allocated = 384105 (750G worth of hugepages out of 762G
      total free => 98% of free memory could be allocated as hugepages)
      
      2. JAVA heap allocation
      
      In this test, we first fragment memory using the same method as for (1).
      
      Then, we start a Java process with a heap size set to 700G and request the
      heap to be allocated with THP hugepages.  We also set THP to madvise to
      allow hugepage backing of this heap.
      
      /usr/bin/time
       java -Xms700G -Xmx700G -XX:+UseTransparentHugePages -XX:+AlwaysPreTouch
      
      The above command allocates 700G of Java heap using hugepages.
      
      - With vanilla 5.6.0-rc3
      
      17.39user 1666.48system 27:37.89elapsed
      
      - With 5.6.0-rc3 + this patch, with proactiveness=20
      
      8.35user 194.58system 3:19.62elapsed
      
      Elapsed time remains around 3:15, as proactiveness is further increased.
      
      Note that proactive compaction happens throughout the runtime of these
      workloads.  The situation of one-time compaction, sufficient to supply
      hugepages for following allocation stream, can probably happen for more
      extreme proactiveness values, like 80 or 90.
      
      In the above Java workload, proactiveness is set to 20.  The test starts
      with a node's score of 80 or higher, depending on the delay between the
      fragmentation step and starting the benchmark, which gives more-or-less
      time for the initial round of compaction.  As t he benchmark consumes
      hugepages, node's score quickly rises above the high threshold (90) and
      proactive compaction starts again, which brings down the score to the low
      threshold level (80).  Repeat.
      
      bpftrace also confirms proactive compaction running 20+ times during the
      runtime of this Java benchmark.  kcompactd threads consume 100% of one of
      the CPUs while it tries to bring a node's score within thresholds.
      
      Backoff behavior
      ================
      
      Above workloads produce a memory state which is easy to compact.  However,
      if memory is filled with unmovable pages, proactive compaction should
      essentially back off.  To test this aspect:
      
      - Created a kernel driver that allocates almost all memory as hugepages
        followed by freeing first 3/4 of each hugepage.
      - Set proactiveness=40
      - Note that proactive_compact_node() is deferred maximum number of times
        with HPAGE_FRAG_CHECK_INTERVAL_MSEC of wait between each check
        (=> ~30 seconds between retries).
      
      [1] https://patchwork.kernel.org/patch/11098289/
      [2] https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
      [3] https://lwn.net/Articles/817905/Signed-off-by: NNitin Gupta <nigupta@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NOleksandr Natalenko <oleksandr@redhat.com>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NKhalid Aziz <khalid.aziz@oracle.com>
      Reviewed-by: NOleksandr Natalenko <oleksandr@redhat.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Khalid Aziz <khalid.aziz@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: Mike Kravetz <mike.kravetz@oracle.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Nitin Gupta <ngupta@nitingupta.dev>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Link: http://lkml.kernel.org/r/20200616204527.19185-1-nigupta@nvidia.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      facdaa91
    • R
      mm: memcg/percpu: per-memcg percpu memory statistics · 772616b0
      Roman Gushchin 提交于
      Percpu memory can represent a noticeable chunk of the total memory
      consumption, especially on big machines with many CPUs.  Let's track
      percpu memory usage for each memcg and display it in memory.stat.
      
      A percpu allocation is usually scattered over multiple pages (and nodes),
      and can be significantly smaller than a page.  So let's add a byte-sized
      counter on the memcg level: MEMCG_PERCPU_B.  Byte-sized vmstat infra
      created for slabs can be perfectly reused for percpu case.
      
      [guro@fb.com: v3]
        Link: http://lkml.kernel.org/r/20200623184515.4132564-4-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NDennis Zhou <dennis@kernel.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Tobin C. Harding <tobin@kernel.org>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Waiman Long <longman@redhat.com>
      Cc: Bixuan Cui <cuibixuan@huawei.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Link: http://lkml.kernel.org/r/20200608230819.832349-4-guro@fb.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      772616b0
  2. 12 8月, 2020 1 次提交
  3. 08 8月, 2020 1 次提交
  4. 04 8月, 2020 1 次提交
  5. 03 8月, 2020 1 次提交
  6. 29 7月, 2020 3 次提交
  7. 28 7月, 2020 3 次提交
  8. 24 7月, 2020 7 次提交
  9. 23 7月, 2020 1 次提交
    • P
      debugfs: Add access restriction option · a24c6f7b
      Peter Enderborg 提交于
      Since debugfs include sensitive information it need to be treated
      carefully. But it also has many very useful debug functions for userspace.
      With this option we can have same configuration for system with
      need of debugfs and a way to turn it off. This gives a extra protection
      for exposure on systems where user-space services with system
      access are attacked.
      
      It is controlled by a configurable default value that can be override
      with a kernel command line parameter. (debugfs=)
      
      It can be on or off, but also internally on but not seen from user-space.
      This no-mount mode do not register a debugfs as filesystem, but client can
      register their parts in the internal structures. This data can be readed
      with a debugger or saved with a crashkernel. When it is off clients
      get EPERM error when accessing the functions for registering their
      components.
      Signed-off-by: NPeter Enderborg <peter.enderborg@sony.com>
      Link: https://lore.kernel.org/r/20200716071511.26864-3-peter.enderborg@sony.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      a24c6f7b
  10. 20 7月, 2020 2 次提交
  11. 18 7月, 2020 1 次提交
    • B
      blk-cgroup: show global disk stats in root cgroup io.stat · ef45fe47
      Boris Burkov 提交于
      In order to improve consistency and usability in cgroup stat accounting,
      we would like to support the root cgroup's io.stat.
      
      Since the root cgroup has processes doing io even if the system has no
      explicitly created cgroups, we need to be careful to avoid overhead in
      that case.  For that reason, the rstat algorithms don't handle the root
      cgroup, so just turning the file on wouldn't give correct statistics.
      
      To get around this, we simulate flushing the iostat struct by filling it
      out directly from global disk stats. The result is a root cgroup io.stat
      file consistent with both /proc/diskstats and io.stat.
      
      Note that in order to collect the disk stats, we needed to iterate over
      devices. To facilitate that, we had to change the linkage of a disk_type
      to external so that it can be used from blk-cgroup.c to iterate over
      disks.
      Suggested-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NBoris Burkov <boris@bur.io>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ef45fe47
  12. 15 7月, 2020 1 次提交
  13. 13 7月, 2020 4 次提交
  14. 09 7月, 2020 4 次提交
  15. 06 7月, 2020 7 次提交