1. 27 7月, 2023 4 次提交
  2. 25 7月, 2023 1 次提交
    • J
      mm: vmscan: fix extreme overreclaim and swap floods · 73d4a8a7
      Johannes Weiner 提交于
      stable inclusion
      from stable-v5.10.157
      commit d925dd3e444cb7f0fab0208fed82673fd61f9765
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I7MU59
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=d925dd3e444cb7f0fab0208fed82673fd61f9765
      
      --------------------------------
      
      commit f53af428 upstream.
      
      During proactive reclaim, we sometimes observe severe overreclaim, with
      several thousand times more pages reclaimed than requested.
      
      This trace was obtained from shrink_lruvec() during such an instance:
      
          prio:0 anon_cost:1141521 file_cost:7767
          nr_reclaimed:4387406 nr_to_reclaim:1047 (or_factor:4190)
          nr=[7161123 345 578 1111]
      
      While he reclaimer requested 4M, vmscan reclaimed close to 16G, most of it
      by swapping.  These requests take over a minute, during which the write()
      to memory.reclaim is unkillably stuck inside the kernel.
      
      Digging into the source, this is caused by the proportional reclaim
      bailout logic.  This code tries to resolve a fundamental conflict: to
      reclaim roughly what was requested, while also aging all LRUs fairly and
      in accordance to their size, swappiness, refault rates etc.  The way it
      attempts fairness is that once the reclaim goal has been reached, it stops
      scanning the LRUs with the smaller remaining scan targets, and adjusts the
      remainder of the bigger LRUs according to how much of the smaller LRUs was
      scanned.  It then finishes scanning that remainder regardless of the
      reclaim goal.
      
      This works fine if priority levels are low and the LRU lists are
      comparable in size.  However, in this instance, the cgroup that is
      targeted by proactive reclaim has almost no files left - they've already
      been squeezed out by proactive reclaim earlier - and the remaining anon
      pages are hot.  Anon rotations cause the priority level to drop to 0,
      which results in reclaim targeting all of anon (a lot) and all of file
      (almost nothing).  By the time reclaim decides to bail, it has scanned
      most or all of the file target, and therefor must also scan most or all of
      the enormous anon target.  This target is thousands of times larger than
      the reclaim goal, thus causing the overreclaim.
      
      The bailout code hasn't changed in years, why is this failing now?  The
      most likely explanations are two other recent changes in anon reclaim:
      
      1. Before the series starting with commit 5df74196 ("mm: fix LRU
         balancing effect of new transparent huge pages"), the VM was
         overall relatively reluctant to swap at all, even if swap was
         configured. This means the LRU balancing code didn't come into play
         as often as it does now, and mostly in high pressure situations
         where pronounced swap activity wouldn't be as surprising.
      
      2. For historic reasons, shrink_lruvec() loops on the scan targets of
         all LRU lists except the active anon one, meaning it would bail if
         the only remaining pages to scan were active anon - even if there
         were a lot of them.
      
         Before the series starting with commit ccc5dc67 ("mm/vmscan:
         make active/inactive ratio as 1:1 for anon lru"), most anon pages
         would live on the active LRU; the inactive one would contain only a
         handful of preselected reclaim candidates. After the series, anon
         gets aged similarly to file, and the inactive list is the default
         for new anon pages as well, making it often the much bigger list.
      
         As a result, the VM is now more likely to actually finish large
         anon targets than before.
      
      Change the code such that only one SWAP_CLUSTER_MAX-sized nudge toward the
      larger LRU lists is made before bailing out on a met reclaim goal.
      
      This fixes the extreme overreclaim problem.
      
      Fairness is more subtle and harder to evaluate.  No obvious misbehavior
      was observed on the test workload, in any case.  Conceptually, fairness
      should primarily be a cumulative effect from regular, lower priority
      scans.  Once the VM is in trouble and needs to escalate scan targets to
      make forward progress, fairness needs to take a backseat.  This is also
      acknowledged by the myriad exceptions in get_scan_count().  This patch
      makes fairness decrease gradually, as it keeps fairness work static over
      increasing priority levels with growing scan targets.  This should make
      more sense - although we may have to re-visit the exact values.
      
      Link: https://lkml.kernel.org/r/20220802162811.39216-1-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: Nsanglipeng <sanglipeng1@jd.com>
      73d4a8a7
  3. 20 7月, 2023 2 次提交
  4. 18 7月, 2023 1 次提交
  5. 10 7月, 2023 1 次提交
    • L
      etmem: fix the div 0 problem in swapcache reclaim process · 905e7deb
      liubo 提交于
      euleros inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I7JI6K
      CVE: NA
      
      ----------------------------------------------------
      
      In the swapcache recycling process, the number of pages
      to be reclaimed on each node is obtained as follows:
      
      nr_to_reclaim[nid_num] = (swapcache_to_reclaim / (swapcache_total_reclaimable / nr[nid_num]));
      
      However, nr[nid_num] is obtained by traversing the number
      of swapcache pages on each node.
      If there are multiple nodes in the environment and
      no swap process occurs on a node, no swapcache page exists.
      The value of nr[nid_num] may be 0.
      
      Therefore, division by zero errors may occur.
      Signed-off-by: Nliubo <liubo254@huawei.com>
      905e7deb
  6. 26 6月, 2023 3 次提交
  7. 25 6月, 2023 8 次提交
  8. 21 6月, 2023 1 次提交
  9. 16 6月, 2023 1 次提交
  10. 13 6月, 2023 1 次提交
  11. 07 6月, 2023 6 次提交
  12. 30 5月, 2023 7 次提交
  13. 19 5月, 2023 4 次提交
    • N
      memcg: support ksm merge any mode per cgroup · 0f6fb357
      Nanyong Sun 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
      CVE: NA
      
      ----------------------------------------------------------------------
      
      Add control file "memory.ksm" to enable ksm per cgroup.
      Echo to 1 will set all tasks currently in the cgroup to ksm merge
      any mode, which means ksm gets enabled for all vma's of a process.
      Meanwhile echo to 0 will disable ksm for them and unmerge the
      merged pages.
      Cat the file will show the above state and ksm related profits
      of this cgroup.
      Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
      0f6fb357
    • D
      mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0 · 351ceedb
      David Hildenbrand 提交于
      mainline inclusion
      from mainline-v6.4-rc1
      commit 24139c07
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=24139c07f413ef4b555482c758343d71392a19bc
      
      ----------------------------------------------------------------------
      
      Patch series "mm/ksm: improve PR_SET_MEMORY_MERGE=0 handling and cleanup
      disabling KSM", v2.
      
      (1) Make PR_SET_MEMORY_MERGE=0 unmerge pages like setting MADV_UNMERGEABLE
      does, (2) add a selftest for it and (3) factor out disabling of KSM from
      s390/gmap code.
      
      This patch (of 3):
      
      Let's unmerge any KSM pages when setting PR_SET_MEMORY_MERGE=0, and clear
      the VM_MERGEABLE flag from all VMAs -- just like KSM would.  Of course,
      only do that if we previously set PR_SET_MEMORY_MERGE=1.
      
      Link: https://lkml.kernel.org/r/20230422205420.30372-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20230422205420.30372-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NStefan Roesch <shr@devkernel.io>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Conflicts:
      	mm/ksm.c
      Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
      351ceedb
    • S
      mm: add new KSM process and sysfs knobs · a098d41e
      Stefan Roesch 提交于
      mainline inclusion
      from mainline-v6.4-rc1
      commit d21077fb
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d21077fbc2fc987c2e593c34dc3b4d84e546dc9f
      
      ----------------------------------------------------------------------
      
      This adds the general_profit KSM sysfs knob and the process profit metric
      knobs to ksm_stat.
      
      1) expose general_profit metric
      
         The documentation mentions a general profit metric, however this
         metric is not calculated.  In addition the formula depends on the size
         of internal structures, which makes it more difficult for an
         administrator to make the calculation.  Adding the metric for a better
         user experience.
      
      2) document general_profit sysfs knob
      
      3) calculate ksm process profit metric
      
         The ksm documentation mentions the process profit metric and how to
         calculate it.  This adds the calculation of the metric.
      
      4) mm: expose ksm process profit metric in ksm_stat
      
         This exposes the ksm process profit metric in /proc/<pid>/ksm_stat.
         The documentation mentions the formula for the ksm process profit
         metric, however it does not calculate it.  In addition the formula
         depends on the size of internal structures.  So it makes sense to
         expose it.
      
      5) document new procfs ksm knobs
      
      Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.ioSigned-off-by: NStefan Roesch <shr@devkernel.io>
      Reviewed-by: NBagas Sanjaya <bagasdotme@gmail.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
      a098d41e
    • S
      mm: add new api to enable ksm per process · 2cd2cdfe
      Stefan Roesch 提交于
      mainline inclusion
      from mainline-v6.4-rc1
      commit d7597f59
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d7597f59d1d33e9efbffa7060deb9ee5bd119e62
      
      ----------------------------------------------------------------------
      
      Patch series "mm: process/cgroup ksm support", v9.
      
      So far KSM can only be enabled by calling madvise for memory regions.  To
      be able to use KSM for more workloads, KSM needs to have the ability to be
      enabled / disabled at the process / cgroup level.
      
      Use case 1:
        The madvise call is not available in the programming language.  An
        example for this are programs with forked workloads using a garbage
        collected language without pointers.  In such a language madvise cannot
        be made available.
      
        In addition the addresses of objects get moved around as they are
        garbage collected.  KSM sharing needs to be enabled "from the outside"
        for these type of workloads.
      
      Use case 2:
        The same interpreter can also be used for workloads where KSM brings
        no benefit or even has overhead.  We'd like to be able to enable KSM on
        a workload by workload basis.
      
      Use case 3:
        With the madvise call sharing opportunities are only enabled for the
        current process: it is a workload-local decision.  A considerable number
        of sharing opportunities may exist across multiple workloads or jobs (if
        they are part of the same security domain).  Only a higler level entity
        like a job scheduler or container can know for certain if its running
        one or more instances of a job.  That job scheduler however doesn't have
        the necessary internal workload knowledge to make targeted madvise
        calls.
      
      Security concerns:
      
        In previous discussions security concerns have been brought up.  The
        problem is that an individual workload does not have the knowledge about
        what else is running on a machine.  Therefore it has to be very
        conservative in what memory areas can be shared or not.  However, if the
        system is dedicated to running multiple jobs within the same security
        domain, its the job scheduler that has the knowledge that sharing can be
        safely enabled and is even desirable.
      
      Performance:
      
        Experiments with using UKSM have shown a capacity increase of around 20%.
      
        Here are the metrics from an instagram workload (taken from a machine
        with 64GB main memory):
      
         full_scans: 445
         general_profit: 20158298048
         max_page_sharing: 256
         merge_across_nodes: 1
         pages_shared: 129547
         pages_sharing: 5119146
         pages_to_scan: 4000
         pages_unshared: 1760924
         pages_volatile: 10761341
         run: 1
         sleep_millisecs: 20
         stable_node_chains: 167
         stable_node_chains_prune_millisecs: 2000
         stable_node_dups: 2751
         use_zero_pages: 0
         zero_pages_sharing: 0
      
      After the service is running for 30 minutes to an hour, 4 to 5 million
      shared pages are common for this workload when using KSM.
      
      Detailed changes:
      
      1. New options for prctl system command
         This patch series adds two new options to the prctl system call.
         The first one allows to enable KSM at the process level and the second
         one to query the setting.
      
      The setting will be inherited by child processes.
      
      With the above setting, KSM can be enabled for the seed process of a cgroup
      and all processes in the cgroup will inherit the setting.
      
      2. Changes to KSM processing
         When KSM is enabled at the process level, the KSM code will iterate
         over all the VMA's and enable KSM for the eligible VMA's.
      
         When forking a process that has KSM enabled, the setting will be
         inherited by the new child process.
      
      3. Add general_profit metric
         The general_profit metric of KSM is specified in the documentation,
         but not calculated.  This adds the general profit metric to
         /sys/kernel/debug/mm/ksm.
      
      4. Add more metrics to ksm_stat
         This adds the process profit metric to /proc/<pid>/ksm_stat.
      
      5. Add more tests to ksm_tests and ksm_functional_tests
         This adds an option to specify the merge type to the ksm_tests.
         This allows to test madvise and prctl KSM.
      
         It also adds a two new tests to ksm_functional_tests: one to test
         the new prctl options and the other one is a fork test to verify that
         the KSM process setting is inherited by client processes.
      
      This patch (of 3):
      
      So far KSM can only be enabled by calling madvise for memory regions.  To
      be able to use KSM for more workloads, KSM needs to have the ability to be
      enabled / disabled at the process / cgroup level.
      
      1. New options for prctl system command
      
         This patch series adds two new options to the prctl system call.
         The first one allows to enable KSM at the process level and the second
         one to query the setting.
      
         The setting will be inherited by child processes.
      
         With the above setting, KSM can be enabled for the seed process of a
         cgroup and all processes in the cgroup will inherit the setting.
      
      2. Changes to KSM processing
      
         When KSM is enabled at the process level, the KSM code will iterate
         over all the VMA's and enable KSM for the eligible VMA's.
      
         When forking a process that has KSM enabled, the setting will be
         inherited by the new child process.
      
        1) Introduce new MMF_VM_MERGE_ANY flag
      
           This introduces the new flag MMF_VM_MERGE_ANY flag.  When this flag
           is set, kernel samepage merging (ksm) gets enabled for all vma's of a
           process.
      
        2) Setting VM_MERGEABLE on VMA creation
      
           When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
           VM_MERGEABLE flag will be set for this VMA.
      
        3) support disabling of ksm for a process
      
           This adds the ability to disable ksm for a process if ksm has been
           enabled for the process with prctl.
      
        4) add new prctl option to get and set ksm for a process
      
           This adds two new options to the prctl system call
           - enable ksm for all vmas of a process (if the vmas support it).
           - query if ksm has been enabled for a process.
      
      3. Disabling MMF_VM_MERGE_ANY for storage keys in s390
      
         In the s390 architecture when storage keys are used, the
         MMF_VM_MERGE_ANY will be disabled.
      
      Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
      Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.ioSigned-off-by: NStefan Roesch <shr@devkernel.io>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Conflicts:
      	kernel/sys.c mm/ksm.c mm/mmap.c
      Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
      2cd2cdfe