1. 24 1月, 2023 3 次提交
    • K
      panic: Introduce warn_limit · f53b6dda
      Kees Cook 提交于
      commit 9fc9e278 upstream.
      
      Like oops_limit, add warn_limit for limiting the number of warnings when
      panic_on_warn is not set.
      
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: tangmeng <tangmeng@uniontech.com>
      Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
      Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: linux-doc@vger.kernel.org
      Reviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20221117234328.594699-5-keescook@chromium.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f53b6dda
    • K
      exit: Allow oops_limit to be disabled · e0738725
      Kees Cook 提交于
      commit de92f657 upstream.
      
      In preparation for keeping oops_limit logic in sync with warn_limit,
      have oops_limit == 0 disable checking the Oops counter.
      
      Cc: Jann Horn <jannh@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: linux-doc@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e0738725
    • J
      exit: Put an upper limit on how often we can oops · 767997ef
      Jann Horn 提交于
      commit d4ccd54d upstream.
      
      Many Linux systems are configured to not panic on oops; but allowing an
      attacker to oops the system **really** often can make even bugs that look
      completely unexploitable exploitable (like NULL dereferences and such) if
      each crash elevates a refcount by one or a lock is taken in read mode, and
      this causes a counter to eventually overflow.
      
      The most interesting counters for this are 32 bits wide (like open-coded
      refcounts that don't use refcount_t). (The ldsem reader count on 32-bit
      platforms is just 16 bits, but probably nobody cares about 32-bit platforms
      that much nowadays.)
      
      So let's panic the system if the kernel is constantly oopsing.
      
      The speed of oopsing 2^32 times probably depends on several factors, like
      how long the stack trace is and which unwinder you're using; an empirically
      important one is whether your console is showing a graphical environment or
      a text console that oopses will be printed to.
      In a quick single-threaded benchmark, it looks like oopsing in a vfork()
      child with a very short stack trace only takes ~510 microseconds per run
      when a graphical console is active; but switching to a text console that
      oopses are printed to slows it down around 87x, to ~45 milliseconds per
      run.
      (Adding more threads makes this faster, but the actual oops printing
      happens under &die_lock on x86, so you can maybe speed this up by a factor
      of around 2 and then any further improvement gets eaten up by lock
      contention.)
      
      It looks like it would take around 8-12 days to overflow a 32-bit counter
      with repeated oopsing on a multi-core X86 system running a graphical
      environment; both me (in an X86 VM) and Seth (with a distro kernel on
      normal hardware in a standard configuration) got numbers in that ballpark.
      
      12 days aren't *that* short on a desktop system, and you'd likely need much
      longer on a typical server system (assuming that people don't run graphical
      desktop environments on their servers), and this is a *very* noisy and
      violent approach to exploiting the kernel; and it also seems to take orders
      of magnitude longer on some machines, probably because stuff like EFI
      pstore will slow it down a ton if that's active.
      Signed-off-by: NJann Horn <jannh@google.com>
      Link: https://lore.kernel.org/r/20221107201317.324457-1-jannh@google.comReviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20221117234328.594699-2-keescook@chromium.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      767997ef
  2. 31 12月, 2022 1 次提交
  3. 17 9月, 2022 1 次提交
  4. 12 9月, 2022 3 次提交
  5. 25 8月, 2022 1 次提交
  6. 24 8月, 2022 1 次提交
  7. 22 8月, 2022 1 次提交
  8. 09 8月, 2022 1 次提交
  9. 27 7月, 2022 1 次提交
    • L
      powerpc/pseries/mobility: set NMI watchdog factor during an LPM · 118b1366
      Laurent Dufour 提交于
      During an LPM, while the memory transfer is in progress on the arrival
      side, some latencies are generated when accessing not yet transferred
      pages on the arrival side. Thus, the NMI watchdog may be triggered too
      frequently, which increases the risk to hit an NMI interrupt in a bad
      place in the kernel, leading to a kernel panic.
      
      Disabling the Hard Lockup Watchdog until the memory transfer could be a
      too strong work around, some users would want this timeout to be
      eventually triggered if the system is hanging even during an LPM.
      
      Introduce a new sysctl variable nmi_watchdog_factor. It allows to apply
      a factor to the NMI watchdog timeout during an LPM. Just before the CPUs
      are stopped for the switchover sequence, the NMI watchdog timer is set
      to watchdog_thresh + factor%
      
      A value of 0 has no effect. The default value is 200, meaning that the
      NMI watchdog is set to 30s during LPM (based on a 10s watchdog_thresh
      value). Once the memory transfer is achieved, the factor is reset to 0.
      
      Setting this value to a high number is like disabling the NMI watchdog
      during an LPM.
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20220713154729.80789-5-ldufour@linux.ibm.com
      118b1366
  10. 09 7月, 2022 1 次提交
  11. 04 7月, 2022 1 次提交
  12. 28 6月, 2022 1 次提交
  13. 25 6月, 2022 1 次提交
  14. 20 5月, 2022 1 次提交
  15. 16 5月, 2022 1 次提交
  16. 14 5月, 2022 2 次提交
    • M
      mm: hugetlb_vmemmap: add hugetlb_optimize_vmemmap sysctl · 78f39084
      Muchun Song 提交于
      We must add hugetlb_free_vmemmap=on (or "off") to the boot cmdline and
      reboot the server to enable or disable the feature of optimizing vmemmap
      pages associated with HugeTLB pages.  However, rebooting usually takes a
      long time.  So add a sysctl to enable or disable the feature at runtime
      without rebooting.  Why we need this?  There are 3 use cases.
      
      1) The feature of minimizing overhead of struct page associated with
         each HugeTLB is disabled by default without passing
         "hugetlb_free_vmemmap=on" to the boot cmdline.  When we (ByteDance)
         deliver the servers to the users who want to enable this feature, they
         have to configure the grub (change boot cmdline) and reboot the
         servers, whereas rebooting usually takes a long time (we have thousands
         of servers).  It's a very bad experience for the users.  So we need a
         approach to enable this feature after rebooting.  This is a use case in
         our practical environment.
      
      2) Some use cases are that HugeTLB pages are allocated 'on the fly'
         instead of being pulled from the HugeTLB pool, those workloads would be
         affected with this feature enabled.  Those workloads could be
         identified by the characteristics of they never explicitly allocating
         huge pages with 'nr_hugepages' but only set 'nr_overcommit_hugepages'
         and then let the pages be allocated from the buddy allocator at fault
         time.  We can confirm it is a real use case from the commit
         099730d6.  For those workloads, the page fault time could be ~2x
         slower than before.  We suspect those users want to disable this
         feature if the system has enabled this before and they don't think the
         memory savings benefit is enough to make up for the performance drop.
      
      3) If the workload which wants vmemmap pages to be optimized and the
         workload which wants to set 'nr_overcommit_hugepages' and does not want
         the extera overhead at fault time when the overcommitted pages be
         allocated from the buddy allocator are deployed in the same server. 
         The user could enable this feature and set 'nr_hugepages' and
         'nr_overcommit_hugepages', then disable the feature.  In this case, the
         overcommited HugeTLB pages will not encounter the extra overhead at
         fault time.
      
      Link: https://lkml.kernel.org/r/20220512041142.39501-5-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Masahiro Yamada <masahiroy@kernel.org>
      Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      78f39084
    • J
      random: fix sysctl documentation nits · 069c4ea6
      Jason A. Donenfeld 提交于
      A semicolon was missing, and the almost-alphabetical-but-not ordering
      was confusing, so regroup these by category instead.
      Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>
      069c4ea6
  17. 02 5月, 2022 1 次提交
  18. 29 4月, 2022 1 次提交
  19. 24 3月, 2022 2 次提交
  20. 23 3月, 2022 1 次提交
    • H
      NUMA balancing: optimize page placement for memory tiering system · c574bbe9
      Huang Ying 提交于
      With the advent of various new memory types, some machines will have
      multiple types of memory, e.g.  DRAM and PMEM (persistent memory).  The
      memory subsystem of these machines can be called memory tiering system,
      because the performance of the different types of memory are usually
      different.
      
      In such system, because of the memory accessing pattern changing etc,
      some pages in the slow memory may become hot globally.  So in this
      patch, the NUMA balancing mechanism is enhanced to optimize the page
      placement among the different memory types according to hot/cold
      dynamically.
      
      In a typical memory tiering system, there are CPUs, fast memory and slow
      memory in each physical NUMA node.  The CPUs and the fast memory will be
      put in one logical node (called fast memory node), while the slow memory
      will be put in another (faked) logical node (called slow memory node).
      That is, the fast memory is regarded as local while the slow memory is
      regarded as remote.  So it's possible for the recently accessed pages in
      the slow memory node to be promoted to the fast memory node via the
      existing NUMA balancing mechanism.
      
      The original NUMA balancing mechanism will stop to migrate pages if the
      free memory of the target node becomes below the high watermark.  This
      is a reasonable policy if there's only one memory type.  But this makes
      the original NUMA balancing mechanism almost do not work to optimize
      page placement among different memory types.  Details are as follows.
      
      It's the common cases that the working-set size of the workload is
      larger than the size of the fast memory nodes.  Otherwise, it's
      unnecessary to use the slow memory at all.  So, there are almost always
      no enough free pages in the fast memory nodes, so that the globally hot
      pages in the slow memory node cannot be promoted to the fast memory
      node.  To solve the issue, we have 2 choices as follows,
      
      a. Ignore the free pages watermark checking when promoting hot pages
         from the slow memory node to the fast memory node.  This will
         create some memory pressure in the fast memory node, thus trigger
         the memory reclaiming.  So that, the cold pages in the fast memory
         node will be demoted to the slow memory node.
      
      b. Define a new watermark called wmark_promo which is higher than
         wmark_high, and have kswapd reclaiming pages until free pages reach
         such watermark.  The scenario is as follows: when we want to promote
         hot-pages from a slow memory to a fast memory, but fast memory's free
         pages would go lower than high watermark with such promotion, we wake
         up kswapd with wmark_promo watermark in order to demote cold pages and
         free us up some space.  So, next time we want to promote hot-pages we
         might have a chance of doing so.
      
      The choice "a" may create high memory pressure in the fast memory node.
      If the memory pressure of the workload is high, the memory pressure
      may become so high that the memory allocation latency of the workload
      is influenced, e.g.  the direct reclaiming may be triggered.
      
      The choice "b" works much better at this aspect.  If the memory
      pressure of the workload is high, the hot pages promotion will stop
      earlier because its allocation watermark is higher than that of the
      normal memory allocation.  So in this patch, choice "b" is implemented.
      A new zone watermark (WMARK_PROMO) is added.  Which is larger than the
      high watermark and can be controlled via watermark_scale_factor.
      
      In addition to the original page placement optimization among sockets,
      the NUMA balancing mechanism is extended to be used to optimize page
      placement according to hot/cold among different memory types.  So the
      sysctl user space interface (numa_balancing) is extended in a backward
      compatible way as follow, so that the users can enable/disable these
      functionality individually.
      
      The sysctl is converted from a Boolean value to a bits field.  The
      definition of the flags is,
      
      - 0: NUMA_BALANCING_DISABLED
      - 1: NUMA_BALANCING_NORMAL
      - 2: NUMA_BALANCING_MEMORY_TIERING
      
      We have tested the patch with the pmbench memory accessing benchmark
      with the 80:20 read/write ratio and the Gauss access address
      distribution on a 2 socket Intel server with Optane DC Persistent
      Memory Model.  The test results shows that the pmbench score can
      improve up to 95.9%.
      
      Thanks Andrew Morton to help fix the document format error.
      
      Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Tested-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Feng Tang <feng.tang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c574bbe9
  21. 22 2月, 2022 1 次提交
  22. 21 2月, 2022 1 次提交
  23. 12 2月, 2022 1 次提交
  24. 31 1月, 2022 1 次提交
  25. 15 1月, 2022 1 次提交
  26. 14 12月, 2021 1 次提交
  27. 17 11月, 2021 1 次提交
  28. 04 9月, 2021 1 次提交
  29. 30 6月, 2021 4 次提交
  30. 18 6月, 2021 1 次提交
  31. 24 5月, 2021 1 次提交