1. 24 1月, 2023 3 次提交
    • K
      panic: Introduce warn_limit · f53b6dda
      Kees Cook 提交于
      commit 9fc9e278 upstream.
      
      Like oops_limit, add warn_limit for limiting the number of warnings when
      panic_on_warn is not set.
      
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Petr Mladek <pmladek@suse.com>
      Cc: tangmeng <tangmeng@uniontech.com>
      Cc: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
      Cc: Tiezhu Yang <yangtiezhu@loongson.cn>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Cc: linux-doc@vger.kernel.org
      Reviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20221117234328.594699-5-keescook@chromium.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f53b6dda
    • K
      exit: Allow oops_limit to be disabled · e0738725
      Kees Cook 提交于
      commit de92f657 upstream.
      
      In preparation for keeping oops_limit logic in sync with warn_limit,
      have oops_limit == 0 disable checking the Oops counter.
      
      Cc: Jann Horn <jannh@google.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
      Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Cc: linux-doc@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      e0738725
    • J
      exit: Put an upper limit on how often we can oops · 767997ef
      Jann Horn 提交于
      commit d4ccd54d upstream.
      
      Many Linux systems are configured to not panic on oops; but allowing an
      attacker to oops the system **really** often can make even bugs that look
      completely unexploitable exploitable (like NULL dereferences and such) if
      each crash elevates a refcount by one or a lock is taken in read mode, and
      this causes a counter to eventually overflow.
      
      The most interesting counters for this are 32 bits wide (like open-coded
      refcounts that don't use refcount_t). (The ldsem reader count on 32-bit
      platforms is just 16 bits, but probably nobody cares about 32-bit platforms
      that much nowadays.)
      
      So let's panic the system if the kernel is constantly oopsing.
      
      The speed of oopsing 2^32 times probably depends on several factors, like
      how long the stack trace is and which unwinder you're using; an empirically
      important one is whether your console is showing a graphical environment or
      a text console that oopses will be printed to.
      In a quick single-threaded benchmark, it looks like oopsing in a vfork()
      child with a very short stack trace only takes ~510 microseconds per run
      when a graphical console is active; but switching to a text console that
      oopses are printed to slows it down around 87x, to ~45 milliseconds per
      run.
      (Adding more threads makes this faster, but the actual oops printing
      happens under &die_lock on x86, so you can maybe speed this up by a factor
      of around 2 and then any further improvement gets eaten up by lock
      contention.)
      
      It looks like it would take around 8-12 days to overflow a 32-bit counter
      with repeated oopsing on a multi-core X86 system running a graphical
      environment; both me (in an X86 VM) and Seth (with a distro kernel on
      normal hardware in a standard configuration) got numbers in that ballpark.
      
      12 days aren't *that* short on a desktop system, and you'd likely need much
      longer on a typical server system (assuming that people don't run graphical
      desktop environments on their servers), and this is a *very* noisy and
      violent approach to exploiting the kernel; and it also seems to take orders
      of magnitude longer on some machines, probably because stuff like EFI
      pstore will slow it down a ton if that's active.
      Signed-off-by: NJann Horn <jannh@google.com>
      Link: https://lore.kernel.org/r/20221107201317.324457-1-jannh@google.comReviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Link: https://lore.kernel.org/r/20221117234328.594699-2-keescook@chromium.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      767997ef
  2. 31 12月, 2022 1 次提交
  3. 12 9月, 2022 2 次提交
    • P
      kernel/utsname_sysctl.c: print kernel arch · bfca3dd3
      Petr Vorel 提交于
      Print the machine hardware name (UTS_MACHINE) in /proc/sys/kernel/arch.
      
      This helps people who debug kernel with initramfs with minimal environment
      (i.e.  without coreutils or even busybox) or allow to open sysfs file
      instead of run 'uname -m' in high level languages.
      
      Link: https://lkml.kernel.org/r/20220901194403.3819-1-pvorel@suse.czSigned-off-by: NPetr Vorel <pvorel@suse.cz>
      Acked-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: David Sterba <dsterba@suse.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: Rafael J. Wysocki <rafael@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      bfca3dd3
    • H
      memory tiering: rate limit NUMA migration throughput · c6833e10
      Huang Ying 提交于
      In NUMA balancing memory tiering mode, if there are hot pages in slow
      memory node and cold pages in fast memory node, we need to promote/demote
      hot/cold pages between the fast and cold memory nodes.
      
      A choice is to promote/demote as fast as possible.  But the CPU cycles and
      memory bandwidth consumed by the high promoting/demoting throughput will
      hurt the latency of some workload because of accessing inflating and slow
      memory bandwidth contention.
      
      A way to resolve this issue is to restrict the max promoting/demoting
      throughput.  It will take longer to finish the promoting/demoting.  But
      the workload latency will be better.  This is implemented in this patch as
      the page promotion rate limit mechanism.
      
      The number of the candidate pages to be promoted to the fast memory node
      via NUMA balancing is counted, if the count exceeds the limit specified by
      the users, the NUMA balancing promotion will be stopped until the next
      second.
      
      A new sysctl knob kernel.numa_balancing_promote_rate_limit_MBps is added
      for the users to specify the limit.
      
      Link: https://lkml.kernel.org/r/20220713083954.34196-3-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Tested-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: osalvador <osalvador@suse.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Yang Shi <shy828301@gmail.com>
      Cc: Zhong Jiang <zhongjiang-ali@linux.alibaba.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      c6833e10
  4. 27 7月, 2022 1 次提交
    • L
      powerpc/pseries/mobility: set NMI watchdog factor during an LPM · 118b1366
      Laurent Dufour 提交于
      During an LPM, while the memory transfer is in progress on the arrival
      side, some latencies are generated when accessing not yet transferred
      pages on the arrival side. Thus, the NMI watchdog may be triggered too
      frequently, which increases the risk to hit an NMI interrupt in a bad
      place in the kernel, leading to a kernel panic.
      
      Disabling the Hard Lockup Watchdog until the memory transfer could be a
      too strong work around, some users would want this timeout to be
      eventually triggered if the system is hanging even during an LPM.
      
      Introduce a new sysctl variable nmi_watchdog_factor. It allows to apply
      a factor to the NMI watchdog timeout during an LPM. Just before the CPUs
      are stopped for the switchover sequence, the NMI watchdog timer is set
      to watchdog_thresh + factor%
      
      A value of 0 has no effect. The default value is 200, meaning that the
      NMI watchdog is set to 30s during LPM (based on a 10s watchdog_thresh
      value). Once the memory transfer is achieved, the factor is reset to 0.
      
      Setting this value to a high number is like disabling the NMI watchdog
      during an LPM.
      Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com>
      Reviewed-by: NNicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
      Link: https://lore.kernel.org/r/20220713154729.80789-5-ldufour@linux.ibm.com
      118b1366
  5. 25 6月, 2022 1 次提交
  6. 14 5月, 2022 1 次提交
  7. 02 5月, 2022 1 次提交
  8. 24 3月, 2022 2 次提交
  9. 23 3月, 2022 1 次提交
    • H
      NUMA balancing: optimize page placement for memory tiering system · c574bbe9
      Huang Ying 提交于
      With the advent of various new memory types, some machines will have
      multiple types of memory, e.g.  DRAM and PMEM (persistent memory).  The
      memory subsystem of these machines can be called memory tiering system,
      because the performance of the different types of memory are usually
      different.
      
      In such system, because of the memory accessing pattern changing etc,
      some pages in the slow memory may become hot globally.  So in this
      patch, the NUMA balancing mechanism is enhanced to optimize the page
      placement among the different memory types according to hot/cold
      dynamically.
      
      In a typical memory tiering system, there are CPUs, fast memory and slow
      memory in each physical NUMA node.  The CPUs and the fast memory will be
      put in one logical node (called fast memory node), while the slow memory
      will be put in another (faked) logical node (called slow memory node).
      That is, the fast memory is regarded as local while the slow memory is
      regarded as remote.  So it's possible for the recently accessed pages in
      the slow memory node to be promoted to the fast memory node via the
      existing NUMA balancing mechanism.
      
      The original NUMA balancing mechanism will stop to migrate pages if the
      free memory of the target node becomes below the high watermark.  This
      is a reasonable policy if there's only one memory type.  But this makes
      the original NUMA balancing mechanism almost do not work to optimize
      page placement among different memory types.  Details are as follows.
      
      It's the common cases that the working-set size of the workload is
      larger than the size of the fast memory nodes.  Otherwise, it's
      unnecessary to use the slow memory at all.  So, there are almost always
      no enough free pages in the fast memory nodes, so that the globally hot
      pages in the slow memory node cannot be promoted to the fast memory
      node.  To solve the issue, we have 2 choices as follows,
      
      a. Ignore the free pages watermark checking when promoting hot pages
         from the slow memory node to the fast memory node.  This will
         create some memory pressure in the fast memory node, thus trigger
         the memory reclaiming.  So that, the cold pages in the fast memory
         node will be demoted to the slow memory node.
      
      b. Define a new watermark called wmark_promo which is higher than
         wmark_high, and have kswapd reclaiming pages until free pages reach
         such watermark.  The scenario is as follows: when we want to promote
         hot-pages from a slow memory to a fast memory, but fast memory's free
         pages would go lower than high watermark with such promotion, we wake
         up kswapd with wmark_promo watermark in order to demote cold pages and
         free us up some space.  So, next time we want to promote hot-pages we
         might have a chance of doing so.
      
      The choice "a" may create high memory pressure in the fast memory node.
      If the memory pressure of the workload is high, the memory pressure
      may become so high that the memory allocation latency of the workload
      is influenced, e.g.  the direct reclaiming may be triggered.
      
      The choice "b" works much better at this aspect.  If the memory
      pressure of the workload is high, the hot pages promotion will stop
      earlier because its allocation watermark is higher than that of the
      normal memory allocation.  So in this patch, choice "b" is implemented.
      A new zone watermark (WMARK_PROMO) is added.  Which is larger than the
      high watermark and can be controlled via watermark_scale_factor.
      
      In addition to the original page placement optimization among sockets,
      the NUMA balancing mechanism is extended to be used to optimize page
      placement according to hot/cold among different memory types.  So the
      sysctl user space interface (numa_balancing) is extended in a backward
      compatible way as follow, so that the users can enable/disable these
      functionality individually.
      
      The sysctl is converted from a Boolean value to a bits field.  The
      definition of the flags is,
      
      - 0: NUMA_BALANCING_DISABLED
      - 1: NUMA_BALANCING_NORMAL
      - 2: NUMA_BALANCING_MEMORY_TIERING
      
      We have tested the patch with the pmbench memory accessing benchmark
      with the 80:20 read/write ratio and the Gauss access address
      distribution on a 2 socket Intel server with Optane DC Persistent
      Memory Model.  The test results shows that the pmbench score can
      improve up to 95.9%.
      
      Thanks Andrew Morton to help fix the document format error.
      
      Link: https://lkml.kernel.org/r/20220221084529.1052339-3-ying.huang@intel.comSigned-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Tested-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Reviewed-by: NYang Shi <shy828301@gmail.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Zi Yan <ziy@nvidia.com>
      Cc: Wei Xu <weixugc@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: zhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Feng Tang <feng.tang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c574bbe9
  10. 22 2月, 2022 1 次提交
  11. 21 2月, 2022 1 次提交
  12. 12 2月, 2022 1 次提交
  13. 14 12月, 2021 1 次提交
  14. 17 11月, 2021 1 次提交
  15. 30 6月, 2021 1 次提交
  16. 18 6月, 2021 1 次提交
  17. 18 5月, 2021 1 次提交
  18. 15 5月, 2021 1 次提交
  19. 14 5月, 2021 1 次提交
  20. 12 5月, 2021 1 次提交
  21. 09 12月, 2020 3 次提交
  22. 13 8月, 2020 1 次提交
  23. 29 7月, 2020 1 次提交
  24. 06 7月, 2020 1 次提交
  25. 27 6月, 2020 1 次提交
  26. 20 6月, 2020 1 次提交
  27. 09 6月, 2020 3 次提交
    • G
      panic: add sysctl to dump all CPUs backtraces on oops event · 60c958d8
      Guilherme G. Piccoli 提交于
      Usually when the kernel reaches an oops condition, it's a point of no
      return; in case not enough debug information is available in the kernel
      splat, one of the last resorts would be to collect a kernel crash dump
      and analyze it.  The problem with this approach is that in order to
      collect the dump, a panic is required (to kexec-load the crash kernel).
      When in an environment of multiple virtual machines, users may prefer to
      try living with the oops, at least until being able to properly shutdown
      their VMs / finish their important tasks.
      
      This patch implements a way to collect a bit more debug details when an
      oops event is reached, by printing all the CPUs backtraces through the
      usage of NMIs (on architectures that support that).  The sysctl added
      (and documented) here was called "oops_all_cpu_backtrace", and when set
      will (as the name suggests) dump all CPUs backtraces.
      
      Far from ideal, this may be the last option though for users that for
      some reason cannot panic on oops.  Most of times oopses are clear enough
      to indicate the kernel portion that must be investigated, but in virtual
      environments it's possible to observe hypervisor/KVM issues that could
      lead to oopses shown in other guests CPUs (like virtual APIC crashes).
      This patch hence aims to help debug such complex issues without
      resorting to kdump.
      Signed-off-by: NGuilherme G. Piccoli <gpiccoli@canonical.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Luis Chamberlain <mcgrof@kernel.org>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Matthew Wilcox <willy@infradead.org>
      Link: http://lkml.kernel.org/r/20200327224116.21030-1-gpiccoli@canonical.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60c958d8
    • G
      kernel/hung_task.c: introduce sysctl to print all traces when a hung task is detected · 0ec9dc9b
      Guilherme G. Piccoli 提交于
      Commit 401c636a ("kernel/hung_task.c: show all hung tasks before
      panic") introduced a change in that we started to show all CPUs
      backtraces when a hung task is detected _and_ the sysctl/kernel
      parameter "hung_task_panic" is set.  The idea is good, because usually
      when observing deadlocks (that may lead to hung tasks), the culprit is
      another task holding a lock and not necessarily the task detected as
      hung.
      
      The problem with this approach is that dumping backtraces is a slightly
      expensive task, specially printing that on console (and specially in
      many CPU machines, as servers commonly found nowadays).  So, users that
      plan to collect a kdump to investigate the hung tasks and narrow down
      the deadlock definitely don't need the CPUs backtrace on dmesg/console,
      which will delay the panic and pollute the log (crash tool would easily
      grab all CPUs traces with 'bt -a' command).
      
      Also, there's the reciprocal scenario: some users may be interested in
      seeing the CPUs backtraces but not have the system panic when a hung
      task is detected.  The current approach hence is almost as embedding a
      policy in the kernel, by forcing the CPUs backtraces' dump (only) on
      hung_task_panic.
      
      This patch decouples the panic event on hung task from the CPUs
      backtraces dump, by creating (and documenting) a new sysctl called
      "hung_task_all_cpu_backtrace", analog to the approach taken on soft/hard
      lockups, that have both a panic and an "all_cpu_backtrace" sysctl to
      allow individual control.  The new mechanism for dumping the CPUs
      backtraces on hung task detection respects "hung_task_warnings" by not
      dumping the traces in case there's no warnings left.
      Signed-off-by: NGuilherme G. Piccoli <gpiccoli@canonical.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NKees Cook <keescook@chromium.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Link: http://lkml.kernel.org/r/20200327223646.20779-1-gpiccoli@canonical.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0ec9dc9b
    • R
      kernel: add panic_on_taint · db38d5c1
      Rafael Aquini 提交于
      Analogously to the introduction of panic_on_warn, this patch introduces
      a kernel option named panic_on_taint in order to provide a simple and
      generic way to stop execution and catch a coredump when the kernel gets
      tainted by any given flag.
      
      This is useful for debugging sessions as it avoids having to rebuild the
      kernel to explicitly add calls to panic() into the code sites that
      introduce the taint flags of interest.
      
      For instance, if one is interested in proceeding with a post-mortem
      analysis at the point a given code path is hitting a bad page (i.e.
      unaccount_page_cache_page(), or slab_bug()), a coredump can be collected
      by rebooting the kernel with 'panic_on_taint=0x20' amended to the
      command line.
      
      Another, perhaps less frequent, use for this option would be as a means
      for assuring a security policy case where only a subset of taints, or no
      single taint (in paranoid mode), is allowed for the running system.  The
      optional switch 'nousertaint' is handy in this particular scenario, as
      it will avoid userspace induced crashes by writes to sysctl interface
      /proc/sys/kernel/tainted causing false positive hits for such policies.
      
      [akpm@linux-foundation.org: tweak kernel-parameters.txt wording]
      Suggested-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NRafael Aquini <aquini@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
      Cc: Dave Young <dyoung@redhat.com>
      Cc: Baoquan He <bhe@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: "Theodore Ts'o" <tytso@mit.edu>
      Cc: Adrian Bunk <bunk@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Laura Abbott <labbott@redhat.com>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Jiri Kosina <jikos@kernel.org>
      Cc: Takashi Iwai <tiwai@suse.de>
      Link: http://lkml.kernel.org/r/20200515175502.146720-1-aquini@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      db38d5c1
  28. 26 5月, 2020 2 次提交
  29. 18 5月, 2020 1 次提交
  30. 16 5月, 2020 1 次提交
  31. 05 5月, 2020 1 次提交