提交 · 88817acb8b75fe533fb5dfb6234a4e2104465e53 · openeuler / Kernel

23 11月, 2022 1 次提交

Documentation: add amd-pstate kernel command line options · 1056d314

由 Perry Yuan 提交于 11月 17, 2022

Add a new amd pstate driver command line option to enable driver passive
working mode via MSR and shared memory interface to request desired
performance within abstract scale and the power management firmware
(SMU) convert the perf requests into actual hardware pstates.

Also the `disable` parameter can disable the pstate driver loading by
adding `amd_pstate=disable` to kernel command line.
Acked-by: NHuang Rui <ray.huang@amd.com>
Reviewed-by: NGautham R. Shenoy <gautham.shenoy@amd.com>
Tested-by: NWyes Karny <wyes.karny@amd.com>
Signed-off-by: NPerry Yuan <Perry.Yuan@amd.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

1056d314

11 10月, 2022 1 次提交

xen/pv: support selecting safe/unsafe msr accesses · 3fac3734

由 Juergen Gross 提交于 9月 26, 2022

Instead of always doing the safe variants for reading and writing MSRs
in Xen PV guests, make the behavior controllable via Kconfig option
and a boot parameter.

The default will be the current behavior, which is to always use the
safe variant.
Signed-off-by: NJuergen Gross <jgross@suse.com>

3fac3734

04 10月, 2022 1 次提交

mm: memcontrol: deprecate swapaccounting=0 mode · b25806dc

由 Johannes Weiner 提交于 9月 26, 2022

The swapaccounting= commandline option already does very little today. To
close a trivial containment failure case, the swap ownership tracking part
of the swap controller has recently become mandatory (see commit
2d1c4980 ("mm: memcontrol: make swap tracking an integral part of
memory control") for details), which makes up the majority of the work
during swapout, swapin, and the swap slot map.

The only thing left under this flag is the page_counter operations and the
visibility of the swap control files in the first place, which are rather
meager savings. There also aren't many scenarios, if any, where
controlling the memory of a cgroup while allowing it unlimited access to a
global swap space is a workable resource isolation strategy.

On the other hand, there have been several bugs and confusion around the
many possible swap controller states (cgroup1 vs cgroup2 behavior, memory
accounting without swap accounting, memcg runtime disabled).

This puts the maintenance overhead of retaining the toggle above its
practical benefits. Deprecate it.

Link: https://lkml.kernel.org/r/20220926135704.400818-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Suggested-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

b25806dc

01 10月, 2022 1 次提交

kunit: add kunit.enable to enable/disable KUnit test · d20a6ba5

由 Joe Fradley 提交于 8月 23, 2022

This patch adds the kunit.enable module parameter that will need to be
set to true in addition to KUNIT being enabled for KUnit tests to run.
The default value is true giving backwards compatibility. However, for
the production+testing use case the new config option
KUNIT_DEFAULT_ENABLED can be set to N requiring the tester to opt-in
by passing kunit.enable=1 to the kernel.
Signed-off-by: NJoe Fradley <joefradley@google.com>
Reviewed-by: NDavid Gow <davidgow@google.com>
Signed-off-by: NShuah Khan <skhan@linuxfoundation.org>

d20a6ba5

26 9月, 2022 1 次提交

Documentation: Rename PPC_FSL_BOOK3E to PPC_E500 · 404a5e72

由 Christophe Leroy 提交于 9月 19, 2022

CONFIG_PPC_FSL_BOOK3E is redundant with CONFIG_PPC_E500.

Rename it so that CONFIG_PPC_FSL_BOOK3E can be removed later.
Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/d3d42b395c09e66b0705fda1e51779f33e13ac38.1663606876.git.christophe.leroy@csgroup.eu

404a5e72

16 9月, 2022 1 次提交

arm64: support huge vmalloc mappings · e9207223

由 Kefeng Wang 提交于 9月 11, 2022

As commit 559089e0 ("vmalloc: replace VM_NO_HUGE_VMAP with
VM_ALLOW_HUGE_VMAP"), the use of hugepage mappings for vmalloc
is an opt-in strategy, so it is saftly to support huge vmalloc
mappings on arm64, for now, it is used in kvmalloc() and
alloc_large_system_hash().
Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Link: https://lore.kernel.org/r/20220911044423.139229-1-wangkefeng.wang@huawei.comSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>

e9207223

12 9月, 2022 1 次提交

page_ext: introduce boot parameter 'early_page_ext' · c4f20f14

由 Li Zhe 提交于 8月 25, 2022

In commit 2f1ee091 ("Revert "mm: use early_pfn_to_nid in
page_ext_init""), we call page_ext_init() after page_alloc_init_late() to
avoid some panic problem.  It seems that we cannot track early page
allocations in current kernel even if page structure has been initialized
early.

This patch introduces a new boot parameter 'early_page_ext' to resolve
this problem.  If we pass it to the kernel, page_ext_init() will be moved
up and the feature 'deferred initialization of struct pages' will be
disabled to initialize the page allocator early and prevent the panic
problem above.  It can help us to catch early page allocations.  This is
useful especially when we find that the free memory value is not the same
right after different kernel booting.

[akpm@linux-foundation.org: fix section issue by removing __meminitdata]
Link: https://lkml.kernel.org/r/20220825102714.669-1-lizhe.67@bytedance.comSigned-off-by: NLi Zhe <lizhe.67@bytedance.com>
Suggested-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Mark-PK Tsai <mark-pk.tsai@mediatek.com>
Cc: Masami Hiramatsu (Google) <mhiramat@kernel.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

c4f20f14

10 9月, 2022 1 次提交

arm64: spectre: increase parameters that can be used to turn off bhb mitigation individually · 877ace9e

由 Liu Song 提交于 8月 26, 2022

In our environment, it was found that the mitigation BHB has a great
impact on the benchmark performance. For example, in the lmbench test,
the "process fork && exit" test performance drops by 20%.
So it is necessary to have the ability to turn off the mitigation
individually through cmdline, thus avoiding having to compile the
kernel by adjusting the config.
Signed-off-by: NLiu Song <liusong@linux.alibaba.com>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Link: https://lore.kernel.org/r/1661514050-22263-1-git-send-email-liusong@linux.alibaba.comSigned-off-by: NCatalin Marinas <catalin.marinas@arm.com>

877ace9e

07 9月, 2022 1 次提交

iommu/amd: Add command-line option to enable different page table · d799a183

由 Vasant Hegde 提交于 8月 25, 2022

Enhance amd_iommu command line option to specify v1 or v2 page table.
By default system will boot in V1 page table mode.
Co-developed-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: NVasant Hegde <vasant.hegde@amd.com>
Link: https://lore.kernel.org/r/20220825063939.8360-10-vasant.hegde@amd.comSigned-off-by: NJoerg Roedel <jroedel@suse.de>

d799a183

05 9月, 2022 1 次提交

powerpc/pseries: Implement CONFIG_PARAVIRT_TIME_ACCOUNTING · 0e8a6313

由 Nicholas Piggin 提交于 9月 02, 2022

CONFIG_VIRT_CPU_ACCOUNTING_GEN under pseries does not provide stolen
time accounting unless CONFIG_PARAVIRT_TIME_ACCOUNTING is enabled.
Implement this using the VPA accumulated wait counters.

Note this will not work on current KVM hosts because KVM does not
implement the VPA dispatch counters (yet). It could be implemented
with the dispatch trace log as it is for VIRT_CPU_ACCOUNTING_NATIVE,
but that is not necessary for the more limited accounting provided
by PARAVIRT_TIME_ACCOUNTING, and it is more expensive, complex, and
has downsides like potential log wrap.

From Shrikanth:

  [...] it was tested on Power10 [PowerVM] Shared LPAR. system has two
  LPAR. we will call first one LPAR1 and second one as LPAR2. Test was
  carried out in SMT=1. Similar observation was seen in SMT=8 as well.

  LPAR config header from each LPAR is below. LPAR1 is twice as big as
  LPAR2. Since Both are sharing the same underlying hardware, work
  stealing will happen when both the LPAR's are contending for the same
  resource.

  LPAR1:
  type=Shared mode=Uncapped smt=Off lcpu=40 cpus=40 ent=20.00
  LPAR2:
  type=Shared mode=Uncapped smt=Off lcpu=20 cpus=40 ent=10.00

  mpstat was used to check for the utilization. stress-ng has been used
  as the workload. Few cases are tested. when the both LPAR are idle
  there is no steal time. when LPAR1 starts running at 100% which
  consumes all of the physical resource, steal time starts to get
  accounted.  With LPAR1 running at 100% and LPAR2 starts running, steal
  time starts increasing. This is as expected. When the LPAR2 Load is
  increased further, steal time increases further.

  Case 1: 0% LPAR1; 0% LPAR2
   %usr  %nice   %sys %iowait  %irq  %soft %steal %guest %gnice  %idle
   0.00   0.00   0.05   0.00   0.00   0.00   0.00   0.00   0.00  99.95

  Case 2: 100% LPAR1; 0% LPAR2
   %usr  %nice   %sys %iowait  %irq  %soft %steal %guest %gnice  %idle
  97.68   0.00   0.00   0.00   0.00   0.00   2.32   0.00   0.00   0.00

  Case 3: 100% LPAR1; 50% LPAR2
   %usr  %nice   %sys %iowait  %irq  %soft %steal %guest %gnice  %idle
  86.34   0.00   0.10   0.00   0.00   0.03  13.54   0.00   0.00   0.00

  Case 4: 100% LPAR1; 100% LPAR2
   %usr  %nice   %sys %iowait  %irq  %soft %steal %guest %gnice  %idle
  78.54   0.00   0.07   0.00   0.00   0.02  21.36   0.00   0.00   0.00

  Case 5: 50% LPAR1; 100% LPAR2
   %usr  %nice   %sys %iowait  %irq  %soft %steal %guest %gnice  %idle
  49.37   0.00   0.00   0.00   0.00   0.00   1.17   0.00   0.00  49.47

  Patch is accounting for the steal time and basic tests are holding
  good.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Tested-by: NShrikanth Hegde <sshegde@linux.ibm.com>
[mpe: Add SPDX tag to new paravirt_api_clock.h]
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220902085316.2071519-3-npiggin@gmail.com

0e8a6313

01 9月, 2022 1 次提交

x86/apic: Don't disable x2APIC if locked · b8d1d163

由 Daniel Sneddon 提交于 8月 16, 2022

The APIC supports two modes, legacy APIC (or xAPIC), and Extended APIC
(or x2APIC).  X2APIC mode is mostly compatible with legacy APIC, but
it disables the memory-mapped APIC interface in favor of one that uses
MSRs.  The APIC mode is controlled by the EXT bit in the APIC MSR.

The MMIO/xAPIC interface has some problems, most notably the APIC LEAK
[1].  This bug allows an attacker to use the APIC MMIO interface to
extract data from the SGX enclave.

Introduce support for a new feature that will allow the BIOS to lock
the APIC in x2APIC mode.  If the APIC is locked in x2APIC mode and the
kernel tries to disable the APIC or revert to legacy APIC mode a GP
fault will occur.

Introduce support for a new MSR (IA32_XAPIC_DISABLE_STATUS) and handle
the new locked mode when the LEGACY_XAPIC_DISABLED bit is set by
preventing the kernel from trying to disable the x2APIC.

On platforms with the IA32_XAPIC_DISABLE_STATUS MSR, if SGX or TDX are
enabled the LEGACY_XAPIC_DISABLED will be set by the BIOS.  If
legacy APIC is required, then it SGX and TDX need to be disabled in the
BIOS.

[1]: https://aepicleak.com/aepicleak.pdfSigned-off-by: NDaniel Sneddon <daniel.sneddon@linux.intel.com>
Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
Acked-by: NDave Hansen <dave.hansen@linux.intel.com>
Tested-by: NNeelima Krishnan <neelima.krishnan@intel.com>
Link: https://lkml.kernel.org/r/20220816231943.1152579-1-daniel.sneddon@linux.intel.com

b8d1d163

23 8月, 2022 1 次提交

arm64: fix rodata=full · 2e8cff0a

由 Mark Rutland 提交于 8月 17, 2022

On arm64, "rodata=full" has been suppored (but not documented) since
commit:

  c55191e9 ("arm64: mm: apply r/o permissions of VM areas to its linear alias as well")

As it's necessary to determine the rodata configuration early during
boot, arm64 has an early_param() handler for this, whereas init/main.c
has a __setup() handler which is run later.

Unfortunately, this split meant that since commit:

  f9a40b08 ("init/main.c: return 1 from handled __setup() functions")

... passing "rodata=full" would result in a spurious warning from the
__setup() handler (though RO permissions would be configured
appropriately).

Further, "rodata=full" has been broken since commit:

  0d6ea3ac ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()")

... which caused strtobool() to parse "full" as false (in addition to
many other values not documented for the "rodata=" kernel parameter.

This patch fixes this breakage by:

* Moving the core parameter parser to an __early_param(), such that it
  is available early.

* Adding an (optional) arch hook which arm64 can use to parse "full".

* Updating the documentation to mention that "full" is valid for arm64.

* Having the core parameter parser handle "on" and "off" explicitly,
  such that any undocumented values (e.g. typos such as "ful") are
  reported as errors rather than being silently accepted.

Note that __setup() and early_param() have opposite conventions for
their return values, where __setup() uses 1 to indicate a parameter was
handled and early_param() uses 0 to indicate a parameter was handled.

Fixes: f9a40b08 ("init/main.c: return 1 from handled __setup() functions")
Fixes: 0d6ea3ac ("lib/kstrtox.c: add "false"/"true" support to kstrtobool()")
Signed-off-by: NMark Rutland <mark.rutland@arm.com>
Cc: Andy Shevchenko <andy.shevchenko@gmail.com>
Cc: Ard Biesheuvel <ardb@kernel.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Jagdish Gediya <jvgediya@linux.ibm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Will Deacon <will@kernel.org>
Reviewed-by: NArd Biesheuvel <ardb@kernel.org>
Link: https://lore.kernel.org/r/20220817154022.3974645-1-mark.rutland@arm.comSigned-off-by: NWill Deacon <will@kernel.org>

2e8cff0a

22 8月, 2022 1 次提交

Remove DECnet support from kernel · 1202cdd6

由 Stephen Hemminger 提交于 8月 17, 2022

DECnet is an obsolete network protocol that receives more attention
from kernel janitors than users. It belongs in computer protocol
history museum not in Linux kernel.

It has been "Orphaned" in kernel since 2010. The iproute2 support
for DECnet was dropped in 5.0 release. The documentation link on
Sourceforge says it is abandoned there as well.

Leave the UAPI alone to keep userspace programs compiling.
This means that there is still an empty neighbour table
for AF_DECNET.

The table of /proc/sys/net entries was updated to match
current directories and reformatted to be alphabetical.
Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Acked-by: NDavid Ahern <dsahern@kernel.org>
Acked-by: NNikolay Aleksandrov <razor@blackwall.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1202cdd6

09 8月, 2022 2 次提交

mm: hugetlb_vmemmap: introduce the name HVO · dff03381

由 Muchun Song 提交于 6月 28, 2022

It it inconvenient to mention the feature of optimizing vmemmap pages
associated with HugeTLB pages when communicating with others since there
is no specific or abbreviated name for it when it is first introduced. 
Let us give it a name HVO (HugeTLB Vmemmap Optimization) from now.

This commit also updates the document about "hugetlb_free_vmemmap" by the
way discussed in thread [1].

Link: https://lore.kernel.org/all/21aae898-d54d-cc4b-a11f-1bb7fddcfffa@redhat.com/ [1]
Link: https://lkml.kernel.org/r/20220628092235.91270-4-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Will Deacon <will@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

dff03381

x86/bugs: Enable STIBP for IBPB mitigated RETBleed · e6cfcdda

由 Kim Phillips 提交于 8月 08, 2022

AMD's "Technical Guidance for Mitigating Branch Type Confusion,
Rev. 1.0 2022-07-12" whitepaper, under section 6.1.2 "IBPB On
Privileged Mode Entry / SMT Safety" says:

  Similar to the Jmp2Ret mitigation, if the code on the sibling thread
  cannot be trusted, software should set STIBP to 1 or disable SMT to
  ensure SMT safety when using this mitigation.

So, like already being done for retbleed=unret, and now also for
retbleed=ibpb, force STIBP on machines that have it, and report its SMT
vulnerability status accordingly.

 [ bp: Remove the "we" and remove "[AMD]" applicability parameter which
   doesn't work here. ]

Fixes: 3ebc1700 ("x86/bugs: Add retbleed=ibpb")
Signed-off-by: NKim Phillips <kim.phillips@amd.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: stable@vger.kernel.org # 5.10, 5.15, 5.19
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206537
Link: https://lore.kernel.org/r/20220804192201.439596-1-kim.phillips@amd.com

e6cfcdda

30 7月, 2022 1 次提交

docs/kernel-parameters: Update descriptions for "mitigations=" param with retbleed · ea304a8b

由 Eiichi Tsukata 提交于 7月 28, 2022

Updates descriptions for "mitigations=off" and "mitigations=auto,nosmt"
with the respective retbleed= settings.
Signed-off-by: NEiichi Tsukata <eiichi.tsukata@nutanix.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: corbet@lwn.net
Link: https://lore.kernel.org/r/20220728043907.165688-1-eiichi.tsukata@nutanix.com

ea304a8b

22 7月, 2022 1 次提交

swiotlb: clean up some coding style and minor issues · 72311809

由 Tianyu Lan 提交于 7月 21, 2022

- Fix the used field of struct io_tlb_area wasn't initialized
- Set area number to be 0 if input area number parameter is 0
- Use array_size() to calculate io_tlb_area array size
- Make parameters of swiotlb_do_find_slots() more reasonable

Fixes: 26ffb91fa5e0 ("swiotlb: split up the global swiotlb lock")
Signed-off-by: NTianyu Lan <tiala@microsoft.com>
Reviewed-by: NMichael Kelley <mikelley@microsoft.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

72311809

20 7月, 2022 2 次提交

rcu/nocb: Add an option to offload all CPUs on boot · b37a667c

由 Joel Fernandes 提交于 4月 22, 2022

Systems built with CONFIG_RCU_NOCB_CPU=y but booted without either
the rcu_nocbs= or rcu_nohz_full= kernel-boot parameters will not have
callback offloading on any of the CPUs, nor can any of the CPUs be
switched to enable callback offloading at runtime.  Although this is
intentional, it would be nice to have a way to offload all the CPUs
without having to make random bootloaders specify either the rcu_nocbs=
or the rcu_nohz_full= kernel-boot parameters.

This commit therefore provides a new CONFIG_RCU_NOCB_CPU_DEFAULT_ALL
Kconfig option that switches the default so as to offload callback
processing on all of the CPUs.  This default can still be overridden
using the rcu_nocbs= and rcu_nohz_full= kernel-boot parameters.
Reviewed-by: NKalesh Singh <kaleshsingh@google.com>
Reviewed-by: NUladzislau Rezki <urezki@gmail.com>
(In v4.1, fixed issues with CONFIG maze reported by kernel test robot).
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NJoel Fernandes <joel@joelfernandes.org>
Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
Reviewed-by: NNeeraj Upadhyay <quic_neeraju@quicinc.com>

b37a667c

srcu: Make expedited RCU grace periods block even less frequently · 4f2bfd94

由 Neeraj Upadhyay 提交于 7月 01, 2022

The purpose of commit 282d8998 ("srcu: Prevent expedited GPs
and blocking readers from consuming CPU") was to prevent a long
series of never-blocking expedited SRCU grace periods from blocking
kernel-live-patching (KLP) progress.  Although it was successful, it also
resulted in excessive boot times on certain embedded workloads running
under qemu with the "-bios QEMU_EFI.fd" command line.  Here "excessive"
means increasing the boot time up into the three-to-four minute range.
This increase in boot time was due to the more than 6000 back-to-back
invocations of synchronize_rcu_expedited() within the KVM host OS, which
in turn resulted from qemu's emulation of a long series of MMIO accesses.

Commit 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace
periods") did not significantly help this particular use case.

Zhangfei Gao and Shameerali Kolothum Thodi did experiments varying the
value of SRCU_MAX_NODELAY_PHASE with HZ=250 and with various values
of non-sleeping per phase counts on a system with preemption enabled,
and observed the following boot times:

+──────────────────────────+────────────────+
| SRCU_MAX_NODELAY_PHASE   | Boot time (s)  |
+──────────────────────────+────────────────+
| 100                      | 30.053         |
| 150                      | 25.151         |
| 200                      | 20.704         |
| 250                      | 15.748         |
| 500                      | 11.401         |
| 1000                     | 11.443         |
| 10000                    | 11.258         |
| 1000000                  | 11.154         |
+──────────────────────────+────────────────+

Analysis on the experiment results show additional improvements with
CPU-bound delays approaching one jiffy in duration. This improvement was
also seen when number of per-phase iterations were scaled to one jiffy.

This commit therefore scales per-grace-period phase number of non-sleeping
polls so that non-sleeping polls extend for about one jiffy. In addition,
the delay-calculation call to srcu_get_delay() in srcu_gp_end() is
replaced with a simple check for an expedited grace period.  This change
schedules callback invocation immediately after expedited grace periods
complete, which results in greatly improved boot times.  Testing done
by Marc and Zhangfei confirms that this change recovers most of the
performance degradation in boottime; for CONFIG_HZ_250 configuration,
specifically, boot times improve from 3m50s to 41s on Marc's setup;
and from 2m40s to ~9.7s on Zhangfei's setup.

In addition to the changes to default per phase delays, this
change adds 3 new kernel parameters - srcutree.srcu_max_nodelay,
srcutree.srcu_max_nodelay_phase, and srcutree.srcu_retry_check_delay.
This allows users to configure the srcu grace period scanning delays in
order to more quickly react to additional use cases.

Fixes: 640a7d37c3f4 ("srcu: Block less aggressively for expedited grace periods")
Fixes: 282d8998 ("srcu: Prevent expedited GPs and blocking readers from consuming CPU")
Reported-by: NZhangfei Gao <zhangfei.gao@linaro.org>
Reported-by: Nyueluck <yueluck@163.com>
Signed-off-by: NNeeraj Upadhyay <quic_neeraju@quicinc.com>
Tested-by: NMarc Zyngier <maz@kernel.org>
Tested-by: NZhangfei Gao <zhangfei.gao@linaro.org>
Link: https://lore.kernel.org/all/20615615-0013-5adc-584f-2b1d5c03ebfc@linaro.org/Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>

4f2bfd94

18 7月, 2022 2 次提交

x86/rdrand: Remove "nordrand" flag in favor of "random.trust_cpu" · 049f9ae9

由 Jason A. Donenfeld 提交于 7月 09, 2022

The decision of whether or not to trust RDRAND is controlled by the
"random.trust_cpu" boot time parameter or the CONFIG_RANDOM_TRUST_CPU
compile time default. The "nordrand" flag was added during the early
days of RDRAND, when there were worries that merely using its values
could compromise the RNG. However, these days, RDRAND values are not
used directly but always go through the RNG's hash function, making
"nordrand" no longer useful.

Rather, the correct switch is "random.trust_cpu", which not only handles
the relevant trust issue directly, but also is general to multiple CPU
types, not just x86.

However, x86 RDRAND does have a history of being occasionally
problematic. Prior, when the kernel would notice something strange, it'd
warn in dmesg and suggest enabling "nordrand". We can improve on that by
making the test a little bit better and then taking the step of
automatically disabling RDRAND if we detect it's problematic.

Also disable RDSEED if the RDRAND test fails.

Cc: x86@kernel.org
Cc: Theodore Ts'o <tytso@mit.edu>
Suggested-by: NH. Peter Anvin <hpa@zytor.com>
Suggested-by: NBorislav Petkov <bp@suse.de>
Acked-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NJason A. Donenfeld <Jason@zx2c4.com>

049f9ae9

init: add "hostname" kernel parameter · 5a704629

由 Dan Moulding 提交于 7月 17, 2022

The gethostname system call returns the hostname for the current machine.
However, the kernel has no mechanism to initially set the current
machine's name in such a way as to guarantee that the first userspace
process to call gethostname will receive a meaningful result. It relies
on some unspecified userspace process to first call sethostname before
gethostname can produce a meaningful name.

Traditionally the machine's hostname is set from userspace by the init
system. The init system, in turn, often relies on a configuration file
(say, /etc/hostname) to provide the value that it will supply in the call
to sethostname. Consequently, the file system containing /etc/hostname
usually must be available before the hostname will be set. There may,
however, be earlier userspace processes that could call gethostname before
the file system containing /etc/hostname is mounted. Such a process will
get some other, likely meaningless, name from gethostname (such as
"(none)", "localhost", or "darkstar").

A real-world example where this can happen, and lead to undesirable
results, is with mdadm. When assembling arrays, mdadm distinguishes
between "local" arrays and "foreign" arrays. A local array is one that
properly belongs to the current machine, and a foreign array is one that
is (possibly temporarily) attached to the current machine, but properly
belongs to some other machine. To determine if an array is local or
foreign, mdadm may compare the "homehost" recorded on the array with the
current hostname. If mdadm is run before the root file system is mounted,
perhaps because the root file system itself resides on an md-raid array,
then /etc/hostname isn't yet available and the init system will not yet
have called sethostname, causing mdadm to incorrectly conclude that all of
the local arrays are foreign.

Solving this problem *could* be delegated to the init system. It could be
left up to the init system (including any init system that starts within
an initramfs, if one is in use) to ensure that sethostname is called
before any other userspace process could possibly call gethostname.
However, it may not always be obvious which processes could call
gethostname (for example, udev itself might not call gethostname, but it
could via udev rules invoke processes that do). Additionally, the init
system has to ensure that the hostname configuration value is stored in
some place where it will be readily accessible during early boot.
Unfortunately, every init system will attempt to (or has already attempted
to) solve this problem in a different, possibly incorrect, way. This
makes getting consistently working configurations harder for users.

I believe it is better for the kernel to provide the means by which the
hostname may be set early, rather than making this a problem for the init
system to solve. The option to set the hostname during early startup, via
a kernel parameter, provides a simple, reliable way to solve this problem.
It also could make system configuration easier for some embedded systems.

[dmoulding@me.com: v2]
Link: https://lkml.kernel.org/r/20220506060310.7495-2-dmoulding@me.com
Link: https://lkml.kernel.org/r/20220505180651.22849-2-dmoulding@me.comSigned-off-by: NDan Moulding <dmoulding@me.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

5a704629

13 7月, 2022 1 次提交

swiotlb: split up the global swiotlb lock · 20347fca

由 Tianyu Lan 提交于 7月 08, 2022

Traditionally swiotlb was not performance critical because it was only
used for slow devices. But in some setups, like TDX/SEV confidential
guests, all IO has to go through swiotlb. Currently swiotlb only has a
single lock. Under high IO load with multiple CPUs this can lead to
significat lock contention on the swiotlb lock.

This patch splits the swiotlb bounce buffer pool into individual areas
which have their own lock. Each CPU tries to allocate in its own area
first. Only if that fails does it search other areas. On freeing the
allocation is freed into the area where the memory was originally
allocated from.

Area number can be set via swiotlb kernel parameter and is default
to be possible cpu number. If possible cpu number is not power of
2, area number will be round up to the next power of 2.

This idea from Andi Kleen patch(https://github.com/intel/tdx/commit/
4529b5784c141782c72ec9bd9a92df2b68cb7d45).
Based-on-idea-by: NAndi Kleen <ak@linux.intel.com>
Signed-off-by: NTianyu Lan <Tianyu.Lan@microsoft.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

20347fca

12 7月, 2022 1 次提交

module: Add support for default value for module async_probe · ae39e9ed

由 Saravana Kannan 提交于 6月 03, 2022

Add a module.async_probe kernel command line option that allows enabling
async probing for all modules. When this command line option is used,
there might still be some modules for which we want to explicitly force
synchronous probing, so extend <modulename>.async_probe to take an
optional bool input so that async probing can be disabled for a specific
module.
Signed-off-by: NSaravana Kannan <saravanak@google.com>
Reviewed-by: NAaron Tomlin <atomlin@redhat.com>
Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>

ae39e9ed

08 7月, 2022 1 次提交

Documentation: KVM: update amd-memory-encryption.rst references · 7ac3945d

由 Mauro Carvalho Chehab 提交于 6月 26, 2022

Changeset daec8d40 ("Documentation: KVM: add separate directories for architecture-specific documentation")
renamed: Documentation/virt/kvm/amd-memory-encryption.rst
to: Documentation/virt/kvm/x86/amd-memory-encryption.rst.

Update the cross-references accordingly.

Fixes: daec8d40 ("Documentation: KVM: add separate directories for architecture-specific documentation")
Signed-off-by: NMauro Carvalho Chehab <mchehab@kernel.org>
Link: https://lore.kernel.org/r/fd80db889e34aae87a4ca88cad94f650723668f4.1656234456.git.mchehab@kernel.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>

7ac3945d

07 7月, 2022 1 次提交

iommu/amd: Add PCI segment support for ivrs_[ioapic/hpet/acpihid] commands · bbe3a106

由 Suravee Suthikulpanit 提交于 7月 06, 2022

By default, PCI segment is zero and can be omitted. To support system
with non-zero PCI segment ID, modify the parsing functions to allow
PCI segment ID.
Co-developed-by: NVasant Hegde <vasant.hegde@amd.com>
Signed-off-by: NVasant Hegde <vasant.hegde@amd.com>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Link: https://lore.kernel.org/r/20220706113825.25582-33-vasant.hegde@amd.comSigned-off-by: NJoerg Roedel <jroedel@suse.de>

bbe3a106

04 7月, 2022 1 次提交

mm: memory_hotplug: make hugetlb_optimize_vmemmap compatible with memmap_on_memory · 66361095

由 Muchun Song 提交于 6月 17, 2022

For now, the feature of hugetlb_free_vmemmap is not compatible with the
feature of memory_hotplug.memmap_on_memory, and hugetlb_free_vmemmap takes
precedence over memory_hotplug.memmap_on_memory. However, someone wants
to make memory_hotplug.memmap_on_memory takes precedence over
hugetlb_free_vmemmap since memmap_on_memory makes it more likely to
succeed memory hotplug in close-to-OOM situations. So the decision of
making hugetlb_free_vmemmap take precedence is not wise and elegant.

The proper approach is to have hugetlb_vmemmap.c do the check whether the
section which the HugeTLB pages belong to can be optimized. If the
section's vmemmap pages are allocated from the added memory block itself,
hugetlb_free_vmemmap should refuse to optimize the vmemmap, otherwise, do
the optimization. Then both kernel parameters are compatible. So this
patch introduces VmemmapSelfHosted to mask any non-optimizable vmemmap
pages. The hugetlb_vmemmap can use this flag to detect if a vmemmap page
can be optimized.

[songmuchun@bytedance.com: walk vmemmap page tables to avoid false-positive]
Link: https://lkml.kernel.org/r/20220620110616.12056-3-songmuchun@bytedance.com
Link: https://lkml.kernel.org/r/20220617135650.74901-3-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com>
Co-developed-by: NOscar Salvador <osalvador@suse.de>
Signed-off-by: NOscar Salvador <osalvador@suse.de>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Xiongchun Duan <duanxiongchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

66361095

01 7月, 2022 2 次提交

arm64: Add the arm64.nosve command line option · 504ee236

由 Marc Zyngier 提交于 6月 30, 2022

In order to be able to completely disable SVE even if the HW
seems to support it (most likely because the FW is broken),
move the SVE setup into the EL2 finalisation block, and
use a new idreg override to deal with it.

Note that we also nuke id_aa64zfr0_el1 as a byproduct, and
that SME also gets disabled, due to the dependency between the
two features.
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Reviewed-by: NMark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20220630160500.1536744-9-maz@kernel.orgSigned-off-by: NWill Deacon <will@kernel.org>

504ee236

arm64: Add the arm64.nosme command line option · b3000e21

由 Marc Zyngier 提交于 6月 30, 2022

In order to be able to completely disable SME even if the HW
seems to support it (most likely because the FW is broken),
move the SME setup into the EL2 finalisation block, and
use a new idreg override to deal with it.

Note that we also nuke id_aa64smfr0_el1 as a byproduct.
Signed-off-by: NMarc Zyngier <maz@kernel.org>
Reviewed-by: NMark Brown <broonie@kernel.org>
Link: https://lore.kernel.org/r/20220630160500.1536744-8-maz@kernel.orgSigned-off-by: NWill Deacon <will@kernel.org>

b3000e21

29 6月, 2022 2 次提交

powerpc/32: Remove 'noltlbs' kernel parameter · 56e54b4e

由 Christophe Leroy 提交于 6月 14, 2022

Mapping without large TLBs has no added value on the 8xx.

Mapping without large TLBs is still necessary on 40x when
selecting CONFIG_KFENCE or CONFIG_DEBUG_PAGEALLOC or
CONFIG_STRICT_KERNEL_RWX, but this is done automatically
and doesn't require user selection.

Remove 'noltlbs' kernel parameter, the user has no reason
to use it.
Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/80ca17bd39cf608a8ebd0764d7064a498e131199.1655202721.git.christophe.leroy@csgroup.eu

56e54b4e

powerpc/32: Remove the 'nobats' kernel parameter · 1ce84497

由 Christophe Leroy 提交于 6月 14, 2022

Mapping without BATs doesn't bring any added value to the user.

Remove that option.
Signed-off-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/6977314c823cfb728bc0273cea634b41807bfb64.1655202721.git.christophe.leroy@csgroup.eu

1ce84497

28 6月, 2022 2 次提交

arm64: correct the effect of mitigations off on kpti · e92b2573

由 Liu Song 提交于 6月 24, 2022

If KASLR is enabled, then kpti will be forced to be enabled even if
mitigations off, so we need to adjust the description of this parameter.
Signed-off-by: NLiu Song <liusong@linux.alibaba.com>
Link: https://lore.kernel.org/r/1656033648-84181-1-git-send-email-liusong@linux.alibaba.comSigned-off-by: NWill Deacon <will@kernel.org>

e92b2573

docs: rename Documentation/vm to Documentation/mm · ee65728e

由 Mike Rapoport 提交于 6月 27, 2022

so it will be consistent with code mm directory and with
Documentation/admin-guide/mm and won't be confused with virtual machines.
Signed-off-by: NMike Rapoport <rppt@linux.ibm.com>
Suggested-by: NMatthew Wilcox <willy@infradead.org>
Tested-by: NIra Weiny <ira.weiny@intel.com>
Acked-by: NJonathan Corbet <corbet@lwn.net>
Acked-by: NWu XiangCheng <bobwxc@email.cn>

ee65728e

27 6月, 2022 4 次提交

x86/bugs: Add retbleed=ibpb · 3ebc1700

由 Peter Zijlstra 提交于 6月 14, 2022

jmp2ret mitigates the easy-to-attack case at relatively low overhead.
It mitigates the long speculation windows after a mispredicted RET, but
it does not mitigate the short speculation window from arbitrary
instruction boundaries.

On Zen2, there is a chicken bit which needs setting, which mitigates
"arbitrary instruction boundaries" down to just "basic block boundaries".

But there is no fix for the short speculation window on basic block
boundaries, other than to flush the entire BTB to evict all attacker
predictions.

On the spectrum of "fast & blurry" -> "safe", there is (on top of STIBP
or no-SMT):

  1) Nothing		System wide open
  2) jmp2ret		May stop a script kiddy
  3) jmp2ret+chickenbit  Raises the bar rather further
  4) IBPB		Only thing which can count as "safe".

Tentative numbers put IBPB-on-entry at a 2.5x hit on Zen2, and a 10x hit
on Zen1 according to lmbench.

  [ bp: Fixup feature bit comments, document option, 32-bit build fix. ]
Suggested-by: NAndrew Cooper <Andrew.Cooper3@citrix.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NJosh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: NBorislav Petkov <bp@suse.de>

3ebc1700

x86/speculation: Add spectre_v2=ibrs option to support Kernel IBRS · 7c693f54

由 Pawan Gupta 提交于 6月 14, 2022

Extend spectre_v2= boot option with Kernel IBRS.

  [jpoimboe: no STIBP with IBRS]
Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NJosh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: NBorislav Petkov <bp@suse.de>

7c693f54

x86/bugs: Enable STIBP for JMP2RET · e8ec1b6e

由 Kim Phillips 提交于 6月 14, 2022

For untrained return thunks to be fully effective, STIBP must be enabled
or SMT disabled.
Co-developed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: NKim Phillips <kim.phillips@amd.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NBorislav Petkov <bp@suse.de>

e8ec1b6e

x86/bugs: Add AMD retbleed= boot parameter · 7fbf47c7

由 Alexandre Chartre 提交于 6月 14, 2022

Add the "retbleed=<value>" boot parameter to select a mitigation for
RETBleed. Possible values are "off", "auto" and "unret"
(JMP2RET mitigation). The default value is "auto".

Currently, "retbleed=auto" will select the unret mitigation on
AMD and Hygon and no mitigation on Intel (JMP2RET is not effective on
Intel).

  [peterz: rebase; add hygon]
  [jpoimboe: cleanups]
Signed-off-by: NAlexandre Chartre <alexandre.chartre@oracle.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NJosh Poimboeuf <jpoimboe@kernel.org>
Signed-off-by: NBorislav Petkov <bp@suse.de>

7fbf47c7

24 6月, 2022 1 次提交

KVM: x86/mmu: Extend Eager Page Splitting to nested MMUs · ada51a9d

由 David Matlack 提交于 6月 22, 2022

Add support for Eager Page Splitting pages that are mapped by nested
MMUs. Walk through the rmap first splitting all 1GiB pages to 2MiB
pages, and then splitting all 2MiB pages to 4KiB pages.

Note, Eager Page Splitting is limited to nested MMUs as a policy rather
than due to any technical reason (the sp->role.guest_mode check could
just be deleted and Eager Page Splitting would work correctly for all
shadow MMU pages). There is really no reason to support Eager Page
Splitting for tdp_mmu=N, since such support will eventually be phased
out, and there is no current use case supporting Eager Page Splitting on
hosts where TDP is either disabled or unavailable in hardware.
Furthermore, future improvements to nested MMU scalability may diverge
the code from the legacy shadow paging implementation. These
improvements will be simpler to make if Eager Page Splitting does not
have to worry about legacy shadow paging.

Splitting huge pages mapped by nested MMUs requires dealing with some
extra complexity beyond that of the TDP MMU:

(1) The shadow MMU has a limit on the number of shadow pages that are
    allowed to be allocated. So, as a policy, Eager Page Splitting
    refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
    pages available.

(2) Splitting a huge page may end up re-using an existing lower level
    shadow page tables. This is unlike the TDP MMU which always allocates
    new shadow page tables when splitting.

(3) When installing the lower level SPTEs, they must be added to the
    rmap which may require allocating additional pte_list_desc structs.

Case (2) is especially interesting since it may require a TLB flush,
unlike the TDP MMU which can fully split huge pages without any TLB
flushes. Specifically, an existing lower level page table may point to
even lower level page tables that are not fully populated, effectively
unmapping a portion of the huge page, which requires a flush.  As of
this commit, a flush is always done always after dropping the huge page
and before installing the lower level page table.

This TLB flush could instead be delayed until the MMU lock is about to be
dropped, which would batch flushes for multiple splits.  However these
flushes should be rare in practice (a huge page must be aliased in
multiple SPTEs and have been split for NX Huge Pages in only some of
them). Flushing immediately is simpler to plumb and also reduces the
chances of tripping over a CPU bug (e.g. see iTLB multihit).

[ This commit is based off of the original implementation of Eager Page
  Splitting from Peter in Google's kernel from 2016. ]
Suggested-by: NPeter Feiner <pfeiner@google.com>
Signed-off-by: NDavid Matlack <dmatlack@google.com>
Message-Id: <20220516232138.1783324-23-dmatlack@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

ada51a9d

22 6月, 2022 2 次提交

doc: Document rcutree.nocb_nobypass_lim_per_jiffy kernel parameter · 89f7f291

由 Paul E. McKenney 提交于 4月 27, 2022

This commit provides documentation for the kernel parameter controlling
RCU's handling of callback floods on offloaded (rcu_nocbs) CPUs.
This parameter might be obscure, but it is always there when you need it.
Reported-by: NFrederic Weisbecker <frederic@kernel.org>
Reported-by: NUladzislau Rezki <urezki@gmail.com>
Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
Reviewed-by: NNeeraj Upadhyay <quic_neeraju@quicinc.com>

89f7f291

doc: Document the rcutree.rcu_divisor kernel boot parameter · 71de1e34

由 Paul E. McKenney 提交于 4月 20, 2022

This commit adds kernel-parameters.txt documentation for the
rcutree.rcu_divisor kernel boot parameter, which controls the softirq
callback-invocation batch limit.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
Reviewed-by: NNeeraj Upadhyay <quic_neeraju@quicinc.com>

71de1e34

14 6月, 2022 1 次提交

docs: selinux: add '=' signs to kernel boot options · 8d6d51ed

由 Randy Dunlap 提交于 2月 28, 2022

Provide the full kernel boot option string (with ending '=' sign).
They won't work without that and that is how other boot options are
listed.

If used without an '=' sign (as listed here), they cause an "Unknown
parameters" message and are added to init's argument strings,
polluting them.

  Unknown kernel command line parameters "enforcing checkreqprot
    BOOT_IMAGE=/boot/bzImage-517rc6", will be passed to user space.

 Run /sbin/init as init process
   with arguments:
     /sbin/init
     enforcing
     checkreqprot
   with environment:
     HOME=/
     TERM=linux
     BOOT_IMAGE=/boot/bzImage-517rc6
Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Stephen Smalley <stephen.smalley.work@gmail.com>
Cc: Eric Paris <eparis@parisplace.org>
Cc: selinux@vger.kernel.org
Cc: Jonathan Corbet <corbet@lwn.net>
[PM: removed bogus 'Fixes' line]
Signed-off-by: NPaul Moore <paul@paul-moore.com>

8d6d51ed

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功