提交 · 8c9a134cae6f8d66f35321b07baa202f75ef2199 · openanolis / cloud-kernel

23 8月, 2018 1 次提交

mm: clarify CONFIG_PAGE_POISONING and usage · 8c9a134c

由 Kees Cook 提交于 8月 21, 2018

The Kconfig text for CONFIG_PAGE_POISONING doesn't mention that it has to
be enabled explicitly. This updates the documentation for that and adds a
note about CONFIG_PAGE_POISONING to the "page_poison" command line docs.
While here, change description of CONFIG_PAGE_POISONING_ZERO too, as it's
not "random" data, but rather the fixed debugging value that would be used
when not zeroing. Additionally removes a stray "bool" in the Kconfig.

Link: http://lkml.kernel.org/r/20180725223832.GA43733@beastSigned-off-by: NKees Cook <keescook@chromium.org>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8c9a134c

18 8月, 2018 2 次提交

tools/vm/page-types.c: add support for idle page tracking · 59ae96ff

由 Christian Hansen 提交于 8月 17, 2018

Add a flag which causes page-types to use the kernels's idle page
tracking to mark pages idle.  As the tool already prints the idle flag
if set, subsequent runs will show which pages have been accessed since
last run.

[akpm@linux-foundation.org: simplify mark_page_idle()]
[chansen3@cisco.com: reorganize mark_page_idle() logic, add docs]
  Link: http://lkml.kernel.org/r/20180706172237.21691-1-chansen3@cisco.com
Link: http://lkml.kernel.org/r/20180612153223.13174-1-chansen3@cisco.comSigned-off-by: NChristian Hansen <chansen3@cisco.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

59ae96ff

tools/vm/page-types.c: include shared map counts · 7f1d23e6

由 Christian Hansen 提交于 8月 17, 2018

Add a new flag that will read kpagecount for each PFN and print out the
number of times the page is mapped along with the flags in the listing
view.

This information is useful in understanding and optimizing memory usage.
Identifying pages which are not shared allows us to focus on adjusting
the memory layout or access patterns for the sole owning process.
Knowing the number of processes that share a page tells us how many
other times we must make the same adjustments or how many processes to
potentially disable.

Truncated sample output:

  voffset map-cnt offset  len     flags
  561a3591e       1       15fe8   1       ___U_lA____Ma_b___________________________
  561a3591f       1       2b103   1       ___U_lA____Ma_b___________________________
  561a36ca4       1       2cc78   1       ___U_lA____Ma_b___________________________
  7f588bb4e       14      2273c   1       __RU_lA____M______________________________

[akpm@linux-foundation.org: coding-style fixes]
[chansen3@cisco.com: add documentation, tweak whitespace]
  Link: http://lkml.kernel.org/r/20180705181204.5529-1-chansen3@cisco.com
Link: http://lkml.kernel.org/r/20180612153205.12879-1-chansen3@cisco.comSigned-off-by: NChristian Hansen <chansen3@cisco.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7f1d23e6

10 8月, 2018 3 次提交

PCI: Add "pci=disable_acs_redir=" parameter for peer-to-peer support · aaca43fd

由 Logan Gunthorpe 提交于 7月 30, 2018

To support peer-to-peer traffic on a segment of the PCI hierarchy, we must
disable the ACS redirect bits for select PCI bridges.  The bridges must be
selected before the devices are discovered by the kernel and the IOMMU
groups created.  Therefore, add a kernel command line parameter to specify
devices which must have their ACS bits disabled.

The new parameter takes a list of devices separated by a semicolon.  Each
device specified will have its ACS redirect bits disabled.  This is
similar to the existing 'resource_alignment' parameter.

The ACS Request P2P Request Redirect, P2P Completion Redirect and P2P
Egress Control bits are disabled, which is sufficient to always allow
passing P2P traffic uninterrupted.  The bits are set after the kernel
(optionally) enables the ACS bits itself.  It is also done regardless of
whether the kernel or platform firmware sets the bits.

If the user tries to disable the ACS redirect for a device without the ACS
capability, print a warning to dmesg.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
[bhelgaas: reorder to add the generic code first and move the
device-specific quirk to subsequent patches]
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Reviewed-by: NStephen Bates <sbates@raithlin.com>
Reviewed-by: NAlex Williamson <alex.williamson@redhat.com>
Acked-by: NChristian König <christian.koenig@amd.com>

aaca43fd

PCI: Allow specifying devices using a base bus and path of devfns · 45db3370

由 Logan Gunthorpe 提交于 7月 30, 2018

When specifying PCI devices on the kernel command line using a
bus/device/function address, bus numbers can change when adding or
replacing a device, changing motherboard firmware, or applying kernel
parameters like "pci=assign-buses".  When bus numbers change, it's likely
the command line tweak will be applied to the wrong device.

Therefore, it is useful to be able to specify devices with a base bus
number and the path of devfns needed to get to it, similar to the "device
scope" structure in the Intel VT-d spec, Section 8.3.1.

Thus, we add an option to specify devices in the following format:

  [<domain>:]<bus>:<device>.<func>[/<device>.<func>]*

The path can be any segment within the PCI hierarchy of any length and
determined through the use of 'lspci -t'.  When specified this way, it is
less likely that a renumbered bus will result in a valid device
specification and the tweak won't be applied to the wrong device.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
[bhelgaas: use "device" instead of "slot" in documentation since that's the
usual language in the PCI specs]
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Reviewed-by: NStephen Bates <sbates@raithlin.com>
Reviewed-by: NAlex Williamson <alex.williamson@redhat.com>
Acked-by: NChristian König <christian.koenig@amd.com>

45db3370

PCI: Make specifying PCI devices in kernel parameters reusable · 07d8d7e5

由 Logan Gunthorpe 提交于 7月 30, 2018

Separate out the code to match a PCI device with a string (typically
originating from a kernel parameter) from the
pci_specified_resource_alignment() function into its own helper function.

While we are at it, this change fixes the kernel style of the function
(fixing a number of long lines and extra parentheses).

Additionally, make the analogous change to the kernel parameter
documentation: Separate the description of how to specify a PCI device
into its own section at the head of the "pci=" parameter.

This patch should have no functional alterations.
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
[bhelgaas: use "device" instead of "slot" in documentation since that's the
usual language in the PCI specs]
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Reviewed-by: NStephen Bates <sbates@raithlin.com>
Reviewed-by: NAlex Williamson <alex.williamson@redhat.com>
Acked-by: NChristian König <christian.koenig@amd.com>

07d8d7e5

07 8月, 2018 1 次提交

Documentation: Add nospectre_v1 parameter · 26cb1f36

由 Diana Craciun 提交于 7月 28, 2018

Currently only supported on powerpc.
Signed-off-by: NDiana Craciun <diana.craciun@nxp.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>

26cb1f36

05 8月, 2018 2 次提交

KVM: VMX: Tell the nested hypervisor to skip L1D flush on vmentry · 5b76a3cf

由 Paolo Bonzini 提交于 8月 05, 2018

When nested virtualization is in use, VMENTER operations from the nested
hypervisor into the nested guest will always be processed by the bare metal
hypervisor, and KVM's "conditional cache flushes" mode in particular does a
flush on nested vmentry.  Therefore, include the "skip L1D flush on
vmentry" bit in KVM's suggested ARCH_CAPABILITIES setting.

Add the relevant Documentation.
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

5b76a3cf

Documentation/l1tf: Remove Yonah processors from not vulnerable list · 58331136

由 Thomas Gleixner 提交于 8月 05, 2018

Dave reported, that it's not confirmed that Yonah processors are
unaffected. Remove them from the list.
Reported-by: Nave Hansen <dave.hansen@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

58331136

02 8月, 2018 1 次提交

block: make iolatency avg_lat exponentially decay · c480bcf9

由 Dennis Zhou (Facebook) 提交于 8月 01, 2018

Currently, avg_lat is calculated by accumulating the mean of every
window in a long running cumulative average. As time goes on, the metric
becomes less and less useful due to the accumulated history.

This patch reuses the same calculation done in load averages to make the
avg_lat metric more lively. Unlike load averages, the avg only advances
when a window elapses (due to an io). Idle periods extend the most
recent window. Bucketing is used to limit the history of avg_lat by
binding it to the window size. So, the window range for 1/exp (decay
rate) is [1 min, 2.5 min) when windows elapse immediately.

The current sample window size is exposed in the debug info to enable
calculation of the window range.
Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
Acked-by: NTejun Heo <tj@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c480bcf9

21 7月, 2018 1 次提交

vt: add /dev/vcsu* to devices.txt · 13aa0a12

由 Nicolas Pitre 提交于 7月 17, 2018

Also mention that the traditional devices provide glyph values whereas
/dev/vcsu* is unicode based.
Suggested-by: NGeert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: NNicolas Pitre <nico@linaro.org>
Reviewed-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

13aa0a12

20 7月, 2018 2 次提交

Documentation/l1tf: Fix typos · 1949f9f4

由 Tony Luck 提交于 7月 19, 2018

Fix spelling and other typos
Signed-off-by: NTony Luck <tony.luck@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

1949f9f4

x86/tsc: Redefine notsc to behave as tsc=unstable · fe9af81e

由 Pavel Tatashin 提交于 7月 19, 2018

Currently, the notsc kernel parameter disables the use of the TSC by
sched_clock(). However, this parameter does not prevent the kernel from
accessing tsc in other places.

The only rationale to boot with notsc is to avoid timing discrepancies on
multi-socket systems where TSC are not properly synchronized, and thus
exclude TSC from being used for time keeping. But that prevents using TSC
as sched_clock() as well, which is not necessary as the core sched_clock()
implementation can handle non synchronized TSC based sched clocks just
fine.

However, there is another method to solve the above problem: booting with
tsc=unstable parameter. This parameter allows sched_clock() to use TSC and
just excludes it from timekeeping.

So there is no real reason to keep notsc, but for compatibility reasons the
parameter has to stay. Make it behave like 'tsc=unstable' instead.

[ tglx: Massaged changelog ]
Signed-off-by: NPavel Tatashin <pasha.tatashin@oracle.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NDou Liyang <douly.fnst@cn.fujitsu.com>
Reviewed-by: NThomas Gleixner <tglx@linutronix.de>
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: peterz@infradead.org
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-12-pasha.tatashin@oracle.com

fe9af81e

18 7月, 2018 2 次提交

blkcg: Track DISCARD statistics and output them in cgroup io.stat · 636620b6

由 Tejun Heo 提交于 7月 18, 2018

Add tracking of REQ_OP_DISCARD ios to the per-cgroup io.stat.  Two
fields, dbytes and dios, to respectively count the total bytes and
number of discards are added.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Cc: Michael Callahan <michaelcallahan@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

636620b6

vsprintf: Add command line option debug_boot_weak_hash · 3672476e

由 Tobin C. Harding 提交于 6月 22, 2018

Currently printing [hashed] pointers requires enough entropy to be
available.  Early in the boot sequence this may not be the case
resulting in a dummy string '(____ptrval____)' being printed.  This
makes debugging the early boot sequence difficult.  We can relax the
requirement to use cryptographically secure hashing during debugging.
This enables debugging while keeping development/production kernel
behaviour the same.

If new command line option debug_boot_weak_hash is enabled use
cryptographically insecure hashing and hash pointer value immediately.
Reviewed-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: NTobin C. Harding <me@tobin.cc>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>

3672476e

13 7月, 2018 3 次提交

Documentation: Add section about CPU vulnerabilities · 3ec8ce5d

由 Thomas Gleixner 提交于 7月 13, 2018

Add documentation for the L1TF vulnerability and the mitigation mechanisms:

  - Explain the problem and risks
  - Document the mitigation mechanisms
  - Document the command line controls
  - Document the sysfs files
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20180713142323.287429944@linutronix.de

3ec8ce5d

x86/bugs, kvm: Introduce boot-time control of L1TF mitigations · d90a7a0e

由 Jiri Kosina 提交于 7月 13, 2018

Introduce the 'l1tf=' kernel command line option to allow for boot-time
switching of mitigation that is used on processors affected by L1TF.

The possible values are:

  full
	Provides all available mitigations for the L1TF vulnerability. Disables
	SMT and enables all mitigations in the hypervisors. SMT control via
	/sys/devices/system/cpu/smt/control is still possible after boot.
	Hypervisors will issue a warning when the first VM is started in
	a potentially insecure configuration, i.e. SMT enabled or L1D flush
	disabled.

  full,force
	Same as 'full', but disables SMT control. Implies the 'nosmt=force'
	command line option. sysfs control of SMT and the hypervisor flush
	control is disabled.

  flush
	Leaves SMT enabled and enables the conditional hypervisor mitigation.
	Hypervisors will issue a warning when the first VM is started in a
	potentially insecure configuration, i.e. SMT enabled or L1D flush
	disabled.

  flush,nosmt
	Disables SMT and enables the conditional hypervisor mitigation. SMT
	control via /sys/devices/system/cpu/smt/control is still possible
	after boot. If SMT is reenabled or flushing disabled at runtime
	hypervisors will issue a warning.

  flush,nowarn
	Same as 'flush', but hypervisors will not warn when
	a VM is started in a potentially insecure configuration.

  off
	Disables hypervisor mitigations and doesn't emit any warnings.

Default is 'flush'.

Let KVM adhere to these semantics, which means:

  - 'lt1f=full,force'	: Performe L1D flushes. No runtime control
    			  possible.

  - 'l1tf=full'
  - 'l1tf-flush'
  - 'l1tf=flush,nosmt'	: Perform L1D flushes and warn on VM start if
			  SMT has been runtime enabled or L1D flushing
			  has been run-time enabled
			  
  - 'l1tf=flush,nowarn'	: Perform L1D flushes and no warnings are emitted.
  
  - 'l1tf=off'		: L1D flushes are not performed and no warnings
			  are emitted.

KVM can always override the L1D flushing behavior using its 'vmentry_l1d_flush'
module parameter except when lt1f=full,force is set.

This makes KVM's private 'nosmt' option redundant, and as it is a bit
non-systematic anyway (this is something to control globally, not on
hypervisor level), remove that option.

Add the missing Documentation entry for the l1tf vulnerability sysfs file
while at it.
Signed-off-by: NJiri Kosina <jkosina@suse.cz>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NJiri Kosina <jkosina@suse.cz>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Link: https://lkml.kernel.org/r/20180713142323.202758176@linutronix.de

d90a7a0e

rcutorture: Change units of onoff_interval to jiffies · 028be12b

由 Paul E. McKenney 提交于 5月 08, 2018

Some RCU bugs have been sensitive to the frequency of CPU-hotplug
operations, which have been gradually increased over time. But this
frequency is now at the one-second lower limit that can be specified using
the rcutorture.onoff_interval kernel parameter. This commit therefore
changes the units of rcutorture.onoff_interval from seconds to jiffies,
and also sets the value specified for this kernel parameter in the TREE03
rcutorture scenario to 200, which is 200 milliseconds for HZ=1000.
Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>

028be12b

11 7月, 2018 2 次提交

Documentation: Add powerpc options for spec_store_bypass_disable · 6b4c1360

由 Michael Ellerman 提交于 7月 10, 2018

Document the support for spec_store_bypass_disable that was added for
powerpc in commit a048a07d ("powerpc/64s: Add support for a store
forwarding barrier at kernel entry/exit").
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Reviewed-by: NKees Cook <keescook@chromium.org>
Acked-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

6b4c1360

docs: kernel-parameters.txt: document xhci-hcd.quirks parameter · 819d731f

由 Laurentiu Tudor 提交于 7月 05, 2018

This parameter introduced several years ago in the XHCI host controller
driver was somehow left undocumented. Add a few lines in the kernel
parameters text.
Signed-off-by: NLaurentiu Tudor <laurentiu.tudor@nxp.com>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

819d731f

10 7月, 2018 1 次提交

driver core: allow stopping deferred probe after init · 25b4e70d

由 Rob Herring 提交于 7月 09, 2018

Deferred probe will currently wait forever on dependent devices to probe,
but sometimes a driver will never exist. It's also not always critical for
a driver to exist. Platforms can rely on default configuration from the
bootloader or reset defaults for things such as pinctrl and power domains.
This is often the case with initial platform support until various drivers
get enabled. There's at least 2 scenarios where deferred probe can render
a platform broken. Both involve using a DT which has more devices and
dependencies than the kernel supports. The 1st case is a driver may be
disabled in the kernel config. The 2nd case is the kernel version may
simply not have the dependent driver. This can happen if using a newer DT
(provided by firmware perhaps) with a stable kernel version. Deferred
probe issues can be difficult to debug especially if the console has
dependencies or userspace fails to boot to a shell.

There are also cases like IOMMUs where only built-in drivers are
supported, so deferring probe after initcalls is not needed. The IOMMU
subsystem implemented its own mechanism to handle this using OF_DECLARE
linker sections.

This commit adds makes ending deferred probe conditional on initcalls
being completed or a debug timeout. Subsystems or drivers may opt-in by
calling driver_deferred_probe_check_init_done() instead of
unconditionally returning -EPROBE_DEFER. They may use additional
information from DT or kernel's config to decide whether to continue to
defer probe or not.

The timeout mechanism is intended for debug purposes and WARNs loudly.
The remaining deferred probe pending list will also be dumped after the
timeout. Not that this timeout won't work for the console which needs
to be enabled before userspace starts. However, if the console's
dependencies are resolved, then the kernel log will be printed (as
opposed to no output).

Cc: Alexander Graf <agraf@suse.de>
Signed-off-by: NRob Herring <robh@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

25b4e70d

09 7月, 2018 1 次提交

Documentation: add a doc for blk-iolatency · b351f0c7

由 Josef Bacik 提交于 7月 03, 2018

A basic documentation to describe the interface, statistics, and
behavior of io.latency.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b351f0c7

06 7月, 2018 1 次提交

docs: kernel-parameters.txt: document xhci-hcd.quirks parameter · c0addc9a

由 Laurentiu Tudor 提交于 7月 05, 2018

This parameter introduced several years ago in the XHCI host controller
driver was somehow left undocumented. Add a few lines in the kernel
parameters text.
Signed-off-by: NLaurentiu Tudor <laurentiu.tudor@nxp.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

c0addc9a

05 7月, 2018 2 次提交

x86/KVM/VMX: Add module argument for L1TF mitigation · a399477e

由 Konrad Rzeszutek Wilk 提交于 7月 02, 2018

Add a mitigation mode parameter "vmentry_l1d_flush" for CVE-2018-3620, aka
L1 terminal fault. The valid arguments are:

 - "always" 	L1D cache flush on every VMENTER.
 - "cond"	Conditional L1D cache flush, explained below
 - "never"	Disable the L1D cache flush mitigation

"cond" is trying to avoid L1D cache flushes on VMENTER if the code executed
between VMEXIT and VMENTER is considered safe, i.e. is not bringing any
interesting information into L1D which might exploited.

[ tglx: Split out from a larger patch ]
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

a399477e

x86/KVM: Warn user if KVM is loaded SMT and L1TF CPU bug being present · 26acfb66

由 Konrad Rzeszutek Wilk 提交于 6月 20, 2018

If the L1TF CPU bug is present we allow the KVM module to be loaded as the
major of users that use Linux and KVM have trusted guests and do not want a
broken setup.

Cloud vendors are the ones that are uncomfortable with CVE 2018-3620 and as
such they are the ones that should set nosmt to one.

Setting 'nosmt' means that the system administrator also needs to disable
SMT (Hyper-threading) in the BIOS, or via the 'nosmt' command line
parameter, or via the /sys/devices/system/cpu/smt/control. See commit
05736e4a ("cpu/hotplug: Provide knobs to control SMT").

Other mitigations are to use task affinity, cpu sets, interrupt binding,
etc - anything to make sure that _only_ the same guests vCPUs are running
on sibling threads.
Signed-off-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

26acfb66

04 7月, 2018 1 次提交

usercopy: Allow boot cmdline disabling of hardening · b5cb15d9

由 Chris von Recklinghausen 提交于 7月 03, 2018

Enabling HARDENED_USERCOPY may cause measurable regressions in networking
performance: up to 8% under UDP flood.

I ran a small packet UDP flood using pktgen vs. a host b2b connected. On
the receiver side the UDP packets are processed by a simple user space
process that just reads and drops them:

https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c

Not very useful from a functional PoV, but it helps to pin-point
bottlenecks in the networking stack.

When running a kernel with CONFIG_HARDENED_USERCOPY=y, I see a 5-8%
regression in the receive tput, compared to the same kernel without this
option enabled.

With CONFIG_HARDENED_USERCOPY=y, perf shows ~6% of CPU time spent
cumulatively in __check_object_size (~4%) and __virt_addr_valid (~2%).

The call-chain is:

__GI___libc_recvfrom
entry_SYSCALL_64_after_hwframe
do_syscall_64
__x64_sys_recvfrom
__sys_recvfrom
inet_recvmsg
udp_recvmsg
__check_object_size

udp_recvmsg() actually calls copy_to_iter() (inlined) and the latters
calls check_copy_size() (again, inlined).

A generic distro may want to enable HARDENED_USERCOPY in their default
kernel config, but at the same time, such distro may want to be able to
avoid the performance penalties in with the default configuration and
disable the stricter check on a per-boot basis.

This change adds a boot parameter that conditionally disables
HARDENED_USERCOPY via "hardened_usercopy=off".
Signed-off-by: NChris von Recklinghausen <crecklin@redhat.com>
Signed-off-by: NKees Cook <keescook@chromium.org>

b5cb15d9

02 7月, 2018 1 次提交

Revert "x86/apic: Ignore secondary threads if nosmt=force" · 506a66f3

由 Thomas Gleixner 提交于 6月 29, 2018

Dave Hansen reported, that it's outright dangerous to keep SMT siblings
disabled completely so they are stuck in the BIOS and wait for SIPI.

The reason is that Machine Check Exceptions are broadcasted to siblings and
the soft disabled sibling has CR4.MCE = 0. If a MCE is delivered to a
logical core with CR4.MCE = 0, it asserts IERR#, which shuts down or
reboots the machine. The MCE chapter in the SDM contains the following
blurb:

    Because the logical processors within a physical package are tightly
    coupled with respect to shared hardware resources, both logical
    processors are notified of machine check errors that occur within a
    given physical processor. If machine-check exceptions are enabled when
    a fatal error is reported, all the logical processors within a physical
    package are dispatched to the machine-check exception handler. If
    machine-check exceptions are disabled, the logical processors enter the
    shutdown state and assert the IERR# signal. When enabling machine-check
    exceptions, the MCE flag in control register CR4 should be set for each
    logical processor.

Reverting the commit which ignores siblings at enumeration time solves only
half of the problem. The core cpuhotplug logic needs to be adjusted as
well.

This thoughtful engineered mechanism also turns the boot process on all
Intel HT enabled systems into a MCE lottery. MCE is enabled on the boot CPU
before the secondary CPUs are brought up. Depending on the number of
physical cores the window in which this situation can happen is smaller or
larger. On a HSW-EX it's about 750ms:

MCE is enabled on the boot CPU:

[    0.244017] mce: CPU supports 22 MCE banks

The corresponding sibling #72 boots:

[    1.008005] .... node  #0, CPUs:    #72

That means if an MCE hits on physical core 0 (logical CPUs 0 and 72)
between these two points the machine is going to shutdown. At least it's a
known safe state.

It's obvious that the early boot can be hit by an MCE as well and then runs
into the same situation because MCEs are not yet enabled on the boot CPU.
But after enabling them on the boot CPU, it does not make any sense to
prevent the kernel from recovering.

Adjust the nosmt kernel parameter documentation as well.

Reverts: 2207def7 ("x86/apic: Ignore secondary threads if nosmt=force")
Reported-by: NDave Hansen <dave.hansen@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NTony Luck <tony.luck@intel.com>

506a66f3

30 6月, 2018 1 次提交

PCI: Make early dump functionality generic · 11eb0e0e

由 Sinan Kaya 提交于 6月 04, 2018

Move early dump functionality into common code so that it is available for
all architectures.  No need to carry arch-specific reads around as the read
hooks are already initialized by the time pci_setup_device() is getting
called during scan.
Tested-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
Signed-off-by: NSinan Kaya <okaya@codeaurora.org>
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Reviewed-by: NAndy Shevchenko <andy.shevchenko@gmail.com>

11eb0e0e

29 6月, 2018 1 次提交

Documentation/admin-guide/README.rst: add a label for cross-referencing · 351f10a3

由 Michael Rodin 提交于 6月 03, 2018

Add a label to the top of the file to allow cross-referencing.
Currently it's not possible to cross-reference this file from
Documentation/process/howto.rst because of the missing label.
Signed-off-by: NMichael Rodin <michael-git@rodin.online>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

351f10a3

27 6月, 2018 2 次提交

Documentation: intel_pstate: Describe hwp_dynamic_boost sysfs knob · 649f53a3

由 Rafael J. Wysocki 提交于 6月 26, 2018

Document the recently introduced hwp_dynamic_boost sysfs knob
allowing user space to tell intel_pstate to use iowait boosting
in the active mode with HWP enabled (to improve performance).
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>

649f53a3

Documentation: admin-guide: intel_pstate: Fix sysfs path · 9e421b8f

由 Rafael J. Wysocki 提交于 6月 26, 2018

Fix an incorrect sysfs path in the intel_pstate admin-guide
documentation.

Fixes: 33fc30b4 (cpufreq: intel_pstate: Document the current behavior and user interface)
Reported-by: NPawit Pornkitprasan <p.pawit@gmail.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

9e421b8f

21 6月, 2018 2 次提交

cpu/hotplug: Provide knobs to control SMT · 05736e4a

由 Thomas Gleixner 提交于 5月 29, 2018

Provide a command line and a sysfs knob to control SMT.

The command line options are:

 'nosmt':	Enumerate secondary threads, but do not online them
 		
 'nosmt=force': Ignore secondary threads completely during enumeration
 		via MP table and ACPI/MADT.

The sysfs control file has the following states (read/write):

 'on':		 SMT is enabled. Secondary threads can be freely onlined
 'off':		 SMT is disabled. Secondary threads, even if enumerated
 		 cannot be onlined
 'forceoff':	 SMT is permanentely disabled. Writes to the control
 		 file are rejected.
 'notsupported': SMT is not supported by the CPU

The command line option 'nosmt' sets the sysfs control to 'off'. This
can be changed to 'on' to reenable SMT during runtime.

The command line option 'nosmt=force' sets the sysfs control to
'forceoff'. This cannot be changed during runtime.

When SMT is 'on' and the control file is changed to 'off' then all online
secondary threads are offlined and attempts to online a secondary thread
later on are rejected.

When SMT is 'off' and the control file is changed to 'on' then secondary
threads can be onlined again. The 'off' -> 'on' transition does not
automatically online the secondary threads.

When the control file is set to 'forceoff', the behaviour is the same as
setting it to 'off', but the operation is irreversible and later writes to
the control file are rejected.

When the control status is 'notsupported' then writes to the control file
are rejected.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Acked-by: NIngo Molnar <mingo@kernel.org>

05736e4a

Documentation: intel_pstate: Fix typo · 7a0f9d1e

由 Rafael J. Wysocki 提交于 6月 20, 2018

Fix a typo in the intel_pstate admin-guide documentation.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

7a0f9d1e

16 6月, 2018 2 次提交

kernel-parameters.txt: fix pointers to sound parameters · 1ca2c806

由 Mauro Carvalho Chehab 提交于 6月 14, 2018

The alsa parameters file was renamed to alsa-configuration.rst.

With regards to OSS, it got retired as a hole by  at changeset
727dede0 ("sound: Retire OSS"). So, it doesn't make sense
to keep mentioning it at kernel-parameters.txt.

Fixes: 727dede0 ("sound: Retire OSS")
Signed-off-by: NMauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: NJonathan Corbet <corbet@lwn.net>

1ca2c806

docs: Fix some broken references · 5fb94e9c

由 Mauro Carvalho Chehab 提交于 5月 08, 2018

As we move stuff around, some doc references are broken. Fix some of
them via this script:
	./scripts/documentation-file-ref-check --fix

Manually checked if the produced result is valid, removing a few
false-positives.
Acked-by: NTakashi Iwai <tiwai@suse.de>
Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
Acked-by: NStephen Boyd <sboyd@kernel.org>
Acked-by: NCharles Keepax <ckeepax@opensource.wolfsonmicro.com>
Acked-by: NMathieu Poirier <mathieu.poirier@linaro.org>
Reviewed-by: NColy Li <colyli@suse.de>
Signed-off-by: NMauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: NJonathan Corbet <corbet@lwn.net>

5fb94e9c

08 6月, 2018 4 次提交

mm: memcg: allow lowering memory.swap.max below the current usage · be09102b

由 Tejun Heo 提交于 6月 07, 2018

Currently an attempt to set swap.max into a value lower than the actual
swap usage fails, which causes configuration problems as there's no way
of lowering the configuration below the current usage short of turning
off swap entirely.  This makes swap.max difficult to use and allows
delegatees to lock the delegator out of reducing swap allocation.

This patch updates swap_max_write() so that the limit can be lowered
below the current usage.  It doesn't implement active reclaiming of swap
entries for the following reasons.

* mem_cgroup_swap_full() already tells the swap machinary to
  aggressively reclaim swap entries if the usage is above 50% of
  limit, so simply lowering the limit automatically triggers gradual
  reclaim.

* Forcing back swapped out pages is likely to heavily impact the
  workload and mess up the working set.  Given that swap usually is a
  lot less valuable and less scarce, letting the existing usage
  dissipate over time through the above gradual reclaim and as they're
  falted back in is likely the better behavior.

Link: http://lkml.kernel.org/r/20180523185041.GR1718769@devbig577.frc2.facebook.comSigned-off-by: NTejun Heo <tj@kernel.org>
Acked-by: NRoman Gushchin <guro@fb.com>
Acked-by: NRik van Riel <riel@surriel.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Shaohua Li <shli@fb.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

be09102b

memcg: introduce memory.min · bf8d5d52

由 Roman Gushchin 提交于 6月 07, 2018

Memory controller implements the memory.low best-effort memory
protection mechanism, which works perfectly in many cases and allows
protecting working sets of important workloads from sudden reclaim.

But its semantics has a significant limitation: it works only as long as
there is a supply of reclaimable memory.  This makes it pretty useless
against any sort of slow memory leaks or memory usage increases.  This
is especially true for swapless systems.  If swap is enabled, memory
soft protection effectively postpones problems, allowing a leaking
application to fill all swap area, which makes no sense.  The only
effective way to guarantee the memory protection in this case is to
invoke the OOM killer.

It's possible to handle this case in userspace by reacting on MEMCG_LOW
events; but there is still a place for a fail-safe in-kernel mechanism
to provide stronger guarantees.

This patch introduces the memory.min interface for cgroup v2 memory
controller.  It works very similarly to memory.low (sharing the same
hierarchical behavior), except that it's not disabled if there is no
more reclaimable memory in the system.

If cgroup is not populated, its memory.min is ignored, because otherwise
even the OOM killer wouldn't be able to reclaim the protected memory,
and the system can stall.

[guro@fb.com: s/low/min/ in docs]
Link: http://lkml.kernel.org/r/20180510130758.GA9129@castle.DHCP.thefacebook.com
Link: http://lkml.kernel.org/r/20180509180734.GA4856@castle.DHCP.thefacebook.comSigned-off-by: NRoman Gushchin <guro@fb.com>
Reviewed-by: NRandy Dunlap <rdunlap@infradead.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

bf8d5d52

mm/docs: describe memory.low refinements · 7854207f

由 Roman Gushchin 提交于 6月 07, 2018

Refine cgroup v2 docs after latest memory.low changes.

Link: http://lkml.kernel.org/r/20180405185921.4942-4-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7854207f

mm, memcontrol: implement memory.swap.events · f3a53a3a

由 Tejun Heo 提交于 6月 07, 2018

Add swap max and fail events so that userland can monitor and respond to
running out of swap.

I'm not too sure about the fail event.  Right now, it's a bit confusing
which stats / events are recursive and which aren't and also which ones
reflect events which originate from a given cgroup and which targets the
cgroup.  No idea what the right long term solution is and it could just
be that growing them organically is actually the only right thing to do.

Link: http://lkml.kernel.org/r/20180416231151.GI1911913@devbig577.frc2.facebook.comSigned-off-by: NTejun Heo <tj@kernel.org>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Roman Gushchin <guro@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: <linux-api@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f3a53a3a

07 6月, 2018 1 次提交

apparmor: update git and wiki locations in AppArmor docs · b896c54e

由 Jordan Glover 提交于 5月 05, 2018

The apparmor information in the apparmor.rst  file is out of date.
Update it to the correct git reference for the master apparmor tree.
Update the wiki location to use apparmor.net which forwards to the
current wiki location on gitlab.com. Update user space tools address
to gitlab.com.
Signed-off-by: NJordan Glover <Golden_Miller83@protonmail.ch>
Signed-off-by: NJohn Johansen <john.johansen@canonical.com>

b896c54e

openanolis / cloud-kernel 大约 1 年 前同步成功

openanolis / cloud-kernel
大约 1 年前同步成功