提交 · c57c80467f90e5504c8df9ad3555d2c78800bf94 · openeuler / Kernel

04 11月, 2019 1 次提交

kvm: mmu: ITLB_MULTIHIT mitigation · b8e8c830

由 Paolo Bonzini 提交于 11月 04, 2019

With some Intel processors, putting the same virtual address in the TLB
as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit
and cause the processor to issue a machine check resulting in a CPU lockup.

Unfortunately when EPT page tables use huge pages, it is possible for a
malicious guest to cause this situation.

Add a knob to mark huge pages as non-executable. When the nx_huge_pages
parameter is enabled (and we are using EPT), all huge pages are marked as
NX. If the guest attempts to execute in one of those pages, the page is
broken down into 4K pages, which are then marked executable.

This is not an issue for shadow paging (except nested EPT), because then
the host is in control of TLB flushes and the problematic situation cannot
happen.  With nested EPT, again the nested guest can cause problems shadow
and direct EPT is treated in the same way.

[ tglx: Fixup default to auto and massage wording a bit ]
Originally-by: NJunaid Shahid <junaids@google.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

b8e8c830

28 10月, 2019 3 次提交

x86/speculation/taa: Add documentation for TSX Async Abort · a7a248c5

由 Pawan Gupta 提交于 10月 23, 2019

Add the documenation for TSX Async Abort. Include the description of
the issue, how to check the mitigation state, control the mitigation,
guidance for system administrators.

 [ bp: Add proper SPDX tags, touch ups by Josh and me. ]
Co-developed-by: NAntonio Gomez Iglesias <antonio.gomez.iglesias@intel.com>
Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: NAntonio Gomez Iglesias <antonio.gomez.iglesias@intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NMark Gross <mgross@linux.intel.com>
Reviewed-by: NTony Luck <tony.luck@intel.com>
Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>

a7a248c5

x86/tsx: Add "auto" option to the tsx= cmdline parameter · 7531a359

由 Pawan Gupta 提交于 10月 23, 2019

Platforms which are not affected by X86_BUG_TAA may want the TSX feature
enabled. Add "auto" option to the TSX cmdline parameter. When tsx=auto
disable TSX when X86_BUG_TAA is present, otherwise enable TSX.

More details on X86_BUG_TAA can be found here:
https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html

 [ bp: Extend the arg buffer to accommodate "auto\0". ]
Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NTony Luck <tony.luck@intel.com>
Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>

7531a359

x86/cpu: Add a "tsx=" cmdline option with TSX disabled by default · 95c5824f

由 Pawan Gupta 提交于 10月 23, 2019

Add a kernel cmdline parameter "tsx" to control the Transactional
Synchronization Extensions (TSX) feature. On CPUs that support TSX
control, use "tsx=on|off" to enable or disable TSX. Not specifying this
option is equivalent to "tsx=off". This is because on certain processors
TSX may be used as a part of a speculative side channel attack.

Carve out the TSX controlling functionality into a separate compilation
unit because TSX is a CPU feature while the TSX async abort control
machinery will go to cpu/bugs.c.

 [ bp: - Massage, shorten and clear the arg buffer.
       - Clarifications of the tsx= possible options - Josh.
       - Expand on TSX_CTRL availability - Pawan. ]
Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>

95c5824f

08 10月, 2019 2 次提交

mm, memcg: proportional memory.{low,min} reclaim · 9783aa99

由 Chris Down 提交于 10月 06, 2019

cgroup v2 introduces two memory protection thresholds: memory.low
(best-effort) and memory.min (hard protection).  While they generally do
what they say on the tin, there is a limitation in their implementation
that makes them difficult to use effectively: that cliff behaviour often
manifests when they become eligible for reclaim.  This patch implements
more intuitive and usable behaviour, where we gradually mount more
reclaim pressure as cgroups further and further exceed their protection
thresholds.

This cliff edge behaviour happens because we only choose whether or not
to reclaim based on whether the memcg is within its protection limits
(see the use of mem_cgroup_protected in shrink_node), but we don't vary
our reclaim behaviour based on this information.  Imagine the following
timeline, with the numbers the lruvec size in this zone:

1. memory.low=1000000, memory.current=999999. 0 pages may be scanned.
2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned.
3. memory.low=1000000, memory.current=1000001. 1000001* pages may be
   scanned. (?!)

* Of course, we won't usually scan all available pages in the zone even
  without this patch because of scan control priority, over-reclaim
  protection, etc.  However, as shown by the tests at the end, these
  techniques don't sufficiently throttle such an extreme change in input,
  so cliff-like behaviour isn't really averted by their existence alone.

Here's an example of how this plays out in practice.  At Facebook, we are
trying to protect various workloads from "system" software, like
configuration management tools, metric collectors, etc (see this[0] case
study).  In order to find a suitable memory.low value, we start by
determining the expected memory range within which the workload will be
comfortable operating.  This isn't an exact science -- memory usage deemed
"comfortable" will vary over time due to user behaviour, differences in
composition of work, etc, etc.  As such we need to ballpark memory.low,
but doing this is currently problematic:

1. If we end up setting it too low for the workload, it won't have
   *any* effect (see discussion above).  The group will receive the full
   weight of reclaim and won't have any priority while competing with the
   less important system software, as if we had no memory.low configured
   at all.

2. Because of this behaviour, we end up erring on the side of setting
   it too high, such that the comfort range is reliably covered.  However,
   protected memory is completely unavailable to the rest of the system,
   so we might cause undue memory and IO pressure there when we *know* we
   have some elasticity in the workload.

3. Even if we get the value totally right, smack in the middle of the
   comfort zone, we get extreme jumps between no pressure and full
   pressure that cause unpredictable pressure spikes in the workload due
   to the current binary reclaim behaviour.

With this patch, we can set it to our ballpark estimation without too much
worry.  Any undesirable behaviour, such as too much or too little reclaim
pressure on the workload or system will be proportional to how far our
estimation is off.  This means we can set memory.low much more
conservatively and thus waste less resources *without* the risk of the
workload falling off a cliff if we overshoot.

As a more abstract technical description, this unintuitive behaviour
results in having to give high-priority workloads a large protection
buffer on top of their expected usage to function reliably, as otherwise
we have abrupt periods of dramatically increased memory pressure which
hamper performance.  Having to set these thresholds so high wastes
resources and generally works against the principle of work conservation.
In addition, having proportional memory reclaim behaviour has other
benefits.  Most notably, before this patch it's basically mandatory to set
memory.low to a higher than desirable value because otherwise as soon as
you exceed memory.low, all protection is lost, and all pages are eligible
to scan again.  By contrast, having a gradual ramp in reclaim pressure
means that you now still get some protection when thresholds are exceeded,
which means that one can now be more comfortable setting memory.low to
lower values without worrying that all protection will be lost.  This is
important because workingset size is really hard to know exactly,
especially with variable workloads, so at least getting *some* protection
if your workingset size grows larger than you expect increases user
confidence in setting memory.low without a huge buffer on top being
needed.

Thanks a lot to Johannes Weiner and Tejun Heo for their advice and
assistance in thinking about how to make this work better.

In testing these changes, I intended to verify that:

1. Changes in page scanning become gradual and proportional instead of
   binary.

   To test this, I experimented stepping further and further down
   memory.low protection on a workload that floats around 19G workingset
   when under memory.low protection, watching page scan rates for the
   workload cgroup:

   +------------+-----------------+--------------------+--------------+
   | memory.low | test (pgscan/s) | control (pgscan/s) | % of control |
   +------------+-----------------+--------------------+--------------+
   |        21G |               0 |                  0 | N/A          |
   |        17G |             867 |               3799 | 23%          |
   |        12G |            1203 |               3543 | 34%          |
   |         8G |            2534 |               3979 | 64%          |
   |         4G |            3980 |               4147 | 96%          |
   |          0 |            3799 |               3980 | 95%          |
   +------------+-----------------+--------------------+--------------+

   As you can see, the test kernel (with a kernel containing this
   patch) ramps up page scanning significantly more gradually than the
   control kernel (without this patch).

2. More gradual ramp up in reclaim aggression doesn't result in
   premature OOMs.

   To test this, I wrote a script that slowly increments the number of
   pages held by stress(1)'s --vm-keep mode until a production system
   entered severe overall memory contention.  This script runs in a highly
   protected slice taking up the majority of available system memory.
   Watching vmstat revealed that page scanning continued essentially
   nominally between test and control, without causing forward reclaim
   progress to become arrested.

[0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project

[akpm@linux-foundation.org: reflow block comments to fit in 80 cols]
[chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection]
  Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name
Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Reviewed-by: NRoman Gushchin <guro@fb.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9783aa99

x86/xen: Return from panic notifier · c6875f3a

由 Boris Ostrovsky 提交于 9月 30, 2019

Currently execution of panic() continues until Xen's panic notifier
(xen_panic_event()) is called at which point we make a hypercall that
never returns.

This means that any notifier that is supposed to be called later as
well as significant part of panic() code (such as pstore writes from
kmsg_dump()) is never executed.

There is no reason for xen_panic_event() to be this last point in
execution since panic()'s emergency_restart() will call into
xen_emergency_restart() from where we can perform our hypercall.

Nevertheless, we will provide xen_legacy_crash boot option that will
preserve original behavior during crash. This option could be used,
for example, if running kernel dumper (which happens after panic
notifiers) is undesirable.
Reported-by: NJames Dingwall <james@dingwall.me.uk>
Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
Reviewed-by: NJuergen Gross <jgross@suse.com>

c6875f3a

25 9月, 2019 2 次提交

memcg, kmem: deprecate kmem.limit_in_bytes · 0158115f

由 Michal Hocko 提交于 9月 23, 2019

Cgroup v1 memcg controller has exposed a dedicated kmem limit to users
which turned out to be really a bad idea because there are paths which
cannot shrink the kernel memory usage enough to get below the limit (e.g.
because the accounted memory is not reclaimable). There are cases when
the failure is even not allowed (e.g. __GFP_NOFAIL). This means that the
kmem limit is in excess to the hard limit without any way to shrink and
thus completely useless. OOM killer cannot be invoked to handle the
situation because that would lead to a premature oom killing.

As a result many places might see ENOMEM returning from kmalloc and result
in unexpected errors. E.g. a global OOM killer when there is a lot of
free memory because ENOMEM is translated into VM_FAULT_OOM in #PF path and
therefore pagefault_out_of_memory would result in OOM killer.

Please note that the kernel memory is still accounted to the overall limit
along with the user memory so removing the kmem specific limit should
still allow to contain kernel memory consumption. Unlike the kmem one,
though, it invokes memory reclaim and targeted memcg oom killing if
necessary.

Start the deprecation process by crying to the kernel log. Let's see
whether there are relevant usecases and simply return to EINVAL in the
second stage if nobody complains in few releases.

[akpm@linux-foundation.org: tweak documentation text]
Link: http://lkml.kernel.org/r/20190911151612.GI4023@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Thomas Lindroth <thomas.lindroth@gmail.com>
Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

0158115f

mm, page_owner, debug_pagealloc: save and dump freeing stack trace · 8974558f

由 Vlastimil Babka 提交于 9月 23, 2019

The debug_pagealloc functionality is useful to catch buggy page allocator
users that cause e.g.  use after free or double free.  When page
inconsistency is detected, debugging is often simpler by knowing the call
stack of process that last allocated and freed the page.  When page_owner
is also enabled, we record the allocation stack trace, but not freeing.

This patch therefore adds recording of freeing process stack trace to page
owner info, if both page_owner and debug_pagealloc are configured and
enabled.  With only page_owner enabled, this info is not useful for the
memory leak debugging use case.  dump_page() is adjusted to print the
info.  An example result of calling __free_pages() twice may look like
this (note the page last free stack trace):

BUG: Bad page state in process bash  pfn:13d8f8
page:ffffc31984f63e00 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0x1affff800000000()
raw: 01affff800000000 dead000000000100 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000
page dumped because: nonzero _refcount
page_owner tracks the page as freed
page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL)
 prep_new_page+0x143/0x150
 get_page_from_freelist+0x289/0x380
 __alloc_pages_nodemask+0x13c/0x2d0
 khugepaged+0x6e/0xc10
 kthread+0xf9/0x130
 ret_from_fork+0x3a/0x50
page last free stack trace:
 free_pcp_prepare+0x134/0x1e0
 free_unref_page+0x18/0x90
 khugepaged+0x7b/0xc10
 kthread+0xf9/0x130
 ret_from_fork+0x3a/0x50
Modules linked in:
CPU: 3 PID: 271 Comm: bash Not tainted 5.3.0-rc4-2.g07a1a73-default+ #57
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014
Call Trace:
 dump_stack+0x85/0xc0
 bad_page.cold+0xba/0xbf
 rmqueue_pcplist.isra.0+0x6c5/0x6d0
 rmqueue+0x2d/0x810
 get_page_from_freelist+0x191/0x380
 __alloc_pages_nodemask+0x13c/0x2d0
 __get_free_pages+0xd/0x30
 __pud_alloc+0x2c/0x110
 copy_page_range+0x4f9/0x630
 dup_mmap+0x362/0x480
 dup_mm+0x68/0x110
 copy_process+0x19e1/0x1b40
 _do_fork+0x73/0x310
 __x64_sys_clone+0x75/0x80
 do_syscall_64+0x6e/0x1e0
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7f10af854a10
...

Link: http://lkml.kernel.org/r/20190820131828.22684-5-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Michal Hocko <mhocko@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

8974558f

14 9月, 2019 2 次提交

Documentation: Add "earlycon=sbi" to the admin guide · 82f12ab3

由 Palmer Dabbelt 提交于 9月 13, 2019

This argument is supported on RISC-V systems and widely used, but was
not documented here.
Signed-off-by: NPalmer Dabbelt <palmer@sifive.com>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

82f12ab3

devices.txt: improve entry for comedi (char major 98) · d62e8055

由 Ian Abbott 提交于 9月 11, 2019

Describe how the comedi minor device numbers are split across comedi
devices and comedi subdevices.

Replace the current, long dead URL with an official URL for the Comedi
project.
Signed-off-by: NIan Abbott <abbotti@mev.co.uk>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

d62e8055

12 9月, 2019 1 次提交

dm: add clone target · 7431b783

由 Nikos Tsironis 提交于 9月 11, 2019

Add the dm-clone target, which allows cloning of arbitrary block
devices.

dm-clone produces a one-to-one copy of an existing, read-only source
device into a writable destination device: It presents a virtual block
device which makes all data appear immediately, and redirects reads and
writes accordingly.

The main use case of dm-clone is to clone a potentially remote,
high-latency, read-only, archival-type block device into a writable,
fast, primary-type device for fast, low-latency I/O. The cloned device
is visible/mountable immediately and the copy of the source device to
the destination device happens in the background, in parallel with user
I/O.

When the cloning completes, the dm-clone table can be removed altogether
and be replaced, e.g., by a linear table, mapping directly to the
destination device.

For further information and examples of how to use dm-clone, please read
Documentation/admin-guide/device-mapper/dm-clone.rst
Suggested-by: NVangelis Koukis <vkoukis@arrikto.com>
Co-developed-by: NIlias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: NIlias Tsitsimpis <iliastsi@arrikto.com>
Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

7431b783

11 9月, 2019 1 次提交

iommu/vt-d: Check whether device requires bounce buffer · e5e04d05

由 Lu Baolu 提交于 9月 06, 2019

This adds a helper to check whether a device needs to
use bounce buffer. It also provides a boot time option
to disable the bounce buffer. Users can use this to
prevent the iommu driver from using the bounce buffer
for performance gain.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: NLu Baolu <baolu.lu@linux.intel.com>
Tested-by: NXu Pengfei <pengfei.xu@intel.com>
Tested-by: NMika Westerberg <mika.westerberg@intel.com>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>

e5e04d05

08 9月, 2019 1 次提交

platform/x86: thinkpad_acpi: Add ThinkPad PrivacyGuard · 110ea1d8

由 Alexander Schremmer 提交于 8月 22, 2019

This feature is found optionally in T480s, T490, T490s.

The feature is called lcdshadow and visible via
/proc/acpi/ibm/lcdshadow.

The ACPI methods \_SB.PCI0.LPCB.EC.HKEY.{GSSS,SSSS,TSSS,CSSS} are
available in these machines. They get, set, toggle or change the state
apparently.

The patch was tested on a 5.0 series kernel on a T480s.
Signed-off-by: NAlexander Schremmer <alex@alexanderweb.de>
Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>

110ea1d8

06 9月, 2019 1 次提交

Documentation: sysrq: don't recommend 'S' 'U' before 'B' · 209c3aa7

由 Adam Borowski 提交于 9月 03, 2019

This advice is obsolete and slightly harmful for filesystems from this
millenium: any modern filesystem can handle unexpected crashes without
requiring fsck -- and on the other hand, trying to write to the disk when
the kernel is in a bad state risks introducing corruption.

For ext2, any unsafe shutdown meant widespread breakage, but it's no longer
a reasonable filesystem for any non-special use.
Signed-off-by: NAdam Borowski <kilobyte@angband.pl>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

209c3aa7

05 9月, 2019 1 次提交

powerpc/64s/radix: introduce options to disable use of the tlbie instruction · 2275d7b5

由 Nicholas Piggin 提交于 9月 03, 2019

Introduce two options to control the use of the tlbie instruction. A
boot time option which completely disables the kernel using the
instruction, this is currently incompatible with HASH MMU, KVM, and
coherent accelerators.

And a debugfs option can be switched at runtime and avoids using tlbie
for invalidating CPU TLBs for normal process and kernel address
mappings. Coherent accelerators are still managed with tlbie, as will
KVM partition scope translations.

Cross-CPU TLB flushing is implemented with IPIs and tlbiel. This is a
basic implementation which does not attempt to make any optimisation
beyond the tlbie implementation.

This is useful for performance testing among other things. For example
in certain situations on large systems, using IPIs may be faster than
tlbie as they can be directed rather than broadcast. Later we may also
take advantage of the IPIs to do more interesting things such as trim
the mm cpumask more aggressively.
Signed-off-by: NNicholas Piggin <npiggin@gmail.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190902152931.17840-7-npiggin@gmail.com

2275d7b5

04 9月, 2019 1 次提交

tty: serial: Add linflexuart driver for S32V234 · 09864c1c

由 Stefan-gabriel Mirea 提交于 8月 09, 2019

Introduce support for LINFlex driver, based on:
- the version of Freescale LPUART driver after commit b3e3bf2e ("Merge
  4.0-rc7 into tty-next");
- commit abf1e0a9 ("tty: serial: fsl_lpuart: lock port on console
  write").
In this basic version, the driver can be tested using initramfs and relies
on the clocks and pin muxing set up by U-Boot.

Remarks concerning the earlycon support:

- LinFlexD does not allow character transmissions in the INIT mode (see
  section 47.4.2.1 in the reference manual[1]). Therefore, a mutual
  exclusion between the first linflex_setup_watermark/linflex_set_termios
  executions and linflex_earlycon_putchar was employed and the characters
  normally sent to earlycon during initialization are kept in a buffer and
  sent afterwards.

- Empirically, character transmission is also forbidden within the last 1-2
  ms before entering the INIT mode, so we use an explicit timeout
  (PREINIT_DELAY) between linflex_earlycon_putchar and the first call to
  linflex_setup_watermark.

- U-Boot currently uses the UART FIFO mode, while this driver makes the
  transition to the buffer mode. Therefore, the earlycon putchar function
  matches the U-Boot behavior before initializations and the Linux behavior
  after.

[1] https://www.nxp.com/webapp/Download?colCode=S32V234RMSigned-off-by: NStoica Cosmin-Stefan <cosmin.stoica@nxp.com>
Signed-off-by: NAdrian.Nitu <adrian.nitu@freescale.com>
Signed-off-by: NLarisa Grigore <Larisa.Grigore@nxp.com>
Signed-off-by: NAna Nedelcu <B56683@freescale.com>
Signed-off-by: NMihaela Martinas <Mihaela.Martinas@freescale.com>
Signed-off-by: NMatthew Nunez <matthew.nunez@nxp.com>
[stefan-gabriel.mirea@nxp.com: Reduced for upstreaming and implemented
                               earlycon support]
Signed-off-by: NStefan-Gabriel Mirea <stefan-gabriel.mirea@nxp.com>
Link: https://lore.kernel.org/r/20190809112853.15846-6-stefan-gabriel.mirea@nxp.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

09864c1c

03 9月, 2019 3 次提交

Documentation:kernel-per-CPU-kthreads.txt: Remove reference to elevator= · fa99165c

由 Marcos Paulo de Souza 提交于 9月 03, 2019

This argument was not being considered since blk-mq was set by default,
so removed this documentation to avoid confusion.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMarcos Paulo de Souza <marcos.souza.org@gmail.com>

.txt file is now .rst
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fa99165c

block: elevator.c: Remove now unused elevator= argument · 85c0a037

由 Marcos Paulo de Souza 提交于 8月 27, 2019

Since the inclusion of blk-mq, elevator argument was not being
considered anymore, and it's utility died long with the legacy IO path,
now removed too.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NBob Liu <bob.liu@oracle.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMarcos Paulo de Souza <marcos.souza.org@gmail.com>

Fold with doc removal patch.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

85c0a037

sched/uclamp: Extend CPU's cgroup controller · 2480c093

由 Patrick Bellasi 提交于 8月 22, 2019

The cgroup CPU bandwidth controller allows to assign a specified
(maximum) bandwidth to the tasks of a group. However this bandwidth is
defined and enforced only on a temporal base, without considering the
actual frequency a CPU is running on. Thus, the amount of computation
completed by a task within an allocated bandwidth can be very different
depending on the actual frequency the CPU is running that task.
The amount of computation can be affected also by the specific CPU a
task is running on, especially when running on asymmetric capacity
systems like Arm's big.LITTLE.

With the availability of schedutil, the scheduler is now able
to drive frequency selections based on actual task utilization.
Moreover, the utilization clamping support provides a mechanism to
bias the frequency selection operated by schedutil depending on
constraints assigned to the tasks currently RUNNABLE on a CPU.

Giving the mechanisms described above, it is now possible to extend the
cpu controller to specify the minimum (or maximum) utilization which
should be considered for tasks RUNNABLE on a cpu.
This makes it possible to better defined the actual computational
power assigned to task groups, thus improving the cgroup CPU bandwidth
controller which is currently based just on time constraints.

Extend the CPU controller with a couple of new attributes uclamp.{min,max}
which allow to enforce utilization boosting and capping for all the
tasks in a group.

Specifically:

- uclamp.min: defines the minimum utilization which should be considered
	      i.e. the RUNNABLE tasks of this group will run at least at a
	      minimum frequency which corresponds to the uclamp.min
	      utilization

- uclamp.max: defines the maximum utilization which should be considered
	      i.e. the RUNNABLE tasks of this group will run up to a
	      maximum frequency which corresponds to the uclamp.max
	      utilization

These attributes:

a) are available only for non-root nodes, both on default and legacy
   hierarchies, while system wide clamps are defined by a generic
   interface which does not depends on cgroups. This system wide
   interface enforces constraints on tasks in the root node.

b) enforce effective constraints at each level of the hierarchy which
   are a restriction of the group requests considering its parent's
   effective constraints. Root group effective constraints are defined
   by the system wide interface.
   This mechanism allows each (non-root) level of the hierarchy to:
   - request whatever clamp values it would like to get
   - effectively get only up to the maximum amount allowed by its parent

c) have higher priority than task-specific clamps, defined via
   sched_setattr(), thus allowing to control and restrict task requests.

Add two new attributes to the cpu controller to collect "requested"
clamp values. Allow that at each non-root level of the hierarchy.
Keep it simple by not caring now about "effective" values computation
and propagation along the hierarchy.

Update sysctl_sched_uclamp_handler() to use the newly introduced
uclamp_mutex so that we serialize system default updates with cgroup
relate updates.
Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NMichal Koutny <mkoutny@suse.com>
Acked-by: NTejun Heo <tj@kernel.org>
Cc: Alessio Balsini <balsini@android.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Joel Fernandes <joelaf@google.com>
Cc: Juri Lelli <juri.lelli@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Paul Turner <pjt@google.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com>
Cc: Steve Muckle <smuckle@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Todd Kjos <tkjos@google.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>

2480c093

30 8月, 2019 1 次提交

powerpc/prom_init: Add the ESM call to prom_init · 6a9c930b

由 Ram Pai 提交于 8月 19, 2019

Make the Enter-Secure-Mode (ESM) ultravisor call to switch the VM to secure
mode. Pass kernel base address and FDT address so that the Ultravisor is
able to verify the integrity of the VM using information from the ESM blob.

Add "svm=" command line option to turn on switching to secure mode.
Signed-off-by: NRam Pai <linuxram@us.ibm.com>
[ andmike: Generate an RTAS os-term hcall when the ESM ucall fails. ]
Signed-off-by: NMichael Anderson <andmike@linux.ibm.com>
[ bauerman: Cleaned up the code a bit. ]
Signed-off-by: NThiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190820021326.6884-5-bauerman@linux.ibm.com

6a9c930b

29 8月, 2019 2 次提交

blkcg: add tools/cgroup/iocost_coef_gen.py · 8504dea7

由 Tejun Heo 提交于 8月 28, 2019

Add a script which can be used to generate device-specific iocost
linear model coefficients.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8504dea7

blkcg: implement blk-iocost · 7caa4715

由 Tejun Heo 提交于 8月 28, 2019

This patchset implements IO cost model based work-conserving
proportional controller.

While io.latency provides the capability to comprehensively prioritize
and protect IOs depending on the cgroups, its protection is binary -
the lowest latency target cgroup which is suffering is protected at
the cost of all others.  In many use cases including stacking multiple
workload containers in a single system, it's necessary to distribute
IO capacity with better granularity.

One challenge of controlling IO resources is the lack of trivially
observable cost metric.  The most common metrics - bandwidth and iops
- can be off by orders of magnitude depending on the device type and
IO pattern.  However, the cost isn't a complete mystery.  Given
several key attributes, we can make fairly reliable predictions on how
expensive a given stream of IOs would be, at least compared to other
IO patterns.

The function which determines the cost of a given IO is the IO cost
model for the device.  This controller distributes IO capacity based
on the costs estimated by such model.  The more accurate the cost
model the better but the controller adapts based on IO completion
latency and as long as the relative costs across differents IO
patterns are consistent and sensible, it'll adapt to the actual
performance of the device.

Currently, the only implemented cost model is a simple linear one with
a few sets of default parameters for different classes of device.
This covers most common devices reasonably well.  All the
infrastructure to tune and add different cost models is already in
place and a later patch will also allow using bpf progs for cost
models.

Please see the top comment in blk-iocost.c and documentation for
more details.

v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
    for a divide-by-zero bug in current_hweight() triggered by zero
    inuse_sum.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7caa4715

28 8月, 2019 2 次提交

docs/perf: Add documentation for the i.MX8 DDR PMU · 3724e186

由 Joakim Zhang 提交于 8月 28, 2019

Add some documentation describing the DDR PMU residing in the Freescale
i.MDX SoC and its perf driver implementation in Linux.
Signed-off-by: NJoakim Zhang <qiangqing.zhang@nxp.com>
Signed-off-by: NWill Deacon <will@kernel.org>

3724e186

Revert "of/platform: Add functional dependency link from DT bindings" · d77b3f07

由 Greg Kroah-Hartman 提交于 8月 27, 2019

This reverts commit 690ff788.

Based on a lot of email and in-person discussions, this patch series is
being reworked to address a number of issues that were pointed out that
needed to be taken care of before it should be merged.  It will be
resubmitted with those changes hopefully soon.

Cc: Frank Rowand <frowand.list@gmail.com>
Cc: Saravana Kannan <saravanak@google.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

d77b3f07

23 8月, 2019 2 次提交

dm verity: add root hash pkcs#7 signature verification · 88cd3e6c

由 Jaskaran Khurana 提交于 7月 17, 2019

The verification is to support cases where the root hash is not secured
by Trusted Boot, UEFI Secureboot or similar technologies.

One of the use cases for this is for dm-verity volumes mounted after
boot, the root hash provided during the creation of the dm-verity volume
has to be secure and thus in-kernel validation implemented here will be
used before we trust the root hash and allow the block device to be
created.

The signature being provided for verification must verify the root hash
and must be trusted by the builtin keyring for verification to succeed.

The hash is added as a key of type "user" and the description is passed
to the kernel so it can look it up and use it for verification.

Adds CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG which can be turned on if root
hash verification is needed.

Kernel commandline dm_verity module parameter 'require_signatures' will
indicate whether to force root hash signature verification (for all dm
verity volumes).
Signed-off-by: NJaskaran Khurana <jaskarankhurana@linux.microsoft.com>
Tested-and-Reviewed-by: NMilan Broz <gmazyland@gmail.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

88cd3e6c

J
Documentation: Update Documentation for iommu.passthrough · c8fb436b
由 Joerg Roedel 提交于 8月 19, 2019
```
This kernel parameter now takes also effect on X86.
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
```
c8fb436b

22 8月, 2019 1 次提交

powerpc: Document xmon options · 6278f55b

由 Gustavo Romero 提交于 8月 14, 2019

Document all options currently supported by xmon debugger.
Signed-off-by: NGustavo Romero <gromero@linux.ibm.com>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20190814205638.25322-1-gromero@linux.ibm.com

6278f55b

20 8月, 2019 2 次提交

security: Add a static lockdown policy LSM · 000d388e

由 Matthew Garrett 提交于 8月 19, 2019

While existing LSMs can be extended to handle lockdown policy,
distributions generally want to be able to apply a straightforward
static policy. This patch adds a simple LSM that can be configured to
reject either integrity or all lockdown queries, and can be configured
at runtime (through securityfs), boot time (via a kernel parameter) or
build time (via a kconfig option). Based on initial code by David
Howells.
Signed-off-by: NMatthew Garrett <mjg59@google.com>
Reviewed-by: NKees Cook <keescook@chromium.org>
Cc: David Howells <dhowells@redhat.com>
Signed-off-by: NJames Morris <jmorris@namei.org>

000d388e

x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h · c49a0a80

由 Tom Lendacky 提交于 8月 19, 2019

There have been reports of RDRAND issues after resuming from suspend on
some AMD family 15h and family 16h systems. This issue stems from a BIOS
not performing the proper steps during resume to ensure RDRAND continues
to function properly.

RDRAND support is indicated by CPUID Fn00000001_ECX[30]. This bit can be
reset by clearing MSR C001_1004[62]. Any software that checks for RDRAND
support using CPUID, including the kernel, will believe that RDRAND is
not supported.

Update the CPU initialization to clear the RDRAND CPUID bit for any family
15h and 16h processor that supports RDRAND. If it is known that the family
15h or family 16h system does not have an RDRAND resume issue or that the
system will not be placed in suspend, the "rdrand=force" kernel parameter
can be used to stop the clearing of the RDRAND CPUID bit.

Additionally, update the suspend and resume path to save and restore the
MSR C001_1004 value to ensure that the RDRAND CPUID setting remains in
place after resuming from suspend.

Note, that clearing the RDRAND CPUID bit does not prevent a processor
that normally supports the RDRAND instruction from executing it. So any
code that determined the support based on family and model won't #UD.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chen Yu <yu.c.chen@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>
Cc: "linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: <stable@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "x86@kernel.org" <x86@kernel.org>
Link: https://lkml.kernel.org/r/7543af91666f491547bd86cebb1e17c66824ab9f.1566229943.git.thomas.lendacky@amd.com

c49a0a80

17 8月, 2019 1 次提交

ia64: remove the zx1 swiotlb machvec · df43acac

由 Christoph Hellwig 提交于 8月 13, 2019

The aim of this machvec is to support devices with < 32-bit dma
masks. But given that ia64 only has a ZONE_DMA32 and not a ZONE_DMA
that isn't supported by swiotlb either.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lkml.kernel.org/r/20190813072514.23299-21-hch@lst.deSigned-off-by: NTony Luck <tony.luck@intel.com>

df43acac

14 8月, 2019 1 次提交

rcu/nocb: Rename rcu_nocb_leader_stride kernel boot parameter · f7c612b0

由 Paul E. McKenney 提交于 4月 02, 2019

This commit changes the name of the rcu_nocb_leader_stride kernel
boot parameter to rcu_nocb_gp_stride in order to account for the new
distinction between callback and grace-period no-CBs kthreads.
Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>

f7c612b0

09 8月, 2019 2 次提交

docs: admin-guide: remove references to IPX and token-ring · 7e7c076e

由 Stephen Hemminger 提交于 8月 05, 2019

Both IPX and TR have not been supported for a while now.
Remove them from the /proc/sys/net documentation.
Signed-off-by: NStephen Hemminger <stephen@networkplumber.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

7e7c076e

PCI: Correct pci=resource_alignment parameter example · 3b1b1ce3

由 Alexey Kardashevskiy 提交于 6月 06, 2019

The "pci=resource_alignment" parameter is described as requiring an order
(not a size) and the code in pci_specified_resource_alignment() expects an
order.

But the example wrongly shows a size.  Convert the example to an order.

Fixes: 8b078c60 ("PCI: Update "pci=resource_alignment" documentation")
Link: https://lore.kernel.org/r/20190606032557.107542-1-aik@ozlabs.ruSigned-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru>
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>

3b1b1ce3

04 8月, 2019 1 次提交

Documentation: Add swapgs description to the Spectre v1 documentation · 4c920576

由 Josh Poimboeuf 提交于 8月 03, 2019

Add documentation to the Spectre document about the new swapgs variant of
Spectre v1.
Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>

4c920576

02 8月, 2019 1 次提交

rcu: Add kernel parameter to dump trace after RCU CPU stall warning · cdc694b2

由 Paul E. McKenney 提交于 6月 13, 2019

This commit adds a rcu_cpu_stall_ftrace_dump kernel boot parameter, that,
when set, causes the trace buffer to be dumped after an RCU CPU stall
warning is printed. This kernel boot parameter is disabled by default,
maintaining compatibility with previous behavior.
Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>

cdc694b2

01 8月, 2019 5 次提交

of/platform: Add functional dependency link from DT bindings · 690ff788

由 Saravana Kannan 提交于 7月 31, 2019

Add device-links after the devices are created (but before they are
probed) by looking at common DT bindings like clocks and
interconnects.

Automatically adding device-links for functional dependencies at the
framework level provides the following benefits:

- Optimizes device probe order and avoids the useless work of
attempting probes of devices that will not probe successfully
(because their suppliers aren't present or haven't probed yet).

For example, in a commonly available mobile SoC, registering just
one consumer device's driver at an initcall level earlier than the
supplier device's driver causes 11 failed probe attempts before the
consumer device probes successfully. This was with a kernel with all
the drivers statically compiled in. This problem gets a lot worse if
all the drivers are loaded as modules without direct symbol
dependencies.

- Supplier devices like clock providers, interconnect providers, etc
need to keep the resources they provide active and at a particular
state(s) during boot up even if their current set of consumers don't
request the resource to be active. This is because the rest of the
consumers might not have probed yet and turning off the resource
before all the consumers have probed could lead to a hang or
undesired user experience.

Some frameworks (Eg: regulator) handle this today by turning off
"unused" resources at late_initcall_sync and hoping all the devices
have probed by then. This is not a valid assumption for systems with
loadable modules. Other frameworks (Eg: clock) just don't handle
this due to the lack of a clear signal for when they can turn off
resources. This leads to downstream hacks to handle cases like this
that can easily be solved in the upstream kernel.

By linking devices before they are probed, we give suppliers a clear
count of the number of dependent consumers. Once all of the
consumers are active, the suppliers can turn off the unused
resources without making assumptions about the number of consumers.

By default we just add device-links to track "driver presence" (probe
succeeded) of the supplier device. If any other functionality provided
by device-links are needed, it is left to the consumer/supplier
devices to change the link when they probe.

kbuild test robot reported clang error about missing const
Reported-by: Nkbuild test robot <lkp@intel.com>
Signed-off-by: NSaravana Kannan <saravanak@google.com>
Link: https://lore.kernel.org/r/20190731221721.187713-4-saravanak@google.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

690ff788

docs: fs: cifs: convert to ReST and add to admin-guide book · f139291c

由 Mauro Carvalho Chehab 提交于 7月 31, 2019

The filenames for cifs documentation is not using the same
convention as almost all Kernel documents is using. So,
rename them to a more appropriate name. Then, manually convert
the documentation files for CIFS to ReST.

By doing a manual conversion, we can preserve the original
author's style, while making it to look more like the other
Kernel documents.

Most of the conversion here is trivial. The most complex one was
the README file (which was renamed to usage.rst).
Signed-off-by: NMauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

f139291c

docs: wimax: convert to ReST and add to admin-guide · ff497db2

由 Mauro Carvalho Chehab 提交于 7月 26, 2019

Manually convert wimax documentation to ReST and add theit
to the Kernel doc body, inside the admin-guide.
Signed-off-by: NMauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

ff497db2

docs: admin-guide: add auxdisplay files to it after conversion to ReST · 76b5a6e8

由 Mauro Carvalho Chehab 提交于 7月 26, 2019

Those two files describe userspace-faced information. While part of
it might fit on uAPI, it sounds to me that the admin guide is the
best place for them.
Signed-off-by: NMauro Carvalho Chehab <mchehab+samsung@kernel.org>
Acked-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

76b5a6e8

Documentation: filesystems: Convert ufs.txt to reStructuredText format · 34d5f4f2

由 Shobhit Kukreti 提交于 7月 10, 2019

This converts the plain text documentation of ufs.txt to
reStructuredText format. Added to documentation build process
and verified with make htmldocs
Signed-off-by: NShobhit Kukreti <shobhitkukreti@gmail.com>
Reviewed-by: NMauro Carvalho Chehab <mchehab+samsung@kernel.org>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>

34d5f4f2

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功