- 05 11月, 2019 2 次提交
-
-
由 Gomez Iglesias, Antonio 提交于
Add the initial ITLB_MULTIHIT documentation. [ tglx: Add it to the index so it gets actually built. ] Signed-off-by: NAntonio Gomez Iglesias <antonio.gomez.iglesias@intel.com> Signed-off-by: NNelson D'Souza <nelson.dsouza@linux.intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
由 Junaid Shahid 提交于
The page table pages corresponding to broken down large pages are zapped in FIFO order, so that the large page can potentially be recovered, if it is not longer being used for execution. This removes the performance penalty for walking deeper EPT page tables. By default, one large page will last about one hour once the guest reaches a steady state. Signed-off-by: NJunaid Shahid <junaids@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 04 11月, 2019 1 次提交
-
-
由 Paolo Bonzini 提交于
With some Intel processors, putting the same virtual address in the TLB as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit and cause the processor to issue a machine check resulting in a CPU lockup. Unfortunately when EPT page tables use huge pages, it is possible for a malicious guest to cause this situation. Add a knob to mark huge pages as non-executable. When the nx_huge_pages parameter is enabled (and we are using EPT), all huge pages are marked as NX. If the guest attempts to execute in one of those pages, the page is broken down into 4K pages, which are then marked executable. This is not an issue for shadow paging (except nested EPT), because then the host is in control of TLB flushes and the problematic situation cannot happen. With nested EPT, again the nested guest can cause problems shadow and direct EPT is treated in the same way. [ tglx: Fixup default to auto and massage wording a bit ] Originally-by: NJunaid Shahid <junaids@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 28 10月, 2019 3 次提交
-
-
由 Pawan Gupta 提交于
Add the documenation for TSX Async Abort. Include the description of the issue, how to check the mitigation state, control the mitigation, guidance for system administrators. [ bp: Add proper SPDX tags, touch ups by Josh and me. ] Co-developed-by: NAntonio Gomez Iglesias <antonio.gomez.iglesias@intel.com> Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: NAntonio Gomez Iglesias <antonio.gomez.iglesias@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NMark Gross <mgross@linux.intel.com> Reviewed-by: NTony Luck <tony.luck@intel.com> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
-
由 Pawan Gupta 提交于
Platforms which are not affected by X86_BUG_TAA may want the TSX feature enabled. Add "auto" option to the TSX cmdline parameter. When tsx=auto disable TSX when X86_BUG_TAA is present, otherwise enable TSX. More details on X86_BUG_TAA can be found here: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html [ bp: Extend the arg buffer to accommodate "auto\0". ] Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NTony Luck <tony.luck@intel.com> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
-
由 Pawan Gupta 提交于
Add a kernel cmdline parameter "tsx" to control the Transactional Synchronization Extensions (TSX) feature. On CPUs that support TSX control, use "tsx=on|off" to enable or disable TSX. Not specifying this option is equivalent to "tsx=off". This is because on certain processors TSX may be used as a part of a speculative side channel attack. Carve out the TSX controlling functionality into a separate compilation unit because TSX is a CPU feature while the TSX async abort control machinery will go to cpu/bugs.c. [ bp: - Massage, shorten and clear the arg buffer. - Clarifications of the tsx= possible options - Josh. - Expand on TSX_CTRL availability - Pawan. ] Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com>
-
- 08 10月, 2019 2 次提交
-
-
由 Chris Down 提交于
cgroup v2 introduces two memory protection thresholds: memory.low (best-effort) and memory.min (hard protection). While they generally do what they say on the tin, there is a limitation in their implementation that makes them difficult to use effectively: that cliff behaviour often manifests when they become eligible for reclaim. This patch implements more intuitive and usable behaviour, where we gradually mount more reclaim pressure as cgroups further and further exceed their protection thresholds. This cliff edge behaviour happens because we only choose whether or not to reclaim based on whether the memcg is within its protection limits (see the use of mem_cgroup_protected in shrink_node), but we don't vary our reclaim behaviour based on this information. Imagine the following timeline, with the numbers the lruvec size in this zone: 1. memory.low=1000000, memory.current=999999. 0 pages may be scanned. 2. memory.low=1000000, memory.current=1000000. 0 pages may be scanned. 3. memory.low=1000000, memory.current=1000001. 1000001* pages may be scanned. (?!) * Of course, we won't usually scan all available pages in the zone even without this patch because of scan control priority, over-reclaim protection, etc. However, as shown by the tests at the end, these techniques don't sufficiently throttle such an extreme change in input, so cliff-like behaviour isn't really averted by their existence alone. Here's an example of how this plays out in practice. At Facebook, we are trying to protect various workloads from "system" software, like configuration management tools, metric collectors, etc (see this[0] case study). In order to find a suitable memory.low value, we start by determining the expected memory range within which the workload will be comfortable operating. This isn't an exact science -- memory usage deemed "comfortable" will vary over time due to user behaviour, differences in composition of work, etc, etc. As such we need to ballpark memory.low, but doing this is currently problematic: 1. If we end up setting it too low for the workload, it won't have *any* effect (see discussion above). The group will receive the full weight of reclaim and won't have any priority while competing with the less important system software, as if we had no memory.low configured at all. 2. Because of this behaviour, we end up erring on the side of setting it too high, such that the comfort range is reliably covered. However, protected memory is completely unavailable to the rest of the system, so we might cause undue memory and IO pressure there when we *know* we have some elasticity in the workload. 3. Even if we get the value totally right, smack in the middle of the comfort zone, we get extreme jumps between no pressure and full pressure that cause unpredictable pressure spikes in the workload due to the current binary reclaim behaviour. With this patch, we can set it to our ballpark estimation without too much worry. Any undesirable behaviour, such as too much or too little reclaim pressure on the workload or system will be proportional to how far our estimation is off. This means we can set memory.low much more conservatively and thus waste less resources *without* the risk of the workload falling off a cliff if we overshoot. As a more abstract technical description, this unintuitive behaviour results in having to give high-priority workloads a large protection buffer on top of their expected usage to function reliably, as otherwise we have abrupt periods of dramatically increased memory pressure which hamper performance. Having to set these thresholds so high wastes resources and generally works against the principle of work conservation. In addition, having proportional memory reclaim behaviour has other benefits. Most notably, before this patch it's basically mandatory to set memory.low to a higher than desirable value because otherwise as soon as you exceed memory.low, all protection is lost, and all pages are eligible to scan again. By contrast, having a gradual ramp in reclaim pressure means that you now still get some protection when thresholds are exceeded, which means that one can now be more comfortable setting memory.low to lower values without worrying that all protection will be lost. This is important because workingset size is really hard to know exactly, especially with variable workloads, so at least getting *some* protection if your workingset size grows larger than you expect increases user confidence in setting memory.low without a huge buffer on top being needed. Thanks a lot to Johannes Weiner and Tejun Heo for their advice and assistance in thinking about how to make this work better. In testing these changes, I intended to verify that: 1. Changes in page scanning become gradual and proportional instead of binary. To test this, I experimented stepping further and further down memory.low protection on a workload that floats around 19G workingset when under memory.low protection, watching page scan rates for the workload cgroup: +------------+-----------------+--------------------+--------------+ | memory.low | test (pgscan/s) | control (pgscan/s) | % of control | +------------+-----------------+--------------------+--------------+ | 21G | 0 | 0 | N/A | | 17G | 867 | 3799 | 23% | | 12G | 1203 | 3543 | 34% | | 8G | 2534 | 3979 | 64% | | 4G | 3980 | 4147 | 96% | | 0 | 3799 | 3980 | 95% | +------------+-----------------+--------------------+--------------+ As you can see, the test kernel (with a kernel containing this patch) ramps up page scanning significantly more gradually than the control kernel (without this patch). 2. More gradual ramp up in reclaim aggression doesn't result in premature OOMs. To test this, I wrote a script that slowly increments the number of pages held by stress(1)'s --vm-keep mode until a production system entered severe overall memory contention. This script runs in a highly protected slice taking up the majority of available system memory. Watching vmstat revealed that page scanning continued essentially nominally between test and control, without causing forward reclaim progress to become arrested. [0]: https://facebookmicrosites.github.io/cgroup2/docs/overview.html#case-study-the-fbtax2-project [akpm@linux-foundation.org: reflow block comments to fit in 80 cols] [chris@chrisdown.name: handle cgroup_disable=memory when getting memcg protection] Link: http://lkml.kernel.org/r/20190201045711.GA18302@chrisdown.name Link: http://lkml.kernel.org/r/20190124014455.GA6396@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Reviewed-by: NRoman Gushchin <guro@fb.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Tejun Heo <tj@kernel.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Boris Ostrovsky 提交于
Currently execution of panic() continues until Xen's panic notifier (xen_panic_event()) is called at which point we make a hypercall that never returns. This means that any notifier that is supposed to be called later as well as significant part of panic() code (such as pstore writes from kmsg_dump()) is never executed. There is no reason for xen_panic_event() to be this last point in execution since panic()'s emergency_restart() will call into xen_emergency_restart() from where we can perform our hypercall. Nevertheless, we will provide xen_legacy_crash boot option that will preserve original behavior during crash. This option could be used, for example, if running kernel dumper (which happens after panic notifiers) is undesirable. Reported-by: NJames Dingwall <james@dingwall.me.uk> Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: NJuergen Gross <jgross@suse.com>
-
- 25 9月, 2019 2 次提交
-
-
由 Michal Hocko 提交于
Cgroup v1 memcg controller has exposed a dedicated kmem limit to users which turned out to be really a bad idea because there are paths which cannot shrink the kernel memory usage enough to get below the limit (e.g. because the accounted memory is not reclaimable). There are cases when the failure is even not allowed (e.g. __GFP_NOFAIL). This means that the kmem limit is in excess to the hard limit without any way to shrink and thus completely useless. OOM killer cannot be invoked to handle the situation because that would lead to a premature oom killing. As a result many places might see ENOMEM returning from kmalloc and result in unexpected errors. E.g. a global OOM killer when there is a lot of free memory because ENOMEM is translated into VM_FAULT_OOM in #PF path and therefore pagefault_out_of_memory would result in OOM killer. Please note that the kernel memory is still accounted to the overall limit along with the user memory so removing the kmem specific limit should still allow to contain kernel memory consumption. Unlike the kmem one, though, it invokes memory reclaim and targeted memcg oom killing if necessary. Start the deprecation process by crying to the kernel log. Let's see whether there are relevant usecases and simply return to EINVAL in the second stage if nobody complains in few releases. [akpm@linux-foundation.org: tweak documentation text] Link: http://lkml.kernel.org/r/20190911151612.GI4023@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Thomas Lindroth <thomas.lindroth@gmail.com> Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vlastimil Babka 提交于
The debug_pagealloc functionality is useful to catch buggy page allocator users that cause e.g. use after free or double free. When page inconsistency is detected, debugging is often simpler by knowing the call stack of process that last allocated and freed the page. When page_owner is also enabled, we record the allocation stack trace, but not freeing. This patch therefore adds recording of freeing process stack trace to page owner info, if both page_owner and debug_pagealloc are configured and enabled. With only page_owner enabled, this info is not useful for the memory leak debugging use case. dump_page() is adjusted to print the info. An example result of calling __free_pages() twice may look like this (note the page last free stack trace): BUG: Bad page state in process bash pfn:13d8f8 page:ffffc31984f63e00 refcount:-1 mapcount:0 mapping:0000000000000000 index:0x0 flags: 0x1affff800000000() raw: 01affff800000000 dead000000000100 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000000 ffffffffffffffff 0000000000000000 page dumped because: nonzero _refcount page_owner tracks the page as freed page last allocated via order 0, migratetype Unmovable, gfp_mask 0xcc0(GFP_KERNEL) prep_new_page+0x143/0x150 get_page_from_freelist+0x289/0x380 __alloc_pages_nodemask+0x13c/0x2d0 khugepaged+0x6e/0xc10 kthread+0xf9/0x130 ret_from_fork+0x3a/0x50 page last free stack trace: free_pcp_prepare+0x134/0x1e0 free_unref_page+0x18/0x90 khugepaged+0x7b/0xc10 kthread+0xf9/0x130 ret_from_fork+0x3a/0x50 Modules linked in: CPU: 3 PID: 271 Comm: bash Not tainted 5.3.0-rc4-2.g07a1a73-default+ #57 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58-prebuilt.qemu.org 04/01/2014 Call Trace: dump_stack+0x85/0xc0 bad_page.cold+0xba/0xbf rmqueue_pcplist.isra.0+0x6c5/0x6d0 rmqueue+0x2d/0x810 get_page_from_freelist+0x191/0x380 __alloc_pages_nodemask+0x13c/0x2d0 __get_free_pages+0xd/0x30 __pud_alloc+0x2c/0x110 copy_page_range+0x4f9/0x630 dup_mmap+0x362/0x480 dup_mm+0x68/0x110 copy_process+0x19e1/0x1b40 _do_fork+0x73/0x310 __x64_sys_clone+0x75/0x80 do_syscall_64+0x6e/0x1e0 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x7f10af854a10 ... Link: http://lkml.kernel.org/r/20190820131828.22684-5-vbabka@suse.czSigned-off-by: NVlastimil Babka <vbabka@suse.cz> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 9月, 2019 2 次提交
-
-
由 Palmer Dabbelt 提交于
This argument is supported on RISC-V systems and widely used, but was not documented here. Signed-off-by: NPalmer Dabbelt <palmer@sifive.com> Signed-off-by: NJonathan Corbet <corbet@lwn.net>
-
由 Ian Abbott 提交于
Describe how the comedi minor device numbers are split across comedi devices and comedi subdevices. Replace the current, long dead URL with an official URL for the Comedi project. Signed-off-by: NIan Abbott <abbotti@mev.co.uk> Signed-off-by: NJonathan Corbet <corbet@lwn.net>
-
- 12 9月, 2019 1 次提交
-
-
由 Nikos Tsironis 提交于
Add the dm-clone target, which allows cloning of arbitrary block devices. dm-clone produces a one-to-one copy of an existing, read-only source device into a writable destination device: It presents a virtual block device which makes all data appear immediately, and redirects reads and writes accordingly. The main use case of dm-clone is to clone a potentially remote, high-latency, read-only, archival-type block device into a writable, fast, primary-type device for fast, low-latency I/O. The cloned device is visible/mountable immediately and the copy of the source device to the destination device happens in the background, in parallel with user I/O. When the cloning completes, the dm-clone table can be removed altogether and be replaced, e.g., by a linear table, mapping directly to the destination device. For further information and examples of how to use dm-clone, please read Documentation/admin-guide/device-mapper/dm-clone.rst Suggested-by: NVangelis Koukis <vkoukis@arrikto.com> Co-developed-by: NIlias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: NIlias Tsitsimpis <iliastsi@arrikto.com> Signed-off-by: NNikos Tsironis <ntsironis@arrikto.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com>
-
- 11 9月, 2019 1 次提交
-
-
由 Lu Baolu 提交于
This adds a helper to check whether a device needs to use bounce buffer. It also provides a boot time option to disable the bounce buffer. Users can use this to prevent the iommu driver from using the bounce buffer for performance gain. Cc: Ashok Raj <ashok.raj@intel.com> Cc: Jacob Pan <jacob.jun.pan@linux.intel.com> Cc: Kevin Tian <kevin.tian@intel.com> Signed-off-by: NLu Baolu <baolu.lu@linux.intel.com> Tested-by: NXu Pengfei <pengfei.xu@intel.com> Tested-by: NMika Westerberg <mika.westerberg@intel.com> Signed-off-by: NJoerg Roedel <jroedel@suse.de>
-
- 08 9月, 2019 1 次提交
-
-
由 Alexander Schremmer 提交于
This feature is found optionally in T480s, T490, T490s. The feature is called lcdshadow and visible via /proc/acpi/ibm/lcdshadow. The ACPI methods \_SB.PCI0.LPCB.EC.HKEY.{GSSS,SSSS,TSSS,CSSS} are available in these machines. They get, set, toggle or change the state apparently. The patch was tested on a 5.0 series kernel on a T480s. Signed-off-by: NAlexander Schremmer <alex@alexanderweb.de> Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
-
- 06 9月, 2019 1 次提交
-
-
由 Adam Borowski 提交于
This advice is obsolete and slightly harmful for filesystems from this millenium: any modern filesystem can handle unexpected crashes without requiring fsck -- and on the other hand, trying to write to the disk when the kernel is in a bad state risks introducing corruption. For ext2, any unsafe shutdown meant widespread breakage, but it's no longer a reasonable filesystem for any non-special use. Signed-off-by: NAdam Borowski <kilobyte@angband.pl> Signed-off-by: NJonathan Corbet <corbet@lwn.net>
-
- 05 9月, 2019 1 次提交
-
-
由 Nicholas Piggin 提交于
Introduce two options to control the use of the tlbie instruction. A boot time option which completely disables the kernel using the instruction, this is currently incompatible with HASH MMU, KVM, and coherent accelerators. And a debugfs option can be switched at runtime and avoids using tlbie for invalidating CPU TLBs for normal process and kernel address mappings. Coherent accelerators are still managed with tlbie, as will KVM partition scope translations. Cross-CPU TLB flushing is implemented with IPIs and tlbiel. This is a basic implementation which does not attempt to make any optimisation beyond the tlbie implementation. This is useful for performance testing among other things. For example in certain situations on large systems, using IPIs may be faster than tlbie as they can be directed rather than broadcast. Later we may also take advantage of the IPIs to do more interesting things such as trim the mm cpumask more aggressively. Signed-off-by: NNicholas Piggin <npiggin@gmail.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190902152931.17840-7-npiggin@gmail.com
-
- 04 9月, 2019 1 次提交
-
-
由 Stefan-gabriel Mirea 提交于
Introduce support for LINFlex driver, based on: - the version of Freescale LPUART driver after commit b3e3bf2e ("Merge 4.0-rc7 into tty-next"); - commit abf1e0a9 ("tty: serial: fsl_lpuart: lock port on console write"). In this basic version, the driver can be tested using initramfs and relies on the clocks and pin muxing set up by U-Boot. Remarks concerning the earlycon support: - LinFlexD does not allow character transmissions in the INIT mode (see section 47.4.2.1 in the reference manual[1]). Therefore, a mutual exclusion between the first linflex_setup_watermark/linflex_set_termios executions and linflex_earlycon_putchar was employed and the characters normally sent to earlycon during initialization are kept in a buffer and sent afterwards. - Empirically, character transmission is also forbidden within the last 1-2 ms before entering the INIT mode, so we use an explicit timeout (PREINIT_DELAY) between linflex_earlycon_putchar and the first call to linflex_setup_watermark. - U-Boot currently uses the UART FIFO mode, while this driver makes the transition to the buffer mode. Therefore, the earlycon putchar function matches the U-Boot behavior before initializations and the Linux behavior after. [1] https://www.nxp.com/webapp/Download?colCode=S32V234RMSigned-off-by: NStoica Cosmin-Stefan <cosmin.stoica@nxp.com> Signed-off-by: NAdrian.Nitu <adrian.nitu@freescale.com> Signed-off-by: NLarisa Grigore <Larisa.Grigore@nxp.com> Signed-off-by: NAna Nedelcu <B56683@freescale.com> Signed-off-by: NMihaela Martinas <Mihaela.Martinas@freescale.com> Signed-off-by: NMatthew Nunez <matthew.nunez@nxp.com> [stefan-gabriel.mirea@nxp.com: Reduced for upstreaming and implemented earlycon support] Signed-off-by: NStefan-Gabriel Mirea <stefan-gabriel.mirea@nxp.com> Link: https://lore.kernel.org/r/20190809112853.15846-6-stefan-gabriel.mirea@nxp.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 03 9月, 2019 3 次提交
-
-
由 Marcos Paulo de Souza 提交于
This argument was not being considered since blk-mq was set by default, so removed this documentation to avoid confusion. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMarcos Paulo de Souza <marcos.souza.org@gmail.com> .txt file is now .rst Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Marcos Paulo de Souza 提交于
Since the inclusion of blk-mq, elevator argument was not being considered anymore, and it's utility died long with the legacy IO path, now removed too. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NBob Liu <bob.liu@oracle.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMarcos Paulo de Souza <marcos.souza.org@gmail.com> Fold with doc removal patch. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Patrick Bellasi 提交于
The cgroup CPU bandwidth controller allows to assign a specified (maximum) bandwidth to the tasks of a group. However this bandwidth is defined and enforced only on a temporal base, without considering the actual frequency a CPU is running on. Thus, the amount of computation completed by a task within an allocated bandwidth can be very different depending on the actual frequency the CPU is running that task. The amount of computation can be affected also by the specific CPU a task is running on, especially when running on asymmetric capacity systems like Arm's big.LITTLE. With the availability of schedutil, the scheduler is now able to drive frequency selections based on actual task utilization. Moreover, the utilization clamping support provides a mechanism to bias the frequency selection operated by schedutil depending on constraints assigned to the tasks currently RUNNABLE on a CPU. Giving the mechanisms described above, it is now possible to extend the cpu controller to specify the minimum (or maximum) utilization which should be considered for tasks RUNNABLE on a cpu. This makes it possible to better defined the actual computational power assigned to task groups, thus improving the cgroup CPU bandwidth controller which is currently based just on time constraints. Extend the CPU controller with a couple of new attributes uclamp.{min,max} which allow to enforce utilization boosting and capping for all the tasks in a group. Specifically: - uclamp.min: defines the minimum utilization which should be considered i.e. the RUNNABLE tasks of this group will run at least at a minimum frequency which corresponds to the uclamp.min utilization - uclamp.max: defines the maximum utilization which should be considered i.e. the RUNNABLE tasks of this group will run up to a maximum frequency which corresponds to the uclamp.max utilization These attributes: a) are available only for non-root nodes, both on default and legacy hierarchies, while system wide clamps are defined by a generic interface which does not depends on cgroups. This system wide interface enforces constraints on tasks in the root node. b) enforce effective constraints at each level of the hierarchy which are a restriction of the group requests considering its parent's effective constraints. Root group effective constraints are defined by the system wide interface. This mechanism allows each (non-root) level of the hierarchy to: - request whatever clamp values it would like to get - effectively get only up to the maximum amount allowed by its parent c) have higher priority than task-specific clamps, defined via sched_setattr(), thus allowing to control and restrict task requests. Add two new attributes to the cpu controller to collect "requested" clamp values. Allow that at each non-root level of the hierarchy. Keep it simple by not caring now about "effective" values computation and propagation along the hierarchy. Update sysctl_sched_uclamp_handler() to use the newly introduced uclamp_mutex so that we serialize system default updates with cgroup relate updates. Signed-off-by: NPatrick Bellasi <patrick.bellasi@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NMichal Koutny <mkoutny@suse.com> Acked-by: NTejun Heo <tj@kernel.org> Cc: Alessio Balsini <balsini@android.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Joel Fernandes <joelaf@google.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Morten Rasmussen <morten.rasmussen@arm.com> Cc: Paul Turner <pjt@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Quentin Perret <quentin.perret@arm.com> Cc: Rafael J . Wysocki <rafael.j.wysocki@intel.com> Cc: Steve Muckle <smuckle@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Todd Kjos <tkjos@google.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: Viresh Kumar <viresh.kumar@linaro.org> Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bellasi@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
-
- 30 8月, 2019 1 次提交
-
-
由 Ram Pai 提交于
Make the Enter-Secure-Mode (ESM) ultravisor call to switch the VM to secure mode. Pass kernel base address and FDT address so that the Ultravisor is able to verify the integrity of the VM using information from the ESM blob. Add "svm=" command line option to turn on switching to secure mode. Signed-off-by: NRam Pai <linuxram@us.ibm.com> [ andmike: Generate an RTAS os-term hcall when the ESM ucall fails. ] Signed-off-by: NMichael Anderson <andmike@linux.ibm.com> [ bauerman: Cleaned up the code a bit. ] Signed-off-by: NThiago Jung Bauermann <bauerman@linux.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190820021326.6884-5-bauerman@linux.ibm.com
-
- 29 8月, 2019 2 次提交
-
-
由 Tejun Heo 提交于
Add a script which can be used to generate device-specific iocost linear model coefficients. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
This patchset implements IO cost model based work-conserving proportional controller. While io.latency provides the capability to comprehensively prioritize and protect IOs depending on the cgroups, its protection is binary - the lowest latency target cgroup which is suffering is protected at the cost of all others. In many use cases including stacking multiple workload containers in a single system, it's necessary to distribute IO capacity with better granularity. One challenge of controlling IO resources is the lack of trivially observable cost metric. The most common metrics - bandwidth and iops - can be off by orders of magnitude depending on the device type and IO pattern. However, the cost isn't a complete mystery. Given several key attributes, we can make fairly reliable predictions on how expensive a given stream of IOs would be, at least compared to other IO patterns. The function which determines the cost of a given IO is the IO cost model for the device. This controller distributes IO capacity based on the costs estimated by such model. The more accurate the cost model the better but the controller adapts based on IO completion latency and as long as the relative costs across differents IO patterns are consistent and sensible, it'll adapt to the actual performance of the device. Currently, the only implemented cost model is a simple linear one with a few sets of default parameters for different classes of device. This covers most common devices reasonably well. All the infrastructure to tune and add different cost models is already in place and a later patch will also allow using bpf progs for cost models. Please see the top comment in blk-iocost.c and documentation for more details. v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix for a divide-by-zero bug in current_hweight() triggered by zero inuse_sum. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 28 8月, 2019 2 次提交
-
-
由 Joakim Zhang 提交于
Add some documentation describing the DDR PMU residing in the Freescale i.MDX SoC and its perf driver implementation in Linux. Signed-off-by: NJoakim Zhang <qiangqing.zhang@nxp.com> Signed-off-by: NWill Deacon <will@kernel.org>
-
由 Greg Kroah-Hartman 提交于
This reverts commit 690ff788. Based on a lot of email and in-person discussions, this patch series is being reworked to address a number of issues that were pointed out that needed to be taken care of before it should be merged. It will be resubmitted with those changes hopefully soon. Cc: Frank Rowand <frowand.list@gmail.com> Cc: Saravana Kannan <saravanak@google.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 23 8月, 2019 2 次提交
-
-
由 Jaskaran Khurana 提交于
The verification is to support cases where the root hash is not secured by Trusted Boot, UEFI Secureboot or similar technologies. One of the use cases for this is for dm-verity volumes mounted after boot, the root hash provided during the creation of the dm-verity volume has to be secure and thus in-kernel validation implemented here will be used before we trust the root hash and allow the block device to be created. The signature being provided for verification must verify the root hash and must be trusted by the builtin keyring for verification to succeed. The hash is added as a key of type "user" and the description is passed to the kernel so it can look it up and use it for verification. Adds CONFIG_DM_VERITY_VERIFY_ROOTHASH_SIG which can be turned on if root hash verification is needed. Kernel commandline dm_verity module parameter 'require_signatures' will indicate whether to force root hash signature verification (for all dm verity volumes). Signed-off-by: NJaskaran Khurana <jaskarankhurana@linux.microsoft.com> Tested-and-Reviewed-by: NMilan Broz <gmazyland@gmail.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com>
-
由 Joerg Roedel 提交于
This kernel parameter now takes also effect on X86. Signed-off-by: NJoerg Roedel <jroedel@suse.de>
-
- 22 8月, 2019 1 次提交
-
-
由 Gustavo Romero 提交于
Document all options currently supported by xmon debugger. Signed-off-by: NGustavo Romero <gromero@linux.ibm.com> Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20190814205638.25322-1-gromero@linux.ibm.com
-
- 20 8月, 2019 2 次提交
-
-
由 Matthew Garrett 提交于
While existing LSMs can be extended to handle lockdown policy, distributions generally want to be able to apply a straightforward static policy. This patch adds a simple LSM that can be configured to reject either integrity or all lockdown queries, and can be configured at runtime (through securityfs), boot time (via a kernel parameter) or build time (via a kconfig option). Based on initial code by David Howells. Signed-off-by: NMatthew Garrett <mjg59@google.com> Reviewed-by: NKees Cook <keescook@chromium.org> Cc: David Howells <dhowells@redhat.com> Signed-off-by: NJames Morris <jmorris@namei.org>
-
由 Tom Lendacky 提交于
There have been reports of RDRAND issues after resuming from suspend on some AMD family 15h and family 16h systems. This issue stems from a BIOS not performing the proper steps during resume to ensure RDRAND continues to function properly. RDRAND support is indicated by CPUID Fn00000001_ECX[30]. This bit can be reset by clearing MSR C001_1004[62]. Any software that checks for RDRAND support using CPUID, including the kernel, will believe that RDRAND is not supported. Update the CPU initialization to clear the RDRAND CPUID bit for any family 15h and 16h processor that supports RDRAND. If it is known that the family 15h or family 16h system does not have an RDRAND resume issue or that the system will not be placed in suspend, the "rdrand=force" kernel parameter can be used to stop the clearing of the RDRAND CPUID bit. Additionally, update the suspend and resume path to save and restore the MSR C001_1004 value to ensure that the RDRAND CPUID setting remains in place after resuming from suspend. Note, that clearing the RDRAND CPUID bit does not prevent a processor that normally supports the RDRAND instruction from executing it. So any code that determined the support based on family and model won't #UD. Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Andrew Cooper <andrew.cooper3@citrix.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Chen Yu <yu.c.chen@intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kees Cook <keescook@chromium.org> Cc: "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org> Cc: "linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org> Cc: Nathan Chancellor <natechancellor@gmail.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: <stable@vger.kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "x86@kernel.org" <x86@kernel.org> Link: https://lkml.kernel.org/r/7543af91666f491547bd86cebb1e17c66824ab9f.1566229943.git.thomas.lendacky@amd.com
-
- 17 8月, 2019 1 次提交
-
-
由 Christoph Hellwig 提交于
The aim of this machvec is to support devices with < 32-bit dma masks. But given that ia64 only has a ZONE_DMA32 and not a ZONE_DMA that isn't supported by swiotlb either. Signed-off-by: NChristoph Hellwig <hch@lst.de> Link: https://lkml.kernel.org/r/20190813072514.23299-21-hch@lst.deSigned-off-by: NTony Luck <tony.luck@intel.com>
-
- 14 8月, 2019 1 次提交
-
-
由 Paul E. McKenney 提交于
This commit changes the name of the rcu_nocb_leader_stride kernel boot parameter to rcu_nocb_gp_stride in order to account for the new distinction between callback and grace-period no-CBs kthreads. Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
-
- 09 8月, 2019 2 次提交
-
-
由 Stephen Hemminger 提交于
Both IPX and TR have not been supported for a while now. Remove them from the /proc/sys/net documentation. Signed-off-by: NStephen Hemminger <stephen@networkplumber.org> Signed-off-by: NDavid S. Miller <davem@davemloft.net>
-
由 Alexey Kardashevskiy 提交于
The "pci=resource_alignment" parameter is described as requiring an order (not a size) and the code in pci_specified_resource_alignment() expects an order. But the example wrongly shows a size. Convert the example to an order. Fixes: 8b078c60 ("PCI: Update "pci=resource_alignment" documentation") Link: https://lore.kernel.org/r/20190606032557.107542-1-aik@ozlabs.ruSigned-off-by: NAlexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
-
- 04 8月, 2019 1 次提交
-
-
由 Josh Poimboeuf 提交于
Add documentation to the Spectre document about the new swapgs variant of Spectre v1. Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
-
- 02 8月, 2019 1 次提交
-
-
由 Paul E. McKenney 提交于
This commit adds a rcu_cpu_stall_ftrace_dump kernel boot parameter, that, when set, causes the trace buffer to be dumped after an RCU CPU stall warning is printed. This kernel boot parameter is disabled by default, maintaining compatibility with previous behavior. Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
-
- 01 8月, 2019 3 次提交
-
-
由 Saravana Kannan 提交于
Add device-links after the devices are created (but before they are probed) by looking at common DT bindings like clocks and interconnects. Automatically adding device-links for functional dependencies at the framework level provides the following benefits: - Optimizes device probe order and avoids the useless work of attempting probes of devices that will not probe successfully (because their suppliers aren't present or haven't probed yet). For example, in a commonly available mobile SoC, registering just one consumer device's driver at an initcall level earlier than the supplier device's driver causes 11 failed probe attempts before the consumer device probes successfully. This was with a kernel with all the drivers statically compiled in. This problem gets a lot worse if all the drivers are loaded as modules without direct symbol dependencies. - Supplier devices like clock providers, interconnect providers, etc need to keep the resources they provide active and at a particular state(s) during boot up even if their current set of consumers don't request the resource to be active. This is because the rest of the consumers might not have probed yet and turning off the resource before all the consumers have probed could lead to a hang or undesired user experience. Some frameworks (Eg: regulator) handle this today by turning off "unused" resources at late_initcall_sync and hoping all the devices have probed by then. This is not a valid assumption for systems with loadable modules. Other frameworks (Eg: clock) just don't handle this due to the lack of a clear signal for when they can turn off resources. This leads to downstream hacks to handle cases like this that can easily be solved in the upstream kernel. By linking devices before they are probed, we give suppliers a clear count of the number of dependent consumers. Once all of the consumers are active, the suppliers can turn off the unused resources without making assumptions about the number of consumers. By default we just add device-links to track "driver presence" (probe succeeded) of the supplier device. If any other functionality provided by device-links are needed, it is left to the consumer/supplier devices to change the link when they probe. kbuild test robot reported clang error about missing const Reported-by: Nkbuild test robot <lkp@intel.com> Signed-off-by: NSaravana Kannan <saravanak@google.com> Link: https://lore.kernel.org/r/20190731221721.187713-4-saravanak@google.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Mauro Carvalho Chehab 提交于
The filenames for cifs documentation is not using the same convention as almost all Kernel documents is using. So, rename them to a more appropriate name. Then, manually convert the documentation files for CIFS to ReST. By doing a manual conversion, we can preserve the original author's style, while making it to look more like the other Kernel documents. Most of the conversion here is trivial. The most complex one was the README file (which was renamed to usage.rst). Signed-off-by: NMauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: NJonathan Corbet <corbet@lwn.net>
-
由 Mauro Carvalho Chehab 提交于
Manually convert wimax documentation to ReST and add theit to the Kernel doc body, inside the admin-guide. Signed-off-by: NMauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: NJonathan Corbet <corbet@lwn.net>
-