- 15 1月, 2020 11 次提交
-
-
由 Xiaoguang Wang 提交于
Indeed tool iostat's await is not good enough, which is somewhat sketchy and could not show request's latency on device driver's side. Here we add a new counter to track io request's d2c time, also with this patch, we can extend iostat to show this value easily. Note: I had checked how iostat is implemented, it just reads fields it needs, so iostat won't be affected by this change, so does tsar. Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 zhangliguang 提交于
This is a temporary workaround plan to avoid the limitation when creating hard link cross two projids. Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com> Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Dan Williams 提交于
commit 730926c3b0998943654019f00296cf8e3b02277e upstream On the expectation that some environments may not upgrade libdaxctl (userspace component that depends on the /sys/class/dax hierarchy), provide a default / legacy dax_pmem_compat driver. The dax_pmem_compat driver implements the original /sys/class/dax sysfs layout rather than /sys/bus/dax. When userspace is upgraded it can blacklist this module and switch to the dax_pmem driver going forward. CONFIG_DEV_DAX_PMEM_COMPAT and supporting code will be deleted according to the dax_pmem entry in Documentation/ABI/obsolete/. Signed-off-by: NDan Williams <dan.j.williams@intel.com> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
-
由 Srinivas Pandruvada 提交于
commit 2ee5bfc1efc81179c73abcd33098dd2c86019146 upstream. Reserve ioctl numbers for intel Speed Select Technology interface drivers. Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com> Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Len Brown 提交于
commit 0e344d8c709fe01d882fc0fb5452bedfe5eba67a upstream. Export die_id in cpu topology, for the benefit of hardware that has multiple-die/package. Signed-off-by: NLen Brown <len.brown@intel.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NIngo Molnar <mingo@kernel.org> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: linux-doc@vger.kernel.org Link: https://lkml.kernel.org/r/e7d1caaf4fbd24ee40db6d557ab28d7d83298900.1557769318.git.len.brown@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Len Brown 提交于
commit 7745f03eb39587dd15a1fb26e6223678b8e906d2 upstream. Some new systems have multiple software-visible die within each package. Update Linux parsing of the Intel CPUID "Extended Topology Leaf" to handle either CPUID.B, or the new CPUID.1F. Add cpuinfo_x86.die_id and cpuinfo_x86.max_dies to store the result. die_id will be non-zero only for multi-die/package systems. Signed-off-by: NLen Brown <len.brown@intel.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NIngo Molnar <mingo@kernel.org> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: linux-doc@vger.kernel.org Link: https://lkml.kernel.org/r/7b23d2d26d717b8e14ba137c94b70943f1ae4b5c.1557769318.git.len.brown@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Len Brown 提交于
commit 3a1c779fb8f71e772e2145e68c262936ada815ed upstream. Syntax only, no functional or semantic change. Signed-off-by: NLen Brown <len.brown@intel.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: linux-doc@vger.kernel.org Link: http://lkml.kernel.org/r/1ca56f8ea922a67f0017bd645912ea02a65a85ec.1551160674.git.len.brown@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NYouquan Song <youquan.song@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Keith Busch 提交于
commit 13bac55ef7aef8ecb67ff3005d24b05a464d28ea upstream. Platforms may provide system memory where some physical address ranges perform differently than others, or is cached by the system on the memory side. Add documentation describing a high level overview of such systems and the perforamnce and caching attributes the kernel provides for applications wishing to query this information. Reviewed-by: NMike Rapoport <rppt@linux.ibm.com> Reviewed-by: NJonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: NKeith Busch <keith.busch@intel.com> Tested-by: NBrice Goglin <Brice.Goglin@inria.fr> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NFan Du <fan.du@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Keith Busch 提交于
commit acc02a109b0497e917c83f986a89c51e47d0022c upstream. System memory may have caches to help improve access speed to frequently requested address ranges. While the system provided cache is transparent to the software accessing these memory ranges, applications can optimize their own access based on cache attributes. Provide a new API for the kernel to register these memory-side caches under the memory node that provides it. The new sysfs representation is modeled from the existing cpu cacheinfo attributes, as seen from /sys/devices/system/cpu/<cpu>/cache/. Unlike CPU cacheinfo though, the node cache level is reported from the view of the memory. A higher level number is nearer to the CPU, while lower levels are closer to the last level memory. The exported attributes are the cache size, the line size, associativity indexing, and write back policy, and add the attributes for the system memory caches to sysfs stable documentation. Signed-off-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Reviewed-by: NBrice Goglin <Brice.Goglin@inria.fr> Tested-by: NBrice Goglin <Brice.Goglin@inria.fr> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NFan Du <fan.du@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Keith Busch 提交于
commit e1cf33aafb8462c7d0a0e6349925870316f040ee upstream. Heterogeneous memory systems provide memory nodes with different latency and bandwidth performance attributes. Provide a new kernel interface for subsystems to register the attributes under the memory target node's initiator access class. If the system provides this information, applications may query these attributes when deciding which node to request memory. The following example shows the new sysfs hierarchy for a node exporting performance attributes: # tree -P "read*|write*"/sys/devices/system/node/nodeY/accessZ/initiators/ /sys/devices/system/node/nodeY/accessZ/initiators/ |-- read_bandwidth |-- read_latency |-- write_bandwidth `-- write_latency The bandwidth is exported as MB/s and latency is reported in nanoseconds. The values are taken from the platform as reported by the manufacturer. Memory accesses from an initiator node that is not one of the memory's access "Z" initiator nodes linked in the same directory may observe different performance than reported here. When a subsystem makes use of this interface, initiators of a different access number may not have the same performance relative to initiators in other access numbers, or omitted from the any access class' initiators. Descriptions for memory access initiator performance access attributes are added to sysfs stable documentation. Acked-by: NJonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: NJonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: NBrice Goglin <Brice.Goglin@inria.fr> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NFan Du <fan.du@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Keith Busch 提交于
commit 08d9dbe72b1f899468b2b34f9309e88a84f440f2 upstream. Systems may be constructed with various specialized nodes. Some nodes may provide memory, some provide compute devices that access and use that memory, and others may provide both. Nodes that provide memory are referred to as memory targets, and nodes that can initiate memory access are referred to as memory initiators. Memory targets will often have varying access characteristics from different initiators, and platforms may have ways to express those relationships. In preparation for these systems, provide interfaces for the kernel to export the memory relationship among different nodes memory targets and their initiators with symlinks to each other. If a system provides access locality for each initiator-target pair, nodes may be grouped into ranked access classes relative to other nodes. The new interface allows a subsystem to register relationships of varying classes if available and desired to be exported. A memory initiator may have multiple memory targets in the same access class. The target memory's initiators in a given class indicate the nodes access characteristics share the same performance relative to other linked initiator nodes. Each target within an initiator's access class, though, do not necessarily perform the same as each other. A memory target node may have multiple memory initiators. All linked initiators in a target's class have the same access characteristics to that target. The following example show the nodes' new sysfs hierarchy for a memory target node 'Y' with access class 0 from initiator node 'X': # symlinks -v /sys/devices/system/node/nodeX/access0/ relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY # symlinks -v /sys/devices/system/node/nodeY/access0/ relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX The new attributes are added to the sysfs stable documentation. Reviewed-by: NJonathan Cameron <Jonathan.Cameron@huawei.com> Signed-off-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: NBrice Goglin <Brice.Goglin@inria.fr> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NFan Du <fan.du@intel.com> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
- 27 12月, 2019 12 次提交
-
-
由 Tejun Heo 提交于
commit 8504dea783b044cab620acbaef87b86ee84646fe upstream. Add a script which can be used to generate device-specific iocost linear model coefficients. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit 7caa47151ab2e644dd221f741ec7578d9532c9a3 upstream. This patchset implements IO cost model based work-conserving proportional controller. While io.latency provides the capability to comprehensively prioritize and protect IOs depending on the cgroups, its protection is binary - the lowest latency target cgroup which is suffering is protected at the cost of all others. In many use cases including stacking multiple workload containers in a single system, it's necessary to distribute IO capacity with better granularity. One challenge of controlling IO resources is the lack of trivially observable cost metric. The most common metrics - bandwidth and iops - can be off by orders of magnitude depending on the device type and IO pattern. However, the cost isn't a complete mystery. Given several key attributes, we can make fairly reliable predictions on how expensive a given stream of IOs would be, at least compared to other IO patterns. The function which determines the cost of a given IO is the IO cost model for the device. This controller distributes IO capacity based on the costs estimated by such model. The more accurate the cost model the better but the controller adapts based on IO completion latency and as long as the relative costs across differents IO patterns are consistent and sensible, it'll adapt to the actual performance of the device. Currently, the only implemented cost model is a simple linear one with a few sets of default parameters for different classes of device. This covers most common devices reasonably well. All the infrastructure to tune and add different cost models is already in place and a later patch will also allow using bpf progs for cost models. Please see the top comment in blk-iocost.c and documentation for more details. v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix for a divide-by-zero bug in current_hweight() triggered by zero inuse_sum. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> [Joseph: fix confilcts with ioc_rqos_throttle()] Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
commit a5e112e6424adb77d953eac20e6936b952fd6b32 upstream. cgroup already uses floating point for percent[ile] numbers and there are several controllers which want to take them as input. Add a generic parse helper to handle inputs. Update the interface convention documentation about the use of percentage numbers. While at it, also clarify the default time unit. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Zhen Lei 提交于
commit 68a6efe86f6a16e25556a2aff40efad41097b486 upstream Add a generic command line option to enable lazy unmapping via IOVA flush queues, which will initally be suuported by iommu-dma. This echoes the semantics of "intel_iommu=strict" (albeit with the opposite default value), but in the driver-agnostic fashion of "iommu.passthrough". Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com> [rm: move handling out of SMMUv3 driver, clean up documentation] Signed-off-by: NRobin Murphy <robin.murphy@arm.com> [will: dropped broken printk when parsing command-line option] Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NZou Cao <zoucao@linux.alibaba.com> Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>
-
由 Suren Baghdasaryan 提交于
commit 0e94682b73bfa6c44c98af7a26771c9c08c055d5 upstream. Psi monitor aims to provide a low-latency short-term pressure detection mechanism configurable by users. It allows users to monitor psi metrics growth and trigger events whenever a metric raises above user-defined threshold within user-defined time window. Time window and threshold are both expressed in usecs. Multiple psi resources with different thresholds and window sizes can be monitored concurrently. Psi monitors activate when system enters stall state for the monitored psi metric and deactivate upon exit from the stall state. While system is in the stall state psi signal growth is monitored at a rate of 10 times per tracking window. Min window size is 500ms, therefore the min monitoring interval is 50ms. Max window size is 10s with monitoring interval of 1s. When activated psi monitor stays active for at least the duration of one tracking window to avoid repeated activations/deactivations when psi signal is bouncing. Notifications to the users are rate-limited to one per tracking window. Link: http://lkml.kernel.org/r/20190319235619.260832-8-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com> Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Dennis Zhou <dennis@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Waiman Long 提交于
commit be87ab0afd680ac35486d16c0963c56d9be1d8a0 upstream. The output of the PSI files show a bunch of numbers with no unit. The psi.txt documentation file also does not indicate what units are used. One can only find out by looking at the source code. The units are percentage for the averages and useconds for the total. Make the information easier to find by documenting the units in psi.txt. Link: http://lkml.kernel.org/r/20190402193810.3450-1-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Johannes Weiner 提交于
commit e0c274472d5d27f277af722e017525e0b33784cd upstream. Mel Gorman reports a hackbench regression with psi that would prohibit shipping the suse kernel with it default-enabled, but he'd still like users to be able to opt in at little to no cost to others. With the current combination of CONFIG_PSI and the psi_disabled bool set from the commandline, this is a challenge. Do the following things to make it easier: 1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros to enable CONFIG_PSI in their kernel but leave the feature disabled unless a user requests it at boot-time. To avoid double negatives, rename psi_disabled= to psi=. 2. Make psi_disabled a static branch to eliminate any branch costs when the feature is disabled. In terms of numbers before and after this patch, Mel says: : The following is a comparision using CONFIG_PSI=n as a baseline against : your patch and a vanilla kernel : : 4.20.0-rc4 4.20.0-rc4 4.20.0-rc4 : kconfigdisable-v1r1 vanilla psidisable-v1r1 : Amean 1 1.3100 ( 0.00%) 1.3923 ( -6.28%) 1.3427 ( -2.49%) : Amean 3 3.8860 ( 0.00%) 4.1230 * -6.10%* 3.8860 ( -0.00%) : Amean 5 6.8847 ( 0.00%) 8.0390 * -16.77%* 6.7727 ( 1.63%) : Amean 7 9.9310 ( 0.00%) 10.8367 * -9.12%* 9.9910 ( -0.60%) : Amean 12 16.6577 ( 0.00%) 18.2363 * -9.48%* 17.1083 ( -2.71%) : Amean 18 26.5133 ( 0.00%) 27.8833 * -5.17%* 25.7663 ( 2.82%) : Amean 24 34.3003 ( 0.00%) 34.6830 ( -1.12%) 32.0450 ( 6.58%) : Amean 30 40.0063 ( 0.00%) 40.5800 ( -1.43%) 41.5087 ( -3.76%) : Amean 32 40.1407 ( 0.00%) 41.2273 ( -2.71%) 39.9417 ( 0.50%) : : It's showing that the vanilla kernel takes a hit (as the bisection : indicated it would) and that disabling PSI by default is reasonably : close in terms of performance for this particular workload on this : particular machine so; Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Tested-by: NMel Gorman <mgorman@techsingularity.net> Reported-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Johannes Weiner 提交于
commit 2ce7135adc9ad081aa3c49744144376ac74fea60 upstream. On a system that executes multiple cgrouped jobs and independent workloads, we don't just care about the health of the overall system, but also that of individual jobs, so that we can ensure individual job health, fairness between jobs, or prioritize some jobs over others. This patch implements pressure stall tracking for cgroups. In kernels with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure, and io.pressure files that track aggregate pressure stall times for only the tasks inside the cgroup. Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Tested-by: NDaniel Drake <drake@endlessm.com> Tested-by: NSuren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> [Joseph: fix apply conflicts in cgroup_create()] Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com> Conflicts: kernel/cgroup/cgroup.c
-
由 Johannes Weiner 提交于
commit eb414681d5a07d28d2ff90dc05f69ec6b232ebd2 upstream. When systems are overcommitted and resources become contended, it's hard to tell exactly the impact this has on workload productivity, or how close the system is to lockups and OOM kills. In particular, when machines work multiple jobs concurrently, the impact of overcommit in terms of latency and throughput on the individual job can be enormous. In order to maximize hardware utilization without sacrificing individual job health or risk complete machine lockups, this patch implements a way to quantify resource pressure in the system. A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that expose the percentage of time the system is stalled on CPU, memory, or IO, respectively. Stall states are aggregate versions of the per-task delay accounting delays: cpu: some tasks are runnable but not executing on a CPU memory: tasks are reclaiming, or waiting for swapin or thrashing cache io: tasks are waiting for io completions These percentages of walltime can be thought of as pressure percentages, and they give a general sense of system health and productivity loss incurred by resource overcommit. They can also indicate when the system is approaching lockup scenarios and OOMs. To do this, psi keeps track of the task states associated with each CPU and samples the time they spend in stall states. Every 2 seconds, the samples are averaged across CPUs - weighted by the CPUs' non-idle time to eliminate artifacts from unused CPUs - and translated into percentages of walltime. A running average of those percentages is maintained over 10s, 1m, and 5m periods (similar to the loadaverage). [hannes@cmpxchg.org: doc fixlet, per Randy] Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org [hannes@cmpxchg.org: code optimization] Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter] Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org [hannes@cmpxchg.org: fix build] Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Tested-by: NDaniel Drake <drake@endlessm.com> Tested-by: NSuren Baghdasaryan <surenb@google.com> Cc: Christopher Lameter <cl@linux.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <jweiner@fb.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Peter Enderborg <peter.enderborg@sony.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Cc: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> [Joseph: fix apply conflicts in task_struct] Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 George Zhang 提交于
By default the tcp_tw_timeout value is 60 seconds. The minimum is 1 second and the maximum is 600. This setting is useful on system under heavy tcp load. NOTE: set the tcp_tw_timeout below 60 seconds voilates the "quiet time" restriction, and make your system into the risk of causing some old data to be accepted as new or new data rejected as old duplicated by some receivers. Link: http://web.archive.org/web/20150102003320/http://tools.ietf.org/html/rfc793Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jiufei Xue 提交于
So far writeback control is supported for cgroup v1 interface. However it also has some restrictions, so introduce a new kernel boot parameter to control the behavior which is disabled by default. Users can enable the writeback control for cgroup v1 with the command line "cgwb_v1". Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Eryu Guan 提交于
Prior to xdragon platform 20181230 release (e.g. 0930 release), vring_use_dma_api() is required to return 'true' unconditionally. Introduce a new kernel boot parameter called "vring_force_dma_api" to control the behavior, boot xdragon host with "vring_force_dma_api" command line to make ENI hotplug work, so that normal ECS hosts keep the original behavior. Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>
-
- 18 12月, 2019 2 次提交
-
-
由 Michal Hocko 提交于
[ Upstream commit 7635d9cbe8327e131a1d3d8517dc186c2796ce2e ] Userspace falls short when trying to find out whether a specific memory range is eligible for THP. There are usecases that would like to know that http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com : This is used to identify heap mappings that should be able to fault thp : but do not, and they normally point to a low-on-memory or fragmentation : issue. The only way to deduce this now is to query for hg resp. nh flags and confronting the state with the global setting. Except that there is also PR_SET_THP_DISABLE that might change the picture. So the final logic is not trivial. Moreover the eligibility of the vma depends on the type of VMA as well. In the past we have supported only anononymous memory VMAs but things have changed and shmem based vmas are supported as well these days and the query logic gets even more complicated because the eligibility depends on the mount option and another global configuration knob. Simplify the current state and report the THP eligibility in /proc/<pid>/smaps for each existing vma. Reuse transparent_hugepage_enabled for this purpose. The original implementation of this function assumes that the caller knows that the vma itself is supported for THP so make the core checks into __transparent_hugepage_enabled and use it for existing callers. __show_smap just use the new transparent_hugepage_enabled which also checks the vma support status (please note that this one has to be out of line due to include dependency issues). [mhocko@kernel.org: fix oops with NULL ->f_mapping] Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Jan Kara <jack@suse.cz> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Paul Oppenheimer <bepvte@gmail.com> Cc: William Kucharski <william.kucharski@oracle.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
由 Oliver Neukum 提交于
commit 65cc8bf99349f651a0a2cee69333525fe581f306 upstream. Document which flags work storage, UAS or both Signed-off-by: NOliver Neukum <oneukum@suse.com> Cc: stable <stable@vger.kernel.org> Link: https://lore.kernel.org/r/20191114112758.32747-4-oneukum@suse.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
- 13 12月, 2019 1 次提交
-
-
由 Baruch Siach 提交于
[ Upstream commit 73852e56827f5cb5db9d6e8dd8191fc2f2e8f424 ] The abracon,tc-resistor property value is in kOhm. Signed-off-by: NBaruch Siach <baruch@tkos.co.il> Signed-off-by: NAlexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 05 12月, 2019 1 次提交
-
-
由 Peter Hutterer 提交于
[ Upstream commit 46b14eef59a8157138dc02f916a7f97c73b3ec53 ] Signed-off-by: NPeter Hutterer <peter.hutterer@who-t.net> Signed-off-by: NJiri Kosina <jkosina@suse.cz> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 01 12月, 2019 2 次提交
-
-
由 Waiman Long 提交于
commit 64870ed1b12e235cfca3f6c6da75b542c973ff78 upstream. For MDS vulnerable processors with TSX support, enabling either MDS or TAA mitigations will enable the use of VERW to flush internal processor buffers at the right code path. IOW, they are either both mitigated or both not. However, if the command line options are inconsistent, the vulnerabilites sysfs files may not report the mitigation status correctly. For example, with only the "mds=off" option: vulnerabilities/mds:Vulnerable; SMT vulnerable vulnerabilities/tsx_async_abort:Mitigation: Clear CPU buffers; SMT vulnerable The mds vulnerabilities file has wrong status in this case. Similarly, the taa vulnerability file will be wrong with mds mitigation on, but taa off. Change taa_select_mitigation() to sync up the two mitigation status and have them turned off if both "mds=off" and "tsx_async_abort=off" are present. Update documentation to emphasize the fact that both "mds=off" and "tsx_async_abort=off" have to be specified together for processors that are affected by both TAA and MDS to be effective. [ bp: Massage and add kernel-parameters.txt change too. ] Fixes: 1b42f017415b ("x86/speculation/taa: Add mitigation for TSX Async Abort") Signed-off-by: NWaiman Long <longman@redhat.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiri Kosina <jkosina@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: linux-doc@vger.kernel.org Cc: Mark Gross <mgross@linux.intel.com> Cc: <stable@vger.kernel.org> Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: x86-ml <x86@kernel.org> Link: https://lkml.kernel.org/r/20191115161445.30809-2-longman@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Keiji Hayashibara 提交于
[ Upstream commit 3511ba7d4ca6f39e2d060bb94e42a41ad1fee7bf ] This commit fixes incorrect property because it was different from the actual. The parameters of '#address-cells' and '#size-cells' were removed, and 'interrupts', 'pinctrl-names' and 'pinctrl-0' were added. Fixes: 4dcd5c27 ("spi: add DT bindings for UniPhier SPI controller") Signed-off-by: NKeiji Hayashibara <hayashibara.keiji@socionext.com> Signed-off-by: NMark Brown <broonie@kernel.org> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 21 11月, 2019 2 次提交
-
-
由 Florian Fainelli 提交于
[ Upstream commit b78ac6ecd1b6b46f8767cbafa95a7b0b51b87ad8 ] Allow the configuration of the MDIO clock divider when the Device Tree contains 'clock-frequency' property (similar to I2C and SPI buses). Because the hardware may have lost its state during suspend/resume, re-apply the MDIO clock divider upon resumption. Signed-off-by: NFlorian Fainelli <f.fainelli@gmail.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
由 Geert Uytterhoeven 提交于
[ Upstream commit 27582f0ea97fe3e4a38beb98ab36cce4b6f029d5 ] With recent dtc and W=1: Warning (graph_port): video-receiver@70/port@10: graph node unit address error, expected "a" Warning (graph_port): video-receiver@70/port@11: graph node unit address error, expected "b" Unit addresses are always hexadecimal (without prefix), while the bases of reg property values depend on their prefixes. Fixes: e6959517 ("media: adv748x: Add adv7481, adv7482 bindings") Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: NRob Herring <robh@kernel.org> Reviewed-by: NKieran Bingham <kieran.bingham+renesas@ideasonboard.com> Signed-off-by: NMauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: NSasha Levin <sashal@kernel.org>
-
- 13 11月, 2019 9 次提交
-
-
由 Junaid Shahid 提交于
commit 1aa9b9572b10529c2e64e2b8f44025d86e124308 upstream. The page table pages corresponding to broken down large pages are zapped in FIFO order, so that the large page can potentially be recovered, if it is not longer being used for execution. This removes the performance penalty for walking deeper EPT page tables. By default, one large page will last about one hour once the guest reaches a steady state. Signed-off-by: NJunaid Shahid <junaids@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Paolo Bonzini 提交于
commit b8e8c8303ff28c61046a4d0f6ea99aea609a7dc0 upstream. With some Intel processors, putting the same virtual address in the TLB as both a 4 KiB and 2 MiB page can confuse the instruction fetch unit and cause the processor to issue a machine check resulting in a CPU lockup. Unfortunately when EPT page tables use huge pages, it is possible for a malicious guest to cause this situation. Add a knob to mark huge pages as non-executable. When the nx_huge_pages parameter is enabled (and we are using EPT), all huge pages are marked as NX. If the guest attempts to execute in one of those pages, the page is broken down into 4K pages, which are then marked executable. This is not an issue for shadow paging (except nested EPT), because then the host is in control of TLB flushes and the problematic situation cannot happen. With nested EPT, again the nested guest can cause problems shadow and direct EPT is treated in the same way. [ tglx: Fixup default to auto and massage wording a bit ] Originally-by: NJunaid Shahid <junaids@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Junaid Shahid 提交于
commit 0d9ce162cf46c99628cc5da9510b959c7976735b upstream. It doesn't seem as if there is any particular need for kvm_lock to be a spinlock, so convert the lock to a mutex so that sleepable functions (in particular cond_resched()) can be called while holding it. Signed-off-by: NJunaid Shahid <junaids@google.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Gomez Iglesias, Antonio 提交于
commit 7f00cc8d4a51074eb0ad4c3f16c15757b1ddfb7d upstream. Add the initial ITLB_MULTIHIT documentation. [ tglx: Add it to the index so it gets actually built. ] Signed-off-by: NAntonio Gomez Iglesias <antonio.gomez.iglesias@intel.com> Signed-off-by: NNelson D'Souza <nelson.dsouza@linux.intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Vineela Tummalapalli 提交于
commit db4d30fbb71b47e4ecb11c4efa5d8aad4b03dfae upstream. Some processors may incur a machine check error possibly resulting in an unrecoverable CPU lockup when an instruction fetch encounters a TLB multi-hit in the instruction TLB. This can occur when the page size is changed along with either the physical address or cache type. The relevant erratum can be found here: https://bugzilla.kernel.org/show_bug.cgi?id=205195 There are other processors affected for which the erratum does not fully disclose the impact. This issue affects both bare-metal x86 page tables and EPT. It can be mitigated by either eliminating the use of large pages or by using careful TLB invalidations when changing the page size in the page tables. Just like Spectre, Meltdown, L1TF and MDS, a new bit has been allocated in MSR_IA32_ARCH_CAPABILITIES (PSCHANGE_MC_NO) and will be set on CPUs which are mitigated against this issue. Signed-off-by: NVineela Tummalapalli <vineela.tummalapalli@intel.com> Co-developed-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Pawan Gupta 提交于
commit a7a248c593e4fd7a67c50b5f5318fe42a0db335e upstream. Add the documenation for TSX Async Abort. Include the description of the issue, how to check the mitigation state, control the mitigation, guidance for system administrators. [ bp: Add proper SPDX tags, touch ups by Josh and me. ] Co-developed-by: NAntonio Gomez Iglesias <antonio.gomez.iglesias@intel.com> Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: NAntonio Gomez Iglesias <antonio.gomez.iglesias@intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NMark Gross <mgross@linux.intel.com> Reviewed-by: NTony Luck <tony.luck@intel.com> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Pawan Gupta 提交于
commit 7531a3596e3272d1f6841e0d601a614555dc6b65 upstream. Platforms which are not affected by X86_BUG_TAA may want the TSX feature enabled. Add "auto" option to the TSX cmdline parameter. When tsx=auto disable TSX when X86_BUG_TAA is present, otherwise enable TSX. More details on X86_BUG_TAA can be found here: https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/tsx_async_abort.html [ bp: Extend the arg buffer to accommodate "auto\0". ] Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NTony Luck <tony.luck@intel.com> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Pawan Gupta 提交于
commit 95c5824f75f3ba4c9e8e5a4b1a623c95390ac266 upstream. Add a kernel cmdline parameter "tsx" to control the Transactional Synchronization Extensions (TSX) feature. On CPUs that support TSX control, use "tsx=on|off" to enable or disable TSX. Not specifying this option is equivalent to "tsx=off". This is because on certain processors TSX may be used as a part of a speculative side channel attack. Carve out the TSX controlling functionality into a separate compilation unit because TSX is a CPU feature while the TSX async abort control machinery will go to cpu/bugs.c. [ bp: - Massage, shorten and clear the arg buffer. - Clarifications of the tsx= possible options - Josh. - Expand on TSX_CTRL availability - Pawan. ] Signed-off-by: NPawan Gupta <pawan.kumar.gupta@linux.intel.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NJosh Poimboeuf <jpoimboe@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-
由 Dave Chiluk 提交于
commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream. It has been observed, that highly-threaded, non-cpu-bound applications running under cpu.cfs_quota_us constraints can hit a high percentage of periods throttled while simultaneously not consuming the allocated amount of quota. This use case is typical of user-interactive non-cpu bound applications, such as those running in kubernetes or mesos when run on multiple cpu cores. This has been root caused to cpu-local run queue being allocated per cpu bandwidth slices, and then not fully using that slice within the period. At which point the slice and quota expires. This expiration of unused slice results in applications not being able to utilize the quota for which they are allocated. The non-expiration of per-cpu slices was recently fixed by 'commit 512ac999 ("sched/fair: Fix bandwidth timer clock drift condition")'. Prior to that it appears that this had been broken since at least 'commit 51f2176d ("sched/fair: Fix unlocked reads of some cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That added the following conditional which resulted in slices never being expired. if (cfs_rq->runtime_expires != cfs_b->runtime_expires) { /* extend local deadline, drift is bounded above by 2 ticks */ cfs_rq->runtime_expires += TICK_NSEC; Because this was broken for nearly 5 years, and has recently been fixed and is now being noticed by many users running kubernetes (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion that the mechanisms around expiring runtime should be removed altogether. This allows quota already allocated to per-cpu run-queues to live longer than the period boundary. This allows threads on runqueues that do not use much CPU to continue to use their remaining slice over a longer period of time than cpu.cfs_period_us. However, this helps prevent the above condition of hitting throttling while also not fully utilizing your cpu quota. This theoretically allows a machine to use slightly more than its allotted quota in some periods. This overflow would be bounded by the remaining quota left on each per-cpu runqueueu. This is typically no more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will change nothing, as they should theoretically fully utilize all of their quota in each period. For user-interactive tasks as described above this provides a much better user/application experience as their cpu utilization will more closely match the amount they requested when they hit throttling. This means that cpu limits no longer strictly apply per period for non-cpu bound applications, but that they are still accurate over longer timeframes. This greatly improves performance of high-thread-count, non-cpu bound applications with low cfs_quota_us allocation on high-core-count machines. In the case of an artificial testcase (10ms/100ms of quota on 80 CPU machine), this commit resulted in almost 30x performance improvement, while still maintaining correct cpu quota restrictions. That testcase is available at https://github.com/indeedeng/fibtest. Fixes: 512ac999 ("sched/fair: Fix bandwidth timer clock drift condition") Signed-off-by: NDave Chiluk <chiluk+linux@indeed.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: NPhil Auld <pauld@redhat.com> Reviewed-by: NBen Segall <bsegall@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: John Hammond <jhammond@indeed.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kyle Anderson <kwa@yelp.com> Cc: Gabriel Munos <gmunoz@netflix.com> Cc: Peter Oskolkov <posk@posk.io> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Brendan Gregg <bgregg@netflix.com> Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
-