提交 · 6b2ef082765d69c6946d1c4279db3c78708074dd · openanolis / cloud-kernel

08 12月, 2019 1 次提交

alios: mm: memcontrol: support background async page reclaim · 6b2ef082

由 Yang Shi 提交于 8月 14, 2019

Currently when memory usage exceeds memory cgroup limit, memory cgroup
just can do sync direct reclaim.  This may incur unexpected stall on
some applications which are sensitive to latency.  Introduce background
async page reclaim mechanism, like what kswapd does.

Define memcg memory usage water mark by introducing wmark_ratio interface,
which is from 0 to 100 and represents percentage of max limit.  The
wmark_high is calculated by (max * wmark_ratio / 100), the wmark_low is
(wmark_high - wmark_high >> 8), which is an empirical value.  If wmark_ratio
is 0, it means water mark is disabled, both wmark_low and wmark_high is max,
which is the default value.

If wmark_ratio is setup, when charging page, if usage is greater than
wmark_high, which means the available memory of memcg is low, a work
would be scheduled to do background page reclaim until memory usage is
reduced to wmark_low if possible.

Define a dedicated unbound workqueue for scheduling water mark reclaim
works.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

6b2ef082

28 11月, 2019 2 次提交

alios: block: add counter to track io request's d2c time · 07232d74

由 Xiaoguang Wang 提交于 6月 19, 2019

Indeed tool iostat's await is not good enough, which is somewhat sketchy
and could not show request's latency on device driver's side.

Here we add a new counter to track io request's d2c time, also with this
patch, we can extend iostat to show this value easily.

Note:
I had checked how iostat is implemented, it just reads fields it needs,
so iostat won't be affected by this change, so does tsar.
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

07232d74

alios: fs,ext4: remove projid limit when create hard link · 28df06b3

由 zhangliguang 提交于 12月 27, 2018

This is a temporary workaround plan to avoid the limitation when
creating hard link cross two projids.
Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

28df06b3

20 11月, 2019 9 次提交

device-dax: Add /sys/class/dax backwards compatibility · 796f6939

由 Dan Williams 提交于 8月 24, 2019

commit 730926c3b0998943654019f00296cf8e3b02277e upstream

On the expectation that some environments may not upgrade libdaxctl
(userspace component that depends on the /sys/class/dax hierarchy),
provide a default / legacy dax_pmem_compat driver. The dax_pmem_compat
driver implements the original /sys/class/dax sysfs layout rather than
/sys/bus/dax. When userspace is upgraded it can blacklist this module
and switch to the dax_pmem driver going forward.

CONFIG_DEV_DAX_PMEM_COMPAT and supporting code will be deleted according
to the dax_pmem entry in Documentation/ABI/obsolete/.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

796f6939

ICX: platform/x86: ISST: Update ioctl-number.txt for Intel Speed Select interface · 28deeb29

由 Srinivas Pandruvada 提交于 6月 26, 2019

commit 2ee5bfc1efc81179c73abcd33098dd2c86019146 upstream.

Reserve ioctl numbers for intel Speed Select Technology interface
drivers.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

28deeb29

ICX: cpu/topology: Export die_id · 67606df2

由 Len Brown 提交于 5月 13, 2019

commit 0e344d8c709fe01d882fc0fb5452bedfe5eba67a upstream.

Export die_id in cpu topology, for the benefit of hardware that has
multiple-die/package.
Signed-off-by: NLen Brown <len.brown@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-doc@vger.kernel.org
Link: https://lkml.kernel.org/r/e7d1caaf4fbd24ee40db6d557ab28d7d83298900.1557769318.git.len.brown@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

67606df2

ICX: x86/topology: Add CPUID.1F multi-die/package support · 2fe21d77

由 Len Brown 提交于 5月 13, 2019

commit 7745f03eb39587dd15a1fb26e6223678b8e906d2 upstream.

Some new systems have multiple software-visible die within each package.

Update Linux parsing of the Intel CPUID "Extended Topology Leaf" to handle
either CPUID.B, or the new CPUID.1F.

Add cpuinfo_x86.die_id and cpuinfo_x86.max_dies to store the result.

die_id will be non-zero only for multi-die/package systems.
Signed-off-by: NLen Brown <len.brown@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-doc@vger.kernel.org
Link: https://lkml.kernel.org/r/7b23d2d26d717b8e14ba137c94b70943f1ae4b5c.1557769318.git.len.brown@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

2fe21d77

ICX: topology: Simplify cputopology.txt formatting and wording · a811755d

由 Len Brown 提交于 2月 26, 2019

commit 3a1c779fb8f71e772e2145e68c262936ada815ed upstream.

Syntax only, no functional or semantic change.
Signed-off-by: NLen Brown <len.brown@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-doc@vger.kernel.org
Link: http://lkml.kernel.org/r/1ca56f8ea922a67f0017bd645912ea02a65a85ec.1551160674.git.len.brown@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

a811755d

ICX: doc/mm: New documentation for memory performance · a6897e0d

由 Keith Busch 提交于 3月 11, 2019

commit 13bac55ef7aef8ecb67ff3005d24b05a464d28ea upstream.

Platforms may provide system memory where some physical address ranges
perform differently than others, or is cached by the system on the
memory side.

Add documentation describing a high level overview of such systems and the
perforamnce and caching attributes the kernel provides for applications
wishing to query this information.
Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
Reviewed-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

a6897e0d

ICX: node: Add memory-side caching attributes · a55e431b

由 Keith Busch 提交于 3月 11, 2019

commit acc02a109b0497e917c83f986a89c51e47d0022c upstream.

System memory may have caches to help improve access speed to frequently
requested address ranges. While the system provided cache is transparent
to the software accessing these memory ranges, applications can optimize
their own access based on cache attributes.

Provide a new API for the kernel to register these memory-side caches
under the memory node that provides it.

The new sysfs representation is modeled from the existing cpu cacheinfo
attributes, as seen from /sys/devices/system/cpu/<cpu>/cache/.  Unlike CPU
cacheinfo though, the node cache level is reported from the view of the
memory. A higher level number is nearer to the CPU, while lower levels
are closer to the last level memory.

The exported attributes are the cache size, the line size, associativity
indexing, and write back policy, and add the attributes for the system
memory caches to sysfs stable documentation.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: NBrice Goglin <Brice.Goglin@inria.fr>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

a55e431b

ICX: node: Add heterogenous memory access attributes · 12dbc11b

由 Keith Busch 提交于 3月 11, 2019

commit e1cf33aafb8462c7d0a0e6349925870316f040ee upstream.

Heterogeneous memory systems provide memory nodes with different latency
and bandwidth performance attributes. Provide a new kernel interface
for subsystems to register the attributes under the memory target
node's initiator access class. If the system provides this information,
applications may query these attributes when deciding which node to
request memory.

The following example shows the new sysfs hierarchy for a node exporting
performance attributes:

  # tree -P "read*|write*"/sys/devices/system/node/nodeY/accessZ/initiators/
  /sys/devices/system/node/nodeY/accessZ/initiators/
  |-- read_bandwidth
  |-- read_latency
  |-- write_bandwidth
  `-- write_latency

The bandwidth is exported as MB/s and latency is reported in
nanoseconds. The values are taken from the platform as reported by the
manufacturer.

Memory accesses from an initiator node that is not one of the memory's
access "Z" initiator nodes linked in the same directory may observe
different performance than reported here. When a subsystem makes use
of this interface, initiators of a different access number may not have
the same performance relative to initiators in other access numbers, or
omitted from the any access class' initiators.

Descriptions for memory access initiator performance access attributes
are added to sysfs stable documentation.
Acked-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

12dbc11b

ICX: node: Link memory nodes to their compute nodes · 417e3834

由 Keith Busch 提交于 3月 11, 2019

commit 08d9dbe72b1f899468b2b34f9309e88a84f440f2 upstream.

Systems may be constructed with various specialized nodes. Some nodes
may provide memory, some provide compute devices that access and use
that memory, and others may provide both. Nodes that provide memory are
referred to as memory targets, and nodes that can initiate memory access
are referred to as memory initiators.

Memory targets will often have varying access characteristics from
different initiators, and platforms may have ways to express those
relationships. In preparation for these systems, provide interfaces for
the kernel to export the memory relationship among different nodes memory
targets and their initiators with symlinks to each other.

If a system provides access locality for each initiator-target pair, nodes
may be grouped into ranked access classes relative to other nodes. The
new interface allows a subsystem to register relationships of varying
classes if available and desired to be exported.

A memory initiator may have multiple memory targets in the same access
class. The target memory's initiators in a given class indicate the
nodes access characteristics share the same performance relative to other
linked initiator nodes. Each target within an initiator's access class,
though, do not necessarily perform the same as each other.

A memory target node may have multiple memory initiators. All linked
initiators in a target's class have the same access characteristics to
that target.

The following example show the nodes' new sysfs hierarchy for a memory
target node 'Y' with access class 0 from initiator node 'X':

  # symlinks -v /sys/devices/system/node/nodeX/access0/
  relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY

  # symlinks -v /sys/devices/system/node/nodeY/access0/
  relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX

The new attributes are added to the sysfs stable documentation.
Reviewed-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

417e3834

07 11月, 2019 3 次提交

blkcg: add tools/cgroup/iocost_coef_gen.py · 2868b62d

由 Tejun Heo 提交于 8月 28, 2019

commit 8504dea783b044cab620acbaef87b86ee84646fe upstream.

Add a script which can be used to generate device-specific iocost
linear model coefficients.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

2868b62d

blkcg: implement blk-iocost · abd942b2

由 Tejun Heo 提交于 8月 28, 2019

commit 7caa47151ab2e644dd221f741ec7578d9532c9a3 upstream.

This patchset implements IO cost model based work-conserving
proportional controller.

While io.latency provides the capability to comprehensively prioritize
and protect IOs depending on the cgroups, its protection is binary -
the lowest latency target cgroup which is suffering is protected at
the cost of all others.  In many use cases including stacking multiple
workload containers in a single system, it's necessary to distribute
IO capacity with better granularity.

One challenge of controlling IO resources is the lack of trivially
observable cost metric.  The most common metrics - bandwidth and iops
- can be off by orders of magnitude depending on the device type and
IO pattern.  However, the cost isn't a complete mystery.  Given
several key attributes, we can make fairly reliable predictions on how
expensive a given stream of IOs would be, at least compared to other
IO patterns.

The function which determines the cost of a given IO is the IO cost
model for the device.  This controller distributes IO capacity based
on the costs estimated by such model.  The more accurate the cost
model the better but the controller adapts based on IO completion
latency and as long as the relative costs across differents IO
patterns are consistent and sensible, it'll adapt to the actual
performance of the device.

Currently, the only implemented cost model is a simple linear one with
a few sets of default parameters for different classes of device.
This covers most common devices reasonably well.  All the
infrastructure to tune and add different cost models is already in
place and a later patch will also allow using bpf progs for cost
models.

Please see the top comment in blk-iocost.c and documentation for
more details.

v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
    for a divide-by-zero bug in current_hweight() triggered by zero
    inuse_sum.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
[Joseph: fix confilcts with ioc_rqos_throttle()]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

abd942b2

cgroup: add cgroup_parse_float() · cb2f6e75

由 Tejun Heo 提交于 5月 13, 2019

commit a5e112e6424adb77d953eac20e6936b952fd6b32 upstream.

cgroup already uses floating point for percent[ile] numbers and there
are several controllers which want to take them as input.  Add a
generic parse helper to handle inputs.

Update the interface convention documentation about the use of
percentage numbers.  While at it, also clarify the default time unit.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

cb2f6e75

01 11月, 2019 1 次提交

sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices · 192fa322

由 Dave Chiluk 提交于 7月 23, 2019

commit de53fd7aedb100f03e5d2231cfce0e4993282425 upstream.

It has been observed, that highly-threaded, non-cpu-bound applications
running under cpu.cfs_quota_us constraints can hit a high percentage of
periods throttled while simultaneously not consuming the allocated
amount of quota. This use case is typical of user-interactive non-cpu
bound applications, such as those running in kubernetes or mesos when
run on multiple cpu cores.

This has been root caused to cpu-local run queue being allocated per cpu
bandwidth slices, and then not fully using that slice within the period.
At which point the slice and quota expires. This expiration of unused
slice results in applications not being able to utilize the quota for
which they are allocated.

The non-expiration of per-cpu slices was recently fixed by
'commit 512ac999 ("sched/fair: Fix bandwidth timer clock drift
condition")'. Prior to that it appears that this had been broken since
at least 'commit 51f2176d ("sched/fair: Fix unlocked reads of some
cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
added the following conditional which resulted in slices never being
expired.

if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
        /* extend local deadline, drift is bounded above by 2 ticks */
        cfs_rq->runtime_expires += TICK_NSEC;

Because this was broken for nearly 5 years, and has recently been fixed
and is now being noticed by many users running kubernetes
(https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
that the mechanisms around expiring runtime should be removed
altogether.

This allows quota already allocated to per-cpu run-queues to live longer
than the period boundary. This allows threads on runqueues that do not
use much CPU to continue to use their remaining slice over a longer
period of time than cpu.cfs_period_us. However, this helps prevent the
above condition of hitting throttling while also not fully utilizing
your cpu quota.

This theoretically allows a machine to use slightly more than its
allotted quota in some periods. This overflow would be bounded by the
remaining quota left on each per-cpu runqueueu. This is typically no
more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
change nothing, as they should theoretically fully utilize all of their
quota in each period. For user-interactive tasks as described above this
provides a much better user/application experience as their cpu
utilization will more closely match the amount they requested when they
hit throttling. This means that cpu limits no longer strictly apply per
period for non-cpu bound applications, but that they are still accurate
over longer timeframes.

This greatly improves performance of high-thread-count, non-cpu bound
applications with low cfs_quota_us allocation on high-core-count
machines. In the case of an artificial testcase (10ms/100ms of quota on
80 CPU machine), this commit resulted in almost 30x performance
improvement, while still maintaining correct cpu quota restrictions.
That testcase is available at https://github.com/indeedeng/fibtest.

Fixes: 512ac999 ("sched/fair: Fix bandwidth timer clock drift condition")
Signed-off-by: NDave Chiluk <chiluk+linux@indeed.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NPhil Auld <pauld@redhat.com>
Reviewed-by: NBen Segall <bsegall@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: John Hammond <jhammond@indeed.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kyle Anderson <kwa@yelp.com>
Cc: Gabriel Munos <gmunoz@netflix.com>
Cc: Peter Oskolkov <posk@posk.io>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Brendan Gregg <bgregg@netflix.com>
Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.comSigned-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

192fa322

30 10月, 2019 9 次提交

iommu: Add "iommu.strict" command line option · 2757989e

由 Zhen Lei 提交于 9月 20, 2018

commit 68a6efe86f6a16e25556a2aff40efad41097b486 upstream

Add a generic command line option to enable lazy unmapping via IOVA
flush queues, which will initally be suuported by iommu-dma. This echoes
the semantics of "intel_iommu=strict" (albeit with the opposite default
value), but in the driver-agnostic fashion of "iommu.passthrough".
Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
[rm: move handling out of SMMUv3 driver, clean up documentation]
Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
[will: dropped broken printk when parsing command-line option]
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

2757989e

psi: introduce psi monitor · 8e7e30fe

由 Suren Baghdasaryan 提交于 5月 14, 2019

commit 0e94682b73bfa6c44c98af7a26771c9c08c055d5 upstream.

Psi monitor aims to provide a low-latency short-term pressure detection
mechanism configurable by users.  It allows users to monitor psi metrics
growth and trigger events whenever a metric raises above user-defined
threshold within user-defined time window.

Time window and threshold are both expressed in usecs.  Multiple psi
resources with different thresholds and window sizes can be monitored
concurrently.

Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state.  While system
is in the stall state psi signal growth is monitored at a rate of 10
times per tracking window.  Min window size is 500ms, therefore the min
monitoring interval is 50ms.  Max window size is 10s with monitoring
interval of 1s.

When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.

Notifications to the users are rate-limited to one per tracking window.

Link: http://lkml.kernel.org/r/20190319235619.260832-8-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

8e7e30fe

psi: clarify the units used in pressure files · 8f0a1267

由 Waiman Long 提交于 4月 05, 2019

commit be87ab0afd680ac35486d16c0963c56d9be1d8a0 upstream.

The output of the PSI files show a bunch of numbers with no unit.  The
psi.txt documentation file also does not indicate what units are used.
One can only find out by looking at the source code.  The units are
percentage for the averages and useconds for the total.  Make the
information easier to find by documenting the units in psi.txt.

Link: http://lkml.kernel.org/r/20190402193810.3450-1-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

8f0a1267

psi: make disabling/enabling easier for vendor kernels · 7bf44e80

由 Johannes Weiner 提交于 11月 30, 2018

commit e0c274472d5d27f277af722e017525e0b33784cd upstream.

Mel Gorman reports a hackbench regression with psi that would prohibit
shipping the suse kernel with it default-enabled, but he'd still like
users to be able to opt in at little to no cost to others.

With the current combination of CONFIG_PSI and the psi_disabled bool set
from the commandline, this is a challenge.  Do the following things to
make it easier:

1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros
   to enable CONFIG_PSI in their kernel but leave the feature disabled
   unless a user requests it at boot-time.

   To avoid double negatives, rename psi_disabled= to psi=.

2. Make psi_disabled a static branch to eliminate any branch costs
   when the feature is disabled.

In terms of numbers before and after this patch, Mel says:

: The following is a comparision using CONFIG_PSI=n as a baseline against
: your patch and a vanilla kernel
:
:                          4.20.0-rc4             4.20.0-rc4             4.20.0-rc4
:                 kconfigdisable-v1r1                vanilla        psidisable-v1r1
: Amean     1       1.3100 (   0.00%)      1.3923 (  -6.28%)      1.3427 (  -2.49%)
: Amean     3       3.8860 (   0.00%)      4.1230 *  -6.10%*      3.8860 (  -0.00%)
: Amean     5       6.8847 (   0.00%)      8.0390 * -16.77%*      6.7727 (   1.63%)
: Amean     7       9.9310 (   0.00%)     10.8367 *  -9.12%*      9.9910 (  -0.60%)
: Amean     12     16.6577 (   0.00%)     18.2363 *  -9.48%*     17.1083 (  -2.71%)
: Amean     18     26.5133 (   0.00%)     27.8833 *  -5.17%*     25.7663 (   2.82%)
: Amean     24     34.3003 (   0.00%)     34.6830 (  -1.12%)     32.0450 (   6.58%)
: Amean     30     40.0063 (   0.00%)     40.5800 (  -1.43%)     41.5087 (  -3.76%)
: Amean     32     40.1407 (   0.00%)     41.2273 (  -2.71%)     39.9417 (   0.50%)
:
: It's showing that the vanilla kernel takes a hit (as the bisection
: indicated it would) and that disabling PSI by default is reasonably
: close in terms of performance for this particular workload on this
: particular machine so;

Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Tested-by: NMel Gorman <mgorman@techsingularity.net>
Reported-by: NMel Gorman <mgorman@techsingularity.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

7bf44e80

psi: cgroup support · 51871fc7

由 Johannes Weiner 提交于 10月 26, 2018

commit 2ce7135adc9ad081aa3c49744144376ac74fea60 upstream.

On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system, but
also that of individual jobs, so that we can ensure individual job health,
fairness between jobs, or prioritize some jobs over others.

This patch implements pressure stall tracking for cgroups.  In kernels
with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for only
the tasks inside the cgroup.

Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NTejun Heo <tj@kernel.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDaniel Drake <drake@endlessm.com>
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
[Joseph: fix apply conflicts in cgroup_create()]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

Conflicts:
    kernel/cgroup/cgroup.c

51871fc7

psi: pressure stall information for CPU, memory, and IO · 6175397b

由 Johannes Weiner 提交于 10月 26, 2018

commit eb414681d5a07d28d2ff90dc05f69ec6b232ebd2 upstream.

When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills.  In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively.  Stall states are aggregate versions of the per-task delay
accounting delays:

       cpu: some tasks are runnable but not executing on a CPU
       memory: tasks are reclaiming, or waiting for swapin or thrashing cache
       io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit.  They can also indicate when the system
is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states.  Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime.  A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).

[hannes@cmpxchg.org: doc fixlet, per Randy]
  Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
  Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
  Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
  Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDaniel Drake <drake@endlessm.com>
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
[Joseph: fix apply conflicts in task_struct]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

6175397b

net/tcp: Support tunable tcp timeout value in TIME-WAIT state · b416e029

由 George Zhang 提交于 3月 28, 2018

By default the tcp_tw_timeout value is 60 seconds. The minimum is
1 second and the maximum is 600. This setting is useful on system under
heavy tcp load.

NOTE: set the tcp_tw_timeout below 60 seconds voilates the "quiet time"
restriction, and make your system into the risk of causing some old data
to be accepted as new or new data rejected as old duplicated by some
receivers.

Link: http://web.archive.org/web/20150102003320/http://tools.ietf.org/html/rfc793Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b416e029

writeback: introduce cgwb_v1 boot param · f91c270d

由 Jiufei Xue 提交于 5月 13, 2019

So far writeback control is supported for cgroup v1 interface. However
it also has some restrictions, so introduce a new kernel boot parameter
to control the behavior which is disabled by default. Users can enable
the writeback control for cgroup v1 with the command line "cgwb_v1".
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f91c270d

eci: drivers/virtio: add vring_force_dma_api boot param · 132cd87e

由 Eryu Guan 提交于 12月 24, 2018

Prior to xdragon platform 20181230 release (e.g. 0930 release),
vring_use_dma_api() is required to return 'true' unconditionally.

Introduce a new kernel boot parameter called "vring_force_dma_api" to
control the behavior, boot xdragon host with "vring_force_dma_api"
command line to make ENI hotplug work, so that normal ECS hosts keep the
original behavior.
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>

132cd87e

18 10月, 2019 1 次提交

USB: rio500: Remove Rio 500 kernel driver · 86575b7f

由 Bastien Nocera 提交于 9月 23, 2019

commit 015664d15270a112c2371d812f03f7c579b35a73 upstream.

The Rio500 kernel driver has not been used by Rio500 owners since 2001
not long after the rio500 project added support for a user-space USB stack
through the very first versions of usbdevfs and then libusb.

Support for the kernel driver was removed from the upstream utilities
in 2008:
https://gitlab.freedesktop.org/hadess/rio500/commit/943f624ab721eb8281c287650fcc9e2026f6f5db

Cc: Cesar Miquel <miquel@df.uba.ar>
Signed-off-by: NBastien Nocera <hadess@hadess.net>
Cc: stable <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/6251c17584d220472ce882a3d9c199c401a51a71.camel@hadess.netSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

86575b7f

12 10月, 2019 3 次提交

arm64/speculation: Support 'mitigations=' cmdline option · d976344d

由 Josh Poimboeuf 提交于 4月 12, 2019

commit a111b7c0f20e13b54df2fa959b3dc0bdf1925ae6 upstream.

Configure arm64 runtime CPU speculation bug mitigations in accordance
with the 'mitigations=' cmdline option.  This affects Meltdown, Spectre
v2, and Speculative Store Bypass.

The default behavior is unchanged.
Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
[will: reorder checks so KASLR implies KPTI and SSBS is affected by cmdline]
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

d976344d

arm64: Provide a command line to disable spectre_v2 mitigation · 59a6dc26

由 Jeremy Linton 提交于 10月 08, 2019

[ Upstream commit e5ce5e7267ddcbe13ab9ead2542524e1b7993e5a ]

There are various reasons, such as benchmarking, to disable spectrev2
mitigation on a machine. Provide a command-line option to do so.
Signed-off-by: NJeremy Linton <jeremy.linton@arm.com>
Reviewed-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: NAndre Przywara <andre.przywara@arm.com>
Reviewed-by: NCatalin Marinas <catalin.marinas@arm.com>
Tested-by: NStefan Wahren <stefan.wahren@i2se.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: linux-doc@vger.kernel.org
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

59a6dc26

arm64: docs: Document SSBS HWCAP · 09c22781

由 Will Deacon 提交于 10月 08, 2019

[ Upstream commit ee91176120bd584aa10c564e7e9fdcaf397190a1 ]

We advertise the MRS/MSR instructions for toggling SSBS at EL0 using an
HWCAP, so document it along with the others.
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

09c22781

21 9月, 2019 1 次提交

ovl: fix regression caused by overlapping layers detection · 9c13e770

由 Amir Goldstein 提交于 7月 12, 2019

commit 0be0bfd2de9dfdd2098a9c5b14bdd8f739c9165d upstream.

Once upon a time, commit 2cac0c00 ("ovl: get exclusive ownership on
upper/work dirs") in v4.13 added some sanity checks on overlayfs layers.
This change caused a docker regression. The root cause was mount leaks
by docker, which as far as I know, still exist.

To mitigate the regression, commit 85fdee1e ("ovl: fix regression
caused by exclusive upper/work dir protection") in v4.14 turned the
mount errors into warnings for the default index=off configuration.

Recently, commit 146d62e5a586 ("ovl: detect overlapping layers") in
v5.2, re-introduced exclusive upper/work dir checks regardless of
index=off configuration.

This changes the status quo and mount leak related bug reports have
started to re-surface. Restore the status quo to fix the regressions.
To clarify, index=off does NOT relax overlapping layers check for this
ovelayfs mount. index=off only relaxes exclusive upper/work dir checks
with another overlayfs mount.

To cover the part of overlapping layers detection that used the
exclusive upper/work dir checks to detect overlap with self upper/work
dir, add a trap also on the work base dir.

Link: https://github.com/moby/moby/issues/34672
Link: https://lore.kernel.org/linux-fsdevel/20171006121405.GA32700@veci.piliscsaba.szeredi.hu/
Link: https://github.com/containers/libpod/issues/3540
Fixes: 146d62e5a586 ("ovl: detect overlapping layers")
Cc: <stable@vger.kernel.org> # v4.19+
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Tested-by: NColin Walters <walters@verbum.org>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

9c13e770

16 9月, 2019 5 次提交

drm/panel: Add support for Armadeus ST0700 Adapt · 87c36921

由 Sébastien Szymanski 提交于 5月 07, 2019

[ Upstream commit c479450f61c7f1f248c9a54aedacd2a6ca521ff8 ]

This patch adds support for the Armadeus ST0700 Adapt. It comes with a
Santek ST0700I5Y-RBSLW 7.0" WVGA (800x480) TFT and an adapter board so
that it can be connected on the TFT header of Armadeus Dev boards.

Cc: stable@vger.kernel.org # v4.19
Reviewed-by: NRob Herring <robh@kernel.org>
Signed-off-by: NSébastien Szymanski <sebastien.szymanski@armadeus.com>
Signed-off-by: NSam Ravnborg <sam@ravnborg.org>
Link: https://patchwork.freedesktop.org/patch/msgid/20190507152713.27494-1-sebastien.szymanski@armadeus.comSigned-off-by: NSasha Levin <sashal@kernel.org>

87c36921

dt-bindings: mmc: Add disable-cqe-dcmd property. · e4ba1578

由 Christoph Muellner 提交于 3月 22, 2019

[ Upstream commit 28f22fb755ecf9f933f045bc0afdb8140641b01c ]

Add disable-cqe-dcmd as optional property for MMC hosts.
This property allows to disable or not enable the direct command
features of the command queue engine.
Signed-off-by: NChristoph Muellner <christoph.muellner@theobroma-systems.com>
Signed-off-by: NPhilipp Tomsich <philipp.tomsich@theobroma-systems.com>
Fixes: 84362d79 ("mmc: sdhci-of-arasan: Add CQHCI support for arasan,sdhci-5.1")
Cc: stable@vger.kernel.org
Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>

e4ba1578

dt-bindings: mmc: Add supports-cqe property · eb83f9fa

由 Sowjanya Komatineni 提交于 1月 23, 2019

[ Upstream commit c7fddbd5db5cffd10ed4d18efa20e36803d1899f ]

Add supports-cqe optional property for MMC hosts.

This property is used to identify the specific host controller
supporting command queue.
Signed-off-by: NSowjanya Komatineni <skomatineni@nvidia.com>
Reviewed-by: NThierry Reding <treding@nvidia.com>
Reviewed-by: NRob Herring <robh@kernel.org>
Signed-off-by: NUlf Hansson <ulf.hansson@linaro.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>

eb83f9fa

iio: adc: exynos-adc: Use proper number of channels for Exynos4x12 · 0d7f329e

由 Krzysztof Kozlowski 提交于 2月 12, 2019

[ Upstream commit 103cda6a3b8d2c10d5f8cd7abad118e9db8f4776 ]

Exynos4212 and Exynos4412 have only four ADC channels so using
"samsung,exynos-adc-v1" compatible (for eight channels ADCv1) on them is
wrong.  Add a new compatible for Exynos4x12.
Signed-off-by: NKrzysztof Kozlowski <krzk@kernel.org>
Cc: <Stable@vger.kernel.org>
Signed-off-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

0d7f329e

dt-bindings: iio: adc: exynos-adc: Add S5PV210 variant · 4e516b72

由 Jonathan Bakker 提交于 12月 07, 2018

[ Upstream commit a9b0a2a7c19316588421b94946c8e2e5a84ac14e ]

Add information about new compatible for S5PV210
Signed-off-by: NJonathan Bakker <xc-racer2@live.ca>
Signed-off-by: NPaweł Chmiel <pawel.mikolaj.chmiel@gmail.com>
Signed-off-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

4e516b72

29 8月, 2019 1 次提交

x86/CPU/AMD: Clear RDRAND CPUID bit on AMD family 15h/16h · e063b03b

由 Tom Lendacky 提交于 8月 19, 2019

commit c49a0a80137c7ca7d6ced4c812c9e07a949f6f24 upstream.

There have been reports of RDRAND issues after resuming from suspend on
some AMD family 15h and family 16h systems. This issue stems from a BIOS
not performing the proper steps during resume to ensure RDRAND continues
to function properly.

RDRAND support is indicated by CPUID Fn00000001_ECX[30]. This bit can be
reset by clearing MSR C001_1004[62]. Any software that checks for RDRAND
support using CPUID, including the kernel, will believe that RDRAND is
not supported.

Update the CPU initialization to clear the RDRAND CPUID bit for any family
15h and 16h processor that supports RDRAND. If it is known that the family
15h or family 16h system does not have an RDRAND resume issue or that the
system will not be placed in suspend, the "rdrand=force" kernel parameter
can be used to stop the clearing of the RDRAND CPUID bit.

Additionally, update the suspend and resume path to save and restore the
MSR C001_1004 value to ensure that the RDRAND CPUID setting remains in
place after resuming from suspend.

Note, that clearing the RDRAND CPUID bit does not prevent a processor
that normally supports the RDRAND instruction from executing it. So any
code that determined the support based on family and model won't #UD.
Signed-off-by: NTom Lendacky <thomas.lendacky@amd.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Andrew Cooper <andrew.cooper3@citrix.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Chen Yu <yu.c.chen@intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: "linux-doc@vger.kernel.org" <linux-doc@vger.kernel.org>
Cc: "linux-pm@vger.kernel.org" <linux-pm@vger.kernel.org>
Cc: Nathan Chancellor <natechancellor@gmail.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: <stable@vger.kernel.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "x86@kernel.org" <x86@kernel.org>
Link: https://lkml.kernel.org/r/7543af91666f491547bd86cebb1e17c66824ab9f.1566229943.git.thomas.lendacky@amd.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

e063b03b

07 8月, 2019 2 次提交

Documentation: Add swapgs description to the Spectre v1 documentation · 7634b9cd

由 Josh Poimboeuf 提交于 8月 03, 2019

commit 4c92057661a3412f547ede95715641d7ee16ddac upstream

Add documentation to the Spectre document about the new swapgs variant of
Spectre v1.
Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

7634b9cd

x86/speculation: Enable Spectre v1 swapgs mitigations · 23e7a7b3

由 Josh Poimboeuf 提交于 7月 08, 2019

commit a2059825986a1c8143fd6698774fa9d83733bb11 upstream

The previous commit added macro calls in the entry code which mitigate the
Spectre v1 swapgs issue if the X86_FEATURE_FENCE_SWAPGS_* features are
enabled.  Enable those features where applicable.

The mitigations may be disabled with "nospectre_v1" or "mitigations=off".

There are different features which can affect the risk of attack:

- When FSGSBASE is enabled, unprivileged users are able to place any
  value in GS, using the wrgsbase instruction.  This means they can
  write a GS value which points to any value in kernel space, which can
  be useful with the following gadget in an interrupt/exception/NMI
  handler:

	if (coming from user space)
		swapgs
	mov %gs:<percpu_offset>, %reg1
	// dependent load or store based on the value of %reg
	// for example: mov %(reg1), %reg2

  If an interrupt is coming from user space, and the entry code
  speculatively skips the swapgs (due to user branch mistraining), it
  may speculatively execute the GS-based load and a subsequent dependent
  load or store, exposing the kernel data to an L1 side channel leak.

  Note that, on Intel, a similar attack exists in the above gadget when
  coming from kernel space, if the swapgs gets speculatively executed to
  switch back to the user GS.  On AMD, this variant isn't possible
  because swapgs is serializing with respect to future GS-based
  accesses.

  NOTE: The FSGSBASE patch set hasn't been merged yet, so the above case
	doesn't exist quite yet.

- When FSGSBASE is disabled, the issue is mitigated somewhat because
  unprivileged users must use prctl(ARCH_SET_GS) to set GS, which
  restricts GS values to user space addresses only.  That means the
  gadget would need an additional step, since the target kernel address
  needs to be read from user space first.  Something like:

	if (coming from user space)
		swapgs
	mov %gs:<percpu_offset>, %reg1
	mov (%reg1), %reg2
	// dependent load or store based on the value of %reg2
	// for example: mov %(reg2), %reg3

  It's difficult to audit for this gadget in all the handlers, so while
  there are no known instances of it, it's entirely possible that it
  exists somewhere (or could be introduced in the future).  Without
  tooling to analyze all such code paths, consider it vulnerable.

  Effects of SMAP on the !FSGSBASE case:

  - If SMAP is enabled, and the CPU reports RDCL_NO (i.e., not
    susceptible to Meltdown), the kernel is prevented from speculatively
    reading user space memory, even L1 cached values.  This effectively
    disables the !FSGSBASE attack vector.

  - If SMAP is enabled, but the CPU *is* susceptible to Meltdown, SMAP
    still prevents the kernel from speculatively reading user space
    memory.  But it does *not* prevent the kernel from reading the
    user value from L1, if it has already been cached.  This is probably
    only a small hurdle for an attacker to overcome.

Thanks to Dave Hansen for contributing the speculative_smap() function.

Thanks to Andrew Cooper for providing the inside scoop on whether swapgs
is serializing on AMD.

[ tglx: Fixed the USER fence decision and polished the comment as suggested
  	by Dave Hansen ]
Signed-off-by: NJosh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NDave Hansen <dave.hansen@intel.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

23e7a7b3

26 7月, 2019 2 次提交

dt-bindings: allow up to four clocks for orion-mdio · 404f59e2

由 Josua Mayer 提交于 7月 09, 2019

commit 80785f5a22e9073e2ded5958feb7f220e066d17b upstream.

Armada 8040 needs four clocks to be enabled for MDIO accesses to work.
Update the binding to allow the extra clock to be specified.

Cc: stable@vger.kernel.org
Fixes: 6d6a331f ("dt-bindings: allow up to three clocks for orion-mdio")
Reviewed-by: NAndrew Lunn <andrew@lunn.ch>
Signed-off-by: NJosua Mayer <josua@solid-run.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

404f59e2

x86/atomic: Fix smp_mb__{before,after}_atomic() · 9e0bcb59

由 Peter Zijlstra 提交于 4月 24, 2019

[ Upstream commit 69d927bba39517d0980462efc051875b7f4db185 ]

Recent probing at the Linux Kernel Memory Model uncovered a
'surprise'. Strongly ordered architectures where the atomic RmW
primitive implies full memory ordering and
smp_mb__{before,after}_atomic() are a simple barrier() (such as x86)
fail for:

	*x = 1;
	atomic_inc(u);
	smp_mb__after_atomic();
	r0 = *y;

Because, while the atomic_inc() implies memory order, it
(surprisingly) does not provide a compiler barrier. This then allows
the compiler to re-order like so:

	atomic_inc(u);
	*x = 1;
	smp_mb__after_atomic();
	r0 = *y;

Which the CPU is then allowed to re-order (under TSO rules) like:

	atomic_inc(u);
	r0 = *y;
	*x = 1;

And this very much was not intended. Therefore strengthen the atomic
RmW ops to include a compiler barrier.

NOTE: atomic_{or,and,xor} and the bitops already had the compiler
barrier.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>

9e0bcb59

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功