提交 · 37cf14520105e776a5869f48fedf8e267ff03f8d · openanolis / cloud-kernel

22 2月, 2020 1 次提交

alinux: doc: Add Documentation/alibaba/interfaces.rst · 37cf1452

由 Xunlei Pang 提交于 9月 04, 2019

This file collects all the interfaces specific to Alibaba Cloud Kernel.

Add "memory.wmark_min_adj", "memory.exstat", and "zombie memcgs reaper"
descriptions.
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

37cf1452

19 1月, 2020 1 次提交

perf/smmuv3: Enable HiSilicon Erratum 162001800 quirk · 8c6138de

由 Shameer Kolothum 提交于 3月 26, 2019

commit 24062fe85860debfdae0eeaa495f27c9971ec163 upstream

HiSilicon erratum 162001800 describes the limitation of
SMMUv3 PMCG implementation on HiSilicon Hip08 platforms.

On these platforms, the PMCG event counter registers
(SMMU_PMCG_EVCNTRn) are read only and as a result it
is not possible to set the initial counter period value
on event monitor start.

To work around this, the current value of the counter
is read and used for delta calculations. OEM information
from ACPI header is used to identify the affected hardware
platforms.
Signed-off-by: NShameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Reviewed-by: NHanjun Guo <hanjun.guo@linaro.org>
Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
Acked-by: NLorenzo Pieralisi <lorenzo.pieralisi@arm.com>
[will: update silicon-errata.txt and add reason string to acpi match]
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: Zou Cao<zoucao@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

8c6138de

14 1月, 2020 3 次提交

mm: don't raise MEMCG_OOM event due to failed high-order allocation · 9ddf0a56

由 Roman Gushchin 提交于 10月 26, 2018

commit 7a1adfddaf0d11a39fdcaf6e82a88e9c0586e08b upstream.

It was reported that on some of our machines containers were restarted
with OOM symptoms without an obvious reason.  Despite there were almost no
memory pressure and plenty of page cache, MEMCG_OOM event was raised
occasionally, causing the container management software to think, that OOM
has happened.  However, no tasks have been killed.

The following investigation showed that the problem is caused by a failing
attempt to charge a high-order page.  In such case, the OOM killer is
never invoked.  As shown below, it can happen under conditions, which are
very far from a real OOM: e.g.  there is plenty of clean page cache and no
memory pressure.

There is no sense in raising an OOM event in this case, as it might
confuse a user and lead to wrong and excessive actions (e.g.  restart the
workload, as in my case).

Let's look at the charging path in try_charge().  If the memory usage is
about memory.max, which is absolutely natural for most memory cgroups, we
try to reclaim some pages.  Even if we were able to reclaim enough memory
for the allocation, the following check can fail due to a race with
another concurrent allocation:

    if (mem_cgroup_margin(mem_over_limit) >= nr_pages)
        goto retry;

For regular pages the following condition will save us from triggering
the OOM:

   if (nr_reclaimed && nr_pages <= (1 << PAGE_ALLOC_COSTLY_ORDER))
       goto retry;

But for high-order allocation this condition will intentionally fail.  The
reason behind is that we'll likely fall to regular pages anyway, so it's
ok and even preferred to return ENOMEM.

In this case the idea of raising MEMCG_OOM looks dubious.

Fix this by moving MEMCG_OOM raising to mem_cgroup_oom() after allocation
order check, so that the event won't be raised for high order allocations.
This change doesn't affect regular pages allocation and charging.

Link: http://lkml.kernel.org/r/20181004214050.7417-1-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Acked-by: NMichal Hocko <mhocko@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

9ddf0a56

mm, memcg: introduce memory.events.local · 5fcab459

由 Shakeel Butt 提交于 12月 16, 2019

commit 1e577f970f66a53d429cbee37b36177c9712f488 upstream.

The memory controller in cgroup v2 exposes memory.events file for each
memcg which shows the number of times events like low, high, max, oom
and oom_kill have happened for the whole tree rooted at that memcg.
Users can also poll or register notification to monitor the changes in
that file. Any event at any level of the tree rooted at memcg will
notify all the listeners along the path till root_mem_cgroup. There are
existing users which depend on this behavior.

However there are users which are only interested in the events
happening at a specific level of the memcg tree and not in the events in
the underlying tree rooted at that memcg. One such use-case is a
centralized resource monitor which can dynamically adjust the limits of
the jobs running on a system. The jobs can create their sub-hierarchy
for their own sub-tasks. The centralized monitor is only interested in
the events at the top level memcgs of the jobs as it can then act and
adjust the limits of the jobs. Using the current memory.events for such
centralized monitor is very inconvenient. The monitor will keep
receiving events which it is not interested and to find if the received
event is interesting, it has to read memory.event files of the next
level and compare it with the top level one. So, let's introduce
memory.events.local to the memcg which shows and notify for the events
at the memcg level.

Now, does memory.stat and memory.pressure need their local versions. IMHO
no due to the no internal process contraint of the cgroup v2. The
memory.stat file of the top level memcg of a job shows the stats and
vmevents of the whole tree. The local stats or vmevents of the top level
memcg will only change if there is a process running in that memcg but v2
does not allow that. Similarly for memory.pressure there will not be any
process in the internal nodes and thus no chance of local pressure.

Link: http://lkml.kernel.org/r/20190527174643.209172-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NRoman Gushchin <guro@fb.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
Cc: Chris Down <chris@chrisdown.name>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

5fcab459

HYGON: i2c-piix4: Add Hygon Dhyana SMBus support · 299b8f26

由 Pu Wen 提交于 1月 10, 2020

commit 24beb83ad289c68bce7c01351cb90465bbb1940a upstream.

The Hygon Dhyana CPU has the SMBus device with PCI device ID 0x790b,
which is the same as AMD CZ SMBus device. So add Hygon Dhyana support
to the i2c-piix4 driver by using the code path of AMD.
Signed-off-by: NPu Wen <puwen@hygon.cn>
Reviewed-by: NJean Delvare <jdelvare@suse.de>
Signed-off-by: NWolfram Sang <wsa@the-dreams.de>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

299b8f26

07 1月, 2020 1 次提交

fs: add an iopoll method to struct file_operations · 22a4f6b4

由 Christoph Hellwig 提交于 11月 22, 2018

commit fb7e160019f4abb4082740bfeb27a38f6389c745 upstream.

This new methods is used to explicitly poll for I/O completion for an
iocb.  It must be called for any iocb submitted asynchronously (that
is with a non-null ki_complete) which has the IOCB_HIPRI flag set.

The method is assisted by a new ki_cookie field in struct iocb to store
the polling cookie.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

22a4f6b4

02 1月, 2020 16 次提交

alinux: introduce psi_v1 boot parameter · 106824ae

由 Joseph Qi 提交于 12月 25, 2019

Instead using static kconfig CONFIG_PSI_CGROUP_V1, we introduce a boot
parameter psi_v1 to enable psi cgroup v1 support. Default it is
disabled, which means when passing psi=1 boot parameter, we only support
cgroup v2.
This is to keep consistent with other cgroup v1 features such as cgroup
writeback v1 (cgwb_v1).
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXunlei Pang <xlpang@linux.alibaba.com>

106824ae

alinux: mm: Support kidled · f55ac551

由 Gavin Shan 提交于 8月 30, 2019

This enables scanning pages in fixed interval to determine their access
frequency (hot/cold). The result is exported to user land on basis of
memory cgroup by "memory.idle_page_stats". The design is highlighted as
below:

   * A kernel thread is spawn when this feature is enabled by writing
     non-zero value to "/sys/kernel/mm/kidled/scan_period_in_seconds".
     The thread sequentially scans the nodes and their pages that have
     been chained up in LRU list.

   * For each page, its corresponding age information is stored in the
     page flags or array in node. The age represents the scanning intervals
     in which the page isn't accessed. Also, the page flag (PG_idle) is
     leveraged. The page's age is increased by one if the idle flag isn't
     cleared in two consective scans. Otherwise, the page's age is cleared out.
     Also, the page's age information is cleared when it's free'd so that
     the stale age information won't be fetched when it's allocated.

   * Initially, the flag is set, while the access bit in its PTE is cleared
     out by the thread. In next scanning period, its PTE access bit is
     synchronized with the page flag: clear the flag if access bit is set.
     The flag is kept otherwise. For unmapped pages, the flag is cleared
     when it's accessed.

   * Eventually, the page's aging information is updated to the unstable
     bucket of its corresponding memory cgroup, taking as statistics. The
     unstable bucket (statistics) is copied to stable bucket when all pages
     in all nodes are scanned for once. The stable bucket (statistics) is
     exported to user land through "memory.idle_page_stats".

TESTING
=======

   * cgroup1, unmapped pagecache

     # dd if=/dev/zero of=/ext4/test.data oflag=direct bs=1M count=128
     #
     # echo 1 > /sys/kernel/mm/kidled/use_hierarchy
     # echo 15 > /sys/kernel/mm/kidled/scan_period_in_seconds
     # mkdir -p /cgroup/memory
     # mount -tcgroup -o memory /cgroup/memory
     # echo 1 > /cgroup/memory/memory.use_hierarchy
     # mkdir -p /cgroup/memory/test
     # echo 1 > /cgroup/memory/test/memory.use_hierarchy
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # dd if=/ext4/test.data of=/dev/null bs=1M count=128
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
       cfei   0   0   0   134217728   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep cfei
       cfei   0   0   0   134217728   0   0   0   0

   * cgroup1, mapped pagecache

     # < create same file and memory cgroups as above >
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # < run program to mmap the whole created file and access the area >
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
       cfei   0   134217728   0   0   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep cfei
       cfei   0   134217728   0   0   0   0   0   0

   * cgroup1, mapped and locked pagecache

     # < create same file and memory cgroups as above >
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # < run program to mmap the whole created file and mlock the area >
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep cfui
       cfui   0   134217728   0   0   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep cfui
       cfui   0   134217728   0   0   0   0   0   0

   * cgroup1, anonymous and locked area

     # < create memory cgroups as above >
     #
     # echo $$ > /cgroup/memory/test/cgroup.procs
     # < run program to mmap anonymous area and mlock it >
     # < wait a few minutes >
     # cat /cgroup/memory/test/memory.idle_page_stats | grep csui
       csui   0   0   134217728   0   0   0   0   0
     # cat /cgroup/memory/memory.idle_page_stats | grep csui
       csui   0   0   134217728   0   0   0   0   0

   * Rerun above test cases in cgroup2 and the results are no exceptional.
     However, the cgroups are populated in different way as below:

     # mkdir -p /cgroup
     # mount -tcgroup2 none /cgroup
     # echo "+memory" > /cgroup/cgroup.subtree_control
     # mkdir -p /cgroup/test
Signed-off-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

f55ac551

alinux: mm: memcontrol: make distance between wmark_low and wmark_high configurable · 666beb72

由 Yang Shi 提交于 8月 17, 2019

Introduce a new interface, wmark_scale_factor, which defines the
distance between wmark_high and wmark_low.  The unit is in fractions of
10,000. The default value of 50 means the distance between wmark_high
and wmark_low is 0.5% of the max limit of the cgroup.  The maximum value
is 1000, or 10% of the max limit.

The distance between wmark_low and wmark_high have impact on how hard
memcg kswapd would reclaim.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

666beb72

alinux: mm: memcontrol: add background reclaim support for cgroupv2 · 2053fb86

由 Yang Shi 提交于 8月 14, 2019

Like v1, add background reclaim support for cgroup v2. The interfaces
are exactly same with v1. However, if high limit is setup for v2, the
water mark would be calculated by high limit instead of max limit.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

2053fb86

alinux: mm: memcontrol: support background async page reclaim · 8b371635

由 Yang Shi 提交于 8月 14, 2019

Currently when memory usage exceeds memory cgroup limit, memory cgroup
just can do sync direct reclaim.  This may incur unexpected stall on
some applications which are sensitive to latency.  Introduce background
async page reclaim mechanism, like what kswapd does.

Define memcg memory usage water mark by introducing wmark_ratio interface,
which is from 0 to 100 and represents percentage of max limit.  The
wmark_high is calculated by (max * wmark_ratio / 100), the wmark_low is
(wmark_high - wmark_high >> 8), which is an empirical value.  If wmark_ratio
is 0, it means water mark is disabled, both wmark_low and wmark_high is max,
which is the default value.

If wmark_ratio is setup, when charging page, if usage is greater than
wmark_high, which means the available memory of memcg is low, a work
would be scheduled to do background page reclaim until memory usage is
reduced to wmark_low if possible.

Define a dedicated unbound workqueue for scheduling water mark reclaim
works.
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>

8b371635

alinux: block: add counter to track io request's d2c time · 81389d70

由 Xiaoguang Wang 提交于 6月 19, 2019

Indeed tool iostat's await is not good enough, which is somewhat sketchy
and could not show request's latency on device driver's side.

Here we add a new counter to track io request's d2c time, also with this
patch, we can extend iostat to show this value easily.

Note:
I had checked how iostat is implemented, it just reads fields it needs,
so iostat won't be affected by this change, so does tsar.
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

81389d70

alinux: fs,ext4: remove projid limit when create hard link · 4f39f80c

由 zhangliguang 提交于 12月 27, 2018

This is a temporary workaround plan to avoid the limitation when
creating hard link cross two projids.
Signed-off-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

4f39f80c

device-dax: Add /sys/class/dax backwards compatibility · 87896a12

由 Dan Williams 提交于 8月 24, 2019

commit 730926c3b0998943654019f00296cf8e3b02277e upstream

On the expectation that some environments may not upgrade libdaxctl
(userspace component that depends on the /sys/class/dax hierarchy),
provide a default / legacy dax_pmem_compat driver. The dax_pmem_compat
driver implements the original /sys/class/dax sysfs layout rather than
/sys/bus/dax. When userspace is upgraded it can blacklist this module
and switch to the dax_pmem driver going forward.

CONFIG_DEV_DAX_PMEM_COMPAT and supporting code will be deleted according
to the dax_pmem entry in Documentation/ABI/obsolete/.
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>

87896a12

ICX: platform/x86: ISST: Update ioctl-number.txt for Intel Speed Select interface · c60e7b18

由 Srinivas Pandruvada 提交于 6月 26, 2019

commit 2ee5bfc1efc81179c73abcd33098dd2c86019146 upstream.

Reserve ioctl numbers for intel Speed Select Technology interface
drivers.
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

c60e7b18

ICX: cpu/topology: Export die_id · c9e7ee49

由 Len Brown 提交于 5月 13, 2019

commit 0e344d8c709fe01d882fc0fb5452bedfe5eba67a upstream.

Export die_id in cpu topology, for the benefit of hardware that has
multiple-die/package.
Signed-off-by: NLen Brown <len.brown@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-doc@vger.kernel.org
Link: https://lkml.kernel.org/r/e7d1caaf4fbd24ee40db6d557ab28d7d83298900.1557769318.git.len.brown@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

c9e7ee49

ICX: x86/topology: Add CPUID.1F multi-die/package support · 0a50b4c3

由 Len Brown 提交于 5月 13, 2019

commit 7745f03eb39587dd15a1fb26e6223678b8e906d2 upstream.

Some new systems have multiple software-visible die within each package.

Update Linux parsing of the Intel CPUID "Extended Topology Leaf" to handle
either CPUID.B, or the new CPUID.1F.

Add cpuinfo_x86.die_id and cpuinfo_x86.max_dies to store the result.

die_id will be non-zero only for multi-die/package systems.
Signed-off-by: NLen Brown <len.brown@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: linux-doc@vger.kernel.org
Link: https://lkml.kernel.org/r/7b23d2d26d717b8e14ba137c94b70943f1ae4b5c.1557769318.git.len.brown@intel.comSigned-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

0a50b4c3

ICX: topology: Simplify cputopology.txt formatting and wording · e69ee0fb

由 Len Brown 提交于 2月 26, 2019

commit 3a1c779fb8f71e772e2145e68c262936ada815ed upstream.

Syntax only, no functional or semantic change.
Signed-off-by: NLen Brown <len.brown@intel.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: linux-doc@vger.kernel.org
Link: http://lkml.kernel.org/r/1ca56f8ea922a67f0017bd645912ea02a65a85ec.1551160674.git.len.brown@intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NYouquan Song <youquan.song@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

e69ee0fb

ICX: doc/mm: New documentation for memory performance · 1a838cd0

由 Keith Busch 提交于 3月 11, 2019

commit 13bac55ef7aef8ecb67ff3005d24b05a464d28ea upstream.

Platforms may provide system memory where some physical address ranges
perform differently than others, or is cached by the system on the
memory side.

Add documentation describing a high level overview of such systems and the
perforamnce and caching attributes the kernel provides for applications
wishing to query this information.
Reviewed-by: NMike Rapoport <rppt@linux.ibm.com>
Reviewed-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

1a838cd0

ICX: node: Add memory-side caching attributes · 2969e478

由 Keith Busch 提交于 3月 11, 2019

commit acc02a109b0497e917c83f986a89c51e47d0022c upstream.

System memory may have caches to help improve access speed to frequently
requested address ranges. While the system provided cache is transparent
to the software accessing these memory ranges, applications can optimize
their own access based on cache attributes.

Provide a new API for the kernel to register these memory-side caches
under the memory node that provides it.

The new sysfs representation is modeled from the existing cpu cacheinfo
attributes, as seen from /sys/devices/system/cpu/<cpu>/cache/.  Unlike CPU
cacheinfo though, the node cache level is reported from the view of the
memory. A higher level number is nearer to the CPU, while lower levels
are closer to the last level memory.

The exported attributes are the cache size, the line size, associativity
indexing, and write back policy, and add the attributes for the system
memory caches to sysfs stable documentation.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: NBrice Goglin <Brice.Goglin@inria.fr>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

2969e478

ICX: node: Add heterogenous memory access attributes · 52b6f021

由 Keith Busch 提交于 3月 11, 2019

commit e1cf33aafb8462c7d0a0e6349925870316f040ee upstream.

Heterogeneous memory systems provide memory nodes with different latency
and bandwidth performance attributes. Provide a new kernel interface
for subsystems to register the attributes under the memory target
node's initiator access class. If the system provides this information,
applications may query these attributes when deciding which node to
request memory.

The following example shows the new sysfs hierarchy for a node exporting
performance attributes:

  # tree -P "read*|write*"/sys/devices/system/node/nodeY/accessZ/initiators/
  /sys/devices/system/node/nodeY/accessZ/initiators/
  |-- read_bandwidth
  |-- read_latency
  |-- write_bandwidth
  `-- write_latency

The bandwidth is exported as MB/s and latency is reported in
nanoseconds. The values are taken from the platform as reported by the
manufacturer.

Memory accesses from an initiator node that is not one of the memory's
access "Z" initiator nodes linked in the same directory may observe
different performance than reported here. When a subsystem makes use
of this interface, initiators of a different access number may not have
the same performance relative to initiators in other access numbers, or
omitted from the any access class' initiators.

Descriptions for memory access initiator performance access attributes
are added to sysfs stable documentation.
Acked-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

52b6f021

ICX: node: Link memory nodes to their compute nodes · c88dea57

由 Keith Busch 提交于 3月 11, 2019

commit 08d9dbe72b1f899468b2b34f9309e88a84f440f2 upstream.

Systems may be constructed with various specialized nodes. Some nodes
may provide memory, some provide compute devices that access and use
that memory, and others may provide both. Nodes that provide memory are
referred to as memory targets, and nodes that can initiate memory access
are referred to as memory initiators.

Memory targets will often have varying access characteristics from
different initiators, and platforms may have ways to express those
relationships. In preparation for these systems, provide interfaces for
the kernel to export the memory relationship among different nodes memory
targets and their initiators with symlinks to each other.

If a system provides access locality for each initiator-target pair, nodes
may be grouped into ranked access classes relative to other nodes. The
new interface allows a subsystem to register relationships of varying
classes if available and desired to be exported.

A memory initiator may have multiple memory targets in the same access
class. The target memory's initiators in a given class indicate the
nodes access characteristics share the same performance relative to other
linked initiator nodes. Each target within an initiator's access class,
though, do not necessarily perform the same as each other.

A memory target node may have multiple memory initiators. All linked
initiators in a target's class have the same access characteristics to
that target.

The following example show the nodes' new sysfs hierarchy for a memory
target node 'Y' with access class 0 from initiator node 'X':

  # symlinks -v /sys/devices/system/node/nodeX/access0/
  relative: /sys/devices/system/node/nodeX/access0/targets/nodeY -> ../../nodeY

  # symlinks -v /sys/devices/system/node/nodeY/access0/
  relative: /sys/devices/system/node/nodeY/access0/initiators/nodeX -> ../../nodeX

The new attributes are added to the sysfs stable documentation.
Reviewed-by: NJonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NBrice Goglin <Brice.Goglin@inria.fr>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NFan Du <fan.du@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

c88dea57

27 12月, 2019 12 次提交

blkcg: add tools/cgroup/iocost_coef_gen.py · 0581637c

由 Tejun Heo 提交于 8月 28, 2019

commit 8504dea783b044cab620acbaef87b86ee84646fe upstream.

Add a script which can be used to generate device-specific iocost
linear model coefficients.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0581637c

blkcg: implement blk-iocost · e383d72b

由 Tejun Heo 提交于 8月 28, 2019

commit 7caa47151ab2e644dd221f741ec7578d9532c9a3 upstream.

This patchset implements IO cost model based work-conserving
proportional controller.

While io.latency provides the capability to comprehensively prioritize
and protect IOs depending on the cgroups, its protection is binary -
the lowest latency target cgroup which is suffering is protected at
the cost of all others.  In many use cases including stacking multiple
workload containers in a single system, it's necessary to distribute
IO capacity with better granularity.

One challenge of controlling IO resources is the lack of trivially
observable cost metric.  The most common metrics - bandwidth and iops
- can be off by orders of magnitude depending on the device type and
IO pattern.  However, the cost isn't a complete mystery.  Given
several key attributes, we can make fairly reliable predictions on how
expensive a given stream of IOs would be, at least compared to other
IO patterns.

The function which determines the cost of a given IO is the IO cost
model for the device.  This controller distributes IO capacity based
on the costs estimated by such model.  The more accurate the cost
model the better but the controller adapts based on IO completion
latency and as long as the relative costs across differents IO
patterns are consistent and sensible, it'll adapt to the actual
performance of the device.

Currently, the only implemented cost model is a simple linear one with
a few sets of default parameters for different classes of device.
This covers most common devices reasonably well.  All the
infrastructure to tune and add different cost models is already in
place and a later patch will also allow using bpf progs for cost
models.

Please see the top comment in blk-iocost.c and documentation for
more details.

v2: Rebased on top of RQ_ALLOC_TIME changes and folded in Rik's fix
    for a divide-by-zero bug in current_hweight() triggered by zero
    inuse_sum.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Andy Newell <newella@fb.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
[Joseph: fix confilcts with ioc_rqos_throttle()]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e383d72b

cgroup: add cgroup_parse_float() · e4b4935f

由 Tejun Heo 提交于 5月 13, 2019

commit a5e112e6424adb77d953eac20e6936b952fd6b32 upstream.

cgroup already uses floating point for percent[ile] numbers and there
are several controllers which want to take them as input.  Add a
generic parse helper to handle inputs.

Update the interface convention documentation about the use of
percentage numbers.  While at it, also clarify the default time unit.
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e4b4935f

iommu: Add "iommu.strict" command line option · eb59e389

由 Zhen Lei 提交于 9月 20, 2018

commit 68a6efe86f6a16e25556a2aff40efad41097b486 upstream

Add a generic command line option to enable lazy unmapping via IOVA
flush queues, which will initally be suuported by iommu-dma. This echoes
the semantics of "intel_iommu=strict" (albeit with the opposite default
value), but in the driver-agnostic fashion of "iommu.passthrough".
Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
[rm: move handling out of SMMUv3 driver, clean up documentation]
Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
[will: dropped broken printk when parsing command-line option]
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

eb59e389

psi: introduce psi monitor · c1cc0998

由 Suren Baghdasaryan 提交于 5月 14, 2019

commit 0e94682b73bfa6c44c98af7a26771c9c08c055d5 upstream.

Psi monitor aims to provide a low-latency short-term pressure detection
mechanism configurable by users.  It allows users to monitor psi metrics
growth and trigger events whenever a metric raises above user-defined
threshold within user-defined time window.

Time window and threshold are both expressed in usecs.  Multiple psi
resources with different thresholds and window sizes can be monitored
concurrently.

Psi monitors activate when system enters stall state for the monitored
psi metric and deactivate upon exit from the stall state.  While system
is in the stall state psi signal growth is monitored at a rate of 10
times per tracking window.  Min window size is 500ms, therefore the min
monitoring interval is 50ms.  Max window size is 10s with monitoring
interval of 1s.

When activated psi monitor stays active for at least the duration of one
tracking window to avoid repeated activations/deactivations when psi
signal is bouncing.

Notifications to the users are rate-limited to one per tracking window.

Link: http://lkml.kernel.org/r/20190319235619.260832-8-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Li Zefan <lizefan@huawei.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

c1cc0998

psi: clarify the units used in pressure files · 92f5c2ae

由 Waiman Long 提交于 4月 05, 2019

commit be87ab0afd680ac35486d16c0963c56d9be1d8a0 upstream.

The output of the PSI files show a bunch of numbers with no unit.  The
psi.txt documentation file also does not indicate what units are used.
One can only find out by looking at the source code.  The units are
percentage for the averages and useconds for the total.  Make the
information easier to find by documenting the units in psi.txt.

Link: http://lkml.kernel.org/r/20190402193810.3450-1-longman@redhat.comSigned-off-by: NWaiman Long <longman@redhat.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

92f5c2ae

psi: make disabling/enabling easier for vendor kernels · 8570f933

由 Johannes Weiner 提交于 11月 30, 2018

commit e0c274472d5d27f277af722e017525e0b33784cd upstream.

Mel Gorman reports a hackbench regression with psi that would prohibit
shipping the suse kernel with it default-enabled, but he'd still like
users to be able to opt in at little to no cost to others.

With the current combination of CONFIG_PSI and the psi_disabled bool set
from the commandline, this is a challenge.  Do the following things to
make it easier:

1. Add a config option CONFIG_PSI_DEFAULT_DISABLED that allows distros
   to enable CONFIG_PSI in their kernel but leave the feature disabled
   unless a user requests it at boot-time.

   To avoid double negatives, rename psi_disabled= to psi=.

2. Make psi_disabled a static branch to eliminate any branch costs
   when the feature is disabled.

In terms of numbers before and after this patch, Mel says:

: The following is a comparision using CONFIG_PSI=n as a baseline against
: your patch and a vanilla kernel
:
:                          4.20.0-rc4             4.20.0-rc4             4.20.0-rc4
:                 kconfigdisable-v1r1                vanilla        psidisable-v1r1
: Amean     1       1.3100 (   0.00%)      1.3923 (  -6.28%)      1.3427 (  -2.49%)
: Amean     3       3.8860 (   0.00%)      4.1230 *  -6.10%*      3.8860 (  -0.00%)
: Amean     5       6.8847 (   0.00%)      8.0390 * -16.77%*      6.7727 (   1.63%)
: Amean     7       9.9310 (   0.00%)     10.8367 *  -9.12%*      9.9910 (  -0.60%)
: Amean     12     16.6577 (   0.00%)     18.2363 *  -9.48%*     17.1083 (  -2.71%)
: Amean     18     26.5133 (   0.00%)     27.8833 *  -5.17%*     25.7663 (   2.82%)
: Amean     24     34.3003 (   0.00%)     34.6830 (  -1.12%)     32.0450 (   6.58%)
: Amean     30     40.0063 (   0.00%)     40.5800 (  -1.43%)     41.5087 (  -3.76%)
: Amean     32     40.1407 (   0.00%)     41.2273 (  -2.71%)     39.9417 (   0.50%)
:
: It's showing that the vanilla kernel takes a hit (as the bisection
: indicated it would) and that disabling PSI by default is reasonably
: close in terms of performance for this particular workload on this
: particular machine so;

Link: http://lkml.kernel.org/r/20181127165329.GA29728@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Tested-by: NMel Gorman <mgorman@techsingularity.net>
Reported-by: NMel Gorman <mgorman@techsingularity.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

8570f933

psi: cgroup support · d26da194

由 Johannes Weiner 提交于 10月 26, 2018

commit 2ce7135adc9ad081aa3c49744144376ac74fea60 upstream.

On a system that executes multiple cgrouped jobs and independent
workloads, we don't just care about the health of the overall system, but
also that of individual jobs, so that we can ensure individual job health,
fairness between jobs, or prioritize some jobs over others.

This patch implements pressure stall tracking for cgroups.  In kernels
with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
and io.pressure files that track aggregate pressure stall times for only
the tasks inside the cgroup.

Link: http://lkml.kernel.org/r/20180828172258.3185-10-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NTejun Heo <tj@kernel.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDaniel Drake <drake@endlessm.com>
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
[Joseph: fix apply conflicts in cgroup_create()]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

Conflicts:
    kernel/cgroup/cgroup.c

d26da194

psi: pressure stall information for CPU, memory, and IO · 8b2bf079

由 Johannes Weiner 提交于 10月 26, 2018

commit eb414681d5a07d28d2ff90dc05f69ec6b232ebd2 upstream.

When systems are overcommitted and resources become contended, it's hard
to tell exactly the impact this has on workload productivity, or how close
the system is to lockups and OOM kills.  In particular, when machines work
multiple jobs concurrently, the impact of overcommit in terms of latency
and throughput on the individual job can be enormous.

In order to maximize hardware utilization without sacrificing individual
job health or risk complete machine lockups, this patch implements a way
to quantify resource pressure in the system.

A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
expose the percentage of time the system is stalled on CPU, memory, or IO,
respectively.  Stall states are aggregate versions of the per-task delay
accounting delays:

       cpu: some tasks are runnable but not executing on a CPU
       memory: tasks are reclaiming, or waiting for swapin or thrashing cache
       io: tasks are waiting for io completions

These percentages of walltime can be thought of as pressure percentages,
and they give a general sense of system health and productivity loss
incurred by resource overcommit.  They can also indicate when the system
is approaching lockup scenarios and OOMs.

To do this, psi keeps track of the task states associated with each CPU
and samples the time they spend in stall states.  Every 2 seconds, the
samples are averaged across CPUs - weighted by the CPUs' non-idle time to
eliminate artifacts from unused CPUs - and translated into percentages of
walltime.  A running average of those percentages is maintained over 10s,
1m, and 5m periods (similar to the loadaverage).

[hannes@cmpxchg.org: doc fixlet, per Randy]
  Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
[hannes@cmpxchg.org: code optimization]
  Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
[hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
  Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
[hannes@cmpxchg.org: fix build]
  Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Tested-by: NDaniel Drake <drake@endlessm.com>
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Cc: Christopher Lameter <cl@linux.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Johannes Weiner <jweiner@fb.com>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Enderborg <peter.enderborg@sony.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Cc: Shakeel Butt <shakeelb@google.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
[Joseph: fix apply conflicts in task_struct]
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

8b2bf079

alinux: net/tcp: Support tunable tcp timeout value in TIME-WAIT state · 0e7b6b7f

由 George Zhang 提交于 3月 28, 2018

By default the tcp_tw_timeout value is 60 seconds. The minimum is
1 second and the maximum is 600. This setting is useful on system under
heavy tcp load.

NOTE: set the tcp_tw_timeout below 60 seconds voilates the "quiet time"
restriction, and make your system into the risk of causing some old data
to be accepted as new or new data rejected as old duplicated by some
receivers.

Link: http://web.archive.org/web/20150102003320/http://tools.ietf.org/html/rfc793Signed-off-by: NGeorge Zhang <georgezhang@linux.alibaba.com>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0e7b6b7f

alinux: writeback: introduce cgwb_v1 boot param · 6f7afdf3

由 Jiufei Xue 提交于 5月 13, 2019

So far writeback control is supported for cgroup v1 interface. However
it also has some restrictions, so introduce a new kernel boot parameter
to control the behavior which is disabled by default. Users can enable
the writeback control for cgroup v1 with the command line "cgwb_v1".
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

6f7afdf3

alinux: drivers/virtio: add vring_force_dma_api boot param · 9c4a8e9d

由 Eryu Guan 提交于 12月 24, 2018

Prior to xdragon platform 20181230 release (e.g. 0930 release),
vring_use_dma_api() is required to return 'true' unconditionally.

Introduce a new kernel boot parameter called "vring_force_dma_api" to
control the behavior, boot xdragon host with "vring_force_dma_api"
command line to make ENI hotplug work, so that normal ECS hosts keep the
original behavior.
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NEryu Guan <eguan@linux.alibaba.com>

9c4a8e9d

18 12月, 2019 2 次提交

mm, thp, proc: report THP eligibility for each vma · c76adee3

由 Michal Hocko 提交于 12月 28, 2018

[ Upstream commit 7635d9cbe8327e131a1d3d8517dc186c2796ce2e ]

Userspace falls short when trying to find out whether a specific memory
range is eligible for THP.  There are usecases that would like to know
that
http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
: This is used to identify heap mappings that should be able to fault thp
: but do not, and they normally point to a low-on-memory or fragmentation
: issue.

The only way to deduce this now is to query for hg resp.  nh flags and
confronting the state with the global setting.  Except that there is also
PR_SET_THP_DISABLE that might change the picture.  So the final logic is
not trivial.  Moreover the eligibility of the vma depends on the type of
VMA as well.  In the past we have supported only anononymous memory VMAs
but things have changed and shmem based vmas are supported as well these
days and the query logic gets even more complicated because the
eligibility depends on the mount option and another global configuration
knob.

Simplify the current state and report the THP eligibility in
/proc/<pid>/smaps for each existing vma.  Reuse
transparent_hugepage_enabled for this purpose.  The original
implementation of this function assumes that the caller knows that the vma
itself is supported for THP so make the core checks into
__transparent_hugepage_enabled and use it for existing callers.
__show_smap just use the new transparent_hugepage_enabled which also
checks the vma support status (please note that this one has to be out of
line due to include dependency issues).

[mhocko@kernel.org: fix oops with NULL ->f_mapping]
  Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Paul Oppenheimer <bepvte@gmail.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>

c76adee3

USB: documentation: flags on usb-storage versus UAS · 9e002edd

由 Oliver Neukum 提交于 11月 14, 2019

commit 65cc8bf99349f651a0a2cee69333525fe581f306 upstream.

Document which flags work storage, UAS or both
Signed-off-by: NOliver Neukum <oneukum@suse.com>
Cc: stable <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20191114112758.32747-4-oneukum@suse.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

9e002edd

13 12月, 2019 1 次提交

rtc: dt-binding: abx80x: fix resistance scale · 0281b46b

由 Baruch Siach 提交于 11月 19, 2018

[ Upstream commit 73852e56827f5cb5db9d6e8dd8191fc2f2e8f424 ]

The abracon,tc-resistor property value is in kOhm.
Signed-off-by: NBaruch Siach <baruch@tkos.co.il>
Signed-off-by: NAlexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>

0281b46b

05 12月, 2019 1 次提交

HID: doc: fix wrong data structure reference for UHID_OUTPUT · 16877902

由 Peter Hutterer 提交于 12月 13, 2018

[ Upstream commit 46b14eef59a8157138dc02f916a7f97c73b3ec53 ]
Signed-off-by: NPeter Hutterer <peter.hutterer@who-t.net>
Signed-off-by: NJiri Kosina <jkosina@suse.cz>
Signed-off-by: NSasha Levin <sashal@kernel.org>

16877902

01 12月, 2019 2 次提交

x86/speculation: Fix incorrect MDS/TAA mitigation status · 0af5ae26

由 Waiman Long 提交于 11月 15, 2019

commit 64870ed1b12e235cfca3f6c6da75b542c973ff78 upstream.

For MDS vulnerable processors with TSX support, enabling either MDS or
TAA mitigations will enable the use of VERW to flush internal processor
buffers at the right code path. IOW, they are either both mitigated
or both not. However, if the command line options are inconsistent,
the vulnerabilites sysfs files may not report the mitigation status
correctly.

For example, with only the "mds=off" option:

  vulnerabilities/mds:Vulnerable; SMT vulnerable
  vulnerabilities/tsx_async_abort:Mitigation: Clear CPU buffers; SMT vulnerable

The mds vulnerabilities file has wrong status in this case. Similarly,
the taa vulnerability file will be wrong with mds mitigation on, but
taa off.

Change taa_select_mitigation() to sync up the two mitigation status
and have them turned off if both "mds=off" and "tsx_async_abort=off"
are present.

Update documentation to emphasize the fact that both "mds=off" and
"tsx_async_abort=off" have to be specified together for processors that
are affected by both TAA and MDS to be effective.

 [ bp: Massage and add kernel-parameters.txt change too. ]

Fixes: 1b42f017415b ("x86/speculation/taa: Add mitigation for TSX Async Abort")
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: linux-doc@vger.kernel.org
Cc: Mark Gross <mgross@linux.intel.com>
Cc: <stable@vger.kernel.org>
Cc: Pawan Gupta <pawan.kumar.gupta@linux.intel.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: x86-ml <x86@kernel.org>
Link: https://lkml.kernel.org/r/20191115161445.30809-2-longman@redhat.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

0af5ae26

spi: uniphier: fix incorrect property items · adcb6d9f

由 Keiji Hayashibara 提交于 10月 24, 2018

[ Upstream commit 3511ba7d4ca6f39e2d060bb94e42a41ad1fee7bf ]

This commit fixes incorrect property because it was different
from the actual.
The parameters of '#address-cells' and '#size-cells' were removed,
and 'interrupts', 'pinctrl-names' and 'pinctrl-0' were added.

Fixes: 4dcd5c27 ("spi: add DT bindings for UniPhier SPI controller")
Signed-off-by: NKeiji Hayashibara <hayashibara.keiji@socionext.com>
Signed-off-by: NMark Brown <broonie@kernel.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>

adcb6d9f

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功