提交 · 7deb8737b2f3255c185f35d412ae3247c6fa1702 · openanolis / cloud-kernel

22 4月, 2020 25 次提交

spi: spi-dw: Add lock protect dw_spi rx/tx to prevent concurrent calls · 7deb8737

由 wuxu.wu 提交于 1月 01, 2020

fix #25872428

commit 19b61392c5a852b4e8a0bf35aecb969983c5932d upstream

dw_spi_irq() and dw_spi_transfer_one concurrent calls.

I find a panic in dw_writer(): txw = *(u8 *)(dws->tx), when dw->tx==null,
dw->len==4, and dw->tx_end==1.

When tpm driver's message overtime dw_spi_irq() and dw_spi_transfer_one
may concurrent visit dw_spi, so I think dw_spi structure lack of protection.

Otherwise dw_spi_transfer_one set dw rx/tx buffer and then open irq,
store dw rx/tx instructions and other cores handle irq load dw rx/tx
instructions may out of order.

	[ 1025.321302] Call trace:
	...
	[ 1025.321319]  __crash_kexec+0x98/0x148
	[ 1025.321323]  panic+0x17c/0x314
	[ 1025.321329]  die+0x29c/0x2e8
	[ 1025.321334]  die_kernel_fault+0x68/0x78
	[ 1025.321337]  __do_kernel_fault+0x90/0xb0
	[ 1025.321346]  do_page_fault+0x88/0x500
	[ 1025.321347]  do_translation_fault+0xa8/0xb8
	[ 1025.321349]  do_mem_abort+0x68/0x118
	[ 1025.321351]  el1_da+0x20/0x8c
	[ 1025.321362]  dw_writer+0xc8/0xd0
	[ 1025.321364]  interrupt_transfer+0x60/0x110
	[ 1025.321365]  dw_spi_irq+0x48/0x70
	...
Signed-off-by: Nwuxu.wu <wuxu.wu@huawei.com>
Link: https://lore.kernel.org/r/1577849981-31489-1-git-send-email-wuxu.wu@huawei.comSigned-off-by: NMark Brown <broonie@kernel.org>
Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>

7deb8737

iommu/amd: Fix IOMMU AVIC not properly update the is_run bit in IRTE · 649821ec

由 Suravee Suthikulpanit 提交于 3月 30, 2020

fix #26319040

commit 730ad0ede130015a773229573559e97ba0943065 upstream.

Commit b9c6ff94e43a ("iommu/amd: Re-factor guest virtual APIC
(de-)activation code") accidentally left out the ir_data pointer when
calling modity_irte_ga(), which causes the function amd_iommu_update_ga()
to return prematurely due to struct amd_ir_data.ref is NULL and
the "is_run" bit of IRTE does not get updated properly.

This results in bad I/O performance since IOMMU AVIC always generate GA Log
entry and notify IOMMU driver and KVM when it receives interrupt from the
PCI pass-through device instead of directly inject interrupt to the vCPU.

Fixes by passing ir_data when calling modify_irte_ga() as done previously.

Fixes: b9c6ff94e43a ("iommu/amd: Re-factor guest virtual APIC (de-)activation code")
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: Ntianyi <fujunkang@linux.alibaba.com>
Reviewed-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Acked-by: Nzhangliguang <zhangliguang@linux.alibaba.com>

649821ec

iommu/amd: Re-factor guest virtual APIC (de-)activation code · d5a28aba

由 Suthikulpanit, Suravee 提交于 3月 30, 2020

fix #26319040

commit b9c6ff94e43a0ee053e0c1d983fba1ac4953b762 upstream.

Re-factore the logic for activate/deactivate guest virtual APIC mode
(GAM)
into helper functions, and export them for other drivers (e.g. SVM).
to support run-time activate/deactivate of SVM AVIC.

Cc: Joerg Roedel <joro@8bytes.org>
Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: Ntianyi <fujunkang@linux.alibaba.com>
Reviewed-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Acked-by: Nzhangliguang <zhangliguang@linux.alibaba.com>

d5a28aba

iommu/amd: Lock code paths traversing protection_domain->dev_list · fd130609

由 Joerg Roedel 提交于 3月 30, 2020

fix #26319040

commit 2a78f9962565e53b78363eaf516eb052009e8020 upstream.

The traversing of this list requires protection_domain->lock to be taken
to avoid nasty races with attach/detach code. Make sure the lock is held
on all code-paths traversing this list.
Reported-by: NFilippo Sironi <sironi@amazon.de>
Fixes: 92d420ec ("iommu/amd: Relax locking in dma_ops path")
Reviewed-by: NFilippo Sironi <sironi@amazon.de>
Reviewed-by: NJerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: Ntianyi <fujunkang@linux.alibaba.com>
Reviewed-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Acked-by: Nzhangliguang <zhangliguang@linux.alibaba.com>

fd130609

iommu/amd: Lock dev_data in attach/detach code paths · 30061d95

由 Joerg Roedel 提交于 3月 30, 2020

fix #26319040

commit ab7b2577f0d119052b98b8d913bad369ac2760eb upstream.

Make sure that attaching a detaching a device can't race against each
other and protect the iommu_dev_data with a spin_lock in these code
paths.

Fixes: 92d420ec ("iommu/amd: Relax locking in dma_ops path")
Reviewed-by: NFilippo Sironi <sironi@amazon.de>
Reviewed-by: NJerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: Ntianyi <fujunkang@linux.alibaba.com>
Reviewed-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Acked-by: Nzhangliguang <zhangliguang@linux.alibaba.com>

30061d95

iommu/amd: Check for busy devices earlier in attach_device() · 9fcb8428

由 tianyi 提交于 3月 30, 2020

fix #26319040

commit 45e528d9c479aeef2d3d1db1e619b243f91e324f upstream.

Check early in attach_device whether the device is already attached to a
domain. This also simplifies the code path so that __attach_device() can
be removed.

Fixes: 92d420ec ("iommu/amd: Relax locking in dma_ops path")
Reviewed-by: NFilippo Sironi <sironi@amazon.de>
Reviewed-by: NJerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: Ntianyi <fujunkang@linux.alibaba.com>
Reviewed-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Acked-by: Nzhangliguang <zhangliguang@linux.alibaba.com>

9fcb8428

iommu/amd: Take domain->lock for complete attach/detach path · 19082e2a

由 Joerg Roedel 提交于 3月 30, 2020

fix #26319040

commit f6c0bfce271b2dd613e8b8e009eefe89c1f788e8 upstream.

The code-paths before __attach_device() and __detach_device() are called
also access and modify domain state, so take the domain lock there too.
This allows to get rid of the __detach_device() function.

Fixes: 92d420ec ("iommu/amd: Relax locking in dma_ops path")
Reviewed-by: NFilippo Sironi <sironi@amazon.de>
Reviewed-by: NJerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: Ntianyi <fujunkang@linux.alibaba.com>
Reviewed-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Acked-by: Nzhangliguang <zhangliguang@linux.alibaba.com>

19082e2a

iommu/amd: Remove amd_iommu_devtable_lock · 7ba34a94

由 Joerg Roedel 提交于 3月 30, 2020

fix #26319040

commit 3a11905b69eb026402448c750f97a0eadfa76b08 upstream.

The lock is not necessary because the device table does not
contain shared state that needs protection. Locking is only
needed on an individual entry basis, and that needs to
happen on the iommu_dev_data level.

Fixes: 92d420ec ("iommu/amd: Relax locking in dma_ops path")
Reviewed-by: NFilippo Sironi <sironi@amazon.de>
Reviewed-by: NJerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: Ntianyi <fujunkang@linux.alibaba.com>
Reviewed-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Acked-by: Nzhangliguang <zhangliguang@linux.alibaba.com>

7ba34a94

iommu/amd: Remove domain->updated · 1eec40e2

由 Joerg Roedel 提交于 3月 30, 2020

fix #26319040

commit f15d9a992f901d4f22db868adf800844d1cac9f2 upstream.

iommu/amd: Remove domain->updated

This struct member was used to track whether a domain
change requires updates to the device-table and IOMMU cache
flushes. The problem is, that access to this field is racy
since locking in the common mapping code-paths has been
eliminated.

Move the updated field to the stack to get rid of all
potential races and remove the field from the struct.

Fixes: 92d420ec ("iommu/amd: Relax locking in dma_ops path")
Reviewed-by: NFilippo Sironi <sironi@amazon.de>
Reviewed-by: NJerry Snitselaar <jsnitsel@redhat.com>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: Ntianyi <fujunkang@linux.alibaba.com>
Reviewed-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
Acked-by: Nzhangliguang <zhangliguang@linux.alibaba.com>

1eec40e2

ACPI: PPTT: Consistently use unsigned int as parameter type · dfedf6da

由 Tian Tao 提交于 12月 30, 2019

to #25688970

commit 643956e61ced913a2bbdcf2c95f3d03026b39d1c upstream

The fourth parameter 'level' of function 'acpi_find_cache_level()' is
a signed interger, but its caller 'acpi_find_cache_node()' passes that
parameter an unsigned interger.

Make the paramter type inconsistency go away.
Signed-off-by: NTian Tao <tiantao6@huawei.com>
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
[ rjw: Subject/changelog ]
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
Acked-by: Nzou cao <zoucao@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

dfedf6da

ACPI/PPTT: Add function to return ACPI 6.3 Identical tokens · 428adfc0

由 Jeremy Linton 提交于 6月 26, 2019

to #25688970

commit 56855a99f3d0d1e9f1f4e24f5851f9bf14c83296 upstream

ACPI 6.3 adds a flag to indicate that child nodes are all
identical cores. This is useful to authoritatively determine
if a set of (possibly offline) cores are identical or not.

Since the flag doesn't give us a unique id we can generate
one and use it to create bitmaps of sibling nodes, or simply
in a loop to determine if a subset of cores are identical.
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NHanjun Guo <hanjun.guo@linaro.org>
Reviewed-by: NSudeep Holla <sudeep.holla@arm.com>
Signed-off-by: NJeremy Linton <jeremy.linton@arm.com>
Signed-off-by: NWill Deacon <will@kernel.org>
Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
Acked-by: Nzou cao <zoucao@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

428adfc0

ACPI/PPTT: Modify node flag detection to find last IDENTICAL · 03bca653

由 Jeremy Linton 提交于 6月 26, 2019

to #25688970

commit ed2b664fcc8073c09394393756df3fc86977bbac upstream

The ACPI specification implies that the IDENTICAL flag should be
set on all non leaf nodes where the children are identical.
This means that we need to be searching for the last node with
the identical flag set rather than the first one.

Since this flag is also dependent on the table revision, we
need to add a bit of extra code to verify the table revision,
and the next node's state in the traversal. Since we want to
avoid function pointers here, lets just special case
the IDENTICAL flag.
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NHanjun Guo <hanjun.guo@linaro.org>
Reviewed-by: NSudeep Holla <sudeep.holla@arm.com>
Signed-off-by: NJeremy Linton <jeremy.linton@arm.com>
Signed-off-by: NWill Deacon <will@kernel.org>
Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
Acked-by: Nzou cao <zoucao@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

03bca653

ACPI: Fix comment typos · ef551233

由 Bjorn Helgaas 提交于 3月 25, 2019

to #25688970

commit 603fadf33604a2e170eb833f99f569d3597f1f09 upstream

Fix some misspellings in comments.  No functional change intended.
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
Acked-by: Nzou cao <zoucao@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

ef551233

ACPI: tables: Simplify PPTT leaf node detection · e3ff3ded

由 Jeremy Linton 提交于 3月 01, 2019

to #25688970

commit 4909e6df213a7c3e5e282538356f31ab68828793 upstream

ACPI 6.3 bumps the PPTT table revision and adds a LEAF_NODE flag.

This allows us to avoid a second pass through the table to assure
that the node in question is a leaf.
Signed-off-by: NJeremy Linton <jeremy.linton@arm.com>
Reviewed-by: NSudeep Holla <sudeep.holla@arm.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
Acked-by: Nzou cao <zoucao@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

e3ff3ded

ACPI/PPTT: Add acpi_pptt_warn_missing() to consolidate logs · 92cceebb

由 John Garry 提交于 2月 08, 2019

to #25688970

commit 6cafe700b08cfd261a279b9e5ed99f3a346fe3b0 upstream

For a system using ACPI-based FW without a PPTT, we may get many warnings
about the lack of a PPTT, as shown:

root@(none)$ dmesg | grep -i pptt
[    0.010125] ACPI PPTT: No PPTT table found, cpu topology may be inaccurate
[    7.138339] ACPI PPTT: No PPTT table found, cache topology may be inaccurate
[    7.145368] ACPI PPTT: No PPTT table found, cache topology may be inaccurate

These logs are generated with pr_warn_once(), so the intention was for a
single log, but the logs overlap, so consolidate them.
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Reviewed-by: NJeremy Linton <jeremy.linton@arm.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com>
Acked-by: Nzou cao <zoucao@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

92cceebb

drm/amdgpu: add VM eviction lock v3 · d3dc53f4

由 Christian König 提交于 3月 18, 2020

to #25447038

commit b4ff0f8a85f3c523942e57b716e8722e7f6799cc upstream.

This allows to invalidate VM entries without taking the reservation
lock.

v3: use -EBUSY
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: Nmfkang <mfkang@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

d3dc53f4

drm/amdgpu: move VM eviction decision into amdgpu_vm.c · 50638c21

由 Christian König 提交于 3月 19, 2020

to #25447038

commit 6ceeb144b1d6952a36afa6c29718beac575f2a3f upstream.

When a page tables needs to be evicted the VM code should
decide if that is possible or not.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NFelix Kuehling <Felix.Kuehling@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: Nmfkang <mfkang@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

50638c21

drm/amdgpu: stop evicting busy PDs/PTs · c7723fa2

由 Christian König 提交于 11月 07, 2018

to #25447038

commit 1bd4e4ca7bb8f681ff4e2b05c97ce975ccd781d6 upstream.

Otherwise we won't be able to cleanly handle page faults.
Signed-off-by: NChristian König <christian.koenig@amd.com>
Reviewed-by: NChunming Zhou <david1.zhou@amd.com>
Signed-off-by: NAlex Deucher <alexander.deucher@amd.com>
Signed-off-by: Nmfkang <mfkang@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

c7723fa2

sysctl: handle overflow in proc_get_long · 662ef34f

由 Christian Brauner 提交于 3月 07, 2019

fix #27124689

commit 7f2923c4f73f21cfd714d12a2d48de8c21f11cfe upstream.

proc_get_long() is a funny function.  It uses simple_strtoul() and for a
good reason.  proc_get_long() wants to always succeed the parse and
return the maybe incorrect value and the trailing characters to check
against a pre-defined list of acceptable trailing values.  However,
simple_strtoul() explicitly ignores overflows which can cause funny
things like the following to happen:

  echo 18446744073709551616 > /proc/sys/fs/file-max
  cat /proc/sys/fs/file-max
  0

(Which will cause your system to silently die behind your back.)

On the other hand kstrtoul() does do overflow detection but does not
return the trailing characters, and also fails the parse when anything
other than '\n' is a trailing character whereas proc_get_long() wants to
be more lenient.

Now, before adding another kstrtoul() function let's simply add a static
parse strtoul_lenient() which:
 - fails on overflow with -ERANGE
 - returns the trailing characters to the caller

The reason why we should fail on ERANGE is that we already do a partial
fail on overflow right now.  Namely, when the TMPBUFLEN is exceeded.  So
we already reject values such as 184467440737095516160 (21 chars) but
accept values such as 18446744073709551616 (20 chars) but both are
overflows.  So we should just always reject 64bit overflows and not
special-case this based on the number of chars.

Link: http://lkml.kernel.org/r/20190107222700.15954-2-christian@brauner.ioSigned-off-by: NChristian Brauner <christian@brauner.io>
Acked-by: NKees Cook <keescook@chromium.org>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Luis Chamberlain <mcgrof@kernel.org>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Waiman Long <longman@redhat.com>
Cc: Dominik Brodowski <linux@dominikbrodowski.net>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

662ef34f

configs: align configs of aarch64 to x86_64 · d56f1adb

由 Shile Zhang 提交于 4月 14, 2020

to #26536261

Keep the common configs same between x86_64 and aarch64.
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

d56f1adb

configs: update aarch64 config · 753594c9

由 Shile Zhang 提交于 4月 14, 2020

to #24582903

Update aarch64 configs since gcc version and more minor changes.
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

753594c9

SUNRPC/cache: Fix unsafe traverse caused double-free in cache_purge · a10b7c40

由 Yihao Wu 提交于 4月 06, 2020

fix #25707555

commit 43e33924c38e8faeb0c12035481cb150e602e39d linux-next

Deleting list entry within hlist_for_each_entry_safe is not safe unless
next pointer (tmp) is protected too. It's not, because once hash_lock
is released, cache_clean may delete the entry that tmp points to. Then
cache_purge can walk to a deleted entry and tries to double free it.

Fix this bug by holding only the deleted entry's reference.
Suggested-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Reviewed-by: NNeilBrown <neilb@suse.de>
[ cel: removed unused variable ]
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

a10b7c40

sched: Avoid scale real weight down to zero · 9b83fd88

由 Michael Wang 提交于 3月 27, 2020

fix #26198889

commit 26cf52229efc87e2effa9d788f9b33c40fb3358a linux-next

During our testing, we found a case that shares no longer
working correctly, the cgroup topology is like:

  /sys/fs/cgroup/cpu/A		(shares=102400)
  /sys/fs/cgroup/cpu/A/B	(shares=2)
  /sys/fs/cgroup/cpu/A/B/C	(shares=1024)

  /sys/fs/cgroup/cpu/D		(shares=1024)
  /sys/fs/cgroup/cpu/D/E	(shares=1024)
  /sys/fs/cgroup/cpu/D/E/F	(shares=1024)

The same benchmark is running in group C & F, no other tasks are
running, the benchmark is capable to consumed all the CPUs.

We suppose the group C will win more CPU resources since it could
enjoy all the shares of group A, but it's F who wins much more.

The reason is because we have group B with shares as 2, since
A->cfs_rq.load.weight == B->se.load.weight == B->shares/nr_cpus,
so A->cfs_rq.load.weight become very small.

And in calc_group_shares() we calculate shares as:

  load = max(scale_load_down(cfs_rq->load.weight),
cfs_rq->avg.load_avg);
  shares = (tg_shares * load) / tg_weight;

Since the 'cfs_rq->load.weight' is too small, the load become 0
after scale down, although 'tg_shares' is 102400, shares of the se
which stand for group A on root cfs_rq become 2.

While the se of D on root cfs_rq is far more bigger than 2, so it
wins the battle.

Thus when scale_load_down() scale real weight down to 0, it's no
longer telling the real story, the caller will have the wrong
information and the calculation will be buggy.

This patch add check in scale_load_down(), so the real weight will
be >= MIN_SHARES after scale, after applied the group C wins as
expected.
Suggested-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NMichael Wang <yun.wang@linux.alibaba.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/38e8e212-59a1-64b2-b247-b6d0b52d8dc1@linux.alibaba.comAcked-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

9b83fd88

sched/fair: Fix race between runtime distribution and assignment · 70a23044

由 Huaixin Chang 提交于 3月 24, 2020

fix #25892693

commit 26a8b12747c975b33b4a82d62e4a307e1c07f31b upstream

Currently, there is a potential race between distribute_cfs_runtime()
and assign_cfs_rq_runtime(). Race happens when cfs_b->runtime is read,
distributes without holding lock and finds out there is not enough
runtime to charge against after distribution. Because
assign_cfs_rq_runtime() might be called during distribution, and use
cfs_b->runtime at the same time.

Fibtest is the tool to test this race. Assume all gcfs_rq is throttled
and cfs period timer runs, slow threads might run and sleep, returning
unused cfs_rq runtime and keeping min_cfs_rq_runtime in their local
pool. If all this happens sufficiently quickly, cfs_b->runtime will drop
a lot. If runtime distributed is large too, over-use of runtime happens.

A runtime over-using by about 70 percent of quota is seen when we
test fibtest on a 96-core machine. We run fibtest with 1 fast thread and
95 slow threads in test group, configure 10ms quota for this group and
see the CPU usage of fibtest is 17.0%, which is far from than the
expected 10%.

On a smaller machine with 32 cores, we also run fibtest with 96
threads. CPU usage is more than 12%, which is also more than expected
10%. This shows that on similar workloads, this race do affect CPU
bandwidth control.

Solve this by holding lock inside distribute_cfs_runtime().

Fixes: c06f04c7 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
Signed-off-by: NHuaixin Chang <changhuaixin@linux.alibaba.com>
Reviewed-by: NBen Segall <bsegall@google.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Link: https://lore.kernel.org/lkml/20200325092602.22471-1-changhuaixin@linux.alibaba.com/Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>

70a23044

alinux: cgroup: Fix task_css_check rcu warnings · 798cfa76

由 Xunlei Pang 提交于 3月 23, 2020

to #26424323

task_css() should be protected by rcu, fix several callers.

Fixes: 1f49a738 ("alinux: psi: Support PSI under cgroup v1")
Acked-by: NMichael Wang <yun.wany@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NYang Shi <yang.shi@linux.alibaba.com>

798cfa76

21 4月, 2020 1 次提交

alinux: config: disable CONFIG_NFS_V3_ACL and CONFIG_NFSD_V3_ACL · 29846134

由 Chunmei Xu 提交于 4月 20, 2020

to #26616987

Disable CONFIG_NFS_V3_ACL and CONFIG_NFSD_V3_ACL for aarch64,
to be same with x86
Signed-off-by: NChunmei Xu <xuchunmei@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

29846134

17 4月, 2020 9 次提交

alinux: kernel: reap zombie process by specified pid · ac2b5c94

由 zhongjiang-ali 提交于 2月 26, 2020

to #26788859

We've met several real-world issues that the child reaper
(i.e. systemd) gets stuck in some aborted status and cann't
reap its zombie children, so we provide the interface to do
By specified the pid.
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>

ac2b5c94

alinux: Fix an potential null pointer reference in dump_header · e483e6eb

由 zhongjiang-ali 提交于 2月 25, 2020

to #26424311

Commit 5028e358 ("alinux: mm: oom_kill: show killed task's cgroup
info in global oom") introduces an potential null pointer reference. It
is because the task 'p' maybe an null pointer in same code path.

Fixes: 5028e358 ("alinux: mm: oom_kill: show killed task's cgroup
info in global oom")
Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

e483e6eb

mm: do not allow MADV_PAGEOUT for CoW pages · 7e691477

由 Michal Hocko 提交于 3月 31, 2020

task #25182720

commit 12e967fd8e4e6c3d275b4c69c890adc838891300 upstream

Jann has brought up a very interesting point [1].  While shared pages
are excluded from MADV_PAGEOUT normally, CoW pages can be easily
reclaimed that way.  This can lead to all sorts of hard to debug
problems.  E.g.  performance problems outlined by Daniel [2].

There are runtime environments where there is a substantial memory
shared among security domains via CoW memory and a easy to reclaim way
of that memory, which MADV_{COLD,PAGEOUT} offers, can lead to either
performance degradation in for the parent process which might be more
privileged or even open side channel attacks.

The feasibility of the latter is not really clear to me TBH but there is
no real reason for exposure at this stage.  It seems there is no real
use case to depend on reclaiming CoW memory via madvise at this stage so
it is much easier to simply disallow it and this is what this patch
does.  Put it simply MADV_{PAGEOUT,COLD} can operate only on the
exclusively owned memory which is a straightforward semantic.

[1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com
[2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com

Fixes: 9c276cc65a58 ("mm: introduce MADV_COLD")
Reported-by: NJann Horn <jannh@google.com>
Signed-off-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Daniel Colascione <dancol@google.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200312082248.GS23944@dhcp22.suse.czSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

7e691477

alinux: mm: Pin code section of process in memory · 7d6cb94f

由 Xunlei Pang 提交于 9月 19, 2019

to #26782094

Pin code section of process in memory for the corresponding
VMAs like mlock does.

Usage:
- pin process "PID"
  echo PID > /proc/unevictable/add_pid
- unpin it
  echo PID > /proc/unevictable/del_pid
- show all pinned process pids
  cat /proc/unevictable/add_pid

For easy maintenance, we place it in kernel because it has no
side effect if don't use it.
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

7d6cb94f

alinux: kidled: make kidled_inc_page_age return latest page age · f579b6f9

由 Xu Yu 提交于 4月 06, 2020

fix #26416752

The idle page age shown in idle_page_stats is one scan period behind the
theoretical idle age.  The cause is that kidled_inc_page_age returned
the ancient value, instead of the latest value, which leads to not
accounting in corresponding memcg.

This makes kidled_inc_page_age return the increased age of the page,
i.e., the latest page age, when KIDLED_AGE_NOT_IN_PAGE_FLAGS is not set.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

f579b6f9

mm, vmstat: reduce zone->lock holding time by /proc/pagetypeinfo · 30496217

由 Michal Hocko 提交于 11月 05, 2019

fix #25820910

commit 93b3a674485f6a4b8ffff85d1682d5e8b7c51560 upstream.

pagetypeinfo_showfree_print is called by zone->lock held in irq mode.
This is not really nice because it blocks both any interrupts on that
cpu and the page allocator.  On large machines this might even trigger
the hard lockup detector.

Considering the pagetypeinfo is a debugging tool we do not really need
exact numbers here.  The primary reason to look at the outuput is to see
how pageblocks are spread among different migratetypes and low number of
pages is much more interesting therefore putting a bound on the number
of pages on the free_list sounds like a reasonable tradeoff.

The new output will simply tell
  [...]
  Node    6, zone   Normal, type      Movable >100000 >100000 >100000 >100000  41019  31560  23996  10054   3229    983    648

instead of
  Node    6, zone   Normal, type      Movable 399568 294127 221558 102119  41019  31560  23996  10054   3229    983    648

The limit has been chosen arbitrary and it is a subject of a future
change should there be a need for that.

While we are at it, also drop the zone lock after each free_list
iteration which will help with the IRQ and page allocator responsiveness
even further as the IRQ lock held time is always bound to those 100k
pages.

[akpm@linux-foundation.org: tweak comment text, per David Hildenbrand]
Link: http://lkml.kernel.org/r/20191025072610.18526-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Suggested-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NWaiman Long <longman@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NRafael Aquini <aquini@redhat.com>
Acked-by: NDavid Rientjes <rientjes@google.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Jann Horn <jannh@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Roman Gushchin <guro@fb.com>
Cc: Song Liu <songliubraving@fb.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

30496217

alinux: mm, memcg: optimize division operation with memsli counters · 3d5ca29d

由 Xu Yu 提交于 3月 18, 2020

to #26424368

Specifically, replace `val / 1000000` with `val >> 20` to do the
optimization.

This also fixes the possible compiling error when building with
ARCH=i386, which reports undefined reference to `__udivdi3`.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

3d5ca29d

alinux: mm, memcg: rework memsli interfaces · 055ed63b

由 Xu Yu 提交于 1月 13, 2020

to #26424368

This reworks memsli "start", "end", "update" interfaces to make it more
clear and symmetrical, by merging "update" action into "end", just like
what psi_memstall_{enter, leave} does.

Now the latency probe pattern of memsli is as follows:

memcg_lat_stat_start(&start);
/* kernel codes being probed */
memcg_lat_stat_end(MEM_LAT_XXX, start);

This also formats the codes and fixes the warning(s) produced when
CONFIG_MEMSLI is not set.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

055ed63b

alinux: config: enable CONFIG_MEMSLI · a2feb0da

由 Xu Yu 提交于 4月 11, 2020

to #26424368

Enable CONFIG_MEMSLI by default.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

a2feb0da

16 4月, 2020 5 次提交

alinux: mm, memcg: add kconfig MEMSLI · 9e58d704

由 Xu Yu 提交于 1月 08, 2020

to #26424368

This introduces the new bool kconfig MEMSLI, determining whether the
memsli (memory latency histogram) feature should be built-in or not.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

9e58d704

alinux: mm, memcg: add memsli procfs switch interface · 892970b7

由 Xu Yu 提交于 12月 25, 2019

to #26424368

Since memsli also records latency histogram for swapout and swapin,
which are NOT in the slow memory path, the overhead of memsli could
be nonnegligible in some specific scenarios.

For example, in scenarios with frequent swapping out and in, memsli
could introduce overhead of ~1% of total run time of the synthetic
testcase.

This adds procfs interface for memsli switch. The memsli feature is
enabled by default, and you can now disable it by:

$ echo 0 > /proc/memsli/enabled

Apparently, you can check current memsli switch status by:

$ cat /proc/memsli/enabled

Note that disabling memsli at runtime will NOT clear the existing
latency histogram. You still need to manually reset the specified
latency histogram(s) by echo 0 into the corresponding cgroup control
file(s).
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

892970b7

alinux: mm, memcg: gather memsli/exstat from all possible cpus · 77663a9d

由 Xu Yu 提交于 12月 30, 2019

to #26424368

CPU hotplug may occur in some business scenarios, resulting in
unavailable per-cpu memsli/exstat data on those non-present or
offline CPU(s).

This fixes the potential problem by using for_each_possible_cpu
macro when gathering per-cpu memsli/exstat data.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

77663a9d

alinux: mm, memcg: account throttle over memory.high for memcg direct reclaim · 6c4e1cc3

由 Xu Yu 提交于 11月 12, 2019

to #26424368

Commit 6202ab24 ("mm, memcg: throttle
allocators when failing reclaim over memory.high") introduces explicit
throttling when reclaim is failing to keep memcg size contained at the
memory.high setting.

Just account this latency on memcg direct reclaim latency histogram.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

6c4e1cc3

alinux: mm, memcg: record latency of swapout and swapin in every memcg · ddfd4d5e

由 Xu Yu 提交于 11月 04, 2019

to #26424368

Probe and calculate the latency of global swapout, memcg swapout and
swapin respectively, and then group into the latency histogram in struct
mem_cgroup.

Note that the latency in each memcg is aggregated from all child memcgs.

Usage:

$ cat memory.direct_swapout_global_latency
0-1ms:  98313
1-5ms:  0
5-10ms:         0
10-100ms:       0
100-500ms:      0
500-1000ms:     0
>=1000ms:       0
total(ms):      52

Each line is the count of global swapout within the appropriate latency
range.

To clear the latency histogram:

$ echo 0 > memory.direct_swapout_global_latency
$ cat memory.direct_swapout_global_latency
0-1ms:  0
1-5ms:  0
5-10ms:         0
10-100ms:       0
100-500ms:      0
500-1000ms:     0
>=1000ms:       0
total(ms):      0

The usage of memory.direct_swapout_memcg_latency and
memory.direct_swapin_latency is the same as
memory.direct_swapout_global_latency.
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

ddfd4d5e

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功