提交 · 3585bdafa1d03306352d6d86dd354b223ddfb03a · openeuler / Kernel

07 6月, 2023 2 次提交

crypto: hisilicon/qm - save capability registers in qm init process · 3585bdaf

由 Zhiqi Song 提交于 6月 06, 2023

driver inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7BANJ
CVE: NA

----------------------------------------------------------------------

We find that in the reset scenario, if the reset failed and the MSE
is disabled, the value of capability registers will became invalid.
When we remove the device under this situation, the unregister process
will read the related irq vector from the capability register directly
with the mask. Then we will get an invalid value which is out of range
and can not be used to get the right irq number by pci_irq_vector().
This will cause the following call trace:

	| Call trace:
	|  pci_irq_vector+0xfc/0x140
	|  hisi_qm_uninit+0x278/0x3b0 [hisi_qm]
	|  hpre_remove+0x16c/0x1c0 [hisi_hpre]
	|  pci_device_remove+0x6c/0x264
	|  device_release_driver_internal+0x1ec/0x3e0
	|  device_release_driver+0x3c/0x60
	|  pci_stop_bus_device+0xfc/0x22c
	|  pci_stop_and_remove_bus_device+0x38/0x70
	|  pci_iov_remove_virtfn+0x108/0x1c0
	|  sriov_disable+0x7c/0x1e4
	|  pci_disable_sriov+0x4c/0x6c
	|  hisi_qm_sriov_disable+0x90/0x160 [hisi_qm]
	|  hpre_remove+0x1a8/0x1c0 [hisi_hpre]
	|  pci_device_remove+0x6c/0x264
	|  device_release_driver_internal+0x1ec/0x3e0
	|  driver_detach+0x168/0x2d0
	|  bus_remove_driver+0xc0/0x230
	|  driver_unregister+0x58/0xdc
	|  pci_unregister_driver+0x40/0x220
	|  hpre_exit+0x34/0x64 [hisi_hpre]
	|  __arm64_sys_delete_module+0x374/0x620
	[...]

	| Call trace:
	|  free_msi_irqs+0x25c/0x300
	|  pci_disable_msi+0x19c/0x264
	|  pci_free_irq_vectors+0x4c/0x70
	|  hisi_qm_pci_uninit+0x44/0x90 [hisi_qm]
	|  hisi_qm_uninit+0x28c/0x3b0 [hisi_qm]
	|  hpre_remove+0x16c/0x1c0 [hisi_hpre]
	|  pci_device_remove+0x6c/0x264
	[...]
So we pre-store the valid value of the capability register to a global
array in qm init process, and read the register value from this array
when we need it. This ensures we can always get valid values.
Signed-off-by: NZhiqi Song <songzhiqi1@huawei.com>
Signed-off-by: NJiangShui Yang <yangjiangshui@h-partners.com>

3585bdaf

crypto: hisilicon/qm - add a function to set qm algs · c1e54cbb

由 Wenkai Lin 提交于 6月 06, 2023

driver inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7BANJ
CVE: NA

----------------------------------------------------------------------

Extract a public function to set qm algs and
remove the similar code for setting qm algs
in each module.
Signed-off-by: NHao Fang <fanghao11@huawei.com>
Signed-off-by: NWenkai Lin <linwenkai6@hisilicon.com>
Signed-off-by: NJiangShui Yang <yangjiangshui@h-partners.com>

c1e54cbb

06 6月, 2023 1 次提交

crypto: hisilicon/qm - stop function and write data to memory · b0620a52

由 Weili Qian 提交于 6月 05, 2023

driver inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7AUVE
CVE: NA

----------------------------------------------------------------------

Before the system is shutdown, the accelerator driver
needs to stop the device and write data to the memory.
This prevents the accelerator from accessing addresses
and writing data to the memory after the memory is reclaimed
by the system, causing device exceptions and generating NFE errors.
Signed-off-by: NWeili Qian <qianweili@huawei.com>
Signed-off-by: NJiangShui Yang <yangjiangshui@h-partners.com>
(cherry picked from commit 23bdb7d8)

b0620a52

05 6月, 2023 1 次提交

tcp/dccp: Add another way to allocate local ports in connect() · 4820557e

由 Lu Wei 提交于 6月 05, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I7AO8G
CVE: NA

--------------------------------

Commit 07f4c900 ("tcp/dccp: try to not exhaust ip_local_port_range
in connect()") allocates even ports for connect() first while leaving
odd ports for bind() and this works well in busy servers.

But this strategy causes severe performance degradation in busy clients.
when a client has used more than half of the local ports setted in
proc/sys/net/ipv4/ip_local_port_range, if this client trys to connect
to a server again, the connect time increases rapidly since it will
traverse all the even ports though they are exhausted.

So this path provides another strategy by introducing a system option:
local_port_allocation. If it is a busy client, users should set it to 1
to use sequential allocation while it should be set to 0 in other
situations. Its default value is 0.
Signed-off-by: NLu Wei <luwei32@huawei.com>
Signed-off-by: NLiu Jian <liujian56@huawei.com>
(cherry picked from commit 726c5265)

4820557e

02 6月, 2023 2 次提交

uacce: use filep->f_mapping to replace inode->i_mapping · 5d1b4729

由 Zhangfei Gao 提交于 6月 02, 2023

driver inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I79JRM
CVE: NA

----------------------------------------------------------------------

The inode can be different in a container, for example, a docker and host
both open the same uacce parent device, which uses the same uacce struct
but different inode, so uacce->inode is not enough.

What's worse, when docker stopped, the inode will be destroyed as well,
causing use-after-free in uacce_remove.

So use q->filep->f_mapping to replace uacce->inode->i_mapping.
Signed-off-by: NZhangfei Gao <zhangfei.gao@linaro.org>
Signed-off-by: NJiangShui Yang <yangjiangshui@h-partners.com>
(cherry picked from commit f29efbf3)

5d1b4729

crypto:hisilicon/qm: bugfix queue parameter issue · 017eb679

由 Longfang Liu 提交于 6月 02, 2023

driver inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I79JRM
CVE: NA

----------------------------------------------------------------------

After the queue isolation function is enabled in the BIOS.
If the current default number of queues is used to enable PF,
the default number of queues will be greater than the number of
queues supported by the function set in the BIOS, which will cause
the driver to fail to load.
After modification. If queue isolation is enabled. When the default
queue parameter is larger than the number supported by the function.
The number of enabled queues will be changed to the number supported
by the function.
So that the driver can be loaded normally.
Signed-off-by: NLongfang Liu <liulongfang@huawei.com>
Signed-off-by: NJiangShui Yang <yangjiangshui@h-partners.com>
(cherry picked from commit 9bcf38ed)

017eb679

01 6月, 2023 2 次提交

sched: fix performance degradation on lmbench · c6aaa310

由 Hui Tang 提交于 6月 01, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I7A718

--------------------------------

There are worse performance with the 'Fixes'
when running "./lat_ctx -P $SYNC_MAX -s 64 16".

The 'Fixes' which allocates memory for p->prefer_cpus
even if "prefer_cpus" not be set.

Before the 'Fixes', only test "p->prefer_cpus",
after, add test "!cpumask_empty(p->prefer_cpus)"
which causing performance degradation.

select_task_rq_fair
  ->set_task_select_cpus
    ->prefer_cpus_valid  ----  test cpumask_empty(p->prefer_cpus)

Fixes: ebeb84ad ("cpuset: Introduce new interface for scheduler ...")
Signed-off-by: NHui Tang <tanghui20@huawei.com>
(cherry picked from commit d8f77f89)

c6aaa310

cgroup: Stop task iteration when rebinding subsystem · 7ad6b560

由 Xiu Jianfeng 提交于 5月 31, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I798WQ
CVE: NA

----------------------------------------------------------------------

We found a refcount UAF bug as follows:

refcount_t: addition on 0; use-after-free.
WARNING: CPU: 1 PID: 342 at lib/refcount.c:25 refcount_warn_saturate+0xa0/0x148
Workqueue: events cpuset_hotplug_workfn
Call trace:
 refcount_warn_saturate+0xa0/0x148
 __refcount_add.constprop.0+0x5c/0x80
 css_task_iter_advance_css_set+0xd8/0x210
 css_task_iter_advance+0xa8/0x120
 css_task_iter_next+0x94/0x158
 update_tasks_root_domain+0x58/0x98
 rebuild_root_domains+0xa0/0x1b0
 rebuild_sched_domains_locked+0x144/0x188
 cpuset_hotplug_workfn+0x138/0x5a0
 process_one_work+0x1e8/0x448
 worker_thread+0x228/0x3e0
 kthread+0xe0/0xf0
 ret_from_fork+0x10/0x20

then a kernel panic will be triggered as below:

Unable to handle kernel paging request at virtual address 00000000c0000010
Call trace:
 cgroup_apply_control_disable+0xa4/0x16c
 rebind_subsystems+0x224/0x590
 cgroup_destroy_root+0x64/0x2e0
 css_free_rwork_fn+0x198/0x2a0
 process_one_work+0x1d4/0x4bc
 worker_thread+0x158/0x410
 kthread+0x108/0x13c
 ret_from_fork+0x10/0x18

The race that cause this bug can be shown as below:

(hotplug cpu)                | (umount cpuset)
mutex_lock(&cpuset_mutex)    | mutex_lock(&cgroup_mutex)
cpuset_hotplug_workfn        |
 rebuild_root_domains        |  rebind_subsystems
  update_tasks_root_domain   |   spin_lock_irq(&css_set_lock)
   css_task_iter_start       |    list_move_tail(&cset->e_cset_node[ss->id]
   while(css_task_iter_next) |                  &dcgrp->e_csets[ss->id]);
   css_task_iter_end         |   spin_unlock_irq(&css_set_lock)
mutex_unlock(&cpuset_mutex)  | mutex_unlock(&cgroup_mutex)

Inside css_task_iter_start/next/end, css_set_lock is hold and then
released, so when iterating task(left side), the css_set may be moved to
another list(right side), then it->cset_head points to the old list head
and it->cset_pos->next points to the head node of new list, which can't
be used as struct css_set.

To fix this issue, introduce CSS_TASK_ITER_STOPPED flag for css_task_iter.
when moving css_set to dcgrp->e_csets[ss->id] in rebind_subsystems(), stop
the task iteration.
Reported-by: NGaosheng Cui <cuigaosheng1@huawei.com>
Link: https://www.spinics.net/lists/cgroups/msg37935.html
Fixes: f9a25f77 ("cpusets: Rebuild root domain deadline accounting information")
Signed-off-by: NXiu Jianfeng <xiujianfeng@huaweicloud.com>
Signed-off-by: NGaosheng Cui <cuigaosheng1@huawei.com>
Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
(cherry picked from commit e52586f4)

7ad6b560

30 5月, 2023 8 次提交

sched/fair: Introduce multiple qos level · c51ad919

由 Zhao Wenhui 提交于 5月 30, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I737X1

-------------------------------

Expand qos_level from {-1,0} to [-2, 2], to distinguish the tasks expected
to be with extremely high or low priority level. Using qos_level_weight
to reweight the shares when calculating group's weight. Meanwhile,
set offline task's schedule policy to SCHED_IDLE so that it can be
preempted at check_preempt_wakeup.
Signed-off-by: NZhao Wenhui <zhaowenhui8@huawei.com>

c51ad919

sched/fair: Fix kabi borken in sched_domain · 00d7e686

由 Guan Jing 提交于 5月 28, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I78WM8
CVE: NA

--------------------------------
Signed-off-by: NGuan Jing <guanjing6@huawei.com>

00d7e686

sched/fair: Adjust the allowed NUMA imbalance when SD_NUMA spans multiple LLCs · edd5e1ef

由 Guan Jing 提交于 5月 28, 2023

mainline inclusion
from mainline-v5.18-rc1
commit e496132e
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I78WM8

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?h=v6.4-rc3&id=e496132ebedd870b67f1f6d2428f9bb9d7ae27fd

--------------------------------

Commit 7d2b5dd0 ("sched/numa: Allow a floating imbalance between NUMA
nodes") allowed an imbalance between NUMA nodes such that communicating
tasks would not be pulled apart by the load balancer. This works fine when
there is a 1:1 relationship between LLC and node but can be suboptimal
for multiple LLCs if independent tasks prematurely use CPUs sharing cache.

Zen* has multiple LLCs per node with local memory channels and due to
the allowed imbalance, it's far harder to tune some workloads to run
optimally than it is on hardware that has 1 LLC per node. This patch
allows an imbalance to exist up to the point where LLCs should be balanced
between nodes.

On a Zen3 machine running STREAM parallelised with OMP to have on instance
per LLC the results and without binding, the results are

5.17.0-rc0 5.17.0-rc0
vanilla sched-numaimb-v6
MB/sec copy-16 162596.94 ( 0.00%) 580559.74 ( 257.05%)
MB/sec scale-16 136901.28 ( 0.00%) 374450.52 ( 173.52%)
MB/sec add-16 157300.70 ( 0.00%) 564113.76 ( 258.62%)
MB/sec triad-16 151446.88 ( 0.00%) 564304.24 ( 272.61%)

STREAM can use directives to force the spread if the OpenMP is new
enough but that doesn't help if an application uses threads and
it's not known in advance how many threads will be created.

Coremark is a CPU and cache intensive benchmark parallelised with
threads. When running with 1 thread per core, the vanilla kernel
allows threads to contend on cache. With the patch;

5.17.0-rc0 5.17.0-rc0
vanilla sched-numaimb-v5
Min Score-16 368239.36 ( 0.00%) 389816.06 ( 5.86%)
Hmean Score-16 388607.33 ( 0.00%) 427877.08 * 10.11%*
Max Score-16 408945.69 ( 0.00%) 481022.17 ( 17.62%)
Stddev Score-16 15247.04 ( 0.00%) 24966.82 ( -63.75%)
CoeffVar Score-16 3.92 ( 0.00%) 5.82 ( -48.48%)

It can also make a big difference for semi-realistic workloads
like specjbb which can execute arbitrary numbers of threads without
advance knowledge of how they should be placed. Even in cases where
the average performance is neutral, the results are more stable.

5.17.0-rc0 5.17.0-rc0
vanilla sched-numaimb-v6
Hmean tput-1 71631.55 ( 0.00%) 73065.57 ( 2.00%)
Hmean tput-8 582758.78 ( 0.00%) 556777.23 ( -4.46%)
Hmean tput-16 1020372.75 ( 0.00%) 1009995.26 ( -1.02%)
Hmean tput-24 1416430.67 ( 0.00%) 1398700.11 ( -1.25%)
Hmean tput-32 1687702.72 ( 0.00%) 1671357.04 ( -0.97%)
Hmean tput-40 1798094.90 ( 0.00%) 2015616.46 * 12.10%*
Hmean tput-48 1972731.77 ( 0.00%) 2333233.72 ( 18.27%)
Hmean tput-56 2386872.38 ( 0.00%) 2759483.38 ( 15.61%)
Hmean tput-64 2909475.33 ( 0.00%) 2925074.69 ( 0.54%)
Hmean tput-72 2585071.36 ( 0.00%) 2962443.97 ( 14.60%)
Hmean tput-80 2994387.24 ( 0.00%) 3015980.59 ( 0.72%)
Hmean tput-88 3061408.57 ( 0.00%) 3010296.16 ( -1.67%)
Hmean tput-96 3052394.82 ( 0.00%) 2784743.41 ( -8.77%)
Hmean tput-104 2997814.76 ( 0.00%) 2758184.50 ( -7.99%)
Hmean tput-112 2955353.29 ( 0.00%) 2859705.09 ( -3.24%)
Hmean tput-120 2889770.71 ( 0.00%) 2764478.46 ( -4.34%)
Hmean tput-128 2871713.84 ( 0.00%) 2750136.73 ( -4.23%)
Stddev tput-1 5325.93 ( 0.00%) 2002.53 ( 62.40%)
Stddev tput-8 6630.54 ( 0.00%) 10905.00 ( -64.47%)
Stddev tput-16 25608.58 ( 0.00%) 6851.16 ( 73.25%)
Stddev tput-24 12117.69 ( 0.00%) 4227.79 ( 65.11%)
Stddev tput-32 27577.16 ( 0.00%) 8761.05 ( 68.23%)
Stddev tput-40 59505.86 ( 0.00%) 2048.49 ( 96.56%)
Stddev tput-48 168330.30 ( 0.00%) 93058.08 ( 44.72%)
Stddev tput-56 219540.39 ( 0.00%) 30687.02 ( 86.02%)
Stddev tput-64 121750.35 ( 0.00%) 9617.36 ( 92.10%)
Stddev tput-72 223387.05 ( 0.00%) 34081.13 ( 84.74%)
Stddev tput-80 128198.46 ( 0.00%) 22565.19 ( 82.40%)
Stddev tput-88 136665.36 ( 0.00%) 27905.97 ( 79.58%)
Stddev tput-96 111925.81 ( 0.00%) 99615.79 ( 11.00%)
Stddev tput-104 146455.96 ( 0.00%) 28861.98 ( 80.29%)
Stddev tput-112 88740.49 ( 0.00%) 58288.23 ( 34.32%)
Stddev tput-120 186384.86 ( 0.00%) 45812.03 ( 75.42%)
Stddev tput-128 78761.09 ( 0.00%) 57418.48 ( 27.10%)

Similarly, for embarassingly parallel problems like NPB-ep, there are
improvements due to better spreading across LLC when the machine is not
fully utilised.

vanilla sched-numaimb-v6
Min ep.D 31.79 ( 0.00%) 26.11 ( 17.87%)
Amean ep.D 31.86 ( 0.00%) 26.17 * 17.86%*
Stddev ep.D 0.07 ( 0.00%) 0.05 ( 24.41%)
CoeffVar ep.D 0.22 ( 0.00%) 0.20 ( 7.97%)
Max ep.D 31.93 ( 0.00%) 26.21 ( 17.91%)
Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NGautham R. Shenoy <gautham.shenoy@amd.com>
Tested-by: NK Prateek Nayak <kprateek.nayak@amd.com>
Link: https://lore.kernel.org/r/20220208094334.16379-3-mgorman@techsingularity.netSigned-off-by: NGuan Jing <guanjing6@huawei.com>

edd5e1ef

cpumask: introduce for_each_cpu_or · 0cba0556

由 Dave Chinner 提交于 3月 15, 2023

mainline inclusion
from mainline-v6.3-rc4
commit 1470afef
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6VS35

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1470afefc3c42df5d1662f87d079b46651bdc95b

--------------------------------

Equivalent of for_each_cpu_and, except it ORs the two masks together
so it iterates all the CPUs present in either mask.
Signed-off-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NZeng Heng <zengheng4@huawei.com>

0cba0556

lib: extend the scope of small_const_nbits() macro · 5f14daa6

由 Yury Norov 提交于 5月 06, 2021

mainline inclusion
from mainline-v5.13-rc1
commit 586eaebe
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6VS35

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=586eaebea5988302c5a8b018096dd6c6f4564940

--------------------------------

find_bit would also benefit from small_const_nbits() optimizations.  The
detailed comment is provided by Rasmus Villemoes.

Link: https://lkml.kernel.org/r/20210401003153.97325-6-yury.norov@gmail.comSuggested-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
Signed-off-by: NYury Norov <yury.norov@gmail.com>
Acked-by: NRasmus Villemoes <linux@rasmusvillemoes.dk>
Acked-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
Cc: Alexey Klimov <aklimov@redhat.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: David Sterba <dsterba@suse.com>
Cc: Dennis Zhou <dennis@kernel.org>
Cc: Geert Uytterhoeven <geert@linux-m68k.org>
Cc: Jianpeng Ma <jianpeng.ma@intel.com>
Cc: Joe Perches <joe@perches.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Rich Felker <dalias@libc.org>
Cc: Stefano Brivio <sbrivio@redhat.com>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Wolfram Sang <wsa+renesas@sang-engineering.com>
Cc: Yoshinori Sato <ysato@users.osdn.me>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NZeng Heng <zengheng4@huawei.com>

5f14daa6

cpumask: Introduce DYING mask · 50612a26

由 Peter Zijlstra 提交于 1月 19, 2021

mainline inclusion
from mainline-v5.13-rc1
commit e40f74c5
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6VS35

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e40f74c535b8a0ecf3ef0388b51a34cdadb34fb5

--------------------------------

Introduce a cpumask that indicates (for each CPU) what direction the
CPU hotplug is currently going. Notably, it tracks rollbacks. Eg. when
an up fails and we do a roll-back down, it will accurately reflect the
direction.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NValentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20210310150109.151441252@infradead.orgSigned-off-by: NZeng Heng <zengheng4@huawei.com>

50612a26

crypto: hisilicon/qm - support no-sva feature · a1666f44

由 Weili Qian 提交于 5月 22, 2023

driver inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I773SD
CVE: NA

----------------------------------------------------------------------

support no-sva feature.
Signed-off-by: NWeili Qian <qianweili@huawei.com>
Signed-off-by: NJiangShui Yang <yangjiangshui@h-partners.com>

a1666f44

uacce: add UACCE_MODE_NOIOMMU for warpdrive · 92e58150

由 Kai Ye 提交于 5月 22, 2023

driver inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I773SD
CVE: NA

----------------------------------------------------------------------

1. UACCE_MODE_NOIOMMU for warpdrive.
2. some dfx logs
3. fix some static checking.
Signed-off-by: NKai Ye <yekai13@huawei.com>
Signed-off-by: NJiangShui Yang <yangjiangshui@h-partners.com>

92e58150

29 5月, 2023 1 次提交

perf/smmuv3: Enable HiSilicon Erratum quirk · fda37f5b

由 Yicong Yang 提交于 5月 29, 2023

driver inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I798Y2
CVE: NA

----------------------------------------------------------------------

Some HiSilicon SMMU PMCG suffers the erratum that the global PMU disable
control sometimes fail to disable each used the counters. This will lead
to error or inaccurate data since before we enable the counters the
counter's still counting for the event used in last perf session.

This patch tries to fix this by hardening the global disable process.
Before disable the PMU, writing an invalid event type (0xff) to focibly
stop the counters.
Signed-off-by: NYicong Yang <yangyicong@hisilicon.com>
Signed-off-by: NJunhao He <hejunhao3@huawei.com>

fda37f5b

23 5月, 2023 3 次提交

net/mlx5: Fix possible use-after-free in async command interface · 15fec526

由 Tariq Toukan 提交于 10月 26, 2022

stable inclusion
from stable-v5.10.153
commit bbcc06933f35651294ea1e963757502312c2171f
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I64YCA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=bbcc06933f35651294ea1e963757502312c2171f

--------------------------------

[ Upstream commit bacd22df ]

mlx5_cmd_cleanup_async_ctx should return only after all its callback
handlers were completed. Before this patch, the below race between
mlx5_cmd_cleanup_async_ctx and mlx5_cmd_exec_cb_handler was possible and
lead to a use-after-free:

1. mlx5_cmd_cleanup_async_ctx is called while num_inflight is 2 (i.e.
   elevated by 1, a single inflight callback).
2. mlx5_cmd_cleanup_async_ctx decreases num_inflight to 1.
3. mlx5_cmd_exec_cb_handler is called, decreases num_inflight to 0 and
   is about to call wake_up().
4. mlx5_cmd_cleanup_async_ctx calls wait_event, which returns
   immediately as the condition (num_inflight == 0) holds.
5. mlx5_cmd_cleanup_async_ctx returns.
6. The caller of mlx5_cmd_cleanup_async_ctx frees the mlx5_async_ctx
   object.
7. mlx5_cmd_exec_cb_handler goes on and calls wake_up() on the freed
   object.

Fix it by syncing using a completion object. Mark it completed when
num_inflight reaches 0.

Trace:

BUG: KASAN: use-after-free in do_raw_spin_lock+0x23d/0x270
Read of size 4 at addr ffff888139cd12f4 by task swapper/5/0

CPU: 5 PID: 0 Comm: swapper/5 Not tainted 6.0.0-rc3_for_upstream_debug_2022_08_30_13_10 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
 <IRQ>
 dump_stack_lvl+0x57/0x7d
 print_report.cold+0x2d5/0x684
 ? do_raw_spin_lock+0x23d/0x270
 kasan_report+0xb1/0x1a0
 ? do_raw_spin_lock+0x23d/0x270
 do_raw_spin_lock+0x23d/0x270
 ? rwlock_bug.part.0+0x90/0x90
 ? __delete_object+0xb8/0x100
 ? lock_downgrade+0x6e0/0x6e0
 _raw_spin_lock_irqsave+0x43/0x60
 ? __wake_up_common_lock+0xb9/0x140
 __wake_up_common_lock+0xb9/0x140
 ? __wake_up_common+0x650/0x650
 ? destroy_tis_callback+0x53/0x70 [mlx5_core]
 ? kasan_set_track+0x21/0x30
 ? destroy_tis_callback+0x53/0x70 [mlx5_core]
 ? kfree+0x1ba/0x520
 ? do_raw_spin_unlock+0x54/0x220
 mlx5_cmd_exec_cb_handler+0x136/0x1a0 [mlx5_core]
 ? mlx5_cmd_cleanup_async_ctx+0x220/0x220 [mlx5_core]
 ? mlx5_cmd_cleanup_async_ctx+0x220/0x220 [mlx5_core]
 mlx5_cmd_comp_handler+0x65a/0x12b0 [mlx5_core]
 ? dump_command+0xcc0/0xcc0 [mlx5_core]
 ? lockdep_hardirqs_on_prepare+0x400/0x400
 ? cmd_comp_notifier+0x7e/0xb0 [mlx5_core]
 cmd_comp_notifier+0x7e/0xb0 [mlx5_core]
 atomic_notifier_call_chain+0xd7/0x1d0
 mlx5_eq_async_int+0x3ce/0xa20 [mlx5_core]
 atomic_notifier_call_chain+0xd7/0x1d0
 ? irq_release+0x140/0x140 [mlx5_core]
 irq_int_handler+0x19/0x30 [mlx5_core]
 __handle_irq_event_percpu+0x1f2/0x620
 handle_irq_event+0xb2/0x1d0
 handle_edge_irq+0x21e/0xb00
 __common_interrupt+0x79/0x1a0
 common_interrupt+0x78/0xa0
 </IRQ>
 <TASK>
 asm_common_interrupt+0x22/0x40
RIP: 0010:default_idle+0x42/0x60
Code: c1 83 e0 07 48 c1 e9 03 83 c0 03 0f b6 14 11 38 d0 7c 04 84 d2 75 14 8b 05 eb 47 22 02 85 c0 7e 07 0f 00 2d e0 9f 48 00 fb f4 <c3> 48 c7 c7 80 08 7f 85 e8 d1 d3 3e fe eb de 66 66 2e 0f 1f 84 00
RSP: 0018:ffff888100dbfdf0 EFLAGS: 00000242
RAX: 0000000000000001 RBX: ffffffff84ecbd48 RCX: 1ffffffff0afe110
RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffff835cc9bc
RBP: 0000000000000005 R08: 0000000000000001 R09: ffff88881dec4ac3
R10: ffffed1103bd8958 R11: 0000017d0ca571c9 R12: 0000000000000005
R13: ffffffff84f024e0 R14: 0000000000000000 R15: dffffc0000000000
 ? default_idle_call+0xcc/0x450
 default_idle_call+0xec/0x450
 do_idle+0x394/0x450
 ? arch_cpu_idle_exit+0x40/0x40
 ? do_idle+0x17/0x450
 cpu_startup_entry+0x19/0x20
 start_secondary+0x221/0x2b0
 ? set_cpu_sibling_map+0x2070/0x2070
 secondary_startup_64_no_verify+0xcd/0xdb
 </TASK>

Allocated by task 49502:
 kasan_save_stack+0x1e/0x40
 __kasan_kmalloc+0x81/0xa0
 kvmalloc_node+0x48/0xe0
 mlx5e_bulk_async_init+0x35/0x110 [mlx5_core]
 mlx5e_tls_priv_tx_list_cleanup+0x84/0x3e0 [mlx5_core]
 mlx5e_ktls_cleanup_tx+0x38f/0x760 [mlx5_core]
 mlx5e_cleanup_nic_tx+0xa7/0x100 [mlx5_core]
 mlx5e_detach_netdev+0x1ca/0x2b0 [mlx5_core]
 mlx5e_suspend+0xdb/0x140 [mlx5_core]
 mlx5e_remove+0x89/0x190 [mlx5_core]
 auxiliary_bus_remove+0x52/0x70
 device_release_driver_internal+0x40f/0x650
 driver_detach+0xc1/0x180
 bus_remove_driver+0x125/0x2f0
 auxiliary_driver_unregister+0x16/0x50
 mlx5e_cleanup+0x26/0x30 [mlx5_core]
 cleanup+0xc/0x4e [mlx5_core]
 __x64_sys_delete_module+0x2b5/0x450
 do_syscall_64+0x3d/0x90
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

Freed by task 49502:
 kasan_save_stack+0x1e/0x40
 kasan_set_track+0x21/0x30
 kasan_set_free_info+0x20/0x30
 ____kasan_slab_free+0x11d/0x1b0
 kfree+0x1ba/0x520
 mlx5e_tls_priv_tx_list_cleanup+0x2e7/0x3e0 [mlx5_core]
 mlx5e_ktls_cleanup_tx+0x38f/0x760 [mlx5_core]
 mlx5e_cleanup_nic_tx+0xa7/0x100 [mlx5_core]
 mlx5e_detach_netdev+0x1ca/0x2b0 [mlx5_core]
 mlx5e_suspend+0xdb/0x140 [mlx5_core]
 mlx5e_remove+0x89/0x190 [mlx5_core]
 auxiliary_bus_remove+0x52/0x70
 device_release_driver_internal+0x40f/0x650
 driver_detach+0xc1/0x180
 bus_remove_driver+0x125/0x2f0
 auxiliary_driver_unregister+0x16/0x50
 mlx5e_cleanup+0x26/0x30 [mlx5_core]
 cleanup+0xc/0x4e [mlx5_core]
 __x64_sys_delete_module+0x2b5/0x450
 do_syscall_64+0x3d/0x90
 entry_SYSCALL_64_after_hwframe+0x46/0xb0

Fixes: e355477e ("net/mlx5: Make mlx5_cmd_exec_cb() a safe API")
Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
Reviewed-by: NMoshe Shemesh <moshe@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
Link: https://lore.kernel.org/r/20221026135153.154807-8-saeed@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NLipeng Sang <sanglipeng1@jd.com>

15fec526

media: videodev2.h: V4L2_DV_BT_BLANKING_HEIGHT should check 'interlaced' · bc22dbf2

由 Hans Verkuil 提交于 10月 12, 2022

stable inclusion
from stable-v5.10.153
commit b6c7446d0a38725c64305bfb4728625d4f411f50
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I64YCA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b6c7446d0a38725c64305bfb4728625d4f411f50

--------------------------------

[ Upstream commit 8da7f097 ]

If it is a progressive (non-interlaced) format, then ignore the
interlaced timing values.
Signed-off-by: NHans Verkuil <hverkuil-cisco@xs4all.nl>
Fixes: 7f68127f ([media] videodev2.h: defines to calculate blanking and frame sizes)
Signed-off-by: NMauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NLipeng Sang <sanglipeng1@jd.com>

bc22dbf2

media: v4l2: Fix v4l2_i2c_subdev_set_name function documentation · 441fb6eb

由 Alexander Stein 提交于 7月 22, 2022

stable inclusion
from stable-v5.10.153
commit 4953a989b72d2b809b18dde7a4c2844cba4232d4
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I64YCA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=4953a989b72d2b809b18dde7a4c2844cba4232d4

--------------------------------

[ Upstream commit bb9ea2c3 ]

The doc says the I²C device's name is used if devname is NULL, but
actually the I²C device driver's name is used.

Fixes: 06582930 ("media: v4l: subdev: Add a function to set an I²C sub-device's name")
Signed-off-by: NAlexander Stein <alexander.stein@ew.tq-group.com>
Signed-off-by: NSakari Ailus <sakari.ailus@linux.intel.com>
Signed-off-by: NMauro Carvalho Chehab <mchehab@kernel.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NLipeng Sang <sanglipeng1@jd.com>

441fb6eb

22 5月, 2023 3 次提交

bpf: support BPF_PROG_QUERY for progs attached to sockmap · 05038388

由 JofDiamonds 提交于 5月 22, 2023

mainline inclusion
from mainline-v6.4-rc3
commit 748cd572
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I776SR
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=748cd5729ac7421091316e32dcdffb0578563880

----------------------------------------------------------------------

Right now there is no way to query whether BPF programs are
attached to a sockmap or not.

we can use the standard interface in libbpf to query, such as:
bpf_prog_query(mapFd, BPF_SK_SKB_STREAM_PARSER, 0, NULL, ...);
the mapFd is the fd of sockmap.
Signed-off-by: NDi Zhu <zhudi2@huawei.com>
Acked-by: NYonghong Song <yhs@fb.com>
Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/r/20220119014005.1209-1-zhudi2@huawei.comSigned-off-by: NAlexei Starovoitov <ast@kernel.org>
Conflicts:
	net/core/sock_map.c
	include/linux/bpf.h
Signed-off-by: NJofDiamonds <kwb0523@163.com>
Reviewed-by: Nwuchangye <wuchangye@huawei.com>

05038388

netfilter: nf_tables: deactivate anonymous set from preparation phase · dcb69fcc

由 Pablo Neira Ayuso 提交于 5月 22, 2023

stable inclusion
from stable-v5.10.180
commit e044a24447189419c3a7ccc5fa6da7516036dc55
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I71F49
CVE: CVE-2023-32233

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=e044a24447189419c3a7ccc5fa6da7516036dc55

--------------------------------

commit c1592a89 upstream.

Toggle deleted anonymous sets as inactive in the next generation, so
users cannot perform any update on it. Clear the generation bitmask
in case the transaction is aborted.

The following KASAN splat shows a set element deletion for a bound
anonymous set that has been already removed in the same transaction.

[   64.921510] ==================================================================
[   64.923123] BUG: KASAN: wild-memory-access in nf_tables_commit+0xa24/0x1490 [nf_tables]
[   64.924745] Write of size 8 at addr dead000000000122 by task test/890
[   64.927903] CPU: 3 PID: 890 Comm: test Not tainted 6.3.0+ #253
[   64.931120] Call Trace:
[   64.932699]  <TASK>
[   64.934292]  dump_stack_lvl+0x33/0x50
[   64.935908]  ? nf_tables_commit+0xa24/0x1490 [nf_tables]
[   64.937551]  kasan_report+0xda/0x120
[   64.939186]  ? nf_tables_commit+0xa24/0x1490 [nf_tables]
[   64.940814]  nf_tables_commit+0xa24/0x1490 [nf_tables]
[   64.942452]  ? __kasan_slab_alloc+0x2d/0x60
[   64.944070]  ? nf_tables_setelem_notify+0x190/0x190 [nf_tables]
[   64.945710]  ? kasan_set_track+0x21/0x30
[   64.947323]  nfnetlink_rcv_batch+0x709/0xd90 [nfnetlink]
[   64.948898]  ? nfnetlink_rcv_msg+0x480/0x480 [nfnetlink]
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NLu Wei <luwei32@huawei.com>
Reviewed-by: NYue Haibing <yuehaibing@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>

dcb69fcc

crypto: hisilicon/qm - support dumping stop queue status · 83430c8d

由 Weili Qian 提交于 5月 19, 2023

driver inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I76TVJ
CVE: NA

----------------------------------------------------------------------

The debugfs files 'dev_state' and 'dev_timeout' are added.

dev_state: if dev_timeout is set, dev_state indicates the status
of stopping the queue. 0 indicates that the queue is stopped
successfully. Other values indicate that the queue stops fail.
if dev_timeout is not set, the value of dev_state is 0;

dev_timeout: If the queue fails to stop, the queue is released
after waiting dev_timeout * 20ms.
Signed-off-by: NWeili Qian <qianweili@huawei.com>
Signed-off-by: NJiangshui Yang <yangjiangshui@h-partners.com>

83430c8d

19 5月, 2023 6 次提交

memcg: support ksm merge any mode per cgroup · 0f6fb357

由 Nanyong Sun 提交于 5月 19, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
CVE: NA

----------------------------------------------------------------------

Add control file "memory.ksm" to enable ksm per cgroup.
Echo to 1 will set all tasks currently in the cgroup to ksm merge
any mode, which means ksm gets enabled for all vma's of a process.
Meanwhile echo to 0 will disable ksm for them and unmerge the
merged pages.
Cat the file will show the above state and ksm related profits
of this cgroup.
Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>

0f6fb357

mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0 · 351ceedb

由 David Hildenbrand 提交于 5月 19, 2023

mainline inclusion
from mainline-v6.4-rc1
commit 24139c07
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=24139c07f413ef4b555482c758343d71392a19bc

----------------------------------------------------------------------

Patch series "mm/ksm: improve PR_SET_MEMORY_MERGE=0 handling and cleanup
disabling KSM", v2.

(1) Make PR_SET_MEMORY_MERGE=0 unmerge pages like setting MADV_UNMERGEABLE
does, (2) add a selftest for it and (3) factor out disabling of KSM from
s390/gmap code.

This patch (of 3):

Let's unmerge any KSM pages when setting PR_SET_MEMORY_MERGE=0, and clear
the VM_MERGEABLE flag from all VMAs -- just like KSM would.  Of course,
only do that if we previously set PR_SET_MEMORY_MERGE=1.

Link: https://lkml.kernel.org/r/20230422205420.30372-1-david@redhat.com
Link: https://lkml.kernel.org/r/20230422205420.30372-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NStefan Roesch <shr@devkernel.io>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Janosch Frank <frankja@linux.ibm.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Conflicts:
	mm/ksm.c
Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>

351ceedb

mm: add new KSM process and sysfs knobs · a098d41e

由 Stefan Roesch 提交于 5月 19, 2023

mainline inclusion
from mainline-v6.4-rc1
commit d21077fb
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d21077fbc2fc987c2e593c34dc3b4d84e546dc9f

----------------------------------------------------------------------

This adds the general_profit KSM sysfs knob and the process profit metric
knobs to ksm_stat.

1) expose general_profit metric

   The documentation mentions a general profit metric, however this
   metric is not calculated.  In addition the formula depends on the size
   of internal structures, which makes it more difficult for an
   administrator to make the calculation.  Adding the metric for a better
   user experience.

2) document general_profit sysfs knob

3) calculate ksm process profit metric

   The ksm documentation mentions the process profit metric and how to
   calculate it.  This adds the calculation of the metric.

4) mm: expose ksm process profit metric in ksm_stat

   This exposes the ksm process profit metric in /proc/<pid>/ksm_stat.
   The documentation mentions the formula for the ksm process profit
   metric, however it does not calculate it.  In addition the formula
   depends on the size of internal structures.  So it makes sense to
   expose it.

5) document new procfs ksm knobs

Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.ioSigned-off-by: NStefan Roesch <shr@devkernel.io>
Reviewed-by: NBagas Sanjaya <bagasdotme@gmail.com>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>

a098d41e

mm: add new api to enable ksm per process · 2cd2cdfe

由 Stefan Roesch 提交于 5月 19, 2023

mainline inclusion
from mainline-v6.4-rc1
commit d7597f59
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d7597f59d1d33e9efbffa7060deb9ee5bd119e62

----------------------------------------------------------------------

Patch series "mm: process/cgroup ksm support", v9.

So far KSM can only be enabled by calling madvise for memory regions.  To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.

Use case 1:
  The madvise call is not available in the programming language.  An
  example for this are programs with forked workloads using a garbage
  collected language without pointers.  In such a language madvise cannot
  be made available.

  In addition the addresses of objects get moved around as they are
  garbage collected.  KSM sharing needs to be enabled "from the outside"
  for these type of workloads.

Use case 2:
  The same interpreter can also be used for workloads where KSM brings
  no benefit or even has overhead.  We'd like to be able to enable KSM on
  a workload by workload basis.

Use case 3:
  With the madvise call sharing opportunities are only enabled for the
  current process: it is a workload-local decision.  A considerable number
  of sharing opportunities may exist across multiple workloads or jobs (if
  they are part of the same security domain).  Only a higler level entity
  like a job scheduler or container can know for certain if its running
  one or more instances of a job.  That job scheduler however doesn't have
  the necessary internal workload knowledge to make targeted madvise
  calls.

Security concerns:

  In previous discussions security concerns have been brought up.  The
  problem is that an individual workload does not have the knowledge about
  what else is running on a machine.  Therefore it has to be very
  conservative in what memory areas can be shared or not.  However, if the
  system is dedicated to running multiple jobs within the same security
  domain, its the job scheduler that has the knowledge that sharing can be
  safely enabled and is even desirable.

Performance:

  Experiments with using UKSM have shown a capacity increase of around 20%.

  Here are the metrics from an instagram workload (taken from a machine
  with 64GB main memory):

   full_scans: 445
   general_profit: 20158298048
   max_page_sharing: 256
   merge_across_nodes: 1
   pages_shared: 129547
   pages_sharing: 5119146
   pages_to_scan: 4000
   pages_unshared: 1760924
   pages_volatile: 10761341
   run: 1
   sleep_millisecs: 20
   stable_node_chains: 167
   stable_node_chains_prune_millisecs: 2000
   stable_node_dups: 2751
   use_zero_pages: 0
   zero_pages_sharing: 0

After the service is running for 30 minutes to an hour, 4 to 5 million
shared pages are common for this workload when using KSM.

Detailed changes:

1. New options for prctl system command
   This patch series adds two new options to the prctl system call.
   The first one allows to enable KSM at the process level and the second
   one to query the setting.

The setting will be inherited by child processes.

With the above setting, KSM can be enabled for the seed process of a cgroup
and all processes in the cgroup will inherit the setting.

2. Changes to KSM processing
   When KSM is enabled at the process level, the KSM code will iterate
   over all the VMA's and enable KSM for the eligible VMA's.

   When forking a process that has KSM enabled, the setting will be
   inherited by the new child process.

3. Add general_profit metric
   The general_profit metric of KSM is specified in the documentation,
   but not calculated.  This adds the general profit metric to
   /sys/kernel/debug/mm/ksm.

4. Add more metrics to ksm_stat
   This adds the process profit metric to /proc/<pid>/ksm_stat.

5. Add more tests to ksm_tests and ksm_functional_tests
   This adds an option to specify the merge type to the ksm_tests.
   This allows to test madvise and prctl KSM.

   It also adds a two new tests to ksm_functional_tests: one to test
   the new prctl options and the other one is a fork test to verify that
   the KSM process setting is inherited by client processes.

This patch (of 3):

So far KSM can only be enabled by calling madvise for memory regions.  To
be able to use KSM for more workloads, KSM needs to have the ability to be
enabled / disabled at the process / cgroup level.

1. New options for prctl system command

   This patch series adds two new options to the prctl system call.
   The first one allows to enable KSM at the process level and the second
   one to query the setting.

   The setting will be inherited by child processes.

   With the above setting, KSM can be enabled for the seed process of a
   cgroup and all processes in the cgroup will inherit the setting.

2. Changes to KSM processing

   When KSM is enabled at the process level, the KSM code will iterate
   over all the VMA's and enable KSM for the eligible VMA's.

   When forking a process that has KSM enabled, the setting will be
   inherited by the new child process.

  1) Introduce new MMF_VM_MERGE_ANY flag

     This introduces the new flag MMF_VM_MERGE_ANY flag.  When this flag
     is set, kernel samepage merging (ksm) gets enabled for all vma's of a
     process.

  2) Setting VM_MERGEABLE on VMA creation

     When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
     VM_MERGEABLE flag will be set for this VMA.

  3) support disabling of ksm for a process

     This adds the ability to disable ksm for a process if ksm has been
     enabled for the process with prctl.

  4) add new prctl option to get and set ksm for a process

     This adds two new options to the prctl system call
     - enable ksm for all vmas of a process (if the vmas support it).
     - query if ksm has been enabled for a process.

3. Disabling MMF_VM_MERGE_ANY for storage keys in s390

   In the s390 architecture when storage keys are used, the
   MMF_VM_MERGE_ANY will be disabled.

Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.ioSigned-off-by: NStefan Roesch <shr@devkernel.io>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Rik van Riel <riel@surriel.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Conflicts:
	kernel/sys.c mm/ksm.c mm/mmap.c
Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>

2cd2cdfe

ksm: count allocated ksm rmap_items for each process · 8c3ecf85

由 xu xin 提交于 5月 19, 2023

mainline inclusion
from mainline-v6.1-rc1
commit cb4df4ca
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cb4df4cae4f2bd8cf7a32eff81178fce31600f7c

----------------------------------------------------------------------

Patch series "ksm: count allocated rmap_items and update documentation",
v5.

KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information.

To determine how beneficial the ksm-policy (like madvise), they are using
brings, so we add a new interface /proc/<pid>/ksm_stat for each process
The value "ksm_rmap_items" in it indicates the total allocated ksm
rmap_items of this process.

The detailed description can be seen in the following patches' commit
message.

This patch (of 2):

KSM can save memory by merging identical pages, but also can consume
additional memory, because it needs to generate rmap_items to save each
scanned page's brief rmap information.  Some of these pages may be merged,
but some may not be abled to be merged after being checked several times,
which are unprofitable memory consumed.

The information about whether KSM save memory or consume memory in
system-wide range can be determined by the comprehensive calculation of
pages_sharing, pages_shared, pages_unshared and pages_volatile.  A simple
approximate calculation:

	profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
	         sizeof(rmap_item);

where all_rmap_items equals to the sum of pages_sharing, pages_shared,
pages_unshared and pages_volatile.

But we cannot calculate this kind of ksm profit inner single-process wide
because the information of ksm rmap_item's number of a process is lacked.
For user applications, if this kind of information could be obtained, it
helps upper users know how beneficial the ksm-policy (like madvise) they
are using brings, and then optimize their app code.  For example, one
application madvise 1000 pages as MERGEABLE, while only a few pages are
really merged, then it's not cost-efficient.

So we add a new interface /proc/<pid>/ksm_stat for each process in which
the value of ksm_rmap_itmes is only shown now and so more values can be
added in future.

So similarly, we can calculate the ksm profit approximately for a single
process by:

	profit =~ ksm_merging_pages * sizeof(page) - ksm_rmap_items *
		 sizeof(rmap_item);

where ksm_merging_pages is shown at /proc/<pid>/ksm_merging_pages, and
ksm_rmap_items is shown in /proc/<pid>/ksm_stat.

Link: https://lkml.kernel.org/r/20220830143731.299702-1-xu.xin16@zte.com.cn
Link: https://lkml.kernel.org/r/20220830143838.299758-1-xu.xin16@zte.com.cnSigned-off-by: Nxu xin <xu.xin16@zte.com.cn>
Reviewed-by: NXiaokai Ran <ran.xiaokai@zte.com.cn>
Reviewed-by: NYang Yang <yang.yang29@zte.com.cn>
Signed-off-by: NCGEL ZTE <cgel.zte@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Bagas Sanjaya <bagasdotme@gmail.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Izik Eidus <izik.eidus@ravellosystems.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Conflicts:
	include/linux/mm_types.h
Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>

8c3ecf85

ksm: count ksm merging pages for each process · 44acbc78

由 xu xin 提交于 5月 19, 2023

mainline inclusion
from mainline-v5.19-rc1
commit 76093853
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7609385337a4feb6236e42dcd0df2185683ce839

----------------------------------------------------------------------

Some applications or containers want to use KSM by calling madvise() to
advise areas of address space to be MERGEABLE.  But they may not know
which applications are more likely to cause real merges in the
deployment.  If this patch is applied, it helps them know their
corresponding number of merged pages, and then optimize their app code.

As current KSM only counts the number of KSM merging pages(e.g.
ksm_pages_sharing and ksm_pages_shared) of the whole system, we cannot see
the more fine-grained KSM merging, for the upper application optimization,
the merging area cannot be set easily according to the KSM page merging
probability of each process.  Therefore, it is necessary to add extra
statistical means so that the upper level users can know the detailed KSM
merging information of each process.

We add a new proc file named as ksm_merging_pages under /proc/<pid>/ to
indicate the involved ksm merging pages of this process.

[akpm@linux-foundation.org: fix comment typo, remove BUG_ON()s]
Link: https://lkml.kernel.org/r/20220325082318.2352853-1-xu.xin16@zte.com.cnSigned-off-by: Nxu xin <xu.xin16@zte.com.cn>
Reported-by: Nkernel test robot <lkp@intel.com>
Reviewed-by: NYang Yang <yang.yang29@zte.com.cn>
Reviewed-by: NRan Xiaokai <ran.xiaokai@zte.com.cn>
Reported-by: NZeal Robot <zealci@zte.com.cn>
Cc: Kees Cook <keescook@chromium.org>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Ohhoon Kwon <ohoono.kwon@samsung.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Stephen Brennan <stephen.s.brennan@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Feng Tang <feng.tang@intel.com>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn>
Cc: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Conflicts:
	include/linux/mm_types.h
Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>

44acbc78

18 5月, 2023 11 次提交

userswap: add checks for input addresses · 03714218

由 ZhangPeng 提交于 5月 18, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Add checks for new_addr in uswap_mremap() and src_addr in
uswap_check_copy_mode(), including user mode checks, overlapping
checks, etc.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

03714218

userswap: fix some type and logical bugs · 74c0e7cd

由 ZhangPeng 提交于 5月 18, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

As follows, fix some type and logical bugs.
1) The type of index variable is changed from int to unsigned long to
support large memory registration.
2) Fix the bug that USWAP_PAGES_DIRTY does not take effect.
3) Take the mmap_read_lock() when using the VMA in
uswap_adjust_uffd_range().
4) Do some code refactoring and cleancode.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

74c0e7cd

userswap: split uswap_register() to validate address ranges · 8c509665

由 ZhangPeng 提交于 5月 18, 2023

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Split uswap_register() into uswap_register() and uswap_adjust_uffd_range().
Before validate_range(), use uswap_register() to handle uswap mode.
After validate_range(), use uswap_adjust_uffd_range() to change address
range to VMA range, which could reduce fragmentation caused by VMA
splitting.
By splitting uswap_register(), we could prevent the userswap registration
of invalid input address ranges.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

8c509665

userswap: convert enable_userswap to static key · cbf06b7d

由 ZhangPeng 提交于 5月 18, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Replace enable_userswap with struct static_key_false userswap_enabled.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

cbf06b7d

userswap: move userswap feature code into mm/userswap.c · 4a55c5b4

由 ZhangPeng 提交于 5月 18, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

This patch moves the code related to enable_userswap and CONFIG_USERSWAP
to mm/userswap.c. This allows for better encapsulation and easier
maintenance.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

4a55c5b4

userswap: provide cpu info in userfault msg · 2ca987f1

由 ZhangPeng 提交于 5月 18, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

The uffd_msg.reserved3 field is used to transfer the CPU information of
the PF.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

2ca987f1

userswap: introduce new flag to determine the first page fault · ef3a1632

由 ZhangPeng 提交于 5月 18, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Introduce new flag to determine the first page fault.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

ef3a1632

userswap: fix VM_BUG_ON() in handle_userfault() · 010932f5

由 ZhangPeng 提交于 5月 18, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

When CONFIG_VM_BUG_ON=y and userswap feature is used, there is a kernel
BUG in handle_userfault(). VM_BUG_ON() didn't allow more than one reason
flag.
Fix this by skipping VM_BUG_ON() if reason is VM_UFFD_MISSING|VM_USWAP.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

010932f5

userswap: introduce MREMAP_USWAP_SET_PTE to remap for swapping out · c97cdd7e

由 ZhangPeng 提交于 5月 18, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

We introduce MREMAP_USWAP_SET_PTE to implement remapping in the swap-out
phase. Unmap the pages between 'addr ~ addr+old_len' and remap them to
'new_addr ~ new_addr+new_len'. During unmapping, the PTE of old_addr is
set to SWP_USERSWAP_ENTRY.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

c97cdd7e

userswap: introduce UFFDIO_COPY_MODE_DIRECT_MAP to map without copying · 444ec524

由 ZhangPeng 提交于 5月 18, 2023

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I6CAIM

--------------------------------

Add a new UFFDIO_COPY mode UFFDIO_COPY_MODE_DIRECT_MAP to map physical
pages without copy_from_user().
We introduce uswap_unmap_anon_page() to unmap an anonymous page and
uswap_map_anon_page() to map page to src addr. We also introduce
mfill_atomic_pte_nocopy() to achieve zero copy by unmapping src_addr to the
physical page and establishing the mapping from dst_addr to the physical
page.
Signed-off-by: NZhangPeng <zhangpeng362@huawei.com>

444ec524

udp: Update reuse->has_conns under reuseport_lock. · 3c424294

由 Kuniyuki Iwashima 提交于 10月 14, 2022

stable inclusion
from stable-v5.10.152
commit 43d5109296fab30b7467d7d399bb51f1bb27eff4
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I73HJ0

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=43d5109296fab30b7467d7d399bb51f1bb27eff4

--------------------------------

commit 69421bf9 upstream.

When we call connect() for a UDP socket in a reuseport group, we have
to update sk->sk_reuseport_cb->has_conns to 1.  Otherwise, the kernel
could select a unconnected socket wrongly for packets sent to the
connected socket.

However, the current way to set has_conns is illegal and possible to
trigger that problem.  reuseport_has_conns() changes has_conns under
rcu_read_lock(), which upgrades the RCU reader to the updater.  Then,
it must do the update under the updater's lock, reuseport_lock, but
it doesn't for now.

For this reason, there is a race below where we fail to set has_conns
resulting in the wrong socket selection.  To avoid the race, let's split
the reader and updater with proper locking.

 cpu1                               cpu2
+----+                             +----+

__ip[46]_datagram_connect()        reuseport_grow()
.                                  .
|- reuseport_has_conns(sk, true)   |- more_reuse = __reuseport_alloc(more_socks_size)
|  .                               |
|  |- rcu_read_lock()
|  |- reuse = rcu_dereference(sk->sk_reuseport_cb)
|  |
|  |                               |  /* reuse->has_conns == 0 here */
|  |                               |- more_reuse->has_conns = reuse->has_conns
|  |- reuse->has_conns = 1         |  /* more_reuse->has_conns SHOULD BE 1 HERE */
|  |                               |
|  |                               |- rcu_assign_pointer(reuse->socks[i]->sk_reuseport_cb,
|  |                               |                     more_reuse)
|  `- rcu_read_unlock()            `- kfree_rcu(reuse, rcu)
|
|- sk->sk_state = TCP_ESTABLISHED

Note the likely(reuse) in reuseport_has_conns_set() is always true,
but we put the test there for ease of review.  [0]

For the record, usually, sk_reuseport_cb is changed under lock_sock().
The only exception is reuseport_grow() & TCP reqsk migration case.

  1) shutdown() TCP listener, which is moved into the latter part of
     reuse->socks[] to migrate reqsk.

  2) New listen() overflows reuse->socks[] and call reuseport_grow().

  3) reuse->max_socks overflows u16 with the new listener.

  4) reuseport_grow() pops the old shutdown()ed listener from the array
     and update its sk->sk_reuseport_cb as NULL without lock_sock().

shutdown()ed TCP sk->sk_reuseport_cb can be changed without lock_sock(),
but, reuseport_has_conns_set() is called only for UDP under lock_sock(),
so likely(reuse) never be false in reuseport_has_conns_set().

[0]: https://lore.kernel.org/netdev/CANn89iLja=eQHbsM_Ta2sQF0tOGU8vAGrh_izRuuHjuO1ouUag@mail.gmail.com/

Fixes: acdcecc6 ("udp: correct reuseport selection with connected sockets")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20221014182625.89913-1-kuniyu@amazon.comSigned-off-by: NPaolo Abeni <pabeni@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NLipeng Sang <sanglipeng1@jd.com>

3c424294

openeuler / Kernel 11 个月 前同步成功

openeuler / Kernel
11 个月前同步成功