1. 07 6月, 2023 2 次提交
    • Z
      crypto: hisilicon/qm - save capability registers in qm init process · 3585bdaf
      Zhiqi Song 提交于
      driver inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I7BANJ
      CVE: NA
      
      ----------------------------------------------------------------------
      
      We find that in the reset scenario, if the reset failed and the MSE
      is disabled, the value of capability registers will became invalid.
      When we remove the device under this situation, the unregister process
      will read the related irq vector from the capability register directly
      with the mask. Then we will get an invalid value which is out of range
      and can not be used to get the right irq number by pci_irq_vector().
      This will cause the following call trace:
      
      	| Call trace:
      	|  pci_irq_vector+0xfc/0x140
      	|  hisi_qm_uninit+0x278/0x3b0 [hisi_qm]
      	|  hpre_remove+0x16c/0x1c0 [hisi_hpre]
      	|  pci_device_remove+0x6c/0x264
      	|  device_release_driver_internal+0x1ec/0x3e0
      	|  device_release_driver+0x3c/0x60
      	|  pci_stop_bus_device+0xfc/0x22c
      	|  pci_stop_and_remove_bus_device+0x38/0x70
      	|  pci_iov_remove_virtfn+0x108/0x1c0
      	|  sriov_disable+0x7c/0x1e4
      	|  pci_disable_sriov+0x4c/0x6c
      	|  hisi_qm_sriov_disable+0x90/0x160 [hisi_qm]
      	|  hpre_remove+0x1a8/0x1c0 [hisi_hpre]
      	|  pci_device_remove+0x6c/0x264
      	|  device_release_driver_internal+0x1ec/0x3e0
      	|  driver_detach+0x168/0x2d0
      	|  bus_remove_driver+0xc0/0x230
      	|  driver_unregister+0x58/0xdc
      	|  pci_unregister_driver+0x40/0x220
      	|  hpre_exit+0x34/0x64 [hisi_hpre]
      	|  __arm64_sys_delete_module+0x374/0x620
      	[...]
      
      	| Call trace:
      	|  free_msi_irqs+0x25c/0x300
      	|  pci_disable_msi+0x19c/0x264
      	|  pci_free_irq_vectors+0x4c/0x70
      	|  hisi_qm_pci_uninit+0x44/0x90 [hisi_qm]
      	|  hisi_qm_uninit+0x28c/0x3b0 [hisi_qm]
      	|  hpre_remove+0x16c/0x1c0 [hisi_hpre]
      	|  pci_device_remove+0x6c/0x264
      	[...]
      So we pre-store the valid value of the capability register to a global
      array in qm init process, and read the register value from this array
      when we need it. This ensures we can always get valid values.
      Signed-off-by: NZhiqi Song <songzhiqi1@huawei.com>
      Signed-off-by: NJiangShui Yang <yangjiangshui@h-partners.com>
      3585bdaf
    • W
      crypto: hisilicon/qm - add a function to set qm algs · c1e54cbb
      Wenkai Lin 提交于
      driver inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I7BANJ
      CVE: NA
      
      ----------------------------------------------------------------------
      
      Extract a public function to set qm algs and
      remove the similar code for setting qm algs
      in each module.
      Signed-off-by: NHao Fang <fanghao11@huawei.com>
      Signed-off-by: NWenkai Lin <linwenkai6@hisilicon.com>
      Signed-off-by: NJiangShui Yang <yangjiangshui@h-partners.com>
      c1e54cbb
  2. 06 6月, 2023 1 次提交
  3. 05 6月, 2023 1 次提交
    • L
      tcp/dccp: Add another way to allocate local ports in connect() · 4820557e
      Lu Wei 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I7AO8G
      CVE: NA
      
      --------------------------------
      
      Commit 07f4c900 ("tcp/dccp: try to not exhaust ip_local_port_range
      in connect()") allocates even ports for connect() first while leaving
      odd ports for bind() and this works well in busy servers.
      
      But this strategy causes severe performance degradation in busy clients.
      when a client has used more than half of the local ports setted in
      proc/sys/net/ipv4/ip_local_port_range, if this client trys to connect
      to a server again, the connect time increases rapidly since it will
      traverse all the even ports though they are exhausted.
      
      So this path provides another strategy by introducing a system option:
      local_port_allocation. If it is a busy client, users should set it to 1
      to use sequential allocation while it should be set to 0 in other
      situations. Its default value is 0.
      Signed-off-by: NLu Wei <luwei32@huawei.com>
      Signed-off-by: NLiu Jian <liujian56@huawei.com>
      (cherry picked from commit 726c5265)
      4820557e
  4. 02 6月, 2023 2 次提交
  5. 01 6月, 2023 2 次提交
    • H
      sched: fix performance degradation on lmbench · c6aaa310
      Hui Tang 提交于
      hulk inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I7A718
      
      --------------------------------
      
      There are worse performance with the 'Fixes'
      when running "./lat_ctx -P $SYNC_MAX -s 64 16".
      
      The 'Fixes' which allocates memory for p->prefer_cpus
      even if "prefer_cpus" not be set.
      
      Before the 'Fixes', only test "p->prefer_cpus",
      after, add test "!cpumask_empty(p->prefer_cpus)"
      which causing performance degradation.
      
      select_task_rq_fair
        ->set_task_select_cpus
          ->prefer_cpus_valid  ----  test cpumask_empty(p->prefer_cpus)
      
      Fixes: ebeb84ad ("cpuset: Introduce new interface for scheduler ...")
      Signed-off-by: NHui Tang <tanghui20@huawei.com>
      (cherry picked from commit d8f77f89)
      c6aaa310
    • X
      cgroup: Stop task iteration when rebinding subsystem · 7ad6b560
      Xiu Jianfeng 提交于
      hulk inclusion
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I798WQ
      CVE: NA
      
      ----------------------------------------------------------------------
      
      We found a refcount UAF bug as follows:
      
      refcount_t: addition on 0; use-after-free.
      WARNING: CPU: 1 PID: 342 at lib/refcount.c:25 refcount_warn_saturate+0xa0/0x148
      Workqueue: events cpuset_hotplug_workfn
      Call trace:
       refcount_warn_saturate+0xa0/0x148
       __refcount_add.constprop.0+0x5c/0x80
       css_task_iter_advance_css_set+0xd8/0x210
       css_task_iter_advance+0xa8/0x120
       css_task_iter_next+0x94/0x158
       update_tasks_root_domain+0x58/0x98
       rebuild_root_domains+0xa0/0x1b0
       rebuild_sched_domains_locked+0x144/0x188
       cpuset_hotplug_workfn+0x138/0x5a0
       process_one_work+0x1e8/0x448
       worker_thread+0x228/0x3e0
       kthread+0xe0/0xf0
       ret_from_fork+0x10/0x20
      
      then a kernel panic will be triggered as below:
      
      Unable to handle kernel paging request at virtual address 00000000c0000010
      Call trace:
       cgroup_apply_control_disable+0xa4/0x16c
       rebind_subsystems+0x224/0x590
       cgroup_destroy_root+0x64/0x2e0
       css_free_rwork_fn+0x198/0x2a0
       process_one_work+0x1d4/0x4bc
       worker_thread+0x158/0x410
       kthread+0x108/0x13c
       ret_from_fork+0x10/0x18
      
      The race that cause this bug can be shown as below:
      
      (hotplug cpu)                | (umount cpuset)
      mutex_lock(&cpuset_mutex)    | mutex_lock(&cgroup_mutex)
      cpuset_hotplug_workfn        |
       rebuild_root_domains        |  rebind_subsystems
        update_tasks_root_domain   |   spin_lock_irq(&css_set_lock)
         css_task_iter_start       |    list_move_tail(&cset->e_cset_node[ss->id]
         while(css_task_iter_next) |                  &dcgrp->e_csets[ss->id]);
         css_task_iter_end         |   spin_unlock_irq(&css_set_lock)
      mutex_unlock(&cpuset_mutex)  | mutex_unlock(&cgroup_mutex)
      
      Inside css_task_iter_start/next/end, css_set_lock is hold and then
      released, so when iterating task(left side), the css_set may be moved to
      another list(right side), then it->cset_head points to the old list head
      and it->cset_pos->next points to the head node of new list, which can't
      be used as struct css_set.
      
      To fix this issue, introduce CSS_TASK_ITER_STOPPED flag for css_task_iter.
      when moving css_set to dcgrp->e_csets[ss->id] in rebind_subsystems(), stop
      the task iteration.
      Reported-by: NGaosheng Cui <cuigaosheng1@huawei.com>
      Link: https://www.spinics.net/lists/cgroups/msg37935.html
      Fixes: f9a25f77 ("cpusets: Rebuild root domain deadline accounting information")
      Signed-off-by: NXiu Jianfeng <xiujianfeng@huaweicloud.com>
      Signed-off-by: NGaosheng Cui <cuigaosheng1@huawei.com>
      Reviewed-by: NWang Weiyang <wangweiyang2@huawei.com>
      Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
      Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com>
      (cherry picked from commit e52586f4)
      7ad6b560
  6. 30 5月, 2023 8 次提交
  7. 29 5月, 2023 1 次提交
  8. 23 5月, 2023 3 次提交
    • T
      net/mlx5: Fix possible use-after-free in async command interface · 15fec526
      Tariq Toukan 提交于
      stable inclusion
      from stable-v5.10.153
      commit bbcc06933f35651294ea1e963757502312c2171f
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I64YCA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=bbcc06933f35651294ea1e963757502312c2171f
      
      --------------------------------
      
      [ Upstream commit bacd22df ]
      
      mlx5_cmd_cleanup_async_ctx should return only after all its callback
      handlers were completed. Before this patch, the below race between
      mlx5_cmd_cleanup_async_ctx and mlx5_cmd_exec_cb_handler was possible and
      lead to a use-after-free:
      
      1. mlx5_cmd_cleanup_async_ctx is called while num_inflight is 2 (i.e.
         elevated by 1, a single inflight callback).
      2. mlx5_cmd_cleanup_async_ctx decreases num_inflight to 1.
      3. mlx5_cmd_exec_cb_handler is called, decreases num_inflight to 0 and
         is about to call wake_up().
      4. mlx5_cmd_cleanup_async_ctx calls wait_event, which returns
         immediately as the condition (num_inflight == 0) holds.
      5. mlx5_cmd_cleanup_async_ctx returns.
      6. The caller of mlx5_cmd_cleanup_async_ctx frees the mlx5_async_ctx
         object.
      7. mlx5_cmd_exec_cb_handler goes on and calls wake_up() on the freed
         object.
      
      Fix it by syncing using a completion object. Mark it completed when
      num_inflight reaches 0.
      
      Trace:
      
      BUG: KASAN: use-after-free in do_raw_spin_lock+0x23d/0x270
      Read of size 4 at addr ffff888139cd12f4 by task swapper/5/0
      
      CPU: 5 PID: 0 Comm: swapper/5 Not tainted 6.0.0-rc3_for_upstream_debug_2022_08_30_13_10 #1
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
      Call Trace:
       <IRQ>
       dump_stack_lvl+0x57/0x7d
       print_report.cold+0x2d5/0x684
       ? do_raw_spin_lock+0x23d/0x270
       kasan_report+0xb1/0x1a0
       ? do_raw_spin_lock+0x23d/0x270
       do_raw_spin_lock+0x23d/0x270
       ? rwlock_bug.part.0+0x90/0x90
       ? __delete_object+0xb8/0x100
       ? lock_downgrade+0x6e0/0x6e0
       _raw_spin_lock_irqsave+0x43/0x60
       ? __wake_up_common_lock+0xb9/0x140
       __wake_up_common_lock+0xb9/0x140
       ? __wake_up_common+0x650/0x650
       ? destroy_tis_callback+0x53/0x70 [mlx5_core]
       ? kasan_set_track+0x21/0x30
       ? destroy_tis_callback+0x53/0x70 [mlx5_core]
       ? kfree+0x1ba/0x520
       ? do_raw_spin_unlock+0x54/0x220
       mlx5_cmd_exec_cb_handler+0x136/0x1a0 [mlx5_core]
       ? mlx5_cmd_cleanup_async_ctx+0x220/0x220 [mlx5_core]
       ? mlx5_cmd_cleanup_async_ctx+0x220/0x220 [mlx5_core]
       mlx5_cmd_comp_handler+0x65a/0x12b0 [mlx5_core]
       ? dump_command+0xcc0/0xcc0 [mlx5_core]
       ? lockdep_hardirqs_on_prepare+0x400/0x400
       ? cmd_comp_notifier+0x7e/0xb0 [mlx5_core]
       cmd_comp_notifier+0x7e/0xb0 [mlx5_core]
       atomic_notifier_call_chain+0xd7/0x1d0
       mlx5_eq_async_int+0x3ce/0xa20 [mlx5_core]
       atomic_notifier_call_chain+0xd7/0x1d0
       ? irq_release+0x140/0x140 [mlx5_core]
       irq_int_handler+0x19/0x30 [mlx5_core]
       __handle_irq_event_percpu+0x1f2/0x620
       handle_irq_event+0xb2/0x1d0
       handle_edge_irq+0x21e/0xb00
       __common_interrupt+0x79/0x1a0
       common_interrupt+0x78/0xa0
       </IRQ>
       <TASK>
       asm_common_interrupt+0x22/0x40
      RIP: 0010:default_idle+0x42/0x60
      Code: c1 83 e0 07 48 c1 e9 03 83 c0 03 0f b6 14 11 38 d0 7c 04 84 d2 75 14 8b 05 eb 47 22 02 85 c0 7e 07 0f 00 2d e0 9f 48 00 fb f4 <c3> 48 c7 c7 80 08 7f 85 e8 d1 d3 3e fe eb de 66 66 2e 0f 1f 84 00
      RSP: 0018:ffff888100dbfdf0 EFLAGS: 00000242
      RAX: 0000000000000001 RBX: ffffffff84ecbd48 RCX: 1ffffffff0afe110
      RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffff835cc9bc
      RBP: 0000000000000005 R08: 0000000000000001 R09: ffff88881dec4ac3
      R10: ffffed1103bd8958 R11: 0000017d0ca571c9 R12: 0000000000000005
      R13: ffffffff84f024e0 R14: 0000000000000000 R15: dffffc0000000000
       ? default_idle_call+0xcc/0x450
       default_idle_call+0xec/0x450
       do_idle+0x394/0x450
       ? arch_cpu_idle_exit+0x40/0x40
       ? do_idle+0x17/0x450
       cpu_startup_entry+0x19/0x20
       start_secondary+0x221/0x2b0
       ? set_cpu_sibling_map+0x2070/0x2070
       secondary_startup_64_no_verify+0xcd/0xdb
       </TASK>
      
      Allocated by task 49502:
       kasan_save_stack+0x1e/0x40
       __kasan_kmalloc+0x81/0xa0
       kvmalloc_node+0x48/0xe0
       mlx5e_bulk_async_init+0x35/0x110 [mlx5_core]
       mlx5e_tls_priv_tx_list_cleanup+0x84/0x3e0 [mlx5_core]
       mlx5e_ktls_cleanup_tx+0x38f/0x760 [mlx5_core]
       mlx5e_cleanup_nic_tx+0xa7/0x100 [mlx5_core]
       mlx5e_detach_netdev+0x1ca/0x2b0 [mlx5_core]
       mlx5e_suspend+0xdb/0x140 [mlx5_core]
       mlx5e_remove+0x89/0x190 [mlx5_core]
       auxiliary_bus_remove+0x52/0x70
       device_release_driver_internal+0x40f/0x650
       driver_detach+0xc1/0x180
       bus_remove_driver+0x125/0x2f0
       auxiliary_driver_unregister+0x16/0x50
       mlx5e_cleanup+0x26/0x30 [mlx5_core]
       cleanup+0xc/0x4e [mlx5_core]
       __x64_sys_delete_module+0x2b5/0x450
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Freed by task 49502:
       kasan_save_stack+0x1e/0x40
       kasan_set_track+0x21/0x30
       kasan_set_free_info+0x20/0x30
       ____kasan_slab_free+0x11d/0x1b0
       kfree+0x1ba/0x520
       mlx5e_tls_priv_tx_list_cleanup+0x2e7/0x3e0 [mlx5_core]
       mlx5e_ktls_cleanup_tx+0x38f/0x760 [mlx5_core]
       mlx5e_cleanup_nic_tx+0xa7/0x100 [mlx5_core]
       mlx5e_detach_netdev+0x1ca/0x2b0 [mlx5_core]
       mlx5e_suspend+0xdb/0x140 [mlx5_core]
       mlx5e_remove+0x89/0x190 [mlx5_core]
       auxiliary_bus_remove+0x52/0x70
       device_release_driver_internal+0x40f/0x650
       driver_detach+0xc1/0x180
       bus_remove_driver+0x125/0x2f0
       auxiliary_driver_unregister+0x16/0x50
       mlx5e_cleanup+0x26/0x30 [mlx5_core]
       cleanup+0xc/0x4e [mlx5_core]
       __x64_sys_delete_module+0x2b5/0x450
       do_syscall_64+0x3d/0x90
       entry_SYSCALL_64_after_hwframe+0x46/0xb0
      
      Fixes: e355477e ("net/mlx5: Make mlx5_cmd_exec_cb() a safe API")
      Signed-off-by: NTariq Toukan <tariqt@nvidia.com>
      Reviewed-by: NMoshe Shemesh <moshe@nvidia.com>
      Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>
      Link: https://lore.kernel.org/r/20221026135153.154807-8-saeed@kernel.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NLipeng Sang <sanglipeng1@jd.com>
      15fec526
    • H
      media: videodev2.h: V4L2_DV_BT_BLANKING_HEIGHT should check 'interlaced' · bc22dbf2
      Hans Verkuil 提交于
      stable inclusion
      from stable-v5.10.153
      commit b6c7446d0a38725c64305bfb4728625d4f411f50
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I64YCA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b6c7446d0a38725c64305bfb4728625d4f411f50
      
      --------------------------------
      
      [ Upstream commit 8da7f097 ]
      
      If it is a progressive (non-interlaced) format, then ignore the
      interlaced timing values.
      Signed-off-by: NHans Verkuil <hverkuil-cisco@xs4all.nl>
      Fixes: 7f68127f ([media] videodev2.h: defines to calculate blanking and frame sizes)
      Signed-off-by: NMauro Carvalho Chehab <mchehab@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NLipeng Sang <sanglipeng1@jd.com>
      bc22dbf2
    • A
      media: v4l2: Fix v4l2_i2c_subdev_set_name function documentation · 441fb6eb
      Alexander Stein 提交于
      stable inclusion
      from stable-v5.10.153
      commit 4953a989b72d2b809b18dde7a4c2844cba4232d4
      category: bugfix
      bugzilla: https://gitee.com/openeuler/kernel/issues/I64YCA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=4953a989b72d2b809b18dde7a4c2844cba4232d4
      
      --------------------------------
      
      [ Upstream commit bb9ea2c3 ]
      
      The doc says the I²C device's name is used if devname is NULL, but
      actually the I²C device driver's name is used.
      
      Fixes: 06582930 ("media: v4l: subdev: Add a function to set an I²C sub-device's name")
      Signed-off-by: NAlexander Stein <alexander.stein@ew.tq-group.com>
      Signed-off-by: NSakari Ailus <sakari.ailus@linux.intel.com>
      Signed-off-by: NMauro Carvalho Chehab <mchehab@kernel.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NLipeng Sang <sanglipeng1@jd.com>
      441fb6eb
  9. 22 5月, 2023 3 次提交
  10. 19 5月, 2023 6 次提交
    • N
      memcg: support ksm merge any mode per cgroup · 0f6fb357
      Nanyong Sun 提交于
      hulk inclusion
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
      CVE: NA
      
      ----------------------------------------------------------------------
      
      Add control file "memory.ksm" to enable ksm per cgroup.
      Echo to 1 will set all tasks currently in the cgroup to ksm merge
      any mode, which means ksm gets enabled for all vma's of a process.
      Meanwhile echo to 0 will disable ksm for them and unmerge the
      merged pages.
      Cat the file will show the above state and ksm related profits
      of this cgroup.
      Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
      0f6fb357
    • D
      mm/ksm: unmerge and clear VM_MERGEABLE when setting PR_SET_MEMORY_MERGE=0 · 351ceedb
      David Hildenbrand 提交于
      mainline inclusion
      from mainline-v6.4-rc1
      commit 24139c07
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=24139c07f413ef4b555482c758343d71392a19bc
      
      ----------------------------------------------------------------------
      
      Patch series "mm/ksm: improve PR_SET_MEMORY_MERGE=0 handling and cleanup
      disabling KSM", v2.
      
      (1) Make PR_SET_MEMORY_MERGE=0 unmerge pages like setting MADV_UNMERGEABLE
      does, (2) add a selftest for it and (3) factor out disabling of KSM from
      s390/gmap code.
      
      This patch (of 3):
      
      Let's unmerge any KSM pages when setting PR_SET_MEMORY_MERGE=0, and clear
      the VM_MERGEABLE flag from all VMAs -- just like KSM would.  Of course,
      only do that if we previously set PR_SET_MEMORY_MERGE=1.
      
      Link: https://lkml.kernel.org/r/20230422205420.30372-1-david@redhat.com
      Link: https://lkml.kernel.org/r/20230422205420.30372-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NStefan Roesch <shr@devkernel.io>
      Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
      Cc: Claudio Imbrenda <imbrenda@linux.ibm.com>
      Cc: Heiko Carstens <hca@linux.ibm.com>
      Cc: Janosch Frank <frankja@linux.ibm.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Shuah Khan <shuah@kernel.org>
      Cc: Sven Schnelle <svens@linux.ibm.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Conflicts:
      	mm/ksm.c
      Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
      351ceedb
    • S
      mm: add new KSM process and sysfs knobs · a098d41e
      Stefan Roesch 提交于
      mainline inclusion
      from mainline-v6.4-rc1
      commit d21077fb
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d21077fbc2fc987c2e593c34dc3b4d84e546dc9f
      
      ----------------------------------------------------------------------
      
      This adds the general_profit KSM sysfs knob and the process profit metric
      knobs to ksm_stat.
      
      1) expose general_profit metric
      
         The documentation mentions a general profit metric, however this
         metric is not calculated.  In addition the formula depends on the size
         of internal structures, which makes it more difficult for an
         administrator to make the calculation.  Adding the metric for a better
         user experience.
      
      2) document general_profit sysfs knob
      
      3) calculate ksm process profit metric
      
         The ksm documentation mentions the process profit metric and how to
         calculate it.  This adds the calculation of the metric.
      
      4) mm: expose ksm process profit metric in ksm_stat
      
         This exposes the ksm process profit metric in /proc/<pid>/ksm_stat.
         The documentation mentions the formula for the ksm process profit
         metric, however it does not calculate it.  In addition the formula
         depends on the size of internal structures.  So it makes sense to
         expose it.
      
      5) document new procfs ksm knobs
      
      Link: https://lkml.kernel.org/r/20230418051342.1919757-3-shr@devkernel.ioSigned-off-by: NStefan Roesch <shr@devkernel.io>
      Reviewed-by: NBagas Sanjaya <bagasdotme@gmail.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
      a098d41e
    • S
      mm: add new api to enable ksm per process · 2cd2cdfe
      Stefan Roesch 提交于
      mainline inclusion
      from mainline-v6.4-rc1
      commit d7597f59
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d7597f59d1d33e9efbffa7060deb9ee5bd119e62
      
      ----------------------------------------------------------------------
      
      Patch series "mm: process/cgroup ksm support", v9.
      
      So far KSM can only be enabled by calling madvise for memory regions.  To
      be able to use KSM for more workloads, KSM needs to have the ability to be
      enabled / disabled at the process / cgroup level.
      
      Use case 1:
        The madvise call is not available in the programming language.  An
        example for this are programs with forked workloads using a garbage
        collected language without pointers.  In such a language madvise cannot
        be made available.
      
        In addition the addresses of objects get moved around as they are
        garbage collected.  KSM sharing needs to be enabled "from the outside"
        for these type of workloads.
      
      Use case 2:
        The same interpreter can also be used for workloads where KSM brings
        no benefit or even has overhead.  We'd like to be able to enable KSM on
        a workload by workload basis.
      
      Use case 3:
        With the madvise call sharing opportunities are only enabled for the
        current process: it is a workload-local decision.  A considerable number
        of sharing opportunities may exist across multiple workloads or jobs (if
        they are part of the same security domain).  Only a higler level entity
        like a job scheduler or container can know for certain if its running
        one or more instances of a job.  That job scheduler however doesn't have
        the necessary internal workload knowledge to make targeted madvise
        calls.
      
      Security concerns:
      
        In previous discussions security concerns have been brought up.  The
        problem is that an individual workload does not have the knowledge about
        what else is running on a machine.  Therefore it has to be very
        conservative in what memory areas can be shared or not.  However, if the
        system is dedicated to running multiple jobs within the same security
        domain, its the job scheduler that has the knowledge that sharing can be
        safely enabled and is even desirable.
      
      Performance:
      
        Experiments with using UKSM have shown a capacity increase of around 20%.
      
        Here are the metrics from an instagram workload (taken from a machine
        with 64GB main memory):
      
         full_scans: 445
         general_profit: 20158298048
         max_page_sharing: 256
         merge_across_nodes: 1
         pages_shared: 129547
         pages_sharing: 5119146
         pages_to_scan: 4000
         pages_unshared: 1760924
         pages_volatile: 10761341
         run: 1
         sleep_millisecs: 20
         stable_node_chains: 167
         stable_node_chains_prune_millisecs: 2000
         stable_node_dups: 2751
         use_zero_pages: 0
         zero_pages_sharing: 0
      
      After the service is running for 30 minutes to an hour, 4 to 5 million
      shared pages are common for this workload when using KSM.
      
      Detailed changes:
      
      1. New options for prctl system command
         This patch series adds two new options to the prctl system call.
         The first one allows to enable KSM at the process level and the second
         one to query the setting.
      
      The setting will be inherited by child processes.
      
      With the above setting, KSM can be enabled for the seed process of a cgroup
      and all processes in the cgroup will inherit the setting.
      
      2. Changes to KSM processing
         When KSM is enabled at the process level, the KSM code will iterate
         over all the VMA's and enable KSM for the eligible VMA's.
      
         When forking a process that has KSM enabled, the setting will be
         inherited by the new child process.
      
      3. Add general_profit metric
         The general_profit metric of KSM is specified in the documentation,
         but not calculated.  This adds the general profit metric to
         /sys/kernel/debug/mm/ksm.
      
      4. Add more metrics to ksm_stat
         This adds the process profit metric to /proc/<pid>/ksm_stat.
      
      5. Add more tests to ksm_tests and ksm_functional_tests
         This adds an option to specify the merge type to the ksm_tests.
         This allows to test madvise and prctl KSM.
      
         It also adds a two new tests to ksm_functional_tests: one to test
         the new prctl options and the other one is a fork test to verify that
         the KSM process setting is inherited by client processes.
      
      This patch (of 3):
      
      So far KSM can only be enabled by calling madvise for memory regions.  To
      be able to use KSM for more workloads, KSM needs to have the ability to be
      enabled / disabled at the process / cgroup level.
      
      1. New options for prctl system command
      
         This patch series adds two new options to the prctl system call.
         The first one allows to enable KSM at the process level and the second
         one to query the setting.
      
         The setting will be inherited by child processes.
      
         With the above setting, KSM can be enabled for the seed process of a
         cgroup and all processes in the cgroup will inherit the setting.
      
      2. Changes to KSM processing
      
         When KSM is enabled at the process level, the KSM code will iterate
         over all the VMA's and enable KSM for the eligible VMA's.
      
         When forking a process that has KSM enabled, the setting will be
         inherited by the new child process.
      
        1) Introduce new MMF_VM_MERGE_ANY flag
      
           This introduces the new flag MMF_VM_MERGE_ANY flag.  When this flag
           is set, kernel samepage merging (ksm) gets enabled for all vma's of a
           process.
      
        2) Setting VM_MERGEABLE on VMA creation
      
           When a VMA is created, if the MMF_VM_MERGE_ANY flag is set, the
           VM_MERGEABLE flag will be set for this VMA.
      
        3) support disabling of ksm for a process
      
           This adds the ability to disable ksm for a process if ksm has been
           enabled for the process with prctl.
      
        4) add new prctl option to get and set ksm for a process
      
           This adds two new options to the prctl system call
           - enable ksm for all vmas of a process (if the vmas support it).
           - query if ksm has been enabled for a process.
      
      3. Disabling MMF_VM_MERGE_ANY for storage keys in s390
      
         In the s390 architecture when storage keys are used, the
         MMF_VM_MERGE_ANY will be disabled.
      
      Link: https://lkml.kernel.org/r/20230418051342.1919757-1-shr@devkernel.io
      Link: https://lkml.kernel.org/r/20230418051342.1919757-2-shr@devkernel.ioSigned-off-by: NStefan Roesch <shr@devkernel.io>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Conflicts:
      	kernel/sys.c mm/ksm.c mm/mmap.c
      Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
      2cd2cdfe
    • X
      ksm: count allocated ksm rmap_items for each process · 8c3ecf85
      xu xin 提交于
      mainline inclusion
      from mainline-v6.1-rc1
      commit cb4df4ca
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=cb4df4cae4f2bd8cf7a32eff81178fce31600f7c
      
      ----------------------------------------------------------------------
      
      Patch series "ksm: count allocated rmap_items and update documentation",
      v5.
      
      KSM can save memory by merging identical pages, but also can consume
      additional memory, because it needs to generate rmap_items to save each
      scanned page's brief rmap information.
      
      To determine how beneficial the ksm-policy (like madvise), they are using
      brings, so we add a new interface /proc/<pid>/ksm_stat for each process
      The value "ksm_rmap_items" in it indicates the total allocated ksm
      rmap_items of this process.
      
      The detailed description can be seen in the following patches' commit
      message.
      
      This patch (of 2):
      
      KSM can save memory by merging identical pages, but also can consume
      additional memory, because it needs to generate rmap_items to save each
      scanned page's brief rmap information.  Some of these pages may be merged,
      but some may not be abled to be merged after being checked several times,
      which are unprofitable memory consumed.
      
      The information about whether KSM save memory or consume memory in
      system-wide range can be determined by the comprehensive calculation of
      pages_sharing, pages_shared, pages_unshared and pages_volatile.  A simple
      approximate calculation:
      
      	profit =~ pages_sharing * sizeof(page) - (all_rmap_items) *
      	         sizeof(rmap_item);
      
      where all_rmap_items equals to the sum of pages_sharing, pages_shared,
      pages_unshared and pages_volatile.
      
      But we cannot calculate this kind of ksm profit inner single-process wide
      because the information of ksm rmap_item's number of a process is lacked.
      For user applications, if this kind of information could be obtained, it
      helps upper users know how beneficial the ksm-policy (like madvise) they
      are using brings, and then optimize their app code.  For example, one
      application madvise 1000 pages as MERGEABLE, while only a few pages are
      really merged, then it's not cost-efficient.
      
      So we add a new interface /proc/<pid>/ksm_stat for each process in which
      the value of ksm_rmap_itmes is only shown now and so more values can be
      added in future.
      
      So similarly, we can calculate the ksm profit approximately for a single
      process by:
      
      	profit =~ ksm_merging_pages * sizeof(page) - ksm_rmap_items *
      		 sizeof(rmap_item);
      
      where ksm_merging_pages is shown at /proc/<pid>/ksm_merging_pages, and
      ksm_rmap_items is shown in /proc/<pid>/ksm_stat.
      
      Link: https://lkml.kernel.org/r/20220830143731.299702-1-xu.xin16@zte.com.cn
      Link: https://lkml.kernel.org/r/20220830143838.299758-1-xu.xin16@zte.com.cnSigned-off-by: Nxu xin <xu.xin16@zte.com.cn>
      Reviewed-by: NXiaokai Ran <ran.xiaokai@zte.com.cn>
      Reviewed-by: NYang Yang <yang.yang29@zte.com.cn>
      Signed-off-by: NCGEL ZTE <cgel.zte@gmail.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Bagas Sanjaya <bagasdotme@gmail.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Izik Eidus <izik.eidus@ravellosystems.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Conflicts:
      	include/linux/mm_types.h
      Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
      8c3ecf85
    • X
      ksm: count ksm merging pages for each process · 44acbc78
      xu xin 提交于
      mainline inclusion
      from mainline-v5.19-rc1
      commit 76093853
      category: feature
      bugzilla: https://gitee.com/openeuler/kernel/issues/I72R0B
      CVE: NA
      
      Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7609385337a4feb6236e42dcd0df2185683ce839
      
      ----------------------------------------------------------------------
      
      Some applications or containers want to use KSM by calling madvise() to
      advise areas of address space to be MERGEABLE.  But they may not know
      which applications are more likely to cause real merges in the
      deployment.  If this patch is applied, it helps them know their
      corresponding number of merged pages, and then optimize their app code.
      
      As current KSM only counts the number of KSM merging pages(e.g.
      ksm_pages_sharing and ksm_pages_shared) of the whole system, we cannot see
      the more fine-grained KSM merging, for the upper application optimization,
      the merging area cannot be set easily according to the KSM page merging
      probability of each process.  Therefore, it is necessary to add extra
      statistical means so that the upper level users can know the detailed KSM
      merging information of each process.
      
      We add a new proc file named as ksm_merging_pages under /proc/<pid>/ to
      indicate the involved ksm merging pages of this process.
      
      [akpm@linux-foundation.org: fix comment typo, remove BUG_ON()s]
      Link: https://lkml.kernel.org/r/20220325082318.2352853-1-xu.xin16@zte.com.cnSigned-off-by: Nxu xin <xu.xin16@zte.com.cn>
      Reported-by: Nkernel test robot <lkp@intel.com>
      Reviewed-by: NYang Yang <yang.yang29@zte.com.cn>
      Reviewed-by: NRan Xiaokai <ran.xiaokai@zte.com.cn>
      Reported-by: NZeal Robot <zealci@zte.com.cn>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Ohhoon Kwon <ohoono.kwon@samsung.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Stephen Brennan <stephen.s.brennan@oracle.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Feng Tang <feng.tang@intel.com>
      Cc: Yang Yang <yang.yang29@zte.com.cn>
      Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn>
      Cc: Zeal Robot <zealci@zte.com.cn>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Conflicts:
      	include/linux/mm_types.h
      Signed-off-by: NNanyong Sun <sunnanyong@huawei.com>
      44acbc78
  11. 18 5月, 2023 11 次提交