提交 · f4f56de3e7e89bcc598af77cfa93b205ded951d0 · openeuler / Kernel

14 7月, 2021 1 次提交

mm: support THPs in zero_user_segments · f4f56de3

由 Matthew Wilcox (Oracle) 提交于 7月 12, 2021

mainline inclusion
from mainline-v5.11-rc1
commit 0060ef3b
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZE5V
CVE: NA

-------------------------------------------------

We can only kmap() one subpage of a THP at a time, so loop over all
relevant subpages, skipping ones which don't need to be zeroed.  This is
too large to inline when THPs are enabled and we actually need highmem, so
put it in highmem.c.

[willy@infradead.org: start1 was allowed to be less than start2]

Link: https://lkml.kernel.org/r/20201124041507.28996-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Naresh Kamboju <naresh.kamboju@linaro.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NLiu Shixin <liushixin2@huawei.com>
Reviewed-by: NTong Tiangen <tongtiangen@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f4f56de3

10 7月, 2021 6 次提交

mm: vmstat: add cma statistics · 12d72fbe

由 Minchan Kim 提交于 7月 08, 2021

mainline inclusion
from mainline-5.13-rc1
commit bbb26920
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZQ5G
CVE: NA

-------------------------------------------------

Since CMA is used more widely, it's worth to have CMA allocation
statistics into vmstat.  With it, we could know how agressively system
uses cma allocation and how often it fails.

Link: https://lkml.kernel.org/r/20210302183346.3707237-1-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
Cc: John Dias <joaodias@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit bbb26920)
Signed-off-by: NYue Zou <zouyue3@huawei.com>
Reviewed-by: Nchenwandun <chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

12d72fbe

memcg: enable memcg oom-kill for __GFP_NOFAIL · 883a63c7

由 Shakeel Butt 提交于 7月 08, 2021

mainline inclusion
from mainline-v5.13-rc1
commit 3d0cbb98
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZXKY
CVE: NA

--------------------------------------

In the era of async memcg oom-killer, the commit a0d8b00a ("mm: memcg:
do not declare OOM from __GFP_NOFAIL allocations") added the code to skip
memcg oom-killer for __GFP_NOFAIL allocations.  The reason was that the
__GFP_NOFAIL callers will not enter aync oom synchronization path and will
keep the task marked as in memcg oom.  At that time the tasks marked in
memcg oom can bypass the memcg limits and the oom synchronization would
have happened later in the later userspace triggered page fault.  Thus
letting the task marked as under memcg oom bypass the memcg limit for
arbitrary time.

With the synchronous memcg oom-killer (commit 29ef680a ("memcg, oom:
move out_of_memory back to the charge path")) and not letting the task
marked under memcg oom to bypass the memcg limits (commit 1f14c1ac
("mm: memcg: do not allow task about to OOM kill to bypass the limit")),
we can again allow __GFP_NOFAIL allocations to trigger memcg oom-kill.
This will make memcg oom behavior closer to page allocator oom behavior.

Link: https://lkml.kernel.org/r/20210223204337.2785120-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Acked-by: NDavid Rientjes <rientjes@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Reviewed-by: Nchenwandun <chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

883a63c7

mm/page_alloc: count CMA pages per zone and print them in /proc/zoneinfo · a7472711

由 David Hildenbrand 提交于 7月 08, 2021

mainline inclusion
from mainline-5.12-rc1-dontuse
commit 3c381db1
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZSR5
CVE: NA

-------------------------------------------------

Let's count the number of CMA pages per zone and print them in
/proc/zoneinfo.

Having access to the total number of CMA pages per zone is helpful for
debugging purposes to know where exactly the CMA pages ended up, and to
figure out how many pages of a zone might behave differently, even after
some of these pages might already have been allocated.

As one example, CMA pages part of a kernel zone cannot be used for
ordinary kernel allocations but instead behave more like ZONE_MOVABLE.

For now, we are only able to get the global nr+free cma pages from
/proc/meminfo and the free cma pages per zone from /proc/zoneinfo.

Example after this patch when booting a 6 GiB QEMU VM with
"hugetlb_cma=2G":
  # cat /proc/zoneinfo | grep cma
          cma      0
        nr_free_cma  0
          cma      0
        nr_free_cma  0
          cma      524288
        nr_free_cma  493016
          cma      0
          cma      0
  # cat /proc/meminfo | grep Cma
  CmaTotal:        2097152 kB
  CmaFree:         1972064 kB

Note: We print even without CONFIG_CMA, just like "nr_free_cma"; this way,
      one can be sure when spotting "cma 0", that there are definetly no
      CMA pages located in a zone.

[david@redhat.com: v2]
  Link: https://lkml.kernel.org/r/20210128164533.18566-1-david@redhat.com
[david@redhat.com: v3]
  Link: https://lkml.kernel.org/r/20210129113451.22085-1-david@redhat.com

Link: https://lkml.kernel.org/r/20210127101813.6370-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Acked-by: NDavid Rientjes <rientjes@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
Cc: Mike Rapoport <rppt@kernel.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Wei Yang <richard.weiyang@linux.alibaba.com>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 3c381db1)
Signed-off-by: NYue Zou <zouyue3@huawei.com>
Reviewed-by: Nchenwandun <chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a7472711

mm/page_owner: record the timestamp of all pages during free · 1b7273ab

由 Georgi Djakov 提交于 7月 06, 2021

mainline inclusion
from mainline-5.13-rc1
commit 866b4852
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZD1N
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=866b485262173a2b873386162b2ddcfbcb542b4a

-------------------------------------------------

Collect the time when each allocation is freed, to help with memory
analysis with kdump/ramdump.  Add the timestamp also in the page_owner
debugfs file and print it in dump_page().

Having another timestamp when we free the page helps for debugging page
migration issues.  For example both alloc and free timestamps being the
same can gave hints that there is an issue with migrating memory, as
opposed to a page just being dropped during migration.

Link: https://lkml.kernel.org/r/20210203175905.12267-1-georgi.djakov@linaro.orgSigned-off-by: NGeorgi Djakov <georgi.djakov@linaro.org>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 866b4852)
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
Reviewed-by: Ntong tiangen <tongtiangen@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

1b7273ab

mm/page_owner: record timestamp and pid · a89f505d

由 Liam Mark 提交于 7月 06, 2021

mainline inclusion
from mainline-5.11-rc1
commit 9cc7e96a
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZD1N
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9cc7e96aa846f9086431d6c2d33ff9ab42d72b2d

-------------------------------------------------

Collect the time for each allocation recorded in page owner so that
allocation "surges" can be measured.

Record the pid for each allocation recorded in page owner so that the
source of allocation "surges" can be better identified.

The above is very useful when doing memory analysis.  On a crash for
example, we can get this information from kdump (or ramdump) and parse it
to figure out memory allocation problems.

Please note that on x86_64 this increases the size of struct page_owner
from 16 bytes to 32.

Vlastimil: it's not a functionality intended for production, so unless
somebody says they need to enable page_owner for debugging and this
increase prevents them from fitting into available memory, let's not
complicate things with making this optional.

[lmark@codeaurora.org: v3]
  Link: https://lkml.kernel.org/r/20201210160357.27779-1-georgi.djakov@linaro.org

Link: https://lkml.kernel.org/r/20201209125153.10533-1-georgi.djakov@linaro.orgSigned-off-by: NLiam Mark <lmark@codeaurora.org>
Signed-off-by: NGeorgi Djakov <georgi.djakov@linaro.org>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 9cc7e96a)
Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com>
Reviewed-by: Ntong tiangen <tongtiangen@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a89f505d

mm/vmalloc: rework the drain logic · 9fe4fcab

由 Uladzislau Rezki (Sony) 提交于 7月 07, 2021

mainline inclusion
from mainline-5.11-rc1
commit 96e2db45
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZG5D
CVE: NA

-------------------------------------------------

A current "lazy drain" model suffers from at least two issues.

First one is related to the unsorted list of vmap areas, thus in order to
identify the [min:max] range of areas to be drained, it requires a full
list scan.  What is a time consuming if the list is too long.

Second one and as a next step is about merging all fragments with a free
space.  What is also a time consuming because it has to iterate over
entire list which holds outstanding lazy areas.

See below the "preemptirqsoff" tracer that illustrates a high latency.  It
is ~24676us.  Our workloads like audio and video are effected by such long
latency:

<snip>
  tracer: preemptirqsoff

  preemptirqsoff latency trace v1.1.5 on 4.9.186-perf+
  --------------------------------------------------------------------
  latency: 24676 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 P:8)
     -----------------
     | task: crtc_commit:112-261 (uid:0 nice:0 policy:1 rt_prio:16)
     -----------------
   => started at: __purge_vmap_area_lazy
   => ended at:   __purge_vmap_area_lazy

                   _------=> CPU#
                  / _-----=> irqs-off
                 | / _----=> need-resched
                 || / _---=> hardirq/softirq
                 ||| / _--=> preempt-depth
                 |||| /     delay
   cmd     pid   ||||| time  |   caller
      \   /      |||||  \    |   /
crtc_com-261     1...1    1us*: _raw_spin_lock <-__purge_vmap_area_lazy
[...]
crtc_com-261     1...1 24675us : _raw_spin_unlock <-__purge_vmap_area_lazy
crtc_com-261     1...1 24677us : trace_preempt_on <-__purge_vmap_area_lazy
crtc_com-261     1...1 24683us : <stack trace>
 => free_vmap_area_noflush
 => remove_vm_area
 => __vunmap
 => vfree
 => drm_property_free_blob
 => drm_mode_object_unreference
 => drm_property_unreference_blob
 => __drm_atomic_helper_crtc_destroy_state
 => sde_crtc_destroy_state
 => drm_atomic_state_default_clear
 => drm_atomic_state_clear
 => drm_atomic_state_free
 => complete_commit
 => _msm_drm_commit_work_cb
 => kthread_worker_fn
 => kthread
 => ret_from_fork
<snip>

To address those two issues we can redesign a purging of the outstanding
lazy areas.  Instead of queuing vmap areas to the list, we replace it by
the separate rb-tree.  In hat case an area is located in the tree/list in
ascending order.  It will give us below advantages:

a) Outstanding vmap areas are merged creating bigger coalesced blocks,
   thus it becomes less fragmented.

b) It is possible to calculate a flush range [min:max] without scanning
   all elements.  It is O(1) access time or complexity;

c) The final merge of areas with the rb-tree that represents a free
   space is faster because of (a).  As a result the lock contention is
   also reduced.

Link: https://lkml.kernel.org/r/20201116220033.1837-2-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Hillf Danton <hdanton@sina.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: huang ying <huang.ying.caritas@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 96e2db45)
Signed-off-by: NYue Zou <zouyue3@huawei.com>
Reviewed-by: Nchenwandun <chenwandun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

9fe4fcab

08 7月, 2021 5 次提交

arm64: clear_page: Add new implementation of clear_page() by STNP · aa316fa1

由 Wei Li 提交于 7月 02, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZN72
CVE: NA

---------------------------

Currently, clear_page() clear the page through 'dc zva', while the page may
not be used immediately mostly, so the cache flush is in vain.

Add an optimized implementation of clear_page() by 'stnp' for performance
promotion. It can be switched by the boot cmdline 'mm.use_clearpage_stnp'.

In the hugetlb clear test, we gained about 53.7% performance improvement:

Set mm.use_clearpage_stnp = 0          |  Set mm.use_clearpage_stnp = 1
[root@localhost liwei]# ./a.out 50 20  |  [root@localhost liwei]# ./a.out 50 20
size is 50 Gib, test times is 20       |  size is 50 Gib, test times is 20
test_time[0] : use 8.438046 sec        |  test_time[0] : use 3.722682 sec
test_time[1] : use 8.028493 sec        |  test_time[1] : use 3.640274 sec
test_time[2] : use 8.646547 sec        |  test_time[2] : use 4.095052 sec
test_time[3] : use 8.122490 sec        |  test_time[3] : use 3.998446 sec
test_time[4] : use 8.053038 sec        |  test_time[4] : use 4.084259 sec
test_time[5] : use 8.843512 sec        |  test_time[5] : use 3.933871 sec
test_time[6] : use 8.308906 sec        |  test_time[6] : use 3.934334 sec
test_time[7] : use 8.093817 sec        |  test_time[7] : use 3.869142 sec
test_time[8] : use 8.303504 sec        |  test_time[8] : use 3.902916 sec
test_time[9] : use 8.178336 sec        |  test_time[9] : use 3.541885 sec
test_time[10] : use 8.003625 sec       |  test_time[10] : use 3.595554 sec
test_time[11] : use 8.163807 sec       |  test_time[11] : use 3.583813 sec
test_time[12] : use 8.267464 sec       |  test_time[12] : use 3.863033 sec
test_time[13] : use 8.055326 sec       |  test_time[13] : use 3.770953 sec
test_time[14] : use 8.246986 sec       |  test_time[14] : use 3.808006 sec
test_time[15] : use 8.546992 sec       |  test_time[15] : use 3.653194 sec
test_time[16] : use 8.727256 sec       |  test_time[16] : use 3.722395 sec
test_time[17] : use 8.288951 sec       |  test_time[17] : use 3.683508 sec
test_time[18] : use 8.019322 sec       |  test_time[18] : use 4.253087 sec
test_time[19] : use 8.250685 sec       |  test_time[19] : use 4.082845 sec
hugetlb test end!                      |  hugetlb test end!
Signed-off-by: NWei Li <liwei391@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

aa316fa1

memcg: update the child's qos_level synchronously in memcg_qos_write() · 585de8f5

由 Jing Xiangfeng 提交于 7月 06, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZN3O
CVE: NA

--------------------------------------

If parent's qos_level is set, iterate over all cgroups (under this tree)
to modify memory.qos_level synchronously. Currently qos_level support 0
and -1.
Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Reviewed-by: NLiu Shixin <liushixin2@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

585de8f5

memcg: Add static key for memcg priority · ce7fa1af

由 Jing Xiangfeng 提交于 7月 06, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZN3O
CVE: NA

--------------------------------------

This patch adds a default-false static key to disable memcg priority
feature. If you want to enable it by writing 1:

echo 1 > /proc/sys/vm/memcg_qos_enable
Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Reviewed-by: NLiu Shixin <liushixin2@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

ce7fa1af

memcg: enable CONFIG_MEMCG_QOS by default · f8905509

由 Jing Xiangfeng 提交于 7月 06, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZN3O
CVE: NA

--------------------------------------

enable CONFIG_MEMCG_QOS to support memcg OOM priority.
Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Reviewed-by: NLiu Shixin <liushixin2@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f8905509

memcg: support priority for oom · 4da32073

由 Jing Xiangfeng 提交于 7月 06, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZN3O
CVE: NA

--------------------------------------

We first kill the process from the low priority memcg if OOM occurs.
If the process is not found, then fallback to normal handle.
Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Reviewed-by: NLiu Shixin <liushixin2@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

4da32073

07 7月, 2021 2 次提交

arm64: errata: enable HISILICON_ERRATUM_HIP08_RU_PREFETCH · 13ab4b7f

由 Xie XiuQi 提交于 7月 07, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFV2
CVE: NA

------------------------------------------------------------

enable CONFIG_HISILICON_ERRATUM_HIP08_RU_PREFETCH, to add a
cmdline option to disable prefetch.
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

13ab4b7f

arm64: errata: add option to disable cache readunique prefetch on HIP08 · 3b876a78

由 Kai Shen 提交于 7月 07, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFV2
CVE: NA

-----------------------------------------------------------

Random performance decreases appear on cases of Hackbench which test
pipe or socket communication among multi-threads on Hisi HIP08 SoC.
Cache sharing which caused by the change of the data layout and the
cache readunique prefetch mechanism both lead to this problem.

Readunique mechanism which may caused by store operation will invalid
cachelines on other cores during data fetching stage which can cause
cacheline invalidation happens frequently in a sharing data access
situation.

Disable cache readunique prefetch can trackle this problem.
Test cases are like:
    for i in 20;do
        echo "--------pipe thread num=$i----------"
        for j in $(seq 1 10);do
            ./hackbench -pipe $i thread 1000
        done
    done

We disable readunique prefetch only in el2 for in el1 disabling
readunique prefetch may cause panic due to lack of related priority
which often be set in BIOS.

Introduce CONFIG_HISILICON_ERRATUM_HIP08_RU_PREFETCH and disable RU
prefetch using boot cmdline 'readunique_prefetch=off'.
Signed-off-by: NKai Shen <shenkai8@huawei.com>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
[XQ: adjusted context]
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

3b876a78

06 7月, 2021 26 次提交

config: disable config ARM64_BOOTPARAM_HOTPLUG_CPU0 by default · 912d97dc

由 Zheng Zengkai 提交于 7月 06, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3Z80Y
CVE: NA

-------------------------------------------------

Disable config ARM64_BOOTPARAM_HOTPLUG_CPU0 in openeuler_defconfig
by default.
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>

912d97dc

arm64: Add config switch and kernel parameter for CPU0 hotplug · e02eaf91

由 Zheng Zengkai 提交于 7月 06, 2021

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I3Z80Y
CVE: NA

-------------------------------------------------

New config switch CONFIG_ARM64_BOOTPARAM_HOTPLUG_CPU0 sets whether
default state of arm64_cpu0_hotpluggable is on or off.

If the config switch is off, arm64_cpu0_hotpluggable is off
by default. But arm64_cpu0_hotpluggable can still be turned on
by kernel parameter arm64_cpu0_hotplug at boot.

If the config switch is on, arm64_cpu0_hotpluggable is always on.

whether CPU0 is hotpluggable depends on cpu_can_disable(0) and
arm64_cpu0_hotpluggable.

The default value of the config switch is off.
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

e02eaf91

iommu/vt-d: Check for allocation failure in aux_detach_device() · a7efb36d

由 Dan Carpenter 提交于 7月 02, 2021

mainline inclusion
from mainline-v5.13-rc4
commit 1a590a1c
category: bugfix
bugzilla: 108082
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1a590a1c8bf46bf80ea12b657ca44c345531ac80

-------------------------------------------------------------------------

In current kernels small allocations never fail, but checking for
allocation failure is the correct thing to do.

Fixes: 18abda7a ("iommu/vt-d: Fix general protection fault in aux_detach_device()")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Acked-by: NLu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/YJuobKuSn81dOPLd@mwanda
Link: https://lore.kernel.org/r/20210519015027.108468-2-baolu.lu@linux.intel.comSigned-off-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a7efb36d

iommu/vt-d: Fix ineffective devTLB invalidation for subdevices · 1867a962

由 Liu Yi L 提交于 7月 02, 2021

mainline inclusion
from mainline-v5.11-rc3
commit 7c29ada5
category: bugfix
bugzilla: 108083
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7c29ada5e70083805bc3a68daa23441df421fbee

-------------------------------------------------------------------------

iommu_flush_dev_iotlb() is called to invalidate caches on a device but
only loops over the devices which are fully-attached to the domain. For
sub-devices, this is ineffective and can result in invalid caching
entries left on the device.

Fix the missing invalidation by adding a loop over the subdevices and
ensuring that 'domain->has_iotlb_device' is updated when attaching to
subdevices.

Fixes: 67b8e02b ("iommu/vt-d: Aux-domain specific domain attach/detach")
Signed-off-by: NLiu Yi L <yi.l.liu@intel.com>
Acked-by: NLu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/1609949037-25291-4-git-send-email-yi.l.liu@intel.comSigned-off-by: NWill Deacon <will@kernel.org>
Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

1867a962

iommu/vt-d: Fix general protection fault in aux_detach_device() · 172b3700

由 Liu Yi L 提交于 7月 02, 2021

mainline inclusion
from mainline-v5.11-rc3
commit 18abda7a
category: bugfix
bugzilla: 108082
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18abda7a2d555783d28ea1701f3ec95e96237a86

-------------------------------------------------------------------------

The aux-domain attach/detach are not tracked, some data structures might
be used after free. This causes general protection faults when multiple
subdevices are created and assigned to a same guest machine:

  | general protection fault, probably for non-canonical address 0xdead000000000100: 0000 [#1] SMP NOPTI
  | RIP: 0010:intel_iommu_aux_detach_device+0x12a/0x1f0
  | [...]
  | Call Trace:
  |  iommu_aux_detach_device+0x24/0x70
  |  vfio_mdev_detach_domain+0x3b/0x60
  |  ? vfio_mdev_set_domain+0x50/0x50
  |  iommu_group_for_each_dev+0x4f/0x80
  |  vfio_iommu_detach_group.isra.0+0x22/0x30
  |  vfio_iommu_type1_detach_group.cold+0x71/0x211
  |  ? find_exported_symbol_in_section+0x4a/0xd0
  |  ? each_symbol_section+0x28/0x50
  |  __vfio_group_unset_container+0x4d/0x150
  |  vfio_group_try_dissolve_container+0x25/0x30
  |  vfio_group_put_external_user+0x13/0x20
  |  kvm_vfio_group_put_external_user+0x27/0x40 [kvm]
  |  kvm_vfio_destroy+0x45/0xb0 [kvm]
  |  kvm_put_kvm+0x1bb/0x2e0 [kvm]
  |  kvm_vm_release+0x22/0x30 [kvm]
  |  __fput+0xcc/0x260
  |  ____fput+0xe/0x10
  |  task_work_run+0x8f/0xb0
  |  do_exit+0x358/0xaf0
  |  ? wake_up_state+0x10/0x20
  |  ? signal_wake_up_state+0x1a/0x30
  |  do_group_exit+0x47/0xb0
  |  __x64_sys_exit_group+0x18/0x20
  |  do_syscall_64+0x57/0x1d0
  |  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fix the crash by tracking the subdevices when attaching and detaching
aux-domains.

Fixes: 67b8e02b ("iommu/vt-d: Aux-domain specific domain attach/detach")
Co-developed-by: NXin Zeng <xin.zeng@intel.com>
Signed-off-by: NXin Zeng <xin.zeng@intel.com>
Signed-off-by: NLiu Yi L <yi.l.liu@intel.com>
Acked-by: NLu Baolu <baolu.lu@linux.intel.com>
Link: https://lore.kernel.org/r/1609949037-25291-3-git-send-email-yi.l.liu@intel.comSigned-off-by: NWill Deacon <will@kernel.org>
Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

172b3700

NFSv4: Refactor to use user namespaces for nfs4idmap · 4d636eff

由 Sargun Dhillon 提交于 6月 29, 2021

mainline inclusion
from mainline-5.11-rc1
commit d3ff46fe
category: bugfix
bugzilla: 108595
CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d3ff46fe693683cb9660e9b93e8c932cc8e0c1f8

---------------------------

In several patches work has been done to enable NFSv4 to use user
namespaces:
58002399: NFSv4: Convert the NFS client idmapper to use the container user namespace
3b7eb5e3: NFS: When mounting, don't share filesystems between different user namespaces

Unfortunately, the userspace APIs were only such that the userspace facing
side of the filesystem (superblock s_user_ns) could be set to a non init
user namespace. This furthers the fs_context related refactoring, and
piggybacks on top of that logic, so the superblock user namespace, and the
NFS user namespace are the same.

Users can still use rpc.idmapd if they choose to, but there are complexities
with user namespaces and request-key that have yet to be addresssed.

Eventually, we will need to at least:
  * Come up with an upcall mechanism that can be triggered inside of the container,
    or safely triggered outside, with the requisite context to do the right
    mapping. * Handle whatever refactoring needs to be done in net/sunrpc.
Signed-off-by: NSargun Dhillon <sargun@sargun.me>
Tested-by: NAlban Crequy <alban.crequy@gmail.com>
Fixes: 62a55d08 ("NFS: Additional refactoring for fs_context conversion")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

4d636eff

NFS: NFSv2/NFSv3: Use cred from fs_context during mount · e671af6b

由 Sargun Dhillon 提交于 6月 29, 2021

mainline inclusion
from mainline-5.11-rc1
commit d18a9d3f
category: bugfix
bugzilla: 108596
CVE: NA
Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d18a9d3fa0f27a47706fb67f1ee0f4d971587c4e

---------------------------

There was refactoring done to use the fs_context for mounting done in:
62a55d08: NFS: Additional refactoring for fs_context conversion

This made it so that the net_ns is fetched from the fs_context (the netns
that fsopen is called in). This change also makes it so that the credential
fetched during fsopen is used as well as the net_ns.

NFS has already had a number of changes to prepare it for user namespaces:
1a58e8a0: NFS: Store the credential of the mount process in the nfs_server
264d948c: NFS: Convert NFSv3 to use the container user namespace
c207db2f: NFS: Convert NFSv2 to use the container user namespace

Previously, different credentials could be used for creation of the
fs_context versus creation of the nfs_server, as FSCONFIG_CMD_CREATE did
the actual credential check, and that's where current_creds() were fetched.
This meant that the user namespace which fsopen was called in could be a
non-init user namespace. This still requires that the user that calls
FSCONFIG_CMD_CREATE has CAP_SYS_ADMIN in the init user ns.

This roughly allows a privileged user to mount on behalf of an unprivileged
usernamespace, by forking off and calling fsopen in the unprivileged user
namespace. It can then pass back that fsfd to the privileged process which
can configure the NFS mount, and then it can call FSCONFIG_CMD_CREATE
before switching back into the mount namespace of the container, and finish
up the mounting process and call fsmount and move_mount.
Signed-off-by: NSargun Dhillon <sargun@sargun.me>
Tested-by: NAlban Crequy <alban.crequy@gmail.com>
Fixes: 62a55d08 ("NFS: Additional refactoring for fs_context conversion")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NZhang Yi <yi.zhang@huawei.com>
Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

e671af6b

seccomp: Remove bogus __user annotations · 411f925e

由 Jann Horn 提交于 6月 30, 2021

stable inclusion
from stable-5.11-rc1
commit fab686eb
bugzilla: 167382
CVE: N/A

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=fab686eb0307121e7a2890b6d6c57edd2457863d

-------------------------------------------------

Buffers that are passed to read_actions_logged() and write_actions_logged()
are in kernel memory; the sysctl core takes care of copying from/to
userspace.

Fixes: 32927393 ("sysctl: pass kernel pointers to ->proc_handler")
Reviewed-by: NTyler Hicks <code@tyhicks.com>
Signed-off-by: NJann Horn <jannh@google.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20201120170545.1419332-1-jannh@google.comSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

411f925e

seccomp/cache: Report cache data through /proc/pid/seccomp_cache · 39be1ac0

由 YiFei Zhu 提交于 6月 30, 2021

stable inclusion
from stable-5.11-rc1
commit 0d8315dd
bugzilla: 167382
CVE: N/A

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0d8315dddd2899f519fe1ca3d4d5cdaf44ea421e

-------------------------------------------------

Currently the kernel does not provide an infrastructure to translate
architecture numbers to a human-readable name. Translating syscall
numbers to syscall names is possible through FTRACE_SYSCALL
infrastructure but it does not provide support for compat syscalls.

This will create a file for each PID as /proc/pid/seccomp_cache.
The file will be empty when no seccomp filters are loaded, or be
in the format of:
<arch name> <decimal syscall number> <ALLOW | FILTER>
where ALLOW means the cache is guaranteed to allow the syscall,
and filter means the cache will pass the syscall to the BPF filter.

For the docker default profile on x86_64 it looks like:
x86_64 0 ALLOW
x86_64 1 ALLOW
x86_64 2 ALLOW
x86_64 3 ALLOW
[...]
x86_64 132 ALLOW
x86_64 133 ALLOW
x86_64 134 FILTER
x86_64 135 FILTER
x86_64 136 FILTER
x86_64 137 ALLOW
x86_64 138 ALLOW
x86_64 139 FILTER
x86_64 140 ALLOW
x86_64 141 ALLOW
[...]

This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default
of N because I think certain users of seccomp might not want the
application to know which syscalls are definitely usable. For
the same reason, it is also guarded by CAP_SYS_ADMIN.
Suggested-by: NJann Horn <jannh@google.com>
Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/94e663fa53136f5a11f432c661794d1ee7060779.1605101222.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

39be1ac0

xtensa: Enable seccomp architecture tracking · c831dd81

由 YiFei Zhu 提交于 6月 30, 2021

stable inclusion
from stable-5.11-rc1
commit 445247b0
bugzilla: 167382
CVE: N/A

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=445247b02342a05b7d528bba6d85d2d418875b69

-------------------------------------------------

To enable seccomp constant action bitmaps, we need to have a static
mapping to the audit architecture and system call table size. Add these
for xtensa.
Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/79669648ba167d668ea6ffb4884250abcd5ed254.1605101222.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c831dd81

sh: Enable seccomp architecture tracking · 5fbbf8a7

由 YiFei Zhu 提交于 6月 30, 2021

stable inclusion
from stable-5.11-rc1
commit 4c18bc05
bugzilla: 167382
CVE: N/A

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=4c18bc054bffe415bec9e0edaa9ff1a84c1a6973

-------------------------------------------------

To enable seccomp constant action bitmaps, we need to have a static
mapping to the audit architecture and system call table size. Add these
for sh.
Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/61ae084cd4783b9b50860d9dedb4a348cf1b7b6f.1605101222.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

5fbbf8a7

s390: Enable seccomp architecture tracking · 12ba673e

由 YiFei Zhu 提交于 6月 30, 2021

stable inclusion
from stable-5.11-rc1
commit c09058ed
bugzilla: 167382
CVE: N/A

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c09058eda2654c37fd7ac28c2004c3aae8b988e9

-------------------------------------------------

To enable seccomp constant action bitmaps, we need to have a static
mapping to the audit architecture and system call table size. Add these
for s390.
Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu>
Acked-by: NHeiko Carstens <hca@linux.ibm.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/a381b10aa2c5b1e583642f3cd46ced842d9d4ce5.1605101222.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

12ba673e

riscv: Enable seccomp architecture tracking · 1734beb7

由 YiFei Zhu 提交于 6月 30, 2021

stable inclusion
from stable-5.11-rc1
commit 673a11a7
bugzilla: 167382
CVE: N/A

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=673a11a7e4152b101bad6851c4e4c34c7c6d6dde

-------------------------------------------------

To enable seccomp constant action bitmaps, we need to have a static
mapping to the audit architecture and system call table size. Add these
for riscv.
Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/58ef925d00505cbb77478fa6bd2b48ab2d902460.1605101222.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

1734beb7

powerpc: Enable seccomp architecture tracking · 6b84c18b

由 YiFei Zhu 提交于 6月 30, 2021

stable inclusion
from stable-5.11-rc1
commit e7bcb462
bugzilla: 167382
CVE: N/A

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=e7bcb4622ddf4473da6c03fa8423919a568c57dc

-------------------------------------------------

To enable seccomp constant action bitmaps, we need to have a static
mapping to the audit architecture and system call table size. Add these
for powerpc.

__LITTLE_ENDIAN__ is used here instead of CONFIG_CPU_LITTLE_ENDIAN
to keep it consistent with asm/syscall.h.
Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/0b64925362671cdaa26d01bfe50b3ba5e164adfd.1605101222.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6b84c18b

parisc: Enable seccomp architecture tracking · cc7ed7aa

由 YiFei Zhu 提交于 6月 30, 2021

stable inclusion
from stable-5.11-rc1
commit 6aa7923c
bugzilla: 167382
CVE: N/A

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=6aa7923c8737d1f8fd2a06154155d68dec646464

-------------------------------------------------

To enable seccomp constant action bitmaps, we need to have a static
mapping to the audit architecture and system call table size. Add these
for parisc.
Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu>
Acked-by: NHelge Deller <deller@gmx.de>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/9bb86c546eda753adf5270425e7353202dbce87c.1605101222.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

cc7ed7aa

csky: Enable seccomp architecture tracking · 5ae35aa2

由 YiFei Zhu 提交于 6月 30, 2021

stable inclusion
from stable-5.11-rc1
commit 6e9ae6f9
bugzilla: 167382
CVE: N/A

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=6e9ae6f98809e0d123ff4d769ba2e6f652119138

-------------------------------------------------

To enable seccomp constant action bitmaps, we need to have a static
mapping to the audit architecture and system call table size. Add these
for csky.
Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/f9219026d4803b22f3e57e3768b4e42e004ef236.1605101222.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

5ae35aa2

arm: Enable seccomp architecture tracking · 36a0eb73

由 Kees Cook 提交于 6月 30, 2021

stable inclusion
from stable-5.11-rc1
commit 424c9102
bugzilla: 167382
CVE: N/A

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=424c9102fa7b2a5c15afe47fd14278c849f4eefb

-------------------------------------------------

To enable seccomp constant action bitmaps, we need to have a static
mapping to the audit architecture and system call table size. Add these
for arm.
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

36a0eb73

arm64: Enable seccomp architecture tracking · 956f9ca3

由 Kees Cook 提交于 6月 30, 2021

stable inclusion
from stable-5.11-rc1
commit ffde7034
bugzilla: 167382
CVE: N/A

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=ffde703470b03b1000017ed35c4f90a90caa22cf

-------------------------------------------------

To enable seccomp constant action bitmaps, we need to have a static
mapping to the audit architecture and system call table size. Add these
for arm64.
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

956f9ca3

selftests/seccomp: Compare bitmap vs filter overhead · be24e5cb

由 Kees Cook 提交于 6月 30, 2021

stable inclusion
from stable-5.11-rc1
commit 192cf322
bugzilla: 167382
CVE: N/A

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=192cf32243ce39af65bd095625aec374b38c03df

-------------------------------------------------

As part of the seccomp benchmarking, include the expectations with
regard to the timing behavior of the constant action bitmaps, and report
inconsistencies better.

Example output with constant action bitmaps on x86:

$ sudo ./seccomp_benchmark 100000000
Current BPF sysctl settings:
net.core.bpf_jit_enable = 1
net.core.bpf_jit_harden = 0
Benchmarking 200000000 syscalls...
129.359381409 - 0.008724424 = 129350656985 (129.4s)
getpid native: 646 ns
264.385890006 - 129.360453229 = 135025436777 (135.0s)
getpid RET_ALLOW 1 filter (bitmap): 675 ns
399.400511893 - 264.387045901 = 135013465992 (135.0s)
getpid RET_ALLOW 2 filters (bitmap): 675 ns
545.872866260 - 399.401718327 = 146471147933 (146.5s)
getpid RET_ALLOW 3 filters (full): 732 ns
696.337101319 - 545.874097681 = 150463003638 (150.5s)
getpid RET_ALLOW 4 filters (full): 752 ns
Estimated total seccomp overhead for 1 bitmapped filter: 29 ns
Estimated total seccomp overhead for 2 bitmapped filters: 29 ns
Estimated total seccomp overhead for 3 full filters: 86 ns
Estimated total seccomp overhead for 4 full filters: 106 ns
Estimated seccomp entry overhead: 29 ns
Estimated seccomp per-filter overhead (last 2 diff): 20 ns
Estimated seccomp per-filter overhead (filters / 4): 19 ns
Expectations:
	native ≤ 1 bitmap (646 ≤ 675): ✔️
	native ≤ 1 filter (646 ≤ 732): ✔️
	per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19): ✔️
	1 bitmapped ≈ 2 bitmapped (29 ≈ 29): ✔️
	entry ≈ 1 bitmapped (29 ≈ 29): ✔️
	entry ≈ 2 bitmapped (29 ≈ 29): ✔️
	native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752): ✔️

[YiFei: Changed commit message to show stats for this patch series]
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/1b61df3db85c5f7f1b9202722c45e7b39df73ef2.1602431034.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

be24e5cb

x86: Enable seccomp architecture tracking · a7118724

由 Kees Cook 提交于 6月 30, 2021

stable inclusion
from stable-5.11-rc1
commit 25db9120
bugzilla: 167382
CVE: N/A

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=25db91209a910a0ccf8b093743088d0f4bf5659f

-------------------------------------------------

Provide seccomp internals with the details to calculate which syscall
table the running kernel is expecting to deal with. This allows for
efficient architecture pinning and paves the way for constant-action
bitmaps.
Co-developed-by: NYiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/da58c3733d95c4f2115dd94225dfbe2573ba4d87.1602431034.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a7118724

seccomp/cache: Add "emulator" to check if filter is constant allow · f6e12051

由 YiFei Zhu 提交于 6月 30, 2021

stable inclusion
from stable-5.11-rc1
commit 8e01b51a
bugzilla: 167382
CVE: N/A

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8e01b51a31a1e08e2c3e8fcc0ef6790441be2f61

-------------------------------------------------

SECCOMP_CACHE will only operate on syscalls that do not access
any syscall arguments or instruction pointer. To facilitate
this we need a static analyser to know whether a filter will
return allow regardless of syscall arguments for a given
architecture number / syscall number pair. This is implemented
here with a pseudo-emulator, and stored in a per-filter bitmap.

In order to build this bitmap at filter attach time, each filter is
emulated for every syscall (under each possible architecture), and
checked for any accesses of struct seccomp_data that are not the "arch"
nor "nr" (syscall) members. If only "arch" and "nr" are examined, and
the program returns allow, then we can be sure that the filter must
return allow independent from syscall arguments.

Nearly all seccomp filters are built from these cBPF instructions:

BPF_LD  | BPF_W    | BPF_ABS
BPF_JMP | BPF_JEQ  | BPF_K
BPF_JMP | BPF_JGE  | BPF_K
BPF_JMP | BPF_JGT  | BPF_K
BPF_JMP | BPF_JSET | BPF_K
BPF_JMP | BPF_JA
BPF_RET | BPF_K
BPF_ALU | BPF_AND  | BPF_K

Each of these instructions are emulated. Any weirdness or loading
from a syscall argument will cause the emulator to bail.

The emulation is also halted if it reaches a return. In that case,
if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good.

Emulator structure and comments are from Kees [1] and Jann [2].

Emulation is done at attach time. If a filter depends on more
filters, and if the dependee does not guarantee to allow the
syscall, then we skip the emulation of this syscall.

[1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/
[2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/Suggested-by: NJann Horn <jannh@google.com>
Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu>
Reviewed-by: NJann Horn <jannh@google.com>
Co-developed-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/71c7be2db5ee08905f41c3be5c1ad6e2601ce88f.1602431034.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f6e12051

seccomp/cache: Lookup syscall allowlist bitmap for fast path · 67b04706

由 YiFei Zhu 提交于 6月 30, 2021

stable inclusion
from stable-5.11-rc1
commit f9d480b6
bugzilla: 167382
CVE: N/A

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f9d480b6ffbeb336bf7f6ce44825c00f61b3abae

-------------------------------------------------

The overhead of running Seccomp filters has been part of some past
discussions [1][2][3]. Oftentimes, the filters have a large number
of instructions that check syscall numbers one by one and jump based
on that. Some users chain BPF filters which further enlarge the
overhead. A recent work [6] comprehensively measures the Seccomp
overhead and shows that the overhead is non-negligible and has a
non-trivial impact on application performance.

We observed some common filters, such as docker's [4] or
systemd's [5], will make most decisions based only on the syscall
numbers, and as past discussions considered, a bitmap where each bit
represents a syscall makes most sense for these filters.

The fast (common) path for seccomp should be that the filter permits
the syscall to pass through, and failing seccomp is expected to be
an exceptional case; it is not expected for userspace to call a
denylisted syscall over and over.

When it can be concluded that an allow must occur for the given
architecture and syscall pair (this determination is introduced in
the next commit), seccomp will immediately allow the syscall,
bypassing further BPF execution.

Each architecture number has its own bitmap. The architecture
number in seccomp_data is checked against the defined architecture
number constant before proceeding to test the bit against the
bitmap with the syscall number as the index of the bit in the
bitmap, and if the bit is set, seccomp returns allow. The bitmaps
are all clear in this patch and will be initialized in the next
commit.

When only one architecture exists, the check against architecture
number is skipped, suggested by Kees Cook [7].

[1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/
[2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/
[3] https://github.com/seccomp/libseccomp/issues/116
[4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json
[5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270
[6] Draco: Architectural and Operating System Support for System Call Security
    https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020
[7] https://lore.kernel.org/bpf/202010091614.8BB0EB64@keescook/Co-developed-by: NDimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: NDimitrios Skarlatos <dskarlat@cs.cmu.edu>
Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu>
Reviewed-by: NJann Horn <jannh@google.com>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/10f91a367ec4fcdea7fc3f086de3f5f13a4a7436.1602431034.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

67b04706

usb: dwc3: core: fix kernel panic when do reboot · 7345598e

由 Peter Chen 提交于 6月 30, 2021

stable inclusion
from stable-5.10.46
commit 174c27583b3807ac96228c442735b02622d8d1c3
bugzilla: 168323
CVE: NA

--------------------------------

commit 4bf584a0 upstream.

When do system reboot, it calls dwc3_shutdown and the whole debugfs
for dwc3 has removed first, when the gadget tries to do deinit, and
remove debugfs for its endpoints, it meets NULL pointer dereference
issue when call debugfs_lookup. Fix it by removing the whole dwc3
debugfs later than dwc3_drd_exit.

[ 2924.958838] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000002
....
[ 2925.030994] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
[ 2925.037005] pc : inode_permission+0x2c/0x198
[ 2925.041281] lr : lookup_one_len_common+0xb0/0xf8
[ 2925.045903] sp : ffff80001276ba70
[ 2925.049218] x29: ffff80001276ba70 x28: ffff0000c01f0000 x27: 0000000000000000
[ 2925.056364] x26: ffff800011791e70 x25: 0000000000000008 x24: dead000000000100
[ 2925.063510] x23: dead000000000122 x22: 0000000000000000 x21: 0000000000000001
[ 2925.070652] x20: ffff8000122c6188 x19: 0000000000000000 x18: 0000000000000000
[ 2925.077797] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000004
[ 2925.084943] x14: ffffffffffffffff x13: 0000000000000000 x12: 0000000000000030
[ 2925.092087] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f x9 : ffff8000102b2420
[ 2925.099232] x8 : 7f7f7f7f7f7f7f7f x7 : feff73746e2f6f64 x6 : 0000000000008080
[ 2925.106378] x5 : 61c8864680b583eb x4 : 209e6ec2d263dbb7 x3 : 000074756f307065
[ 2925.113523] x2 : 0000000000000001 x1 : 0000000000000000 x0 : ffff8000122c6188
[ 2925.120671] Call trace:
[ 2925.123119]  inode_permission+0x2c/0x198
[ 2925.127042]  lookup_one_len_common+0xb0/0xf8
[ 2925.131315]  lookup_one_len_unlocked+0x34/0xb0
[ 2925.135764]  lookup_positive_unlocked+0x14/0x50
[ 2925.140296]  debugfs_lookup+0x68/0xa0
[ 2925.143964]  dwc3_gadget_free_endpoints+0x84/0xb0
[ 2925.148675]  dwc3_gadget_exit+0x28/0x78
[ 2925.152518]  dwc3_drd_exit+0x100/0x1f8
[ 2925.156267]  dwc3_remove+0x11c/0x120
[ 2925.159851]  dwc3_shutdown+0x14/0x20
[ 2925.163432]  platform_shutdown+0x28/0x38
[ 2925.167360]  device_shutdown+0x15c/0x378
[ 2925.171291]  kernel_restart_prepare+0x3c/0x48
[ 2925.175650]  kernel_restart+0x1c/0x68
[ 2925.179316]  __do_sys_reboot+0x218/0x240
[ 2925.183247]  __arm64_sys_reboot+0x28/0x30
[ 2925.187262]  invoke_syscall+0x48/0x100
[ 2925.191017]  el0_svc_common.constprop.0+0x48/0xc8
[ 2925.195726]  do_el0_svc+0x28/0x88
[ 2925.199045]  el0_svc+0x20/0x30
[ 2925.202104]  el0_sync_handler+0xa8/0xb0
[ 2925.205942]  el0_sync+0x148/0x180
[ 2925.209270] Code: a9025bf5 2a0203f5 121f0056 370802b5 (79400660)
[ 2925.215372] ---[ end trace 124254d8e485a58b ]---
[ 2925.220012] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
[ 2925.227676] Kernel Offset: disabled
[ 2925.231164] CPU features: 0x00001001,20000846
[ 2925.235521] Memory Limit: none
[ 2925.238580] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]---

Fixes: 8d396bb0 ("usb: dwc3: debugfs: Add and remove endpoint dirs dynamically")
Cc: Jack Pham <jackp@codeaurora.org>
Tested-by: NJack Pham <jackp@codeaurora.org>
Signed-off-by: NPeter Chen <peter.chen@kernel.org>
Link: https://lore.kernel.org/r/20210608105656.10795-1-peter.chen@kernel.org
(cherry picked from commit 2a042767)
Link: https://lore.kernel.org/r/20210615080847.GA10432@jackp-linux.qualcomm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

7345598e

usb: dwc3: debugfs: Add and remove endpoint dirs dynamically · 58251724

由 Jack Pham 提交于 6月 30, 2021

stable inclusion
from stable-5.10.46
commit e52d43c82f2f6556f0b7a790c19c072c1e99a95f
bugzilla: 168323
CVE: NA

--------------------------------

commit 8d396bb0 upstream.

The DWC3 DebugFS directory and files are currently created once
during probe.  This includes creation of subdirectories for each
of the gadget's endpoints.  This works fine for peripheral-only
controllers, as dwc3_core_init_mode() calls dwc3_gadget_init()
just prior to calling dwc3_debugfs_init().

However, for dual-role controllers, dwc3_core_init_mode() will
instead call dwc3_drd_init() which is problematic in a few ways.
First, the initial state must be determined, then dwc3_set_mode()
will have to schedule drd_work and by then dwc3_debugfs_init()
could have already been invoked.  Even if the initial mode is
peripheral, dwc3_gadget_init() happens after the DebugFS files
are created, and worse so if the initial state is host and the
controller switches to peripheral much later.  And secondly,
even if the gadget endpoints' debug entries were successfully
created, if the controller exits peripheral mode, its dwc3_eps
are freed so the debug files would now hold stale references.

So it is best if the DebugFS endpoint entries are created and
removed dynamically at the same time the underlying dwc3_eps are.
Do this by calling dwc3_debugfs_create_endpoint_dir() as each
endpoint is created, and conversely remove the DebugFS entry when
the endpoint is freed.

Fixes: 41ce1456 ("usb: dwc3: core: make dwc3_set_mode() work properly")
Cc: stable <stable@vger.kernel.org>
Reviewed-by: NPeter Chen <peter.chen@kernel.org>
Signed-off-by: NJack Pham <jackp@codeaurora.org>
Link: https://lore.kernel.org/r/20210529192932.22912-1-jackp@codeaurora.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

58251724

perf beauty: Update copy of linux/socket.h with the kernel sources · 82200035

由 Arnaldo Carvalho de Melo 提交于 6月 30, 2021

stable inclusion
from stable-5.10.46
commit 1b5fbb66182f5cab525be163327ce1a1fdbb9f15
bugzilla: 168323
CVE: NA

--------------------------------

commit ef83f9ef upstream.

To pick the changes in:

  ea6932d7 ("net: make get_net_ns return error if NET_NS is disabled")

That don't result in any changes in the tables generated from that
header.

This silences this perf build warning:

  Warning: Kernel ABI header at 'tools/perf/trace/beauty/include/linux/socket.h' differs from latest version at 'include/linux/socket.h'
  diff -u tools/perf/trace/beauty/include/linux/socket.h include/linux/socket.h

Cc: Changbin Du <changbin.du@intel.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

82200035

tools headers UAPI: Sync linux/in.h copy with the kernel sources · b0992698

由 Arnaldo Carvalho de Melo 提交于 6月 30, 2021

stable inclusion
from stable-5.10.46
commit 69371e0482ea3a39484642e8d29c3d51fb26a915
bugzilla: 168323
CVE: NA

--------------------------------

commit 1792a59e upstream.

To pick the changes in:

  32182747 ("icmp: don't send out ICMP messages with a source address of 0.0.0.0")

That don't result in any change in tooling, as INADDR_ are not used to
generate id->string tables used by 'perf trace'.

This addresses this build warning:

  Warning: Kernel ABI header at 'tools/include/uapi/linux/in.h' differs from latest version at 'include/uapi/linux/in.h'
  diff -u tools/include/uapi/linux/in.h include/uapi/linux/in.h

Cc: David S. Miller <davem@davemloft.net>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b0992698

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功