- 14 7月, 2021 1 次提交
-
-
由 Matthew Wilcox (Oracle) 提交于
mainline inclusion from mainline-v5.11-rc1 commit 0060ef3b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZE5V CVE: NA ------------------------------------------------- We can only kmap() one subpage of a THP at a time, so loop over all relevant subpages, skipping ones which don't need to be zeroed. This is too large to inline when THPs are enabled and we actually need highmem, so put it in highmem.c. [willy@infradead.org: start1 was allowed to be less than start2] Link: https://lkml.kernel.org/r/20201124041507.28996-1-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Cc: Yang Shi <shy828301@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: Zi Yan <ziy@nvidia.com> Cc: Song Liu <songliubraving@fb.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Naresh Kamboju <naresh.kamboju@linaro.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NLiu Shixin <liushixin2@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 10 7月, 2021 6 次提交
-
-
由 Minchan Kim 提交于
mainline inclusion from mainline-5.13-rc1 commit bbb26920 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZQ5G CVE: NA ------------------------------------------------- Since CMA is used more widely, it's worth to have CMA allocation statistics into vmstat. With it, we could know how agressively system uses cma allocation and how often it fails. Link: https://lkml.kernel.org/r/20210302183346.3707237-1-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org> Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com> Cc: John Dias <joaodias@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit bbb26920) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: Nchenwandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Shakeel Butt 提交于
mainline inclusion from mainline-v5.13-rc1 commit 3d0cbb98 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZXKY CVE: NA -------------------------------------- In the era of async memcg oom-killer, the commit a0d8b00a ("mm: memcg: do not declare OOM from __GFP_NOFAIL allocations") added the code to skip memcg oom-killer for __GFP_NOFAIL allocations. The reason was that the __GFP_NOFAIL callers will not enter aync oom synchronization path and will keep the task marked as in memcg oom. At that time the tasks marked in memcg oom can bypass the memcg limits and the oom synchronization would have happened later in the later userspace triggered page fault. Thus letting the task marked as under memcg oom bypass the memcg limit for arbitrary time. With the synchronous memcg oom-killer (commit 29ef680a ("memcg, oom: move out_of_memory back to the charge path")) and not letting the task marked under memcg oom to bypass the memcg limits (commit 1f14c1ac ("mm: memcg: do not allow task about to OOM kill to bypass the limit")), we can again allow __GFP_NOFAIL allocations to trigger memcg oom-kill. This will make memcg oom behavior closer to page allocator oom behavior. Link: https://lkml.kernel.org/r/20210223204337.2785120-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NJohannes Weiner <hannes@cmpxchg.org> Acked-by: NDavid Rientjes <rientjes@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com> Reviewed-by: Nchenwandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 David Hildenbrand 提交于
mainline inclusion from mainline-5.12-rc1-dontuse commit 3c381db1 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZSR5 CVE: NA ------------------------------------------------- Let's count the number of CMA pages per zone and print them in /proc/zoneinfo. Having access to the total number of CMA pages per zone is helpful for debugging purposes to know where exactly the CMA pages ended up, and to figure out how many pages of a zone might behave differently, even after some of these pages might already have been allocated. As one example, CMA pages part of a kernel zone cannot be used for ordinary kernel allocations but instead behave more like ZONE_MOVABLE. For now, we are only able to get the global nr+free cma pages from /proc/meminfo and the free cma pages per zone from /proc/zoneinfo. Example after this patch when booting a 6 GiB QEMU VM with "hugetlb_cma=2G": # cat /proc/zoneinfo | grep cma cma 0 nr_free_cma 0 cma 0 nr_free_cma 0 cma 524288 nr_free_cma 493016 cma 0 cma 0 # cat /proc/meminfo | grep Cma CmaTotal: 2097152 kB CmaFree: 1972064 kB Note: We print even without CONFIG_CMA, just like "nr_free_cma"; this way, one can be sure when spotting "cma 0", that there are definetly no CMA pages located in a zone. [david@redhat.com: v2] Link: https://lkml.kernel.org/r/20210128164533.18566-1-david@redhat.com [david@redhat.com: v3] Link: https://lkml.kernel.org/r/20210129113451.22085-1-david@redhat.com Link: https://lkml.kernel.org/r/20210127101813.6370-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Wei Yang <richard.weiyang@linux.alibaba.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 3c381db1) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: Nchenwandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Georgi Djakov 提交于
mainline inclusion from mainline-5.13-rc1 commit 866b4852 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZD1N CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=866b485262173a2b873386162b2ddcfbcb542b4a ------------------------------------------------- Collect the time when each allocation is freed, to help with memory analysis with kdump/ramdump. Add the timestamp also in the page_owner debugfs file and print it in dump_page(). Having another timestamp when we free the page helps for debugging page migration issues. For example both alloc and free timestamps being the same can gave hints that there is an issue with migrating memory, as opposed to a page just being dropped during migration. Link: https://lkml.kernel.org/r/20210203175905.12267-1-georgi.djakov@linaro.orgSigned-off-by: NGeorgi Djakov <georgi.djakov@linaro.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 866b4852) Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com> Reviewed-by: Ntong tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Liam Mark 提交于
mainline inclusion from mainline-5.11-rc1 commit 9cc7e96a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZD1N CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9cc7e96aa846f9086431d6c2d33ff9ab42d72b2d ------------------------------------------------- Collect the time for each allocation recorded in page owner so that allocation "surges" can be measured. Record the pid for each allocation recorded in page owner so that the source of allocation "surges" can be better identified. The above is very useful when doing memory analysis. On a crash for example, we can get this information from kdump (or ramdump) and parse it to figure out memory allocation problems. Please note that on x86_64 this increases the size of struct page_owner from 16 bytes to 32. Vlastimil: it's not a functionality intended for production, so unless somebody says they need to enable page_owner for debugging and this increase prevents them from fitting into available memory, let's not complicate things with making this optional. [lmark@codeaurora.org: v3] Link: https://lkml.kernel.org/r/20201210160357.27779-1-georgi.djakov@linaro.org Link: https://lkml.kernel.org/r/20201209125153.10533-1-georgi.djakov@linaro.orgSigned-off-by: NLiam Mark <lmark@codeaurora.org> Signed-off-by: NGeorgi Djakov <georgi.djakov@linaro.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 9cc7e96a) Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com> Reviewed-by: Ntong tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Uladzislau Rezki (Sony) 提交于
mainline inclusion from mainline-5.11-rc1 commit 96e2db45 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZG5D CVE: NA ------------------------------------------------- A current "lazy drain" model suffers from at least two issues. First one is related to the unsorted list of vmap areas, thus in order to identify the [min:max] range of areas to be drained, it requires a full list scan. What is a time consuming if the list is too long. Second one and as a next step is about merging all fragments with a free space. What is also a time consuming because it has to iterate over entire list which holds outstanding lazy areas. See below the "preemptirqsoff" tracer that illustrates a high latency. It is ~24676us. Our workloads like audio and video are effected by such long latency: <snip> tracer: preemptirqsoff preemptirqsoff latency trace v1.1.5 on 4.9.186-perf+ -------------------------------------------------------------------- latency: 24676 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 P:8) ----------------- | task: crtc_commit:112-261 (uid:0 nice:0 policy:1 rt_prio:16) ----------------- => started at: __purge_vmap_area_lazy => ended at: __purge_vmap_area_lazy _------=> CPU# / _-----=> irqs-off | / _----=> need-resched || / _---=> hardirq/softirq ||| / _--=> preempt-depth |||| / delay cmd pid ||||| time | caller \ / ||||| \ | / crtc_com-261 1...1 1us*: _raw_spin_lock <-__purge_vmap_area_lazy [...] crtc_com-261 1...1 24675us : _raw_spin_unlock <-__purge_vmap_area_lazy crtc_com-261 1...1 24677us : trace_preempt_on <-__purge_vmap_area_lazy crtc_com-261 1...1 24683us : <stack trace> => free_vmap_area_noflush => remove_vm_area => __vunmap => vfree => drm_property_free_blob => drm_mode_object_unreference => drm_property_unreference_blob => __drm_atomic_helper_crtc_destroy_state => sde_crtc_destroy_state => drm_atomic_state_default_clear => drm_atomic_state_clear => drm_atomic_state_free => complete_commit => _msm_drm_commit_work_cb => kthread_worker_fn => kthread => ret_from_fork <snip> To address those two issues we can redesign a purging of the outstanding lazy areas. Instead of queuing vmap areas to the list, we replace it by the separate rb-tree. In hat case an area is located in the tree/list in ascending order. It will give us below advantages: a) Outstanding vmap areas are merged creating bigger coalesced blocks, thus it becomes less fragmented. b) It is possible to calculate a flush range [min:max] without scanning all elements. It is O(1) access time or complexity; c) The final merge of areas with the rb-tree that represents a free space is faster because of (a). As a result the lock contention is also reduced. Link: https://lkml.kernel.org/r/20201116220033.1837-2-urezki@gmail.comSigned-off-by: NUladzislau Rezki (Sony) <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Minchan Kim <minchan@kernel.org> Cc: huang ying <huang.ying.caritas@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 96e2db45) Signed-off-by: NYue Zou <zouyue3@huawei.com> Reviewed-by: Nchenwandun <chenwandun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 08 7月, 2021 5 次提交
-
-
由 Wei Li 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZN72 CVE: NA --------------------------- Currently, clear_page() clear the page through 'dc zva', while the page may not be used immediately mostly, so the cache flush is in vain. Add an optimized implementation of clear_page() by 'stnp' for performance promotion. It can be switched by the boot cmdline 'mm.use_clearpage_stnp'. In the hugetlb clear test, we gained about 53.7% performance improvement: Set mm.use_clearpage_stnp = 0 | Set mm.use_clearpage_stnp = 1 [root@localhost liwei]# ./a.out 50 20 | [root@localhost liwei]# ./a.out 50 20 size is 50 Gib, test times is 20 | size is 50 Gib, test times is 20 test_time[0] : use 8.438046 sec | test_time[0] : use 3.722682 sec test_time[1] : use 8.028493 sec | test_time[1] : use 3.640274 sec test_time[2] : use 8.646547 sec | test_time[2] : use 4.095052 sec test_time[3] : use 8.122490 sec | test_time[3] : use 3.998446 sec test_time[4] : use 8.053038 sec | test_time[4] : use 4.084259 sec test_time[5] : use 8.843512 sec | test_time[5] : use 3.933871 sec test_time[6] : use 8.308906 sec | test_time[6] : use 3.934334 sec test_time[7] : use 8.093817 sec | test_time[7] : use 3.869142 sec test_time[8] : use 8.303504 sec | test_time[8] : use 3.902916 sec test_time[9] : use 8.178336 sec | test_time[9] : use 3.541885 sec test_time[10] : use 8.003625 sec | test_time[10] : use 3.595554 sec test_time[11] : use 8.163807 sec | test_time[11] : use 3.583813 sec test_time[12] : use 8.267464 sec | test_time[12] : use 3.863033 sec test_time[13] : use 8.055326 sec | test_time[13] : use 3.770953 sec test_time[14] : use 8.246986 sec | test_time[14] : use 3.808006 sec test_time[15] : use 8.546992 sec | test_time[15] : use 3.653194 sec test_time[16] : use 8.727256 sec | test_time[16] : use 3.722395 sec test_time[17] : use 8.288951 sec | test_time[17] : use 3.683508 sec test_time[18] : use 8.019322 sec | test_time[18] : use 4.253087 sec test_time[19] : use 8.250685 sec | test_time[19] : use 4.082845 sec hugetlb test end! | hugetlb test end! Signed-off-by: NWei Li <liwei391@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Jing Xiangfeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZN3O CVE: NA -------------------------------------- If parent's qos_level is set, iterate over all cgroups (under this tree) to modify memory.qos_level synchronously. Currently qos_level support 0 and -1. Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com> Reviewed-by: NLiu Shixin <liushixin2@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Jing Xiangfeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZN3O CVE: NA -------------------------------------- This patch adds a default-false static key to disable memcg priority feature. If you want to enable it by writing 1: echo 1 > /proc/sys/vm/memcg_qos_enable Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com> Reviewed-by: NLiu Shixin <liushixin2@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Jing Xiangfeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZN3O CVE: NA -------------------------------------- enable CONFIG_MEMCG_QOS to support memcg OOM priority. Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com> Reviewed-by: NLiu Shixin <liushixin2@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Jing Xiangfeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZN3O CVE: NA -------------------------------------- We first kill the process from the low priority memcg if OOM occurs. If the process is not found, then fallback to normal handle. Signed-off-by: NJing Xiangfeng <jingxiangfeng@huawei.com> Reviewed-by: NLiu Shixin <liushixin2@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 07 7月, 2021 2 次提交
-
-
由 Xie XiuQi 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFV2 CVE: NA ------------------------------------------------------------ enable CONFIG_HISILICON_ERRATUM_HIP08_RU_PREFETCH, to add a cmdline option to disable prefetch. Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Kai Shen 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZFV2 CVE: NA ----------------------------------------------------------- Random performance decreases appear on cases of Hackbench which test pipe or socket communication among multi-threads on Hisi HIP08 SoC. Cache sharing which caused by the change of the data layout and the cache readunique prefetch mechanism both lead to this problem. Readunique mechanism which may caused by store operation will invalid cachelines on other cores during data fetching stage which can cause cacheline invalidation happens frequently in a sharing data access situation. Disable cache readunique prefetch can trackle this problem. Test cases are like: for i in 20;do echo "--------pipe thread num=$i----------" for j in $(seq 1 10);do ./hackbench -pipe $i thread 1000 done done We disable readunique prefetch only in el2 for in el1 disabling readunique prefetch may cause panic due to lack of related priority which often be set in BIOS. Introduce CONFIG_HISILICON_ERRATUM_HIP08_RU_PREFETCH and disable RU prefetch using boot cmdline 'readunique_prefetch=off'. Signed-off-by: NKai Shen <shenkai8@huawei.com> Signed-off-by: NHanjun Guo <guohanjun@huawei.com> [XQ: adjusted context] Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 06 7月, 2021 26 次提交
-
-
由 Zheng Zengkai 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3Z80Y CVE: NA ------------------------------------------------- Disable config ARM64_BOOTPARAM_HOTPLUG_CPU0 in openeuler_defconfig by default. Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
-
由 Zheng Zengkai 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3Z80Y CVE: NA ------------------------------------------------- New config switch CONFIG_ARM64_BOOTPARAM_HOTPLUG_CPU0 sets whether default state of arm64_cpu0_hotpluggable is on or off. If the config switch is off, arm64_cpu0_hotpluggable is off by default. But arm64_cpu0_hotpluggable can still be turned on by kernel parameter arm64_cpu0_hotplug at boot. If the config switch is on, arm64_cpu0_hotpluggable is always on. whether CPU0 is hotpluggable depends on cpu_can_disable(0) and arm64_cpu0_hotpluggable. The default value of the config switch is off. Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Dan Carpenter 提交于
mainline inclusion from mainline-v5.13-rc4 commit 1a590a1c category: bugfix bugzilla: 108082 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1a590a1c8bf46bf80ea12b657ca44c345531ac80 ------------------------------------------------------------------------- In current kernels small allocations never fail, but checking for allocation failure is the correct thing to do. Fixes: 18abda7a ("iommu/vt-d: Fix general protection fault in aux_detach_device()") Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Acked-by: NLu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/YJuobKuSn81dOPLd@mwanda Link: https://lore.kernel.org/r/20210519015027.108468-2-baolu.lu@linux.intel.comSigned-off-by: NJoerg Roedel <jroedel@suse.de> Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Liu Yi L 提交于
mainline inclusion from mainline-v5.11-rc3 commit 7c29ada5 category: bugfix bugzilla: 108083 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7c29ada5e70083805bc3a68daa23441df421fbee ------------------------------------------------------------------------- iommu_flush_dev_iotlb() is called to invalidate caches on a device but only loops over the devices which are fully-attached to the domain. For sub-devices, this is ineffective and can result in invalid caching entries left on the device. Fix the missing invalidation by adding a loop over the subdevices and ensuring that 'domain->has_iotlb_device' is updated when attaching to subdevices. Fixes: 67b8e02b ("iommu/vt-d: Aux-domain specific domain attach/detach") Signed-off-by: NLiu Yi L <yi.l.liu@intel.com> Acked-by: NLu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/1609949037-25291-4-git-send-email-yi.l.liu@intel.comSigned-off-by: NWill Deacon <will@kernel.org> Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Liu Yi L 提交于
mainline inclusion from mainline-v5.11-rc3 commit 18abda7a category: bugfix bugzilla: 108082 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=18abda7a2d555783d28ea1701f3ec95e96237a86 ------------------------------------------------------------------------- The aux-domain attach/detach are not tracked, some data structures might be used after free. This causes general protection faults when multiple subdevices are created and assigned to a same guest machine: | general protection fault, probably for non-canonical address 0xdead000000000100: 0000 [#1] SMP NOPTI | RIP: 0010:intel_iommu_aux_detach_device+0x12a/0x1f0 | [...] | Call Trace: | iommu_aux_detach_device+0x24/0x70 | vfio_mdev_detach_domain+0x3b/0x60 | ? vfio_mdev_set_domain+0x50/0x50 | iommu_group_for_each_dev+0x4f/0x80 | vfio_iommu_detach_group.isra.0+0x22/0x30 | vfio_iommu_type1_detach_group.cold+0x71/0x211 | ? find_exported_symbol_in_section+0x4a/0xd0 | ? each_symbol_section+0x28/0x50 | __vfio_group_unset_container+0x4d/0x150 | vfio_group_try_dissolve_container+0x25/0x30 | vfio_group_put_external_user+0x13/0x20 | kvm_vfio_group_put_external_user+0x27/0x40 [kvm] | kvm_vfio_destroy+0x45/0xb0 [kvm] | kvm_put_kvm+0x1bb/0x2e0 [kvm] | kvm_vm_release+0x22/0x30 [kvm] | __fput+0xcc/0x260 | ____fput+0xe/0x10 | task_work_run+0x8f/0xb0 | do_exit+0x358/0xaf0 | ? wake_up_state+0x10/0x20 | ? signal_wake_up_state+0x1a/0x30 | do_group_exit+0x47/0xb0 | __x64_sys_exit_group+0x18/0x20 | do_syscall_64+0x57/0x1d0 | entry_SYSCALL_64_after_hwframe+0x44/0xa9 Fix the crash by tracking the subdevices when attaching and detaching aux-domains. Fixes: 67b8e02b ("iommu/vt-d: Aux-domain specific domain attach/detach") Co-developed-by: NXin Zeng <xin.zeng@intel.com> Signed-off-by: NXin Zeng <xin.zeng@intel.com> Signed-off-by: NLiu Yi L <yi.l.liu@intel.com> Acked-by: NLu Baolu <baolu.lu@linux.intel.com> Link: https://lore.kernel.org/r/1609949037-25291-3-git-send-email-yi.l.liu@intel.comSigned-off-by: NWill Deacon <will@kernel.org> Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Sargun Dhillon 提交于
mainline inclusion from mainline-5.11-rc1 commit d3ff46fe category: bugfix bugzilla: 108595 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d3ff46fe693683cb9660e9b93e8c932cc8e0c1f8 --------------------------- In several patches work has been done to enable NFSv4 to use user namespaces: 58002399: NFSv4: Convert the NFS client idmapper to use the container user namespace 3b7eb5e3: NFS: When mounting, don't share filesystems between different user namespaces Unfortunately, the userspace APIs were only such that the userspace facing side of the filesystem (superblock s_user_ns) could be set to a non init user namespace. This furthers the fs_context related refactoring, and piggybacks on top of that logic, so the superblock user namespace, and the NFS user namespace are the same. Users can still use rpc.idmapd if they choose to, but there are complexities with user namespaces and request-key that have yet to be addresssed. Eventually, we will need to at least: * Come up with an upcall mechanism that can be triggered inside of the container, or safely triggered outside, with the requisite context to do the right mapping. * Handle whatever refactoring needs to be done in net/sunrpc. Signed-off-by: NSargun Dhillon <sargun@sargun.me> Tested-by: NAlban Crequy <alban.crequy@gmail.com> Fixes: 62a55d08 ("NFS: Additional refactoring for fs_context conversion") Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: NZhang Yi <yi.zhang@huawei.com> Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Sargun Dhillon 提交于
mainline inclusion from mainline-5.11-rc1 commit d18a9d3f category: bugfix bugzilla: 108596 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d18a9d3fa0f27a47706fb67f1ee0f4d971587c4e --------------------------- There was refactoring done to use the fs_context for mounting done in: 62a55d08: NFS: Additional refactoring for fs_context conversion This made it so that the net_ns is fetched from the fs_context (the netns that fsopen is called in). This change also makes it so that the credential fetched during fsopen is used as well as the net_ns. NFS has already had a number of changes to prepare it for user namespaces: 1a58e8a0: NFS: Store the credential of the mount process in the nfs_server 264d948c: NFS: Convert NFSv3 to use the container user namespace c207db2f: NFS: Convert NFSv2 to use the container user namespace Previously, different credentials could be used for creation of the fs_context versus creation of the nfs_server, as FSCONFIG_CMD_CREATE did the actual credential check, and that's where current_creds() were fetched. This meant that the user namespace which fsopen was called in could be a non-init user namespace. This still requires that the user that calls FSCONFIG_CMD_CREATE has CAP_SYS_ADMIN in the init user ns. This roughly allows a privileged user to mount on behalf of an unprivileged usernamespace, by forking off and calling fsopen in the unprivileged user namespace. It can then pass back that fsfd to the privileged process which can configure the NFS mount, and then it can call FSCONFIG_CMD_CREATE before switching back into the mount namespace of the container, and finish up the mounting process and call fsmount and move_mount. Signed-off-by: NSargun Dhillon <sargun@sargun.me> Tested-by: NAlban Crequy <alban.crequy@gmail.com> Fixes: 62a55d08 ("NFS: Additional refactoring for fs_context conversion") Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: NZhang Yi <yi.zhang@huawei.com> Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: NChen Jun <chenjun102@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Jann Horn 提交于
stable inclusion from stable-5.11-rc1 commit fab686eb bugzilla: 167382 CVE: N/A Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=fab686eb0307121e7a2890b6d6c57edd2457863d ------------------------------------------------- Buffers that are passed to read_actions_logged() and write_actions_logged() are in kernel memory; the sysctl core takes care of copying from/to userspace. Fixes: 32927393 ("sysctl: pass kernel pointers to ->proc_handler") Reviewed-by: NTyler Hicks <code@tyhicks.com> Signed-off-by: NJann Horn <jannh@google.com> Signed-off-by: NKees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/20201120170545.1419332-1-jannh@google.comSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 YiFei Zhu 提交于
stable inclusion from stable-5.11-rc1 commit 0d8315dd bugzilla: 167382 CVE: N/A Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0d8315dddd2899f519fe1ca3d4d5cdaf44ea421e ------------------------------------------------- Currently the kernel does not provide an infrastructure to translate architecture numbers to a human-readable name. Translating syscall numbers to syscall names is possible through FTRACE_SYSCALL infrastructure but it does not provide support for compat syscalls. This will create a file for each PID as /proc/pid/seccomp_cache. The file will be empty when no seccomp filters are loaded, or be in the format of: <arch name> <decimal syscall number> <ALLOW | FILTER> where ALLOW means the cache is guaranteed to allow the syscall, and filter means the cache will pass the syscall to the BPF filter. For the docker default profile on x86_64 it looks like: x86_64 0 ALLOW x86_64 1 ALLOW x86_64 2 ALLOW x86_64 3 ALLOW [...] x86_64 132 ALLOW x86_64 133 ALLOW x86_64 134 FILTER x86_64 135 FILTER x86_64 136 FILTER x86_64 137 ALLOW x86_64 138 ALLOW x86_64 139 FILTER x86_64 140 ALLOW x86_64 141 ALLOW [...] This file is guarded by CONFIG_SECCOMP_CACHE_DEBUG with a default of N because I think certain users of seccomp might not want the application to know which syscalls are definitely usable. For the same reason, it is also guarded by CAP_SYS_ADMIN. Suggested-by: NJann Horn <jannh@google.com> Link: https://lore.kernel.org/lkml/CAG48ez3Ofqp4crXGksLmZY6=fGrF_tWyUCg7PBkAetvbbOPeOA@mail.gmail.com/Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu> Signed-off-by: NKees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/94e663fa53136f5a11f432c661794d1ee7060779.1605101222.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 YiFei Zhu 提交于
stable inclusion from stable-5.11-rc1 commit 445247b0 bugzilla: 167382 CVE: N/A Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=445247b02342a05b7d528bba6d85d2d418875b69 ------------------------------------------------- To enable seccomp constant action bitmaps, we need to have a static mapping to the audit architecture and system call table size. Add these for xtensa. Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu> Signed-off-by: NKees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/79669648ba167d668ea6ffb4884250abcd5ed254.1605101222.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 YiFei Zhu 提交于
stable inclusion from stable-5.11-rc1 commit 4c18bc05 bugzilla: 167382 CVE: N/A Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=4c18bc054bffe415bec9e0edaa9ff1a84c1a6973 ------------------------------------------------- To enable seccomp constant action bitmaps, we need to have a static mapping to the audit architecture and system call table size. Add these for sh. Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu> Signed-off-by: NKees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/61ae084cd4783b9b50860d9dedb4a348cf1b7b6f.1605101222.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 YiFei Zhu 提交于
stable inclusion from stable-5.11-rc1 commit c09058ed bugzilla: 167382 CVE: N/A Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=c09058eda2654c37fd7ac28c2004c3aae8b988e9 ------------------------------------------------- To enable seccomp constant action bitmaps, we need to have a static mapping to the audit architecture and system call table size. Add these for s390. Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu> Acked-by: NHeiko Carstens <hca@linux.ibm.com> Signed-off-by: NKees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/a381b10aa2c5b1e583642f3cd46ced842d9d4ce5.1605101222.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 YiFei Zhu 提交于
stable inclusion from stable-5.11-rc1 commit 673a11a7 bugzilla: 167382 CVE: N/A Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=673a11a7e4152b101bad6851c4e4c34c7c6d6dde ------------------------------------------------- To enable seccomp constant action bitmaps, we need to have a static mapping to the audit architecture and system call table size. Add these for riscv. Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu> Signed-off-by: NKees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/58ef925d00505cbb77478fa6bd2b48ab2d902460.1605101222.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 YiFei Zhu 提交于
stable inclusion from stable-5.11-rc1 commit e7bcb462 bugzilla: 167382 CVE: N/A Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=e7bcb4622ddf4473da6c03fa8423919a568c57dc ------------------------------------------------- To enable seccomp constant action bitmaps, we need to have a static mapping to the audit architecture and system call table size. Add these for powerpc. __LITTLE_ENDIAN__ is used here instead of CONFIG_CPU_LITTLE_ENDIAN to keep it consistent with asm/syscall.h. Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu> Signed-off-by: NKees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/0b64925362671cdaa26d01bfe50b3ba5e164adfd.1605101222.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 YiFei Zhu 提交于
stable inclusion from stable-5.11-rc1 commit 6aa7923c bugzilla: 167382 CVE: N/A Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=6aa7923c8737d1f8fd2a06154155d68dec646464 ------------------------------------------------- To enable seccomp constant action bitmaps, we need to have a static mapping to the audit architecture and system call table size. Add these for parisc. Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu> Acked-by: NHelge Deller <deller@gmx.de> Signed-off-by: NKees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/9bb86c546eda753adf5270425e7353202dbce87c.1605101222.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 YiFei Zhu 提交于
stable inclusion from stable-5.11-rc1 commit 6e9ae6f9 bugzilla: 167382 CVE: N/A Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=6e9ae6f98809e0d123ff4d769ba2e6f652119138 ------------------------------------------------- To enable seccomp constant action bitmaps, we need to have a static mapping to the audit architecture and system call table size. Add these for csky. Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu> Signed-off-by: NKees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/f9219026d4803b22f3e57e3768b4e42e004ef236.1605101222.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Kees Cook 提交于
stable inclusion from stable-5.11-rc1 commit 424c9102 bugzilla: 167382 CVE: N/A Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=424c9102fa7b2a5c15afe47fd14278c849f4eefb ------------------------------------------------- To enable seccomp constant action bitmaps, we need to have a static mapping to the audit architecture and system call table size. Add these for arm. Signed-off-by: NKees Cook <keescook@chromium.org> Signed-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Kees Cook 提交于
stable inclusion from stable-5.11-rc1 commit ffde7034 bugzilla: 167382 CVE: N/A Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=ffde703470b03b1000017ed35c4f90a90caa22cf ------------------------------------------------- To enable seccomp constant action bitmaps, we need to have a static mapping to the audit architecture and system call table size. Add these for arm64. Signed-off-by: NKees Cook <keescook@chromium.org> Signed-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Kees Cook 提交于
stable inclusion from stable-5.11-rc1 commit 192cf322 bugzilla: 167382 CVE: N/A Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=192cf32243ce39af65bd095625aec374b38c03df ------------------------------------------------- As part of the seccomp benchmarking, include the expectations with regard to the timing behavior of the constant action bitmaps, and report inconsistencies better. Example output with constant action bitmaps on x86: $ sudo ./seccomp_benchmark 100000000 Current BPF sysctl settings: net.core.bpf_jit_enable = 1 net.core.bpf_jit_harden = 0 Benchmarking 200000000 syscalls... 129.359381409 - 0.008724424 = 129350656985 (129.4s) getpid native: 646 ns 264.385890006 - 129.360453229 = 135025436777 (135.0s) getpid RET_ALLOW 1 filter (bitmap): 675 ns 399.400511893 - 264.387045901 = 135013465992 (135.0s) getpid RET_ALLOW 2 filters (bitmap): 675 ns 545.872866260 - 399.401718327 = 146471147933 (146.5s) getpid RET_ALLOW 3 filters (full): 732 ns 696.337101319 - 545.874097681 = 150463003638 (150.5s) getpid RET_ALLOW 4 filters (full): 752 ns Estimated total seccomp overhead for 1 bitmapped filter: 29 ns Estimated total seccomp overhead for 2 bitmapped filters: 29 ns Estimated total seccomp overhead for 3 full filters: 86 ns Estimated total seccomp overhead for 4 full filters: 106 ns Estimated seccomp entry overhead: 29 ns Estimated seccomp per-filter overhead (last 2 diff): 20 ns Estimated seccomp per-filter overhead (filters / 4): 19 ns Expectations: native ≤ 1 bitmap (646 ≤ 675):
✔ ️ native ≤ 1 filter (646 ≤ 732):✔ ️ per-filter (last 2 diff) ≈ per-filter (filters / 4) (20 ≈ 19):✔ ️ 1 bitmapped ≈ 2 bitmapped (29 ≈ 29):✔ ️ entry ≈ 1 bitmapped (29 ≈ 29):✔ ️ entry ≈ 2 bitmapped (29 ≈ 29):✔ ️ native + entry + (per filter * 4) ≈ 4 filters total (755 ≈ 752):✔ ️ [YiFei: Changed commit message to show stats for this patch series] Signed-off-by: NKees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/1b61df3db85c5f7f1b9202722c45e7b39df73ef2.1602431034.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com> -
由 Kees Cook 提交于
stable inclusion from stable-5.11-rc1 commit 25db9120 bugzilla: 167382 CVE: N/A Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=25db91209a910a0ccf8b093743088d0f4bf5659f ------------------------------------------------- Provide seccomp internals with the details to calculate which syscall table the running kernel is expecting to deal with. This allows for efficient architecture pinning and paves the way for constant-action bitmaps. Co-developed-by: NYiFei Zhu <yifeifz2@illinois.edu> Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu> Signed-off-by: NKees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/da58c3733d95c4f2115dd94225dfbe2573ba4d87.1602431034.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 YiFei Zhu 提交于
stable inclusion from stable-5.11-rc1 commit 8e01b51a bugzilla: 167382 CVE: N/A Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8e01b51a31a1e08e2c3e8fcc0ef6790441be2f61 ------------------------------------------------- SECCOMP_CACHE will only operate on syscalls that do not access any syscall arguments or instruction pointer. To facilitate this we need a static analyser to know whether a filter will return allow regardless of syscall arguments for a given architecture number / syscall number pair. This is implemented here with a pseudo-emulator, and stored in a per-filter bitmap. In order to build this bitmap at filter attach time, each filter is emulated for every syscall (under each possible architecture), and checked for any accesses of struct seccomp_data that are not the "arch" nor "nr" (syscall) members. If only "arch" and "nr" are examined, and the program returns allow, then we can be sure that the filter must return allow independent from syscall arguments. Nearly all seccomp filters are built from these cBPF instructions: BPF_LD | BPF_W | BPF_ABS BPF_JMP | BPF_JEQ | BPF_K BPF_JMP | BPF_JGE | BPF_K BPF_JMP | BPF_JGT | BPF_K BPF_JMP | BPF_JSET | BPF_K BPF_JMP | BPF_JA BPF_RET | BPF_K BPF_ALU | BPF_AND | BPF_K Each of these instructions are emulated. Any weirdness or loading from a syscall argument will cause the emulator to bail. The emulation is also halted if it reaches a return. In that case, if it returns an SECCOMP_RET_ALLOW, the syscall is marked as good. Emulator structure and comments are from Kees [1] and Jann [2]. Emulation is done at attach time. If a filter depends on more filters, and if the dependee does not guarantee to allow the syscall, then we skip the emulation of this syscall. [1] https://lore.kernel.org/lkml/20200923232923.3142503-5-keescook@chromium.org/ [2] https://lore.kernel.org/lkml/CAG48ez1p=dR_2ikKq=xVxkoGg0fYpTBpkhJSv1w-6BG=76PAvw@mail.gmail.com/Suggested-by: NJann Horn <jannh@google.com> Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu> Reviewed-by: NJann Horn <jannh@google.com> Co-developed-by: NKees Cook <keescook@chromium.org> Signed-off-by: NKees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/71c7be2db5ee08905f41c3be5c1ad6e2601ce88f.1602431034.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 YiFei Zhu 提交于
stable inclusion from stable-5.11-rc1 commit f9d480b6 bugzilla: 167382 CVE: N/A Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f9d480b6ffbeb336bf7f6ce44825c00f61b3abae ------------------------------------------------- The overhead of running Seccomp filters has been part of some past discussions [1][2][3]. Oftentimes, the filters have a large number of instructions that check syscall numbers one by one and jump based on that. Some users chain BPF filters which further enlarge the overhead. A recent work [6] comprehensively measures the Seccomp overhead and shows that the overhead is non-negligible and has a non-trivial impact on application performance. We observed some common filters, such as docker's [4] or systemd's [5], will make most decisions based only on the syscall numbers, and as past discussions considered, a bitmap where each bit represents a syscall makes most sense for these filters. The fast (common) path for seccomp should be that the filter permits the syscall to pass through, and failing seccomp is expected to be an exceptional case; it is not expected for userspace to call a denylisted syscall over and over. When it can be concluded that an allow must occur for the given architecture and syscall pair (this determination is introduced in the next commit), seccomp will immediately allow the syscall, bypassing further BPF execution. Each architecture number has its own bitmap. The architecture number in seccomp_data is checked against the defined architecture number constant before proceeding to test the bit against the bitmap with the syscall number as the index of the bit in the bitmap, and if the bit is set, seccomp returns allow. The bitmaps are all clear in this patch and will be initialized in the next commit. When only one architecture exists, the check against architecture number is skipped, suggested by Kees Cook [7]. [1] https://lore.kernel.org/linux-security-module/c22a6c3cefc2412cad00ae14c1371711@huawei.com/T/ [2] https://lore.kernel.org/lkml/202005181120.971232B7B@keescook/T/ [3] https://github.com/seccomp/libseccomp/issues/116 [4] https://github.com/moby/moby/blob/ae0ef82b90356ac613f329a8ef5ee42ca923417d/profiles/seccomp/default.json [5] https://github.com/systemd/systemd/blob/6743a1caf4037f03dc51a1277855018e4ab61957/src/shared/seccomp-util.c#L270 [6] Draco: Architectural and Operating System Support for System Call Security https://tianyin.github.io/pub/draco.pdf, MICRO-53, Oct. 2020 [7] https://lore.kernel.org/bpf/202010091614.8BB0EB64@keescook/Co-developed-by: NDimitrios Skarlatos <dskarlat@cs.cmu.edu> Signed-off-by: NDimitrios Skarlatos <dskarlat@cs.cmu.edu> Signed-off-by: NYiFei Zhu <yifeifz2@illinois.edu> Reviewed-by: NJann Horn <jannh@google.com> Signed-off-by: NKees Cook <keescook@chromium.org> Link: https://lore.kernel.org/r/10f91a367ec4fcdea7fc3f086de3f5f13a4a7436.1602431034.git.yifeifz2@illinois.eduSigned-off-by: NGONG, Ruiqi <gongruiqi1@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Peter Chen 提交于
stable inclusion from stable-5.10.46 commit 174c27583b3807ac96228c442735b02622d8d1c3 bugzilla: 168323 CVE: NA -------------------------------- commit 4bf584a0 upstream. When do system reboot, it calls dwc3_shutdown and the whole debugfs for dwc3 has removed first, when the gadget tries to do deinit, and remove debugfs for its endpoints, it meets NULL pointer dereference issue when call debugfs_lookup. Fix it by removing the whole dwc3 debugfs later than dwc3_drd_exit. [ 2924.958838] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000002 .... [ 2925.030994] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--) [ 2925.037005] pc : inode_permission+0x2c/0x198 [ 2925.041281] lr : lookup_one_len_common+0xb0/0xf8 [ 2925.045903] sp : ffff80001276ba70 [ 2925.049218] x29: ffff80001276ba70 x28: ffff0000c01f0000 x27: 0000000000000000 [ 2925.056364] x26: ffff800011791e70 x25: 0000000000000008 x24: dead000000000100 [ 2925.063510] x23: dead000000000122 x22: 0000000000000000 x21: 0000000000000001 [ 2925.070652] x20: ffff8000122c6188 x19: 0000000000000000 x18: 0000000000000000 [ 2925.077797] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000004 [ 2925.084943] x14: ffffffffffffffff x13: 0000000000000000 x12: 0000000000000030 [ 2925.092087] x11: 0101010101010101 x10: 7f7f7f7f7f7f7f7f x9 : ffff8000102b2420 [ 2925.099232] x8 : 7f7f7f7f7f7f7f7f x7 : feff73746e2f6f64 x6 : 0000000000008080 [ 2925.106378] x5 : 61c8864680b583eb x4 : 209e6ec2d263dbb7 x3 : 000074756f307065 [ 2925.113523] x2 : 0000000000000001 x1 : 0000000000000000 x0 : ffff8000122c6188 [ 2925.120671] Call trace: [ 2925.123119] inode_permission+0x2c/0x198 [ 2925.127042] lookup_one_len_common+0xb0/0xf8 [ 2925.131315] lookup_one_len_unlocked+0x34/0xb0 [ 2925.135764] lookup_positive_unlocked+0x14/0x50 [ 2925.140296] debugfs_lookup+0x68/0xa0 [ 2925.143964] dwc3_gadget_free_endpoints+0x84/0xb0 [ 2925.148675] dwc3_gadget_exit+0x28/0x78 [ 2925.152518] dwc3_drd_exit+0x100/0x1f8 [ 2925.156267] dwc3_remove+0x11c/0x120 [ 2925.159851] dwc3_shutdown+0x14/0x20 [ 2925.163432] platform_shutdown+0x28/0x38 [ 2925.167360] device_shutdown+0x15c/0x378 [ 2925.171291] kernel_restart_prepare+0x3c/0x48 [ 2925.175650] kernel_restart+0x1c/0x68 [ 2925.179316] __do_sys_reboot+0x218/0x240 [ 2925.183247] __arm64_sys_reboot+0x28/0x30 [ 2925.187262] invoke_syscall+0x48/0x100 [ 2925.191017] el0_svc_common.constprop.0+0x48/0xc8 [ 2925.195726] do_el0_svc+0x28/0x88 [ 2925.199045] el0_svc+0x20/0x30 [ 2925.202104] el0_sync_handler+0xa8/0xb0 [ 2925.205942] el0_sync+0x148/0x180 [ 2925.209270] Code: a9025bf5 2a0203f5 121f0056 370802b5 (79400660) [ 2925.215372] ---[ end trace 124254d8e485a58b ]--- [ 2925.220012] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b [ 2925.227676] Kernel Offset: disabled [ 2925.231164] CPU features: 0x00001001,20000846 [ 2925.235521] Memory Limit: none [ 2925.238580] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b ]--- Fixes: 8d396bb0 ("usb: dwc3: debugfs: Add and remove endpoint dirs dynamically") Cc: Jack Pham <jackp@codeaurora.org> Tested-by: NJack Pham <jackp@codeaurora.org> Signed-off-by: NPeter Chen <peter.chen@kernel.org> Link: https://lore.kernel.org/r/20210608105656.10795-1-peter.chen@kernel.org (cherry picked from commit 2a042767) Link: https://lore.kernel.org/r/20210615080847.GA10432@jackp-linux.qualcomm.comSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Jack Pham 提交于
stable inclusion from stable-5.10.46 commit e52d43c82f2f6556f0b7a790c19c072c1e99a95f bugzilla: 168323 CVE: NA -------------------------------- commit 8d396bb0 upstream. The DWC3 DebugFS directory and files are currently created once during probe. This includes creation of subdirectories for each of the gadget's endpoints. This works fine for peripheral-only controllers, as dwc3_core_init_mode() calls dwc3_gadget_init() just prior to calling dwc3_debugfs_init(). However, for dual-role controllers, dwc3_core_init_mode() will instead call dwc3_drd_init() which is problematic in a few ways. First, the initial state must be determined, then dwc3_set_mode() will have to schedule drd_work and by then dwc3_debugfs_init() could have already been invoked. Even if the initial mode is peripheral, dwc3_gadget_init() happens after the DebugFS files are created, and worse so if the initial state is host and the controller switches to peripheral much later. And secondly, even if the gadget endpoints' debug entries were successfully created, if the controller exits peripheral mode, its dwc3_eps are freed so the debug files would now hold stale references. So it is best if the DebugFS endpoint entries are created and removed dynamically at the same time the underlying dwc3_eps are. Do this by calling dwc3_debugfs_create_endpoint_dir() as each endpoint is created, and conversely remove the DebugFS entry when the endpoint is freed. Fixes: 41ce1456 ("usb: dwc3: core: make dwc3_set_mode() work properly") Cc: stable <stable@vger.kernel.org> Reviewed-by: NPeter Chen <peter.chen@kernel.org> Signed-off-by: NJack Pham <jackp@codeaurora.org> Link: https://lore.kernel.org/r/20210529192932.22912-1-jackp@codeaurora.orgSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Arnaldo Carvalho de Melo 提交于
stable inclusion from stable-5.10.46 commit 1b5fbb66182f5cab525be163327ce1a1fdbb9f15 bugzilla: 168323 CVE: NA -------------------------------- commit ef83f9ef upstream. To pick the changes in: ea6932d7 ("net: make get_net_ns return error if NET_NS is disabled") That don't result in any changes in the tables generated from that header. This silences this perf build warning: Warning: Kernel ABI header at 'tools/perf/trace/beauty/include/linux/socket.h' differs from latest version at 'include/linux/socket.h' diff -u tools/perf/trace/beauty/include/linux/socket.h include/linux/socket.h Cc: Changbin Du <changbin.du@intel.com> Cc: David S. Miller <davem@davemloft.net> Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Arnaldo Carvalho de Melo 提交于
stable inclusion from stable-5.10.46 commit 69371e0482ea3a39484642e8d29c3d51fb26a915 bugzilla: 168323 CVE: NA -------------------------------- commit 1792a59e upstream. To pick the changes in: 32182747 ("icmp: don't send out ICMP messages with a source address of 0.0.0.0") That don't result in any change in tooling, as INADDR_ are not used to generate id->string tables used by 'perf trace'. This addresses this build warning: Warning: Kernel ABI header at 'tools/include/uapi/linux/in.h' differs from latest version at 'include/uapi/linux/in.h' diff -u tools/include/uapi/linux/in.h include/uapi/linux/in.h Cc: David S. Miller <davem@davemloft.net> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NChen Jun <chenjun102@huawei.com> Acked-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-