- 26 9月, 2021 40 次提交
-
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Add KEY_ALLOC_DOMAIN_* flags so that the key domain tag can be specified on the key creation. This is done to separate the key domain setting from the key type. If applied to the keyring, it will set the requested domain tag for every key added to that keyring. IMA uses the existing key_type_asymmetric for appraisal, but also has to specify the key domain to bind appraisal key with the ima namespace. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Add domain tag to the key_match_data. If set, check domain tag in the default match function and asymmetric keys match functions. This will allow to use the key domain tag in the search criteria for the iterative search, not only for the direct lookup that is based on the index key. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Add keyring_search_tag() version of keyring_search(), that allows to specify the key domain tag. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- If subject based rule is added to the policy before the user namespace uid mapping is defined, ID has to be recalculated. It can happen if the new user namespace is created alongside the new ima namespace. The default policy rules are loaded when the first process is born into the new ima namespace. In that case, user has no chance to define the mapping. It can also happen for the custom policy rules loaded from within the new ima namespace, before the mapping is created. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Add function that checks if the UID map is defined. It will be used by ima to check if ID remapping in subject-based rules is necessary. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Parse per ima namespace policy file. The path is passed as for the root ima namespace through the ima securityfs 'policy' entry. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Add ima securityfs entries to configure per ima namespace: - path to the x509 certificate - ima kernel boot parameters The x509 certificate will be parsed and loaded when the first process is born into the new ima namespace, paths are not validated when written. Kernel boot parameters are pre-parsed and applied when the first process is born into the new namespace. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- It's possible that the user first unshares the ima namespace and then creates a new user namespace using clone3(). In that case the owning user namespace is the newly created one, because it is associated with the first process in the new ima namespace. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- The violations are now tracked per namespace. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Add "others" permissions to the namespaced ima securityfs entries. It is necessary so that the root in the user namespace that is the parent of the given ima namespace has access to the ima related data. Loosened DAC restrictrions are compensated by an extra check for SYS_ADMIN capabilities in the ima code. The access is given only to the namespaced data, e.g. root user in the new ima namespace will see measurement list entries collected for that namespace and not for the other existing namespaces. The only exception is made for the admin in the initial user namespace, who has access to all the data. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- To detect ToMToU violations reader counter of the given inode is checked. This is not enough, because the reader may exist in a different ima namespace. Per inode reader counter tracks readers in all ima namespaces, whereas the per namespace counter is necessary to avoid false positives. Add a new reader counter to the integrity inode cache entry. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Modify ima securityfs interface, so that only measurement list entries that belong to the given ima namespace are visible/counted. The initial ima namespace is an exception, its processes have access to all measurement list entries. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Add a new ima-ns template: "d-ng|n-ng|ns" Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Compare the namespace id during the digest entry lookup. Remove digests from hash table when the namespace is destroyed, but keep the global measurement list entry. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Maintain per ima namespace measurement list. It will be used to provide information about the namespace measurements in securityfs and to clean up hash table entries when the namespace is destroyed. The global measurement list remains and is not modified. It is necessary to keep it so that the PCR value can be recreated. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Add ima namespace id to the ima_event_data and ima_template_entry. This is done so that the template entries can be tracked per ima namespace. The following patches will add new templates that will include the namespace id, but the namespace id has to be stored separately so that the namespace functionality is enabled for every template. After kexec, all entries from the old measurement list will be associated with the new root ima namespace. This will prevent users in the new ima namespaces from accessing the old entries if the ima namespace id is reused. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Set ima policy per namespace and remove the global settings. Operations on the objects may now have impact in more than one ima namespace and therefore iterate all active ima namespaces when necessary. Read-write violations can now happen across namespaces and should be checked in all namespaces for each relevant ima hook. Inform all concerned ima namespaces about the actions on the objects when the object is freed. E.g. if an object had been appraised in the ima_ns_1 and then modified in the ima_ns_2, appraised flag in the ima_ns_1 is cleared and the object will be re-appraised in the ima_ns_1 namespace. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Add an iint tree to the ima namespace. Each namespace should track operations on its objects separately. Per namespace iint tree is not yet used, it will be done in the following patches. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Inode integrity cache will be maintained per ima namespace. Add new functions that allow to specify the iint tree to use. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Add ima namespace pointer to the input parameters of the relevant functions. This is a preparation for the policy namespacing, more functions may be modified later, when other aspects of the ima are namespaced. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- IMA subsystem is configured at boot time using kernel command-line parameters, e.g.: ima_policy=tcb|appraise_tcb|secure_boot. The same configuration options should be available for the new ima namespace. Add new functions to parse configuration string and store parsed data in the new policy data structures. Don't implement it yet, just add the dummy interface. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Collate global variables describing the ima policy in one structure and add it to the ima namespace. Collate setup data (parsed kernel boot parameters) in a separate structure. Per namespace policy is not yet properly set and it is not used. This will be done in the following patches. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- IMA namespace reference will be required in ima_file_free() to check the policy and find inode integrity data for the correct ima namespace. ima_file_free() is called on __fput(), and __fput() may be called after releasing namespaces in exit_task_namespaces() in do_exit() and therefore nsproxy reference cannot be used - it is already set to NULL. This is a preparation for namespacing policy and inode integrity data. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- Add a list of the installed ima namespaces. IMA namespace is considered installed, if there is at least one process born in that namespace. This list will be used to check the read-write violations and to detect any object related changes relevant across namespaces. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Krzysztof Struczynski 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I49KW1 CVE: NA -------------------------------- IMA namespace wraps global ima resources in an abstraction, to enable ima to work with the containers. Currently, ima namespace contains no useful data but a dummy interface. IMA resources related to different aspects of IMA, namely IMA-audit, IMA-measurement, IMA-appraisal will be added in the following patches. The way how ima namespace is created is analogous to the time namespace: unshare(CLONE_NEWIMA) system call creates a new ima namespace but doesn't assign it to the current process. All children of the process will be born in the new ima namespace, or a process can use setns() system call to join the new ima namespace. Call to clone3(CLONE_NEWIMA) system call creates a new namespace, which the new process joins instantly. This scheme, allows to configure the new ima namespace before any process appears in it. If user initially unshares the new ima namespace, ima can be configured using ima entries in the securityfs. If user calls clone3() system call directly, the new ima namespace can be configured using clone arguments. To allow this, new securityfs entries have to be added, and structures clone_args and kernel_clone_args have to be extended. Early configuration is crucial. The new ima polices must apply to the first process in the new namespace, and the appraisal key has to be loaded beforehand. Add a new CONFIG_IMA_NS option to the kernel configuration, that enables one to create a new IMA namespace. IMA namespace functionality is disabled by default. Signed-off-by: NKrzysztof Struczynski <krzysztof.struczynski@huawei.com> Reviewed-by: NZhang Tianxing <zhangtianxing3@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Chuck Lever 提交于
mainline inclusion from mainline-5.14-rc2 commit 06147843 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZVL2 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=061478438d04779181c2ce4d7ffeeca343a70a98 ------------------------------------------------- The author of commit b3b64ebd ("mm/page_alloc: do bulk array bounds check after checking populated elements") was possibly confused by the mixture of return values throughout the function. The API contract is clear that the function "Returns the number of pages on the list or array." It does not list zero as a unique return value with a special meaning. Therefore zero is a plausible return value only if @nr_pages is zero or less. Clean up the return logic to make it clear that the returned value is always the total number of pages in the array/list, not the number of pages that were allocated during this call. The only change in behavior with this patch is the value returned if prepare_alloc_pages() fails. To match the API contract, the number of pages currently in the array/list is returned in this case. The call site in __page_pool_alloc_pages_slow() also seems to be confused on this matter. It should be attended to by someone who is familiar with that code. [mel@techsingularity.net: Return nr_populated if 0 pages are requested] Link: https://lkml.kernel.org/r/20210713152100.10381-4-mgorman@techsingularity.netSigned-off-by: NChuck Lever <chuck.lever@oracle.com> Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NJesper Dangaard Brouer <brouer@redhat.com> Cc: Desmond Cheong Zhi Xi <desmondcheongzx@gmail.com> Cc: Zhang Qiang <Qiang.Zhang@windriver.com> Cc: Yanfei Xu <yanfei.xu@windriver.com> Cc: Matteo Croce <mcroce@microsoft.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 06147843) Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com> Reviewed-by: Ntong tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yanfei Xu 提交于
mainline inclusion from mainline-5.14-rc2 commit e5c15cea category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZVL2 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e5c15cea339115edf99dc92282865f173cf84510 ------------------------------------------------- If the array passed in is already partially populated, we should return "nr_populated" even failing at preparing arguments stage. Link: https://lkml.kernel.org/r/20210713152100.10381-3-mgorman@techsingularity.netSigned-off-by: NYanfei Xu <yanfei.xu@windriver.com> Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Link: https://lore.kernel.org/r/20210709102855.55058-1-yanfei.xu@windriver.comSigned-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit e5c15cea) Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com> Reviewed-by: Ntong tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Mel Gorman 提交于
mainline inclusion from mainline-5.14-rc2 commit 187ad460 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZVL2 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=187ad460b8413e863c951998cb321a117a717868 ------------------------------------------------- Syzbot is reporting potential deadlocks due to pagesets.lock when PAGE_OWNER is enabled. One example from Desmond Cheong Zhi Xi is as follows __alloc_pages_bulk() local_lock_irqsave(&pagesets.lock, flags) <---- outer lock here prep_new_page(): post_alloc_hook(): set_page_owner(): __set_page_owner(): save_stack(): stack_depot_save(): alloc_pages(): alloc_page_interleave(): __alloc_pages(): get_page_from_freelist(): rm_queue(): rm_queue_pcplist(): local_lock_irqsave(&pagesets.lock, flags); *** DEADLOCK *** Zhang, Qiang also reported BUG: sleeping function called from invalid context at mm/page_alloc.c:5179 in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0 ..... __dump_stack lib/dump_stack.c:79 [inline] dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:96 ___might_sleep.cold+0x1f1/0x237 kernel/sched/core.c:9153 prepare_alloc_pages+0x3da/0x580 mm/page_alloc.c:5179 __alloc_pages+0x12f/0x500 mm/page_alloc.c:5375 alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2147 alloc_pages+0x238/0x2a0 mm/mempolicy.c:2270 stack_depot_save+0x39d/0x4e0 lib/stackdepot.c:303 save_stack+0x15e/0x1e0 mm/page_owner.c:120 __set_page_owner+0x50/0x290 mm/page_owner.c:181 prep_new_page mm/page_alloc.c:2445 [inline] __alloc_pages_bulk+0x8b9/0x1870 mm/page_alloc.c:5313 alloc_pages_bulk_array_node include/linux/gfp.h:557 [inline] vm_area_alloc_pages mm/vmalloc.c:2775 [inline] __vmalloc_area_node mm/vmalloc.c:2845 [inline] __vmalloc_node_range+0x39d/0x960 mm/vmalloc.c:2947 __vmalloc_node mm/vmalloc.c:2996 [inline] vzalloc+0x67/0x80 mm/vmalloc.c:3066 There are a number of ways it could be fixed. The page owner code could be audited to strip GFP flags that allow sleeping but it'll impair the functionality of PAGE_OWNER if allocations fail. The bulk allocator could add a special case to release/reacquire the lock for prep_new_page and lookup PCP after the lock is reacquired at the cost of performance. The pages requiring prep could be tracked using the least significant bit and looping through the array although it is more complicated for the list interface. The options are relatively complex and the second one still incurs a performance penalty when PAGE_OWNER is active so this patch takes the simple approach -- disable bulk allocation of PAGE_OWNER is active. The caller will be forced to allocate one page at a time incurring a performance penalty but PAGE_OWNER is already a performance penalty. Link: https://lkml.kernel.org/r/20210708081434.GV3840@techsingularity.net Fixes: dbbee9d5 ("mm/page_alloc: convert per-cpu list protection to local_lock") Signed-off-by: NMel Gorman <mgorman@techsingularity.net> Reported-by: NDesmond Cheong Zhi Xi <desmondcheongzx@gmail.com> Reported-by: N"Zhang, Qiang" <Qiang.Zhang@windriver.com> Reported-by: syzbot+127fd7828d6eeb611703@syzkaller.appspotmail.com Tested-by: syzbot+127fd7828d6eeb611703@syzkaller.appspotmail.com Acked-by: NRafael Aquini <aquini@redhat.com> Cc: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com> Reviewed-by: Ntong tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yang Shi 提交于
mainline inclusion from mainline-v5.13-rc1 commit 18bb473e category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H CVE: NA ------------------------------------------------- The number of deferred objects might get windup to an absurd number, and it results in clamp of slab objects. It is undesirable for sustaining workingset. So shrink deferred objects proportional to priority and cap nr_deferred to twice of cache items. The idea is borrowed from Dave Chinner's patch: https://lore.kernel.org/linux-xfs/20191031234618.15403-13-david@fromorbit.com/ Tested with kernel build and vfs metadata heavy workload in our production environment, no regression is spotted so far. Link: https://lkml.kernel.org/r/20210311190845.9708-14-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <guro@fb.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yang Shi 提交于
mainline inclusion from mainline-v5.13-rc1 commit a178015c category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H CVE: NA ------------------------------------------------- Now shrinker's nr_deferred is per memcg for memcg aware shrinkers, add to parent's corresponding nr_deferred when memcg offline. Link: https://lkml.kernel.org/r/20210311190845.9708-13-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yang Shi 提交于
mainline inclusion from mainline-v5.13-rc1 commit 476b30a0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H CVE: NA ------------------------------------------------- Now nr_deferred is available on per memcg level for memcg aware shrinkers, so don't need allocate shrinker->nr_deferred for such shrinkers anymore. The prealloc_memcg_shrinker() would return -ENOSYS if !CONFIG_MEMCG or memcg is disabled by kernel command line, then shrinker's SHRINKER_MEMCG_AWARE flag would be cleared. This makes the implementation of this patch simpler. Link: https://lkml.kernel.org/r/20210311190845.9708-12-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yang Shi 提交于
mainline inclusion from mainline-v5.13-rc1 commit 86750830 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H CVE: NA ------------------------------------------------- Use per memcg's nr_deferred for memcg aware shrinkers. The shrinker's nr_deferred will be used in the following cases: 1. Non memcg aware shrinkers 2. !CONFIG_MEMCG 3. memcg is disabled by boot parameter Link: https://lkml.kernel.org/r/20210311190845.9708-11-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yang Shi 提交于
mainline inclusion from mainline-v5.13-rc1 commit 3c6f17e6 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H CVE: NA ------------------------------------------------- Currently the number of deferred objects are per shrinker, but some slabs, for example, vfs inode/dentry cache are per memcg, this would result in poor isolation among memcgs. The deferred objects typically are generated by __GFP_NOFS allocations, one memcg with excessive __GFP_NOFS allocations may blow up deferred objects, then other innocent memcgs may suffer from over shrink, excessive reclaim latency, etc. For example, two workloads run in memcgA and memcgB respectively, workload in B is vfs heavy workload. Workload in A generates excessive deferred objects, then B's vfs cache might be hit heavily (drop half of caches) by B's limit reclaim or global reclaim. We observed this hit in our production environment which was running vfs heavy workload shown as the below tracing log: <...>-409454 [016] .... 28286961.747146: mm_shrink_slab_start: super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 1 objects to shrink 3641681686040 gfp_flags GFP_HIGHUSER_MOVABLE|__GFP_ZERO pgs_scanned 1 lru_pgs 15721 cache items 246404277 delta 31345 total_scan 123202138 <...>-409454 [022] .... 28287105.928018: mm_shrink_slab_end: super_cache_scan+0x0/0x1a0 ffff9a83046f3458: nid: 1 unused scan count 3641681686040 new scan count 3641798379189 total_scan 602 last shrinker return val 123186855 The vfs cache and page cache ratio was 10:1 on this machine, and half of caches were dropped. This also resulted in significant amount of page caches were dropped due to inodes eviction. Make nr_deferred per memcg for memcg aware shrinkers would solve the unfairness and bring better isolation. The following patch will add nr_deferred to parent memcg when memcg offline. To preserve nr_deferred when reparenting memcgs to root, root memcg needs shrinker_info allocated too. When memcg is not enabled (!CONFIG_MEMCG or memcg disabled), the shrinker's nr_deferred would be used. And non memcg aware shrinkers use shrinker's nr_deferred all the time. Link: https://lkml.kernel.org/r/20210311190845.9708-10-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yang Shi 提交于
mainline inclusion from mainline-v5.13-rc1 commit 41ca668a category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H CVE: NA ------------------------------------------------- Currently registered shrinker is indicated by non-NULL shrinker->nr_deferred. This approach is fine with nr_deferred at the shrinker level, but the following patches will move MEMCG_AWARE shrinkers' nr_deferred to memcg level, so their shrinker->nr_deferred would always be NULL. This would prevent the shrinkers from unregistering correctly. Remove SHRINKER_REGISTERING since we could check if shrinker is registered successfully by the new flag. Link: https://lkml.kernel.org/r/20210311190845.9708-9-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com> Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yang Shi 提交于
mainline inclusion from mainline-v5.13-rc1 commit 468ab843 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H CVE: NA ------------------------------------------------- The shrinker_info is dereferenced in a couple of places via rcu_dereference_protected with different calling conventions, for example, using mem_cgroup_nodeinfo helper or dereferencing memcg->nodeinfo[nid]->shrinker_info. And the later patch will add more dereference places. So extract the dereference into a helper to make the code more readable. No functional change. [akpm@linux-foundation.org: retain rcu_dereference_protected() in free_shrinker_info(), per Hugh] Link: https://lkml.kernel.org/r/20210311190845.9708-8-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yang Shi 提交于
mainline inclusion from mainline-v5.13-rc1 commit e4262c4f category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H CVE: NA ------------------------------------------------- The following patch is going to add nr_deferred into shrinker_map, the change will make shrinker_map not only include map anymore, so rename it to "memcg_shrinker_info". And this should make the patch adding nr_deferred cleaner and readable and make review easier. Also remove the "memcg_" prefix. Link: https://lkml.kernel.org/r/20210311190845.9708-7-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yang Shi 提交于
mainline inclusion from mainline-v5.13-rc1 commit 72673e86 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H CVE: NA ------------------------------------------------- Using kvfree_rcu() to free the old shrinker_maps instead of call_rcu(). We don't have to define a dedicated callback for call_rcu() anymore. Link: https://lkml.kernel.org/r/20210311190845.9708-6-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yang Shi 提交于
mainline inclusion from mainline-v5.13-rc1 commit a2fb1261 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H CVE: NA ------------------------------------------------- Both memcg_shrinker_map_size and shrinker_nr_max is maintained, but actually the map size can be calculated via shrinker_nr_max, so it seems unnecessary to keep both. Remove memcg_shrinker_map_size since shrinker_nr_max is also used by iterating the bit map. Link: https://lkml.kernel.org/r/20210311190845.9708-5-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com> Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NRoman Gushchin <guro@fb.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yang Shi 提交于
mainline inclusion from mainline-v5.13-rc1 commit d27cf2aa category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H CVE: NA ------------------------------------------------- Since memcg_shrinker_map_size just can be changed under holding shrinker_rwsem exclusively, the read side can be protected by holding read lock, so it sounds superfluous to have a dedicated mutex. Kirill Tkhai suggested use write lock since: * We want the assignment to shrinker_maps is visible for shrink_slab_memcg(). * The rcu_dereference_protected() dereferrencing in shrink_slab_memcg(), but in case of we use READ lock in alloc_shrinker_maps(), the dereferrencing is not actually protected. * READ lock makes alloc_shrinker_info() racy against memory allocation fail. alloc_shrinker_info()->free_shrinker_info() may free memory right after shrink_slab_memcg() dereferenced it. You may say shrink_slab_memcg()->mem_cgroup_online() protects us from it? Yes, sure, but this is not the thing we want to remember in the future, since this spreads modularity. And a test with heavy paging workload didn't show write lock makes things worse. Link: https://lkml.kernel.org/r/20210311190845.9708-4-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Yang Shi 提交于
mainline inclusion from mainline-v5.13-rc1 commit 2bfd3637 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I48N0H CVE: NA ------------------------------------------------- The shrinker map management is not purely memcg specific, it is at the intersection between memory cgroup and shrinkers. It's allocation and assignment of a structure, and the only memcg bit is the map is being stored in a memcg structure. So move the shrinker_maps handling code into vmscan.c for tighter integration with shrinker code, and remove the "memcg_" prefix. There is no functional change. Link: https://lkml.kernel.org/r/20210311190845.9708-3-shy828301@gmail.comSigned-off-by: NYang Shi <shy828301@gmail.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com> Acked-by: NRoman Gushchin <guro@fb.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/memcontrol.c Signed-off-by: NChen Wandun <chenwandun@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-