- 28 4月, 2023 1 次提交
-
-
由 Ma Wupeng 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2 CVE: NA -------------------------------- Since struct mempolicy is commonly used by external users, use wrapper to fix KABI broken in struct mempolicy. Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
-
- 27 4月, 2023 9 次提交
-
-
由 Mathieu Desnoyers 提交于
mainline inclusion from mainline-v6.2-rc1 commit 38ce7c9b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=38ce7c9bdfc228c14d7621ba36d3eebedd9d4f76 -------------------------------- When encountering any vma in the range with policy other than MPOL_BIND or MPOL_PREFERRED_MANY, an error is returned without issuing a mpol_put on the policy just allocated with mpol_dup(). This allows arbitrary users to leak kernel memory. Link: https://lkml.kernel.org/r/20221215194621.202816-1-mathieu.desnoyers@efficios.com Fixes: c6018b4b ("mm/mempolicy: add set_mempolicy_home_node syscall") Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: NRandy Dunlap <rdunlap@infradead.org> Reviewed-by: N"Huang, Ying" <ying.huang@intel.com> Reviewed-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: <stable@vger.kernel.org> [5.17+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
-
由 Feng Tang 提交于
mainline inclusion from mainline-v6.1-rc1 commit d2226ebd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d2226ebd5484afcf9f9b71b394ec1567a7730eb1 -------------------------------- Muchun Song found that after MPOL_PREFERRED_MANY policy was introduced in commit b27abacc ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes"), the policy_nodemask_current()'s semantics for this new policy has been changed, which returns 'preferred' nodes instead of 'allowed' nodes. With the changed semantic of policy_nodemask_current, a task with MPOL_PREFERRED_MANY policy could fail to get its reservation even though it can fall back to other nodes (either defined by cpusets or all online nodes) for that reservation failing mmap calles unnecessarily early. The fix is to not consider MPOL_PREFERRED_MANY for reservations at all because they, unlike MPOL_MBIND, do not pose any actual hard constrain. Michal suggested the policy_nodemask_current() is only used by hugetlb, and could be moved to hugetlb code with more explicit name to enforce the 'allowed' semantics for which only MPOL_BIND policy matters. apply_policy_zone() is made extern to be called in hugetlb code and its return value is changed to bool. [1]. https://lore.kernel.org/lkml/20220801084207.39086-1-songmuchun@bytedance.com/t/ Link: https://lkml.kernel.org/r/20220805005903.95563-1-feng.tang@intel.com Fixes: b27abacc ("mm/mempolicy: add MPOL_PREFERRED_MANY for multiple preferred nodes") Signed-off-by: NFeng Tang <feng.tang@intel.com> Reported-by: NMuchun Song <songmuchun@bytedance.com> Suggested-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ben Widawsky <bwidawsk@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Conflicts: include/linux/mempolicy.h mm/hugetlb.c Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
-
由 Aneesh Kumar K.V 提交于
mainline inclusion from mainline-v5.17-rc1 commit c6018b4b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c6018b4b254971863bd0ad36bb5e7d0fa0f0ddb0 -------------------------------- This syscall can be used to set a home node for the MPOL_BIND and MPOL_PREFERRED_MANY memory policy. Users should use this syscall after setting up a memory policy for the specified range as shown below. mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp, new_nodes->size + 1, 0); sys_set_mempolicy_home_node((unsigned long)p, nr_pages * page_size, home_node, 0); The syscall allows specifying a home node/preferred node from which kernel will fulfill memory allocation requests first. For address range with MPOL_BIND memory policy, if nodemask specifies more than one node, page allocations will come from the node in the nodemask with sufficient free memory that is closest to the home node/preferred node. For MPOL_PREFERRED_MANY if the nodemask specifies more than one node, page allocation will come from the node in the nodemask with sufficient free memory that is closest to the home node/preferred node. If there is not enough memory in all the nodes specified in the nodemask, the allocation will be attempted from the closest numa node to the home node in the system. This helps applications to hint at a memory allocation preference node and fallback to _only_ a set of nodes if the memory is not available on the preferred node. Fallback allocation is attempted from the node which is nearest to the preferred node. This helps applications to have control on memory allocation numa nodes and avoids default fallback to slow memory NUMA nodes. For example a system with NUMA nodes 1,2 and 3 with DRAM memory and 10, 11 and 12 of slow memory new_nodes = numa_bitmask_alloc(nr_nodes); numa_bitmask_setbit(new_nodes, 1); numa_bitmask_setbit(new_nodes, 2); numa_bitmask_setbit(new_nodes, 3); p = mmap(NULL, nr_pages * page_size, protflag, mapflag, -1, 0); mbind(p, nr_pages * page_size, MPOL_BIND, new_nodes->maskp, new_nodes->size + 1, 0); sys_set_mempolicy_home_node(p, nr_pages * page_size, 2, 0); This will allocate from nodes closer to node 2 and will make sure the kernel will only allocate from nodes 1, 2, and 3. Memory will not be allocated from slow memory nodes 10, 11, and 12. This differs from default MPOL_BIND behavior in that with default MPOL_BIND the allocation will be attempted from node closer to the local node. One of the reasons to specify a home node is to allow allocations from cpu less NUMA node and its nearby NUMA nodes. With MPOL_PREFERRED_MANY on the other hand will first try to allocate from the closest node to node 2 from the node list 1, 2 and 3. If those nodes don't have enough memory, kernel will allocate from slow memory node 10, 11 and 12 which ever is closer to node 2. Link: https://lkml.kernel.org/r/20211202123810.267175-3-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: <linux-api@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: include/linux/mempolicy.h mm/mempolicy.c Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
-
由 Aneesh Kumar K.V 提交于
mainline inclusion from mainline-v5.17-rc1 commit c0455116 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c04551162167368022a61899843821bbf015b473 -------------------------------- Patch series "mm: add new syscall set_mempolicy_home_node", v6. This patch (of 3): A followup patch will enable setting a home node with MPOL_PREFERRED_MANY memory policy. To facilitate that switch to using policy_node helper. There is no functional change in this patch. Link: https://lkml.kernel.org/r/20211202123810.267175-1-aneesh.kumar@linux.ibm.com Link: https://lkml.kernel.org/r/20211202123810.267175-2-aneesh.kumar@linux.ibm.comSigned-off-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: <linux-api@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
-
由 Ben Widawsky 提交于
mainline inclusion from mainline-v5.15-rc1 commit a38a59fd category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a38a59fdfa10be55d08e4530923d950e739ac6a2 -------------------------------- Adds a new mode to the existing mempolicy modes, MPOL_PREFERRED_MANY. MPOL_PREFERRED_MANY will be adequately documented in the internal admin-guide with this patch. Eventually, the man pages for mbind(2), get_mempolicy(2), set_mempolicy(2) and numactl(8) will also have text about this mode. Those shall contain the canonical reference. NUMA systems continue to become more prevalent. New technologies like PMEM make finer grain control over memory access patterns increasingly desirable. MPOL_PREFERRED_MANY allows userspace to specify a set of nodes that will be tried first when performing allocations. If those allocations fail, all remaining nodes will be tried. It's a straight forward API which solves many of the presumptive needs of system administrators wanting to optimize workloads on such machines. The mode will work either per VMA, or per thread. [Michal Hocko: refine kernel doc for MPOL_PREFERRED_MANY] Link: https://lore.kernel.org/r/20200630212517.308045-13-ben.widawsky@intel.com Link: https://lkml.kernel.org/r/1627970362-61305-5-git-send-email-feng.tang@intel.comSigned-off-by: NBen Widawsky <ben.widawsky@intel.com> Signed-off-by: NFeng Tang <feng.tang@intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
-
由 Feng Tang 提交于
mainline inclusion from mainline-v5.15-rc1 commit 4c54d949 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4c54d94908e089e9741513797eac30a8b8217034 -------------------------------- The semantics of MPOL_PREFERRED_MANY is similar to MPOL_PREFERRED, that it will first try to allocate memory from the preferred node(s), and fallback to all nodes in system when first try fails. Add a dedicated function alloc_pages_preferred_many() for it just like for 'interleave' policy, which will be used by 2 general memoory allocation APIs: alloc_pages() and alloc_pages_vma() Link: https://lore.kernel.org/r/20200630212517.308045-9-ben.widawsky@intel.com Link: https://lkml.kernel.org/r/1627970362-61305-3-git-send-email-feng.tang@intel.comSuggested-by: NMichal Hocko <mhocko@suse.com> Originally-by: NBen Widawsky <ben.widawsky@intel.com> Co-developed-by: NBen Widawsky <ben.widawsky@intel.com> Signed-off-by: NBen Widawsky <ben.widawsky@intel.com> Signed-off-by: NFeng Tang <feng.tang@intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
-
由 Dave Hansen 提交于
mainline inclusion from mainline-v5.15-rc1 commit b27abacc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b27abaccf8e8b012f126da0c2a1ab32723ec8b9f -------------------------------- Patch series "Introduce multi-preference mempolicy", v7. This patch series introduces the concept of the MPOL_PREFERRED_MANY mempolicy. This mempolicy mode can be used with either the set_mempolicy(2) or mbind(2) interfaces. Like the MPOL_PREFERRED interface, it allows an application to set a preference for nodes which will fulfil memory allocation requests. Unlike the MPOL_PREFERRED mode, it takes a set of nodes. Like the MPOL_BIND interface, it works over a set of nodes. Unlike MPOL_BIND, it will not cause a SIGSEGV or invoke the OOM killer if those preferred nodes are not available. Along with these patches are patches for libnuma, numactl, numademo, and memhog. They still need some polish, but can be found here: https://gitlab.com/bwidawsk/numactl/-/tree/prefer-many It allows new usage: `numactl -P 0,3,4` The goal of the new mode is to enable some use-cases when using tiered memory usage models which I've lovingly named. 1a. The Hare - The interconnect is fast enough to meet bandwidth and latency requirements allowing preference to be given to all nodes with "fast" memory. 1b. The Indiscriminate Hare - An application knows it wants fast memory (or perhaps slow memory), but doesn't care which node it runs on. The application can prefer a set of nodes and then xpu bind to the local node (cpu, accelerator, etc). This reverses the nodes are chosen today where the kernel attempts to use local memory to the CPU whenever possible. This will attempt to use the local accelerator to the memory. 2. The Tortoise - The administrator (or the application itself) is aware it only needs slow memory, and so can prefer that. Much of this is almost achievable with the bind interface, but the bind interface suffers from an inability to fallback to another set of nodes if binding fails to all nodes in the nodemask. Like MPOL_BIND a nodemask is given. Inherently this removes ordering from the preference. > /* Set first two nodes as preferred in an 8 node system. */ > const unsigned long nodes = 0x3 > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8); > /* Mimic interleave policy, but have fallback *. > const unsigned long nodes = 0xaa > set_mempolicy(MPOL_PREFER_MANY, &nodes, 8); Some internal discussion took place around the interface. There are two alternatives which we have discussed, plus one I stuck in: 1. Ordered list of nodes. Currently it's believed that the added complexity is nod needed for expected usecases. 2. A flag for bind to allow falling back to other nodes. This confuses the notion of binding and is less flexible than the current solution. 3. Create flags or new modes that helps with some ordering. This offers both a friendlier API as well as a solution for more customized usage. It's unknown if it's worth the complexity to support this. Here is sample code for how this might work: > // Prefer specific nodes for some something wacky > set_mempolicy(MPOL_PREFER_MANY, 0x17c, 1024); > > // Default > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_SOCKET, NULL, 0); > // which is the same as > set_mempolicy(MPOL_DEFAULT, NULL, 0); > > // The Hare > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, NULL, 0); > > // The Tortoise > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE_REV, NULL, 0); > > // Prefer the fast memory of the first two sockets > set_mempolicy(MPOL_PREFER_MANY | MPOL_F_PREFER_ORDER_TYPE, -1, 2); > This patch (of 5): The NUMA APIs currently allow passing in a "preferred node" as a single bit set in a nodemask. If more than one bit it set, bits after the first are ignored. This single node is generally OK for location-based NUMA where memory being allocated will eventually be operated on by a single CPU. However, in systems with multiple memory types, folks want to target a *type* of memory instead of a location. For instance, someone might want some high-bandwidth memory but do not care about the CPU next to which it is allocated. Or, they want a cheap, high capacity allocation and want to target all NUMA nodes which have persistent memory in volatile mode. In both of these cases, the application wants to target a *set* of nodes, but does not want strict MPOL_BIND behavior as that could lead to OOM killer or SIGSEGV. So add MPOL_PREFERRED_MANY policy to support the multiple preferred nodes requirement. This is not a pie-in-the-sky dream for an API. This was a response to a specific ask of more than one group at Intel. Specifically: 1. There are existing libraries that target memory types such as https://github.com/memkind/memkind. These are known to suffer from SIGSEGV's when memory is low on targeted memory "kinds" that span more than one node. The MCDRAM on a Xeon Phi in "Cluster on Die" mode is an example of this. 2. Volatile-use persistent memory users want to have a memory policy which is targeted at either "cheap and slow" (PMEM) or "expensive and fast" (DRAM). However, they do not want to experience allocation failures when the targeted type is unavailable. 3. Allocate-then-run. Generally, we let the process scheduler decide on which physical CPU to run a task. That location provides a default allocation policy, and memory availability is not generally considered when placing tasks. For situations where memory is valuable and constrained, some users want to allocate memory first, *then* allocate close compute resources to the allocation. This is the reverse of the normal (CPU) model. Accelerators such as GPUs that operate on core-mm-managed memory are interested in this model. A check is added in sanitize_mpol_flags() to not permit 'prefer_many' policy to be used for now, and will be removed in later patch after all implementations for 'prefer_many' are ready, as suggested by Michal Hocko. [mhocko@kernel.org: suggest to refine policy_node/policy_nodemask handling] Link: https://lkml.kernel.org/r/1627970362-61305-1-git-send-email-feng.tang@intel.com Link: https://lore.kernel.org/r/20200630212517.308045-4-ben.widawsky@intel.com Link: https://lkml.kernel.org/r/1627970362-61305-2-git-send-email-feng.tang@intel.comCo-developed-by: NBen Widawsky <ben.widawsky@intel.com> Signed-off-by: NBen Widawsky <ben.widawsky@intel.com> Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com> Signed-off-by: NFeng Tang <feng.tang@intel.com> Cc: Michal Hocko <mhocko@kernel.org> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com>b Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/mempolicy.c Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
-
由 Feng Tang 提交于
mainline inclusion from mainline-v5.14-rc1 commit 95837924 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=95837924587c60425f941dc8cbfba61cb964fcb5 -------------------------------- Currently the kernel_mbind() and kernel_set_mempolicy() do almost the same operation for parameter sanity check. Add a helper function to unify the code to reduce the redundancy, and make it easier for changing the sanity check code in future. [thanks to David Rientjes for suggesting using helper function instead of macro]. [feng.tang@intel.com: add comment] Link: https://lkml.kernel.org/r/1622560492-1294-4-git-send-email-feng.tang@intel.com Link: https://lkml.kernel.org/r/1622469956-82897-4-git-send-email-feng.tang@intel.comSigned-off-by: NFeng Tang <feng.tang@intel.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Andi Kleen <ak@linux.intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
-
由 Feng Tang 提交于
mainline inclusion from mainline-v5.14-rc1 commit 7858d7bc category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I6I1Z2 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7858d7bca7fbbbbd5b940d2ec371b2d060b21b84 -------------------------------- MPOL_LOCAL policy has been setup as a real policy, but it is still handled like a faked POL_PREFERRED policy with one internal MPOL_F_LOCAL flag bit set, and there are many places having to judge the real 'prefer' or the 'local' policy, which are quite confusing. In current code, there are 4 cases that MPOL_LOCAL are used: 1. user specifies 'local' policy 2. user specifies 'prefer' policy, but with empty nodemask 3. system 'default' policy is used 4. 'prefer' policy + valid 'preferred' node with MPOL_F_STATIC_NODES flag set, and when it is 'rebind' to a nodemask which doesn't contains the 'preferred' node, it will perform as 'local' policy So make 'local' a real policy instead of a fake 'prefer' one, and kill MPOL_F_LOCAL bit, which can greatly reduce the confusion for code reading. For case 4, the logic of mpol_rebind_preferred() is confusing, as Michal Hocko pointed out: : I do believe that rebinding preferred policy is just bogus and it should : be dropped altogether on the ground that a preference is a mere hint from : userspace where to start the allocation. Unless I am missing something : cpusets will be always authoritative for the final placement. The : preferred node just acts as a starting point and it should be really : preserved when cpusets changes. Otherwise we have a very subtle behavior : corner cases. So dump all the tricky transformation between 'prefer' and 'local', and just record the new nodemask of rebinding. [feng.tang@intel.com: fix a problem in mpol_set_nodemask(), per Michal Hocko] Link: https://lkml.kernel.org/r/1622560492-1294-3-git-send-email-feng.tang@intel.com [feng.tang@intel.com: refine code and comments of mpol_set_nodemask(), per Michal] Link: https://lkml.kernel.org/r/20210603081807.GE56979@shbuild999.sh.intel.com Link: https://lkml.kernel.org/r/1622469956-82897-3-git-send-email-feng.tang@intel.comSigned-off-by: NFeng Tang <feng.tang@intel.com> Suggested-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Ben Widawsky <ben.widawsky@intel.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: David Rientjes <rientjes@google.com> Cc: Huang Ying <ying.huang@intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NMa Wupeng <mawupeng1@huawei.com>
-
- 10 11月, 2022 1 次提交
-
-
由 Wang Cheng 提交于
stable inclusion from stable-v5.10.134 commit ddb3f0b68863bd1c5f43177eea476bce316d4993 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5ZVR7 Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=ddb3f0b68863bd1c5f43177eea476bce316d4993 -------------------------------- commit 018160ad upstream. mpol_set_nodemask()(mm/mempolicy.c) does not set up nodemask when pol->mode is MPOL_LOCAL. Check pol->mode before access pol->w.cpuset_mems_allowed in mpol_rebind_policy()(mm/mempolicy.c). BUG: KMSAN: uninit-value in mpol_rebind_policy mm/mempolicy.c:352 [inline] BUG: KMSAN: uninit-value in mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368 mpol_rebind_policy mm/mempolicy.c:352 [inline] mpol_rebind_task+0x2ac/0x2c0 mm/mempolicy.c:368 cpuset_change_task_nodemask kernel/cgroup/cpuset.c:1711 [inline] cpuset_attach+0x787/0x15e0 kernel/cgroup/cpuset.c:2278 cgroup_migrate_execute+0x1023/0x1d20 kernel/cgroup/cgroup.c:2515 cgroup_migrate kernel/cgroup/cgroup.c:2771 [inline] cgroup_attach_task+0x540/0x8b0 kernel/cgroup/cgroup.c:2804 __cgroup1_procs_write+0x5cc/0x7a0 kernel/cgroup/cgroup-v1.c:520 cgroup1_tasks_write+0x94/0xb0 kernel/cgroup/cgroup-v1.c:539 cgroup_file_write+0x4c2/0x9e0 kernel/cgroup/cgroup.c:3852 kernfs_fop_write_iter+0x66a/0x9f0 fs/kernfs/file.c:296 call_write_iter include/linux/fs.h:2162 [inline] new_sync_write fs/read_write.c:503 [inline] vfs_write+0x1318/0x2030 fs/read_write.c:590 ksys_write+0x28b/0x510 fs/read_write.c:643 __do_sys_write fs/read_write.c:655 [inline] __se_sys_write fs/read_write.c:652 [inline] __x64_sys_write+0xdb/0x120 fs/read_write.c:652 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae Uninit was created at: slab_post_alloc_hook mm/slab.h:524 [inline] slab_alloc_node mm/slub.c:3251 [inline] slab_alloc mm/slub.c:3259 [inline] kmem_cache_alloc+0x902/0x11c0 mm/slub.c:3264 mpol_new mm/mempolicy.c:293 [inline] do_set_mempolicy+0x421/0xb70 mm/mempolicy.c:853 kernel_set_mempolicy mm/mempolicy.c:1504 [inline] __do_sys_set_mempolicy mm/mempolicy.c:1510 [inline] __se_sys_set_mempolicy+0x44c/0xb60 mm/mempolicy.c:1507 __x64_sys_set_mempolicy+0xd8/0x110 mm/mempolicy.c:1507 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x54/0xd0 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x44/0xae KMSAN: uninit-value in mpol_rebind_task (2) https://syzkaller.appspot.com/bug?id=d6eb90f952c2a5de9ea718a1b873c55cb13b59dc This patch seems to fix below bug too. KMSAN: uninit-value in mpol_rebind_mm (2) https://syzkaller.appspot.com/bug?id=f2fecd0d7013f54ec4162f60743a2b28df40926b The uninit-value is pol->w.cpuset_mems_allowed in mpol_rebind_policy(). When syzkaller reproducer runs to the beginning of mpol_new(), mpol_new() mm/mempolicy.c do_mbind() mm/mempolicy.c kernel_mbind() mm/mempolicy.c `mode` is 1(MPOL_PREFERRED), nodes_empty(*nodes) is `true` and `flags` is 0. Then mode = MPOL_LOCAL; ... policy->mode = mode; policy->flags = flags; will be executed. So in mpol_set_nodemask(), mpol_set_nodemask() mm/mempolicy.c do_mbind() kernel_mbind() pol->mode is 4 (MPOL_LOCAL), that `nodemask` in `pol` is not initialized, which will be accessed in mpol_rebind_policy(). Link: https://lkml.kernel.org/r/20220512123428.fq3wofedp6oiotd4@ppc.localdomainSigned-off-by: NWang Cheng <wanngchenng@gmail.com> Reported-by: <syzbot+217f792c92599518a2ab@syzkaller.appspotmail.com> Tested-by: <syzbot+217f792c92599518a2ab@syzkaller.appspotmail.com> Cc: David Rientjes <rientjes@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com> Reviewed-by: NWei Li <liwei391@huawei.com>
-
- 18 7月, 2022 1 次提交
-
-
由 Miaohe Lin 提交于
stable inclusion from stable-v5.10.111 commit f7e183b0a7136b6dc9c7b9b2a85a608a8feba894 category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I5GL1Z Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=f7e183b0a7136b6dc9c7b9b2a85a608a8feba894 -------------------------------- commit 4ad09955 upstream. If mpol_new is allocated but not used in restart loop, mpol_new will be freed via mpol_put before returning to the caller. But refcnt is not initialized yet, so mpol_put could not do the right things and might leak the unused mpol_new. This would happen if mempolicy was updated on the shared shmem file while the sp->lock has been dropped during the memory allocation. This issue could be triggered easily with the below code snippet if there are many processes doing the below work at the same time: shmid = shmget((key_t)5566, 1024 * PAGE_SIZE, 0666|IPC_CREAT); shm = shmat(shmid, 0, 0); loop many times { mbind(shm, 1024 * PAGE_SIZE, MPOL_LOCAL, mask, maxnode, 0); mbind(shm + 128 * PAGE_SIZE, 128 * PAGE_SIZE, MPOL_DEFAULT, mask, maxnode, 0); } Link: https://lkml.kernel.org/r/20220329111416.27954-1-linmiaohe@huawei.com Fixes: 42288fe3 ("mm: mempolicy: Convert shared_policy mutex to spinlock") Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Mel Gorman <mgorman@suse.de> Cc: <stable@vger.kernel.org> [3.8] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com> Reviewed-by: NWei Li <liwei391@huawei.com>
-
- 06 7月, 2022 1 次提交
-
-
由 Hugh Dickins 提交于
stable inclusion from stable-v5.10.110 commit 4bcefc78c87409da495eda4afe12b37ef5aa9ea1 bugzilla: https://gitee.com/openeuler/kernel/issues/I574AL Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=4bcefc78c87409da495eda4afe12b37ef5aa9ea1 -------------------------------- commit 4e090600 upstream. v2.6.34 commit 9d8cebd4 ("mm: fix mbind vma merge problem") introduced vma_merge() to mbind_range(); but unlike madvise, mlock and mprotect, it put a "continue" to next vma where its precedents go to update flags on current vma before advancing: that left vma with the wrong setting in the infamous vma_merge() case 8. v3.10 commit 1444f92c ("mm: merging memory blocks resets mempolicy") tried to fix that in vma_adjust(), without fully understanding the issue. v3.11 commit 3964acd0 ("mm: mempolicy: fix mbind_range() && vma_adjust() interaction") reverted that, and went about the fix in the right way, but chose to optimize out an unnecessary mpol_dup() with a prior mpol_equal() test. But on tmpfs, that also pessimized out the vital call to its ->set_policy(), leaving the new mbind unenforced. The user visible effect was that the pages got allocated on the local node (happened to be 0), after the mbind() caller had specifically asked for them to be allocated on node 1. There was not any page migration involved in the case reported: the pages simply got allocated on the wrong node. Just delete that optimization now (though it could be made conditional on vma not having a set_policy). Also remove the "next" variable: it turned out to be blameless, but also pointless. Link: https://lkml.kernel.org/r/319e4db9-64ae-4bca-92f0-ade85d342ff@google.com Fixes: 3964acd0 ("mm: mempolicy: fix mbind_range() && vma_adjust() interaction") Signed-off-by: NHugh Dickins <hughd@google.com> Acked-by: NOleg Nesterov <oleg@redhat.com> Reviewed-by: NLiam R. Howlett <Liam.Howlett@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYu Liao <liaoyu15@huawei.com> Reviewed-by: NWei Li <liwei391@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 07 6月, 2022 1 次提交
-
-
由 Zhang Jian 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I54IL8 CVE: NA backport: openEuler-22.03-LTS ----------------------------- When passing numa id to sp_alloc, sometimes numa id does not work. This is because memory policy will change numa id to a preferred one if memory policy is set. Fix the error by mbind virtual address to desired numa id. Signed-off-by: NZhang Jian <zhangjian210@huawei.com> Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com> Signed-off-by: NWang Wensheng <wangwensheng4@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 21 4月, 2022 1 次提交
-
-
由 yanghui 提交于
mainline inclusion from mainline-v5.15-rc1 commit 276aeee1 category: bugfix bugzilla: 181417 https://gitee.com/openeuler/kernel/issues/I53CSV backport: openEuler-22.03-LTS Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=276aeee1c5fc00df700f0782060beae126600472 ------------------------------- Servers happened below panic: Kernel version:5.4.56 BUG: unable to handle page fault for address: 0000000000002c48 RIP: 0010:__next_zones_zonelist+0x1d/0x40 Call Trace: __alloc_pages_nodemask+0x277/0x310 alloc_page_interleave+0x13/0x70 handle_mm_fault+0xf99/0x1390 __do_page_fault+0x288/0x500 do_page_fault+0x30/0x110 page_fault+0x3e/0x50 The reason for the panic is that MAX_NUMNODES is passed in the third parameter in __alloc_pages_nodemask(preferred_nid). So access to zonelist->zoneref->zone_idx in __next_zones_zonelist will cause a panic. In offset_il_node(), first_node() returns nid from pol->v.nodes, after this other threads may chang pol->v.nodes before next_node(). This race condition will let next_node return MAX_NUMNODES. So put pol->nodes in a local variable. The race condition is between offset_il_node and cpuset_change_task_nodemask: CPU0: CPU1: alloc_pages_vma() interleave_nid(pol,) offset_il_node(pol,) first_node(pol->v.nodes) cpuset_change_task_nodemask //nodes==0xc mpol_rebind_task mpol_rebind_policy mpol_rebind_nodemask(pol,nodes) //nodes==0x3 next_node(nid, pol->v.nodes)//return MAX_NUMNODES Link: https://lkml.kernel.org/r/20210906034658.48721-1-yanghui.def@bytedance.comSigned-off-by: Nyanghui <yanghui.def@bytedance.com> Reviewed-by: NMuchun Song <songmuchun@bytedance.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 276aeee1) conflicts: mm/mempolicy.c Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 26 1月, 2022 1 次提交
-
-
由 Andrey Ryabinin 提交于
stable inclusion from stable-v5.10.89 commit ee6f34215c5dfa2257298cc362cd79e14af5a25a bugzilla: 186140 https://gitee.com/openeuler/kernel/issues/I4S8HA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=ee6f34215c5dfa2257298cc362cd79e14af5a25a -------------------------------- alloc_pages_vma() may try to allocate THP page on the local NUMA node first: page = __alloc_pages_node(hpage_node, gfp | __GFP_THISNODE | __GFP_NORETRY, order); And if the allocation fails it retries allowing remote memory: if (!page && (gfp & __GFP_DIRECT_RECLAIM)) page = __alloc_pages_node(hpage_node, gfp, order); However, this retry allocation completely ignores memory policy nodemask allowing allocation to escape restrictions. The first appearance of this bug seems to be the commit ac5b2c18 ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings"). The bug disappeared later in the commit 89c83fb5 ("mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask") and reappeared again in slightly different form in the commit 76e654cc ("mm, page_alloc: allow hugepage fallback to remote nodes when madvised") Fix this by passing correct nodemask to the __alloc_pages() call. The demonstration/reproducer of the problem: $ mount -oremount,size=4G,huge=always /dev/shm/ $ echo always > /sys/kernel/mm/transparent_hugepage/defrag $ cat mbind_thp.c #include <unistd.h> #include <sys/mman.h> #include <sys/stat.h> #include <fcntl.h> #include <assert.h> #include <stdlib.h> #include <stdio.h> #include <numaif.h> #define SIZE 2ULL << 30 int main(int argc, char **argv) { int fd; unsigned long long i; char *addr; pid_t pid; char buf[100]; unsigned long nodemask = 1; fd = open("/dev/shm/test", O_RDWR|O_CREAT); assert(fd > 0); assert(ftruncate(fd, SIZE) == 0); addr = mmap(NULL, SIZE, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); assert(mbind(addr, SIZE, MPOL_BIND, &nodemask, 2, MPOL_MF_STRICT|MPOL_MF_MOVE)==0); for (i = 0; i < SIZE; i+=4096) { addr[i] = 1; } pid = getpid(); snprintf(buf, sizeof(buf), "grep shm /proc/%d/numa_maps", pid); system(buf); sleep(10000); return 0; } $ gcc mbind_thp.c -o mbind_thp -lnuma $ numactl -H available: 2 nodes (0-1) node 0 cpus: 0 2 node 0 size: 1918 MB node 0 free: 1595 MB node 1 cpus: 1 3 node 1 size: 2014 MB node 1 free: 1731 MB node distances: node 0 1 0: 10 20 1: 20 10 $ rm -f /dev/shm/test; taskset -c 0 ./mbind_thp 7fd970a00000 bind:0 file=/dev/shm/test dirty=524288 active=0 N0=396800 N1=127488 kernelpagesize_kB=4 Link: https://lkml.kernel.org/r/20211208165343.22349-1-arbn@yandex-team.com Fixes: ac5b2c18 ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") Signed-off-by: NAndrey Ryabinin <arbn@yandex-team.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NDavid Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 29 11月, 2021 3 次提交
-
-
由 Lijun Fang 提交于
ascend inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4JMLR CVE: NA ----------------- CDM nodes should not be part of mems_allowed, However, It must be allowed to alloc from CDM node, when mpol->mode was MPOL_BIND. Signed-off-by: NLijun Fang <fanglijun3@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Anshuman Khandual 提交于
ascend inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4JMLR CVE: NA ------------------- CDM nodes need a way of explicit memory allocation mechanism from the user space. After the previous FALLBACK zonelist rebuilding process changes, the mbind(MPOL_BIND) based allocation request fails on the CDM node. This is because allocation requesting local node's FALLBACK zonelist is selected for further nodemask processing targeted at MPOL_BIND implementation. As the CDM node's zones are not part of any other regular node's FALLBACK zonelist, the allocation simply fails without getting any valid zone. The allocation requesting node is always going to be different than the CDM node which does not have any CPU. Hence MPOL_MBIND implementation must choose given CDM node's FALLBACK zonelist instead of the requesting local node's FALLBACK zonelist. This implements that change. Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: NLijun Fang <fanglijun3@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Anshuman Khandual 提交于
ascend inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4JMLR CVE: NA ------------------- Mark all the applicable VMAs with VM_CDM explicitly during mbind(MPOL_BIND) call if the user provided nodemask has a CDM node. Mark the corresponding VMA with VM_CDM flag if the allocated page happens to be from a CDM node. This can be expensive from performance stand point. There are multiple checks to avoid an expensive page_to_nid lookup but it can be optimized further. Signed-off-by: NAnshuman Khandual <khandual@linux.vnet.ibm.com> Signed-off-by: NLijun Fang <fanglijun3@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 14 7月, 2021 6 次提交
-
-
由 Wei Li 提交于
hulk inclusion category: feature bugzilla: 169576 CVE: NA ------------------------------------------------- Enabling CNA is controlled via a new configuration option (NUMA_AWARE_SPINLOCKS). Add it for arm64. Signed-off-by: NWei Li <liwei391@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Matthew Wilcox (Oracle) 提交于
mainline inclusion from mainline-5.13-rc1 commit 5f076944 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZVL2 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5f076944f06988391a6dbd458fc6485a71088e57 ------------------------------------------------- Sphinx interprets the Return section as a list and complains about it. Turn it into a sentence and move it to the end of the kernel-doc to fit the kernel-doc style. Link: https://lkml.kernel.org/r/20210225150642.2582252-8-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Acked-by: NMike Rapoport <rppt@linux.ibm.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 5f076944) Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com> (fix conflicts: add "kernel-doc:: mm/mempolicy.c" at the end of mm-api.rst) Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Matthew Wilcox (Oracle) 提交于
mainline inclusion from mainline-5.13-rc1 commit eb350739 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZVL2 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb35073960510762dee417574589b7a8971c68ab ------------------------------------------------- The current formatting doesn't quite work with kernel-doc. Link: https://lkml.kernel.org/r/20210225150642.2582252-7-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Acked-by: NMike Rapoport <rppt@linux.ibm.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit eb350739) Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Matthew Wilcox (Oracle) 提交于
mainline inclusion from mainline-5.13-rc1 commit 6421ec76 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZVL2 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6421ec764a62c51f810c5dc40cd45eeb15801ad9 ------------------------------------------------- Document alloc_pages() for both NUMA and non-NUMA cases as kernel-doc doesn't care. Link: https://lkml.kernel.org/r/20210225150642.2582252-6-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Acked-by: NMike Rapoport <rppt@linux.ibm.com> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 6421ec76) Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Matthew Wilcox (Oracle) 提交于
mainline inclusion from mainline-5.13-rc1 commit d7f946d0 category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZVL2 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d7f946d0faf90014547ee5d090e9d05018278c7a ------------------------------------------------- When CONFIG_NUMA is enabled, alloc_pages() is a wrapper around alloc_pages_current(). This is pointless, just implement alloc_pages() directly. Link: https://lkml.kernel.org/r/20210225150642.2582252-5-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit d7f946d0) Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
由 Matthew Wilcox (Oracle) 提交于
mainline inclusion from mainline-5.13-rc1 commit 84172f4b category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I3ZVL2 CVE: NA Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=84172f4bb752424415756351a40f8da5714e1554 ------------------------------------------------- There are only two callers of __alloc_pages() so prune the thicket of alloc_page variants by combining the two functions together. Current callers of __alloc_pages() simply add an extra 'NULL' parameter and current callers of __alloc_pages_nodemask() call __alloc_pages() instead. Link: https://lkml.kernel.org/r/20210225150642.2582252-4-willy@infradead.orgSigned-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Acked-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from commit 84172f4b) Signed-off-by: NYongqiang Liu <liuyongqiang13@huawei.com> Reviewed-by: NTong Tiangen <tongtiangen@huawei.com> Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
-
- 03 11月, 2020 1 次提交
-
-
由 Shijie Luo 提交于
When flags in queue_pages_pte_range don't have MPOL_MF_MOVE or MPOL_MF_MOVE_ALL bits, code breaks and passing origin pte - 1 to pte_unmap_unlock seems like not a good idea. queue_pages_pte_range can run in MPOL_MF_MOVE_ALL mode which doesn't migrate misplaced pages but returns with EIO when encountering such a page. Since commit a7f40cfe ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified") and early break on the first pte in the range results in pte_unmap_unlock on an underflow pte. This can lead to lockups later on when somebody tries to lock the pte resp. page_table_lock again.. Fixes: a7f40cfe ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified") Signed-off-by: NShijie Luo <luoshijie1@huawei.com> Signed-off-by: NMiaohe Lin <linmiaohe@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NOscar Salvador <osalvador@suse.de> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Feilong Lin <linfeilong@huawei.com> Cc: Shijie Luo <luoshijie1@huawei.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20201019074853.50856-1-luoshijie1@huawei.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 14 10月, 2020 2 次提交
-
-
由 Wei Yang 提交于
No one use this macro anymore. Also fix code style of policy_node(). Signed-off-by: NWei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20200921021401.84508-1-richard.weiyang@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wei Yang 提交于
It is not necessary to hold the lock of current when setting nodemask of a new policy. Signed-off-by: NWei Yang <richard.weiyang@linux.alibaba.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Link: https://lkml.kernel.org/r/20200921040416.86185-1-richard.weiyang@linux.alibaba.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 15 8月, 2020 1 次提交
-
-
由 Matthew Wilcox (Oracle) 提交于
The thp prefix is more frequently used than hpage and we should be consistent between the various functions. [akpm@linux-foundation.org: fix mm/migrate.c] Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com> Reviewed-by: NZi Yan <ziy@nvidia.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 8月, 2020 5 次提交
-
-
由 Joonsoo Kim 提交于
There is a well-defined migration target allocation callback. Use it. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMichal Hocko <mhocko@suse.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Christoph Hellwig <hch@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/1594622517-20681-7-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Joonsoo Kim 提交于
There is no difference between two migration callback functions, alloc_huge_page_node() and alloc_huge_page_nodemask(), except __GFP_THISNODE handling. It's redundant to have two almost similar functions in order to handle this flag. So, this patch tries to remove one by introducing a new argument, gfp_mask, to alloc_huge_page_nodemask(). After introducing gfp_mask argument, it's caller's job to provide correct gfp_mask. So, every callsites for alloc_huge_page_nodemask() are changed to provide gfp_mask. Note that it's safe to remove a node id check in alloc_huge_page_node() since there is no caller passing NUMA_NO_NODE as a node id. Signed-off-by: NJoonsoo Kim <iamjoonsoo.kim@lge.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Roman Gushchin <guro@fb.com> Link: http://lkml.kernel.org/r/1594622517-20681-4-git-send-email-iamjoonsoo.kim@lge.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Wenchao Hao 提交于
Previous implementatoin calls untagged_addr() before error check, while if the error check failed and return EINVAL, the untagged_addr() call is just useless work. Signed-off-by: NWenchao Hao <haowenchao22@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200801090825.5597-1-haowenchao22@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Krzysztof Kozlowski 提交于
Fix W=1 compile warnings (invalid kerneldoc): mm/mempolicy.c:137: warning: Function parameter or member 'node' not described in 'numa_map_to_online_node' mm/mempolicy.c:137: warning: Excess function parameter 'nid' description in 'numa_map_to_online_node' Signed-off-by: NKrzysztof Kozlowski <krzk@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Link: http://lkml.kernel.org/r/20200728171109.28687-3-krzk@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Muchun Song 提交于
In the reservation routine, we only check whether the cpuset meets the memory allocation requirements. But we ignore the mempolicy of MPOL_BIND case. If someone mmap hugetlb succeeds, but the subsequent memory allocation may fail due to mempolicy restrictions and receives the SIGBUS signal. This can be reproduced by the follow steps. 1) Compile the test case. cd tools/testing/selftests/vm/ gcc map_hugetlb.c -o map_hugetlb 2) Pre-allocate huge pages. Suppose there are 2 numa nodes in the system. Each node will pre-allocate one huge page. echo 2 > /proc/sys/vm/nr_hugepages 3) Run test case(mmap 4MB). We receive the SIGBUS signal. numactl --membind=3D0 ./map_hugetlb 4 With this patch applied, the mmap will fail in the step 3) and throw "mmap: Cannot allocate memory". [akpm@linux-foundation.org: include sched.h for `current'] Reported-by: NJianchao Guo <guojianchao@bytedance.com> Suggested-by: NMichal Hocko <mhocko@kernel.org> Signed-off-by: NMuchun Song <songmuchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Michel Lespinasse <walken@google.com> Cc: Baoquan He <bhe@redhat.com> Link: http://lkml.kernel.org/r/20200728034938.14993-1-songmuchun@bytedance.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 17 7月, 2020 1 次提交
-
-
由 Kees Cook 提交于
Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: NKees Cook <keescook@chromium.org>
-
- 10 6月, 2020 3 次提交
-
-
由 Michel Lespinasse 提交于
Convert comments that reference mmap_sem to reference mmap_lock instead. [akpm@linux-foundation.org: fix up linux-next leftovers] [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil] [akpm@linux-foundation.org: more linux-next fixups, per Michel] Signed-off-by: NMichel Lespinasse <walken@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michel Lespinasse 提交于
Convert comments that reference old mmap_sem APIs to reference corresponding new mmap locking APIs instead. Signed-off-by: NMichel Lespinasse <walken@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NDavidlohr Bueso <dbueso@suse.de> Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Laurent Dufour <ldufour@linux.ibm.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Michel Lespinasse 提交于
This change converts the existing mmap_sem rwsem calls to use the new mmap locking API instead. The change is generated using coccinelle with the following rule: // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir . @@ expression mm; @@ ( -init_rwsem +mmap_init_lock | -down_write +mmap_write_lock | -down_write_killable +mmap_write_lock_killable | -down_write_trylock +mmap_write_trylock | -up_write +mmap_write_unlock | -downgrade_write +mmap_write_downgrade | -down_read +mmap_read_lock | -down_read_killable +mmap_read_lock_killable | -down_read_trylock +mmap_read_trylock | -up_read +mmap_read_unlock ) -(&mm->mmap_sem) +(mm) Signed-off-by: NMichel Lespinasse <walken@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com> Reviewed-by: NLaurent Dufour <ldufour@linux.ibm.com> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Cc: Davidlohr Bueso <dbueso@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Hugh Dickins <hughd@google.com> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: Jerome Glisse <jglisse@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Liam Howlett <Liam.Howlett@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ying Han <yinghan@google.com> Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 04 6月, 2020 1 次提交
-
-
由 Michal Hocko 提交于
ba841078 ("mm/mempolicy: Allow lookup_node() to handle fatal signal") has added a special casing for 0 return value because that was a possible gup return value when interrupted by fatal signal. This has been fixed by ae46d2aa ("mm/gup: Let __get_user_pages_locked() return -EINTR for fatal signal") in the mean time so ba841078 can be reverted. This patch however doesn't go all the way to revert it because the check for 0 is wrong and confusing here. Firstly it is inherently unsafe to access the page when get_user_pages_locked returns 0 (aka no page returned). Fortunatelly this will not happen because get_user_pages_locked will not return 0 when nr_pages > 0 unless FOLL_NOWAIT is specified which is not the case here. Document this potential error code in gup code while we are at it. Signed-off-by: NMichal Hocko <mhocko@suse.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Peter Xu <peterx@redhat.com> Link: http://lkml.kernel.org/r/20200421071026.18394-1-mhocko@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 08 4月, 2020 1 次提交
-
-
由 Peter Xu 提交于
lookup_node() uses gup to pin the page and get node information. It checks against ret>=0 assuming the page will be filled in. However it's also possible that gup will return zero, for example, when the thread is quickly killed with a fatal signal. Teach lookup_node() to gracefully return an error -EFAULT if it happens. Meanwhile, initialize "page" to NULL to avoid potential risk of exploiting the pointer. Fixes: 4426e945 ("mm/gup: allow VM_FAULT_RETRY for multiple times") Reported-by: syzbot+693dc11fcb53120b5559@syzkaller.appspotmail.com Signed-off-by: NPeter Xu <peterx@redhat.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-