- 22 10月, 2021 6 次提交
-
-
由 Tang Yizhou 提交于
ascend inclusion category: bugfix bugzilla: 46906, https://gitee.com/openeuler/kernel/issues/I4DZ7Q CVE: NA ------------------------------------------------- When CONFIG_MEMCG is disabled an CONFIG_MM_OWNER is enabled, we encounter a compilation error as follows: mm/hugepage_tuning.c: In function ‘get_mem_cgroup_from_path’: mm/hugepage_tuning.c:130:8: error: implicit declaration of function ‘mem_cgroup_from_css’; did you mean ‘mem_cgroup_from_obj’? [-Werror=implicit-function-declaration] mcg = mem_cgroup_from_css(of_css(of)); ^~~~~~~~~~~~~~~~~~~ mem_cgroup_from_obj mm/hugepage_tuning.c:130:6: warning: assignment makes pointer from integer without a cast [-Wint-conversion] mcg = mem_cgroup_from_css(of_css(of)); To fix it, let mm_update_next_owner() depend on CONFIG_MEMCG Fixes: 719e31550652 ("arm64/ascend: Add auto tuning hugepage module") Signed-off-by: NTang Yizhou <tangyizhou@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: NDing Tianhong <dingtianhong@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Miklos Szeredi 提交于
mainline inclusion from mainline-5.14 commit 76224355 category: bugfix bugzilla: 181107 CVE: NA --------------------------- fuse_finish_open() will be called with FUSE_NOWRITE in case of atomic O_TRUNC. This can deadlock with fuse_wait_on_page_writeback() in fuse_launder_page() triggered by invalidate_inode_pages2(). Fix by replacing invalidate_inode_pages2() in fuse_finish_open() with a truncate_pagecache() call. This makes sense regardless of FOPEN_KEEP_CACHE or fc->writeback cache, so do it unconditionally. Reported-by: NXie Yongji <xieyongji@bytedance.com> Reported-and-tested-by: syzbot+bea44a5189836d956894@syzkaller.appspotmail.com Fixes: e4648309 ("fuse: truncate pending writes on O_TRUNC") Cc: <stable@vger.kernel.org> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com> Conflict: fs/fuse/file.c Signed-off-by: NYu Kuai <yukuai3@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Zhang Yi 提交于
mainline inclusion from mainline-5.15-rc4 commit cc883236 category: perf bugzilla: NA CVE: NA --------------------------- After we factor out the inline data write procedure from ext4_da_write_end(), we don't need to start journal handle for the cases of both buffer overwrite and append-write. If we need to update i_disksize, mark_inode_dirty() do start handle and update inode buffer. So we could just remove all the journal handle codes in the delalloc write procedure. After this patch, we could get a lot of performance improvement. Below is the Unixbench comparison data test on my machine with 'Intel Xeon Gold 5120' CPU and nvme SSD backend. Test cmd: ./Run -c 56 -i 3 fstime fsbuffer fsdisk Before this patch: System Benchmarks Partial Index BASELINE RESULT INDEX File Copy 1024 bufsize 2000 maxblocks 3960.0 422965.0 1068.1 File Copy 256 bufsize 500 maxblocks 1655.0 105077.0 634.9 File Copy 4096 bufsize 8000 maxblocks 5800.0 1429092.0 2464.0 ====== System Benchmarks Index Score (Partial Only) 1186.6 After this patch: System Benchmarks Partial Index BASELINE RESULT INDEX File Copy 1024 bufsize 2000 maxblocks 3960.0 732716.0 1850.3 File Copy 256 bufsize 500 maxblocks 1655.0 184940.0 1117.5 File Copy 4096 bufsize 8000 maxblocks 5800.0 2427152.0 4184.7 ====== System Benchmarks Index Score (Partial Only) 2053.0 Signed-off-by: NZhang Yi <yi.zhang@huawei.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20210716122024.1105856-5-yi.zhang@huawei.comReviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Zhang Yi 提交于
mainline inclusion from mainline-5.15-rc4 commit 6984aef5 category: perf bugzilla: NA CVE: NA --------------------------- Now that the inline_data file write end procedure are falled into the common write end functions, it is not clear. Factor them out and do some cleanup. This patch also drop ext4_da_write_inline_data_end() and switch to use ext4_write_inline_data_end() instead because we also need to do the same error processing if we failed to write data into inline entry. Signed-off-by: NZhang Yi <yi.zhang@huawei.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20210716122024.1105856-4-yi.zhang@huawei.com Conflicts: fs/ext4/inline.c fs/ext4/inode.c Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Zhang Yi 提交于
mainline inclusion from mainline-5.15-rc4 commit 55ce2f64 category: perf bugzilla: NA CVE: NA --------------------------- Current error path of ext4_write_inline_data_end() is not correct. Firstly, it should pass out the error value if ext4_get_inode_loc() return fail, or else it could trigger infinite loop if we inject error here. And then it's better to add inode to orphan list if it return fail in ext4_journal_stop(), otherwise we could not restore inline xattr entry after power failure. Finally, we need to reset the 'ret' value if ext4_write_inline_data_end() return success in ext4_write_end() and ext4_journalled_write_end(), otherwise we could not get the error return value of ext4_journal_stop(). Signed-off-by: NZhang Yi <yi.zhang@huawei.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20210716122024.1105856-3-yi.zhang@huawei.com Conflicts: fs/ext4/inode.c Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Zhang Yi 提交于
mainline inclusion from mainline-5.15-rc4 commit 4df031ff category: perf bugzilla: NA CVE: NA --------------------------- After commit 3da40c7b ("ext4: only call ext4_truncate when size <= isize"), i_disksize could always be updated to i_size in ext4_setattr(), and we could sure that i_disksize <= i_size since holding inode lock and if i_disksize < i_size there are delalloc writes pending in the range upto i_size. If the end of the current write is <= i_size, there's no need to touch i_disksize since writeback will push i_disksize upto i_size eventually. So we can switch to check i_size instead of i_disksize in ext4_da_write_end() when write to the end of the file. we also could remove ext4_mark_inode_dirty() together because we defer inode dirtying to generic_write_end() or ext4_da_write_inline_data_end(). Signed-off-by: NZhang Yi <yi.zhang@huawei.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NTheodore Ts'o <tytso@mit.edu> Link: https://lore.kernel.org/r/20210716122024.1105856-2-yi.zhang@huawei.com Conflicts: fs/ext4/inode.c Reviewed-by: NYang Erkun <yangerkun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 21 10月, 2021 1 次提交
-
-
由 Dietmar Eggemann 提交于
mainline inclusion from mainline-v5.12-rc1 commit 71e5f664 category: bugfix bugzilla: 182847,https://gitee.com/openeuler/kernel/issues/I4EVBL CVE: NA ---------------------------------------------------------- Commit "sched/topology: Make sched_init_numa() use a set for the deduplicating sort" allocates 'i + nr_levels (level)' instead of 'i + nr_levels + 1' sched_domain_topology_level. This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG): sched_init_domains build_sched_domains() __free_domain_allocs() __sdt_free() { ... for_each_sd_topology(tl) ... sd = *per_cpu_ptr(sdd->sd, j); <-- ... } Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NIngo Molnar <mingo@kernel.org> Tested-by: NVincent Guittot <vincent.guittot@linaro.org> Tested-by: NBarry Song <song.bao.hua@hisilicon.com> Link: https://lkml.kernel.org/r/6000e39e-7d28-c360-9cd6-8798fd22a9bf@arm.com Fixes: 620a6dc4 ("sched/topology: Make sched_init_numa() use a set for the deduplicating sort") Signed-off-by: NJialin Zhang <zhangjialin11@huawei.com> Reviewed-by: NCheng Jian <cj.chengjian@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
- 20 10月, 2021 9 次提交
-
-
由 Yu'an Wang 提交于
driver inclusion category: bugfix bugzilla: NA CVE: NA 1.add input para check of uacce_unregister api 2.change uacce_qfrt_str to internal interface, because it is used just in uacce.c Signed-off-by: NYu'an Wang <wangyuan46@huawei.com> Reviewed-by: NLongfang Liu <liulongfang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 David Hildenbrand 提交于
mainline inclusion from mainline-v5.10-rc1 commit 7fef431b category: feature bugzilla: 182882 CVE: NA __free_pages_core() is used when exposing fresh memory to the buddy during system boot and when onlining memory in generic_online_page(). generic_online_page() is used in two cases: 1. Direct memory onlining in online_pages(). 2. Deferred memory onlining in memory-ballooning-like mechanisms (HyperV balloon and virtio-mem), when parts of a section are kept fake-offline to be fake-onlined later on. In 1, we already place pages to the tail of the freelist. Pages will be freed to MIGRATE_ISOLATE lists first and moved to the tail of the freelists via undo_isolate_page_range(). In 2, we currently don't implement a proper rule. In case of virtio-mem, where we currently always online MAX_ORDER - 1 pages, the pages will be placed to the HEAD of the freelist - undesireable. While the hyper-v balloon calls generic_online_page() with single pages, usually it will call it on successive single pages in a larger block. The pages are fresh, so place them to the tail of the freelist and avoid the PCP. In __free_pages_core(), remove the now superflouos call to set_page_refcounted() and add a comment regarding page initialization and the refcount. Note: In 2. we currently don't shuffle. If ever relevant (page shuffling is usually of limited use in virtualized environments), we might want to shuffle after a sequence of generic_online_page() calls in the relevant callers. Signed-off-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com> Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@kernel.org> Cc: Scott Cheloha <cheloha@linux.ibm.com> Link: https://lkml.kernel.org/r/20201005121534.15649-5-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/page_alloc.c [Peng Liu: adjust context] Signed-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 David Hildenbrand 提交于
mainline inclusion from mainline-v5.10-rc1 commit 293ffa5e category: feature bugzilla: 182882 CVE: NA ----------------------------------------------- Whenever we move pages between freelists via move_to_free_list()/ move_freepages_block(), we don't actually touch the pages: 1. Page isolation doesn't actually touch the pages, it simply isolates pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist. When undoing isolation, we move the pages back to the target list. 2. Page stealing (steal_suitable_fallback()) moves free pages directly between lists without touching them. 3. reserve_highatomic_pageblock()/unreserve_highatomic_pageblock() moves free pages directly between freelists without touching them. We already place pages to the tail of the freelists when undoing isolation via __putback_isolated_page(), let's do it in any case (e.g., if order <= pageblock_order) and document the behavior. To simplify, let's move the pages to the tail for all move_to_free_list()/move_freepages_block() users. In 2., the target list is empty, so there should be no change. In 3., we might observe a change, however, highatomic is more concerned about allocations succeeding than cache hotness - if we ever realize this change degrades a workload, we can special-case this instance and add a proper comment. This change results in all pages getting onlined via online_pages() to be placed to the tail of the freelist. Signed-off-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com> Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Link: https://lkml.kernel.org/r/20201005121534.15649-4-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/page_alloc.c [Peng Liu: cherry-pick from 293ffa5e] Signed-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 David Hildenbrand 提交于
mainline inclusion from mainline-v5.10-rc1 commit 47b6a24a category: feature bugzilla: 182882 CVE: NA ----------------------------------------------- __putback_isolated_page() already documents that pages will be placed to the tail of the freelist - this is, however, not the case for "order >= MAX_ORDER - 2" (see buddy_merge_likely()) - which should be the case for all existing users. This change affects two users: - free page reporting - page isolation, when undoing the isolation (including memory onlining). This behavior is desirable for pages that haven't really been touched lately, so exactly the two users that don't actually read/write page content, but rather move untouched pages. The new behavior is especially desirable for memory onlining, where we allow allocation of newly onlined pages via undo_isolate_page_range() in online_pages(). Right now, we always place them to the head of the freelist, resulting in undesireable behavior: Assume we add individual memory chunks via add_memory() and online them right away to the NORMAL zone. We create a dependency chain of unmovable allocations e.g., via the memmap. The memmap of the next chunk will be placed onto previous chunks - if the last block cannot get offlined+removed, all dependent ones cannot get offlined+removed. While this can already be observed with individual DIMMs, it's more of an issue for virtio-mem (and I suspect also ppc DLPAR). Document that this should only be used for optimizations, and no code should rely on this behavior for correction (if the order of the freelists ever changes). We won't care about page shuffling: memory onlining already properly shuffles after onlining. free page reporting doesn't care about physically contiguous ranges, and there are already cases where page isolation will simply move (physically close) free pages to (currently) the head of the freelists via move_freepages_block() instead of shuffling. If this becomes ever relevant, we should shuffle the whole zone when undoing isolation of larger ranges, and after free_contig_range(). Signed-off-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com> Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mike Rapoport <rppt@kernel.org> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Link: https://lkml.kernel.org/r/20201005121534.15649-3-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/page_alloc.c [Peng Liu: cherry-pick from 47b6a24a] Signed-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 David Hildenbrand 提交于
mainline inclusion from mainline-v5.10-rc1 commit f04a5d5d category: feature bugzilla: 182882 CVE: NA ----------------------------------------------- Patch series "mm: place pages to the freelist tail when onlining and undoing isolation", v2. When adding separate memory blocks via add_memory*() and onlining them immediately, the metadata (especially the memmap) of the next block will be placed onto one of the just added+onlined block. This creates a chain of unmovable allocations: If the last memory block cannot get offlined+removed() so will all dependent ones. We directly have unmovable allocations all over the place. This can be observed quite easily using virtio-mem, however, it can also be observed when using DIMMs. The freshly onlined pages will usually be placed to the head of the freelists, meaning they will be allocated next, turning the just-added memory usually immediately un-removable. The fresh pages are cold, prefering to allocate others (that might be hot) also feels to be the natural thing to do. It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when adding separate, successive memory blocks, each memory block will have unmovable allocations on them - for example gigantic pages will fail to allocate. While the ZONE_NORMAL doesn't provide any guarantees that memory can get offlined+removed again (any kind of fragmentation with unmovable allocations is possible), there are many scenarios (hotplugging a lot of memory, running workload, hotunplug some memory/as much as possible) where we can offline+remove quite a lot with this patchset. a) To visualize the problem, a very simple example: Start a VM with 4GB and 8GB of virtio-mem memory: [root@localhost ~]# lsmem RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x00000000bfffffff 3G online yes 0-23 0x0000000100000000-0x000000033fffffff 9G online yes 32-103 Memory block size: 128M Total online memory: 12G Total offline memory: 0B Then try to unplug as much as possible using virtio-mem. Observe which memory blocks are still around. Without this patch set: [root@localhost ~]# lsmem RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x00000000bfffffff 3G online yes 0-23 0x0000000100000000-0x000000013fffffff 1G online yes 32-39 0x0000000148000000-0x000000014fffffff 128M online yes 41 0x0000000158000000-0x000000015fffffff 128M online yes 43 0x0000000168000000-0x000000016fffffff 128M online yes 45 0x0000000178000000-0x000000017fffffff 128M online yes 47 0x0000000188000000-0x0000000197ffffff 256M online yes 49-50 0x00000001a0000000-0x00000001a7ffffff 128M online yes 52 0x00000001b0000000-0x00000001b7ffffff 128M online yes 54 0x00000001c0000000-0x00000001c7ffffff 128M online yes 56 0x00000001d0000000-0x00000001d7ffffff 128M online yes 58 0x00000001e0000000-0x00000001e7ffffff 128M online yes 60 0x00000001f0000000-0x00000001f7ffffff 128M online yes 62 0x0000000200000000-0x0000000207ffffff 128M online yes 64 0x0000000210000000-0x0000000217ffffff 128M online yes 66 0x0000000220000000-0x0000000227ffffff 128M online yes 68 0x0000000230000000-0x0000000237ffffff 128M online yes 70 0x0000000240000000-0x0000000247ffffff 128M online yes 72 0x0000000250000000-0x0000000257ffffff 128M online yes 74 0x0000000260000000-0x0000000267ffffff 128M online yes 76 0x0000000270000000-0x0000000277ffffff 128M online yes 78 0x0000000280000000-0x0000000287ffffff 128M online yes 80 0x0000000290000000-0x0000000297ffffff 128M online yes 82 0x00000002a0000000-0x00000002a7ffffff 128M online yes 84 0x00000002b0000000-0x00000002b7ffffff 128M online yes 86 0x00000002c0000000-0x00000002c7ffffff 128M online yes 88 0x00000002d0000000-0x00000002d7ffffff 128M online yes 90 0x00000002e0000000-0x00000002e7ffffff 128M online yes 92 0x00000002f0000000-0x00000002f7ffffff 128M online yes 94 0x0000000300000000-0x0000000307ffffff 128M online yes 96 0x0000000310000000-0x0000000317ffffff 128M online yes 98 0x0000000320000000-0x0000000327ffffff 128M online yes 100 0x0000000330000000-0x000000033fffffff 256M online yes 102-103 Memory block size: 128M Total online memory: 8.1G Total offline memory: 0B With this patch set: [root@localhost ~]# lsmem RANGE SIZE STATE REMOVABLE BLOCK 0x0000000000000000-0x00000000bfffffff 3G online yes 0-23 0x0000000100000000-0x000000013fffffff 1G online yes 32-39 Memory block size: 128M Total online memory: 4G Total offline memory: 0B All memory can get unplugged, all memory block can get removed. Of course, no workload ran and the system was basically idle, but it highlights the issue - the fairly deterministic chain of unmovable allocations. When a huge page for the 2MB memmap is needed, a just-onlined 4MB page will be split. The remaining 2MB page will be used for the memmap of the next memory block. So one memory block will hold the memmap of the two following memory blocks. Finally the pages of the last-onlined memory block will get used for the next bigger allocations - if any allocation is unmovable, all dependent memory blocks cannot get unplugged and removed until that allocation is gone. Note that with bigger memory blocks (e.g., 256MB), *all* memory blocks are dependent and none can get unplugged again! b) Experiment with memory intensive workload I performed an experiment with an older version of this patch set (before we used undo_isolate_page_range() in online_pages(): Hotplug 56GB to a VM with an initial 4GB, onlining all memory to ZONE_NORMAL right from the kernel when adding it. I then run various memory intensive workloads that consume most system memory for a total of 45 minutes. Once finished, I try to unplug as much memory as possible. With this change, I am able to remove via virtio-mem (adding individual 128MB memory blocks) 413 out of 448 added memory blocks. Via individual (256MB) DIMMs 380 out of 448 added memory blocks. (I don't have any numbers without this patchset, but looking at the above example, it's at most half of the 448 memory blocks for virtio-mem, and most probably none for DIMMs). Again, there are workloads that might behave very differently due to the nature of ZONE_NORMAL. This change also affects (besides memory onlining): - Other users of undo_isolate_page_range(): Pages are always placed to the tail. -- When memory offlining fails -- When memory isolation fails after having isolated some pageblocks -- When alloc_contig_range() either succeeds or fails - Other users of __putback_isolated_page(): Pages are always placed to the tail. -- Free page reporting - Other users of __free_pages_core() -- AFAIKs, any memory that is getting exposed to the buddy during boot. IIUC we will now usually allocate memory from lower addresses within a zone first (especially during boot). - Other users of generic_online_page() -- Hyper-V balloon This patch (of 5): Let's prepare for additional flags and avoid long parameter lists of bools. Follow-up patches will also make use of the flags in __free_pages_ok(). Signed-off-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Reviewed-by: NVlastimil Babka <vbabka@suse.cz> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NWei Yang <richard.weiyang@linux.alibaba.com> Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Wei Liu <wei.liu@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Link: https://lkml.kernel.org/r/20201005121534.15649-1-david@redhat.com Link: https://lkml.kernel.org/r/20201005121534.15649-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/page_alloc.c [Peng Liu: cherry-pick from f04a5d5d] Signed-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Alexander Duyck 提交于
mainline inclusion from mainline-v5.7-rc1 commit 624f58d8 category: feature bugzilla: 182882 CVE: NA ----------------------------------------------- There are cases where we would benefit from avoiding having to go through the allocation and free cycle to return an isolated page. Examples for this might include page poisoning in which we isolate a page and then put it back in the free list without ever having actually allocated it. This will enable us to also avoid notifiers for the future free page reporting which will need to avoid retriggering page reporting when returning pages that have been reported on. Signed-off-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NMel Gorman <mgorman@techsingularity.net> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Cc: Luiz Capitulino <lcapitulino@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nitesh Narayan Lal <nitesh@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pankaj Gupta <pagupta@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Wei Wang <wei.w.wang@intel.com> Cc: Yang Zhang <yang.zhang.wz@gmail.com> Cc: wei qi <weiqi4@huawei.com> Link: http://lkml.kernel.org/r/20200211224624.29318.89287.stgit@localhost.localdomainSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/internal.h [Peng Liu: cherry-pick from 624f58d8] Signed-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Arun KS 提交于
mainline inclusion from mainline-v5.1-rc1 commit a9cd410a category: feature bugzilla: 182882 CVE: NA ----------------------------------------------- When freeing pages are done with higher order, time spent on coalescing pages by buddy allocator can be reduced. With section size of 256MB, hot add latency of a single section shows improvement from 50-60 ms to less than 1 ms, hence improving the hot add latency by 60 times. Modify external providers of online callback to align with the change. [arunks@codeaurora.org: v11] Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org [akpm@linux-foundation.org: remove unused local, per Arun] [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar] [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch] [arunks@codeaurora.org: v8] Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org [arunks@codeaurora.org: v9] Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NMichal Hocko <mhocko@suse.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Juergen Gross <jgross@suse.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mathieu Malaterre <malat@debian.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Souptick Joarder <jrdr.linux@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Aaron Lu <aaron.lu@intel.com> Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org> Cc: Vinayak Menon <vinmenon@codeaurora.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: mm/page_alloc.c mm/memory_hotplug.c [Peng Liu: adjust context] Signed-off-by: NPeng Liu <liupeng256@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Guoqing Jiang 提交于
mainline inclusion from mainline-v5.15-rc1 commit 6607cd31 category: bugfix bugzilla: 182883, https://gitee.com/openeuler/kernel/issues/I4ENHY CVE: NA ------------------------------------------------- We can't split write behind bio with more than BIO_MAX_VECS sectors, otherwise the below call trace was triggered because we could allocate oversized write behind bio later. [ 8.097936] bvec_alloc+0x90/0xc0 [ 8.098934] bio_alloc_bioset+0x1b3/0x260 [ 8.099959] raid1_make_request+0x9ce/0xc50 [raid1] [ 8.100988] ? __bio_clone_fast+0xa8/0xe0 [ 8.102008] md_handle_request+0x158/0x1d0 [md_mod] [ 8.103050] md_submit_bio+0xcd/0x110 [md_mod] [ 8.104084] submit_bio_noacct+0x139/0x530 [ 8.105127] submit_bio+0x78/0x1d0 [ 8.106163] ext4_io_submit+0x48/0x60 [ext4] [ 8.107242] ext4_writepages+0x652/0x1170 [ext4] [ 8.108300] ? do_writepages+0x41/0x100 [ 8.109338] ? __ext4_mark_inode_dirty+0x240/0x240 [ext4] [ 8.110406] do_writepages+0x41/0x100 [ 8.111450] __filemap_fdatawrite_range+0xc5/0x100 [ 8.112513] file_write_and_wait_range+0x61/0xb0 [ 8.113564] ext4_sync_file+0x73/0x370 [ext4] [ 8.114607] __x64_sys_fsync+0x33/0x60 [ 8.115635] do_syscall_64+0x33/0x40 [ 8.116670] entry_SYSCALL_64_after_hwframe+0x44/0xae Thanks for the comment from Christoph. [1]. https://bugs.archlinux.org/task/70992 Cc: stable@vger.kernel.org # v5.12+ Reported-by: NJens Stutte <jens@chianterastutte.eu> Tested-by: NJens Stutte <jens@chianterastutte.eu> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn> Signed-off-by: NSong Liu <songliubraving@fb.com> Conflict: drivers/md/raid1.c [ Mainline patch 6607cd31 ("raid1: ensure write behind bio has less than BIO_MAX_VECS sectors"), BIO_MAX_VECS is used directly, but the BIO_MAX_VECS was renamed previously and the corresponding patch a8affc03 ("block: rename BIO_MAX_PAGES to BIO_MAX_VECS") was not incorporated. So we modify BIO_MAX_VECS to the original BIO_MAX_PAGES.] Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com> Reviewed-by: NJason Yan <yanaijie@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Laibin Qiu 提交于
hulk inclusion category: bugfix bugzilla: 182135, https://gitee.com/openeuler/kernel/issues/I4ENC8 CVE: NA -------------------------- Block test reported the following stack, Some req has been watting for wakeup in wbt_wait, and vmcore showed that wbt inflight counter is -1. So Request cannot be awakened. PID: 75416 TASK: ffff88836c098000 CPU: 2 COMMAND: "fsstress" [ffff8882e59a7608] __schedule at ffffffffb2d22a25 [ffff8882e59a7720] schedule at ffffffffb2d2358f [ffff8882e59a7738] io_schedule at ffffffffb2d23bdc [ffff8882e59a7750] rq_qos_wait at ffffffffb2400fde [ffff8882e59a7878] wbt_wait at ffffffffb243a051 [ffff8882e59a7910] __rq_qos_throttle at ffffffffb2400a20 [ffff8882e59a7930] blk_mq_make_request at ffffffffb23de038 [ffff8882e59a7a98] generic_make_request at ffffffffb23c393d [ffff8882e59a7b80] submit_bio at ffffffffb23c3db8 [ffff8882e59a7c48] submit_bio_wait at ffffffffb23b3a5d [ffff8882e59a7cf0] blkdev_issue_flush at ffffffffb23c8f4c [ffff8882e59a7d20] ext4_sync_fs at ffffffffc06dd708 [ext4] [ffff8882e59a7dd0] sync_filesystem at ffffffffb21e8335 [ffff8882e59a7df8] ovl_sync_fs at ffffffffc0fd853a [overlay] [ffff8882e59a7e10] sync_fs_one_sb at ffffffffb21e8221 [ffff8882e59a7e30] iterate_supers at ffffffffb218401e [ffff8882e59a7e70] ksys_sync at ffffffffb21e8588 [ffff8882e59a7f20] __x64_sys_sync at ffffffffb21e861f [ffff8882e59a7f28] do_syscall_64 at ffffffffb1c06bc8 [ffff8882e59a7f50] entry_SYSCALL_64_after_hwframe at ffffffffb2e000ad RIP: 00007f479ab13347 RSP: 00007ffd4dda9fe8 RFLAGS: 00000202 RAX: ffffffffffffffda RBX: 0000000000000068 RCX: 00007f479ab13347 RDX: 0000000000000000 RSI: 000000003e1b142d RDI: 0000000000000068 RBP: 0000000051eb851f R8: 00007f479abd4034 R9: 00007f479abd40a0 R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000402c20 R13: 0000000000000001 R14: 0000000000000000 R15: 7fffffffffffffff The ->inflight counter may be negative (-1) if 1) blk-wbt was disabled when the IO was issued, which will add inflight count. 2) blk-wbt was enabled before this IO tracked. 3) the ->inflight counter is decreased from 0 to -1 in endio(). This fixes the problem by freezing the queue while enabling wbt, there is no inflight rq running. Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com> Reviewed-by: NHou Tao <houtao1@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 19 10月, 2021 7 次提交
-
-
由 Xu Qiang 提交于
ascend inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4D63I CVE: NA ------------------------------------------------- Export console_flush_on_panic for bbox to use. Signed-off-by: NXu Qiang <xuqiang36@huawei.com> Signed-off-by: NFang Lijun <fanglijun3@huawei.com> Reviewed-by: NDing Tianhong <dingtianhong@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Amir Goldstein 提交于
mainline inclusion from mainline-v5.7-rc1 commit 4d314f78 category: bugfix bugzilla: NA CVE: NA ------------------------------------------------- There is no reason to deplete the system's global get_next_ino() pool for overlay non-persistent inode numbers and there is no reason at all to allocate non-persistent inode numbers for non-directories. For non-directories, it is much better to leave i_ino the same as real i_ino, to be consistent with st_ino/d_ino. Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com> Signed-off-by: NZheng Liang <zhengliang6@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Amir Goldstein 提交于
mainline inclusion from mainline-v5.7-rc1 commit 62c832ed category: bugfix bugzilla: NA CVE: NA ------------------------------------------------- Move i_ino initialization to ovl_inode_init() to avoid the dance of setting i_ino in ovl_fill_inode() sometimes on the first call and sometimes on the seconds call. Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com> Signed-off-by: NZheng Liang <zhengliang6@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Amir Goldstein 提交于
mainline inclusion from mainline-v5.7-rc1 commit 2effc5c2 category: bugfix bugzilla: NA CVE: NA ------------------------------------------------- Allocates and initializes the root dentry and inode. Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com> Signed-off-by: NZheng Liang <zhengliang6@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Amir Goldstein 提交于
mainline inclusion from mainline-v5.7-rc1 commit 735c907d category: bugfix bugzilla: NA CVE: NA ------------------------------------------------- ovl_inode_update() is no longer called from create object code path. Fixes: 01b39dcc ("ovl: use inode_insert5() to hash a newly...") Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com> Signed-off-by: NZheng Liang <zhengliang6@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Cheng Jian 提交于
ascend inclusion category: feature bugzilla: 46922, https://gitee.com/openeuler/kernel/issues/I4EHJ6 CVE: NA ------------------------------------- This reverts commit 1948fd64. commit 1948fd64 ("cache: Workaround HiSilicon Taishan DC CVAU") breaks the kabi symbols: cpu_hwcaps cpu_hwcap_keys just revert it now. Signed-off-by: NCheng Jian <cj.chengjian@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
-
由 Cheng Jian 提交于
ascend inclusion category: feature bugzilla: 46922, https://gitee.com/openeuler/kernel/issues/I4EHJ6 CVE: NA ------------------------------------- This reverts commit d6ec90ce. commit 1948fd64 ("cache: Workaround HiSilicon Taishan DC CVAU") breaks the kabi symbols: cpu_hwcaps cpu_hwcap_keys just revert it now. Signed-off-by: NCheng Jian <cj.chengjian@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
-
- 18 10月, 2021 8 次提交
-
-
由 Iwona Winiarska 提交于
stable inclusion from linux-4.19.207 commit 9c8891b638319ddba9cfa330247922cd960c95b0 CVE: CVE-2021-42252 -------------------------------- commit b49a0e69 upstream. The check mixes pages (vm_pgoff) with bytes (vm_start, vm_end) on one side of the comparison, and uses resource address (rather than just the resource size) on the other side of the comparison. This can allow malicious userspace to easily bypass the boundary check and map pages that are located outside memory-region reserved by the driver. Fixes: 6c4e9767 ("drivers/misc: Add Aspeed LPC control driver") Cc: stable@vger.kernel.org Signed-off-by: NIwona Winiarska <iwona.winiarska@intel.com> Reviewed-by: NAndrew Jeffery <andrew@aj.id.au> Tested-by: NAndrew Jeffery <andrew@aj.id.au> Reviewed-by: NJoel Stanley <joel@aj.id.au> Signed-off-by: NJoel Stanley <joel@jms.id.au> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Kefeng Wang 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4AHP2 CVE: NA ------------------------------------------------- Fix some format issues in mm/mmap.c. Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Xiongfeng Wang 提交于
hulk inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4AHP2 CVE: NA ------------------------------------------------- When userswap is enabled, the memory pointed by 'pages' is not freed in abnormal branch in do_mmap(). To fix the issue and keep do_mmap() mostly unchanged, we rename do_mmap() to __do_mmap() and extract the memory alloc and free code out of __do_mmap(). When __do_mmap() returns a error value, we goto the error label to free the memory. Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 wenzhiwei11 提交于
kylin inclusion category: bugfix bugzilla: https://gitee.com/openeuler/kernel/issues/I4AHUL CVE: NA --------------------------------------------------- initialize the value "ret" in "schemata_list_init()" Signed-off-by: wenzhiwei11 <wenzhiwei@kylinos.cn> # openEuler_contributor Reviewed-by: NWang ShaoBo <bobo.shaobowang@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Trond Myklebust 提交于
mainline inclusion from mainline-v5.7-rc4 commit 9c07b75b category: bugfix bugzilla: 182252 CVE: NA ----------------------------------------------- The struct nfs_server gets put on the cl_superblocks list before the server->super field has been initialised, in which case the call to nfs_sb_active() will Oops. Add a check to ensure that we skip such a list entry. Fixes: 3c9e502b ("NFS: Add a helper nfs_client_for_each_server()") Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: NChenXiaoSong <chenxiaosong2@huawei.com> Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Trond Myklebust 提交于
mainline inclusion from mainline-v5.7-rc1 commit af3b61bf category: bugfix bugzilla: 182252 CVE: NA ----------------------------------------------- Convert it to use the nfs_client_for_each_server() helper, and make it more efficient by skipping delegations for inodes we know are in the process of being freed. Also improve the efficiency of the cursor by skipping delegations that are being freed. Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: NChenXiaoSong <chenxiaosong2@huawei.com> Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Trond Myklebust 提交于
mainline inclusion from mainline-v5.7-rc1 commit 3c9e502b category: bugfix bugzilla: 182252 CVE: NA ----------------------------------------------- Add a helper nfs_client_for_each_server() to iterate through all the filesystems that are attached to a struct nfs_client, and apply a function to all the active ones. Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: NChenXiaoSong <chenxiaosong2@huawei.com> Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Zhihao Cheng 提交于
mainline inclusion from mainline-5.15-rc3 commit 5afedf67 category: bugfix bugzilla: 181454 CVE: NA --------------------------- There is an use-after-free problem triggered by following process: P1(sda) P2(sdb) echo 0 > /sys/block/sdb/trace/enable blk_trace_remove_queue synchronize_rcu blk_trace_free relay_close rcu_read_lock __blk_add_trace trace_note_tsk (Iterate running_trace_list) relay_close_buf relay_destroy_buf kfree(buf) trace_note(sdb's bt) relay_reserve buf->offset <- nullptr deference (use-after-free) !!! rcu_read_unlock [ 502.714379] BUG: kernel NULL pointer dereference, address: 0000000000000010 [ 502.715260] #PF: supervisor read access in kernel mode [ 502.715903] #PF: error_code(0x0000) - not-present page [ 502.716546] PGD 103984067 P4D 103984067 PUD 17592b067 PMD 0 [ 502.717252] Oops: 0000 [#1] SMP [ 502.720308] RIP: 0010:trace_note.isra.0+0x86/0x360 [ 502.732872] Call Trace: [ 502.733193] __blk_add_trace.cold+0x137/0x1a3 [ 502.733734] blk_add_trace_rq+0x7b/0xd0 [ 502.734207] blk_add_trace_rq_issue+0x54/0xa0 [ 502.734755] blk_mq_start_request+0xde/0x1b0 [ 502.735287] scsi_queue_rq+0x528/0x1140 ... [ 502.742704] sg_new_write.isra.0+0x16e/0x3e0 [ 502.747501] sg_ioctl+0x466/0x1100 Reproduce method: ioctl(/dev/sda, BLKTRACESETUP, blk_user_trace_setup[buf_size=127]) ioctl(/dev/sda, BLKTRACESTART) ioctl(/dev/sdb, BLKTRACESETUP, blk_user_trace_setup[buf_size=127]) ioctl(/dev/sdb, BLKTRACESTART) echo 0 > /sys/block/sdb/trace/enable & // Add delay(mdelay/msleep) before kernel enters blk_trace_free() ioctl$SG_IO(/dev/sda, SG_IO, ...) // Enters trace_note_tsk() after blk_trace_free() returned // Use mdelay in rcu region rather than msleep(which may schedule out) Remove blk_trace from running_list before calling blk_trace_free() by sysfs if blk_trace is at Blktrace_running state. Fixes: c71a8961 ("blktrace: add ftrace plugin") Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Link: https://lore.kernel.org/r/20210923134921.109194-1-chengzhihao1@huawei.comSigned-off-by: NJens Axboe <axboe@kernel.dk> Reviewed-by: NJason Yan <yanaijie@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
- 15 10月, 2021 9 次提交
-
-
由 Pavel Begunkov 提交于
mainline inclusion from mainline-5.12-rc1 commit 792bb6eb category: bugfix bugzilla: 182869 CVE: NA --------------------------- [ 97.866748] a.out/2890 is trying to acquire lock: [ 97.867829] ffff8881046763e8 (&ctx->uring_lock){+.+.}-{3:3}, at: io_wq_submit_work+0x155/0x240 [ 97.869735] [ 97.869735] but task is already holding lock: [ 97.871033] ffff88810dfe0be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0 [ 97.873074] [ 97.873074] other info that might help us debug this: [ 97.874520] Possible unsafe locking scenario: [ 97.874520] [ 97.875845] CPU0 [ 97.876440] ---- [ 97.877048] lock(&ctx->uring_lock); [ 97.877961] lock(&ctx->uring_lock); [ 97.878881] [ 97.878881] *** DEADLOCK *** [ 97.878881] [ 97.880341] May be due to missing lock nesting notation [ 97.880341] [ 97.881952] 1 lock held by a.out/2890: [ 97.882873] #0: ffff88810dfe0be8 (&ctx->uring_lock){+.+.}-{3:3}, at: __x64_sys_io_uring_enter+0x3f0/0x5b0 [ 97.885108] [ 97.885108] stack backtrace: [ 97.890457] Call Trace: [ 97.891121] dump_stack+0xac/0xe3 [ 97.891972] __lock_acquire+0xab6/0x13a0 [ 97.892940] lock_acquire+0x2c3/0x390 [ 97.894894] __mutex_lock+0xae/0x9f0 [ 97.901101] io_wq_submit_work+0x155/0x240 [ 97.902112] io_wq_cancel_cb+0x162/0x490 [ 97.904126] io_async_find_and_cancel+0x3b/0x140 [ 97.905247] io_issue_sqe+0x86d/0x13e0 [ 97.909122] __io_queue_sqe+0x10b/0x550 [ 97.913971] io_queue_sqe+0x235/0x470 [ 97.914894] io_submit_sqes+0xcce/0xf10 [ 97.917872] __x64_sys_io_uring_enter+0x3fb/0x5b0 [ 97.921424] do_syscall_64+0x2d/0x40 [ 97.922329] entry_SYSCALL_64_after_hwframe+0x44/0xa9 While holding uring_lock, e.g. from inline execution, async cancel request may attempt cancellations through io_wq_submit_work, which may try to grab a lock. Delay it to task_work, so we do it from a clean context and don't have to worry about locking. Cc: <stable@vger.kernel.org> # 5.5+ Fixes: c07e6719 ("io_uring: hold uring_lock while completing failed polled io in io_wq_submit_work()") Reported-by: NAbaci <abaci@linux.alibaba.com> Reported-by: NHao Xu <haoxu@linux.alibaba.com> Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Conflicts: fs/io_uring.c [ eab30c4d("io_uring: deduplicate failing task_work_add") is not applied. 87ceb6a6("io_uring: drop 'ctx' ref on task work cancelation") is not applied. 91989c70("task_work: cleanup notification modes") is not applied.] Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Xiaoguang Wang 提交于
mainline inclusion from mainline-5.11-rc1 commit c07e6719 category: bugfix bugzilla: 182869 CVE: NA --------------------------- io_iopoll_complete() does not hold completion_lock to complete polled io, so in io_wq_submit_work(), we can not call io_req_complete() directly, to complete polled io, otherwise there maybe concurrent access to cqring, defer_list, etc, which is not safe. Commit dad1b124 ("io_uring: always let io_iopoll_complete() complete polled io") has fixed this issue, but Pavel reported that IOPOLL apart from rw can do buf reg/unreg requests( IORING_OP_PROVIDE_BUFFERS or IORING_OP_REMOVE_BUFFERS), so the fix is not good. Given that io_iopoll_complete() is always called under uring_lock, so here for polled io, we can also get uring_lock to fix this issue. Fixes: dad1b124 ("io_uring: always let io_iopoll_complete() complete polled io") Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> [axboe: don't deref 'req' after completing it'] Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: NYang Erkun <yangerkun@huawei.com> Reviewed-by: NZhang Yi <yi.zhang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Laibin Qiu 提交于
hulk inclusion category: bugfix bugzilla: 182666 CVE: NA -------------------------- KASAN reports a use-after-free report when doing block test: [293762.535116] ================================================================== [293762.535129] BUG: KASAN: use-after-free in queued_spin_lock_slowpath+0x78/0x4c8 [293762.535133] Write of size 2 at addr ffff8000d5f12bc8 by task sh/9148 [293762.535135] [293762.535145] CPU: 1 PID: 9148 Comm: sh Kdump: loaded Tainted: G W 4.19.90-vhulk2108.6.0.h824.kasan.eulerosv2r10.aarch64 #1 [293762.535148] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015 [293762.535150] Call trace: [293762.535154] dump_backtrace+0x0/0x310 [293762.535158] show_stack+0x28/0x38 [293762.535165] dump_stack+0xec/0x15c [293762.535172] print_address_description+0x68/0x2d0 [293762.535177] kasan_report+0x130/0x2f0 [293762.535182] __asan_store2+0x80/0xa8 [293762.535189] queued_spin_lock_slowpath+0x78/0x4c8 [293762.535194] __ioc_clear_queue+0x158/0x160 [293762.535198] ioc_clear_queue+0x194/0x258 [293762.535202] elevator_switch_mq+0x64/0x170 [293762.535206] elevator_switch+0x140/0x270 [293762.535211] elv_iosched_store+0x1a4/0x2a0 [293762.535215] queue_attr_store+0x90/0xe0 [293762.535219] sysfs_kf_write+0xa8/0xe8 [293762.535222] kernfs_fop_write+0x1f8/0x378 [293762.535227] __vfs_write+0xe0/0x360 [293762.535233] vfs_write+0xf0/0x270 [293762.535237] ksys_write+0xdc/0x1b8 [293762.535241] __arm64_sys_write+0x50/0x60 [293762.535245] el0_svc_common+0xc8/0x320 [293762.535250] el0_svc_handler+0xf8/0x160 [293762.535253] el0_svc+0x10/0x218 [293762.535254] [293762.535258] Allocated by task 3466763: [293762.535264] kasan_kmalloc+0xe0/0x190 [293762.535269] kasan_slab_alloc+0x14/0x20 [293762.535276] kmem_cache_alloc_node+0x1b4/0x420 [293762.535280] create_task_io_context+0x40/0x210 [293762.535284] generic_make_request_checks+0xc78/0xe38 [293762.535288] generic_make_request+0xf8/0x640 [293762.535394] generic_file_direct_write+0x100/0x268 [293762.535401] __generic_file_write_iter+0x128/0x370 [293762.535467] vfs_iter_write+0x64/0x90 [293762.535489] ovl_write_iter+0x2f8/0x458 [overlay] [293762.535493] __vfs_write+0x264/0x360 [293762.535497] vfs_write+0xf0/0x270 [293762.535501] ksys_write+0xdc/0x1b8 [293762.535505] __arm64_sys_write+0x50/0x60 [293762.535509] el0_svc_common+0xc8/0x320 [293762.535387] ext4_direct_IO+0x3c8/0xe80 [ext4] [293762.535394] generic_file_direct_write+0x100/0x268 [293762.535401] __generic_file_write_iter+0x128/0x370 [293762.535452] ext4_file_write_iter+0x610/0xa80 [ext4] [293762.535457] do_iter_readv_writev+0x28c/0x390 [293762.535463] do_iter_write+0xfc/0x360 [293762.535467] vfs_iter_write+0x64/0x90 [293762.535489] ovl_write_iter+0x2f8/0x458 [overlay] [293762.535493] __vfs_write+0x264/0x360 [293762.535497] vfs_write+0xf0/0x270 [293762.535501] ksys_write+0xdc/0x1b8 [293762.535505] __arm64_sys_write+0x50/0x60 [293762.535509] el0_svc_common+0xc8/0x320 [293762.535513] el0_svc_handler+0xf8/0x160 [293762.535517] el0_svc+0x10/0x218 [293762.535521] [293762.535523] Freed by task 3466763: [293762.535528] __kasan_slab_free+0x120/0x228 [293762.535532] kasan_slab_free+0x10/0x18 [293762.535536] kmem_cache_free+0x68/0x248 [293762.535540] put_io_context+0x104/0x190 [293762.535545] put_io_context_active+0x204/0x2c8 [293762.535549] exit_io_context+0x74/0xa0 [293762.535553] do_exit+0x658/0xae0 [293762.535557] do_group_exit+0x74/0x1a8 [293762.535561] get_signal+0x21c/0xf38 [293762.535564] do_signal+0x10c/0x450 [293762.535568] do_notify_resume+0x224/0x4b0 [293762.535573] work_pending+0x8/0x10 [293762.535574] [293762.535578] The buggy address belongs to the object at ffff8000d5f12bb8 which belongs to the cache blkdev_ioc of size 136 [293762.535582] The buggy address is located 16 bytes inside of 136-byte region [ffff8000d5f12bb8, ffff8000d5f12c40) [293762.535583] The buggy address belongs to the page: [293762.535588] page:ffff7e000357c480 count:1 mapcount:0 mapping:ffff8000d8563c00 index:0x0 [293762.536201] flags: 0x7ffff0000000100(slab) [293762.536540] raw: 07ffff0000000100 ffff7e0003118588 ffff8000d8adb530 ffff8000d8563c00 [293762.536546] raw: 0000000000000000 0000000000140014 00000001ffffffff 0000000000000000 [293762.536551] page dumped because: kasan: bad access detected [293762.536552] [293762.536554] Memory state around the buggy address: [293762.536558] ffff8000d5f12a80: 00 00 00 00 00 00 fc fc fc fc fc fc fc fc fb fb [293762.536562] ffff8000d5f12b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fc [293762.536566] >ffff8000d5f12b80: fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb fb [293762.536568] ^ [293762.536572] ffff8000d5f12c00: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc [293762.536576] ffff8000d5f12c80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [293762.536577] ================================================================== ioc_release_fn() will destroy icq from ioc->icq_list and __ioc_clear_queue() will destroy icq from request_queue->icq_list. However, the ioc_release_fn() will hold ioc_lock firstly, and free ioc finally. Then __ioc_clear_queue() will get ioc from icq and hold ioc_lock, but ioc has been released, which will result in a use-after-free. CPU0 CPU1 put_io_context elevator_switch_mq queue_work &ioc->release_work ioc_clear_queue ^^^ splice q->icq_list __ioc_clear_queue ^^^get icq from icq_list get ioc from icq->ioc ioc_release_fn spin_lock(ioc->lock) ioc_destroy_icq(icq) spin_unlock(ioc->lock) free(ioc) spin_lock(ioc->lock) <= UAF Fix by grabbing the request_queue->queue_lock in ioc_clear_queue() to avoid this race scene. Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com> Link: https://lore.kernel.org/lkml/1c9ad9f2-c487-c793-1ffc-5c3ec0fcc0ae@kernel.dk/ Conflicts: block/blk-ioc.c Reviewed-by: NJason Yan <yanaijie@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> -
由 Zhou Guanghui 提交于
hulk inclusion category: feature bugzilla: https://gitee.com/openeuler/kernel/issues/I4D63I CVE: NA ------------------------------------------------- If the SMMU frequently reports a large number of events, the events in the event queue cannot be processed in time. As a result, the while loop cannot exit. So add a cond_resched() to avoid softlockup. Signed-off-by: NZhou Guanghui <zhouguanghui1@huawei.com> Signed-off-by: NGuo Mengqi <guomengqi3@huawei.com> Reviewed-by: NWeilong Chen <chenweilong@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Eric Dumazet 提交于
mainline inclusion from mainline-v5.5-rc1 commit b60fa1c5 category: bugfix bugzilla: 182865 CVE: NA ------------------------------------------------- The introduction of this schedule point was done in commit 2ba2506c ("[NET]: Add preemption point in qdisc_run") at a time the loop was not bounded. Then later in commit d5b8aa1d ("net_sched: fix dequeuer fairness") we added a limit on the number of packets. Now is the time to remove the schedule point, since the default limit of 64 packets matches the number of packets a typical NAPI poll can process in a row. This solves a latency problem for most TCP receivers under moderate load : 1) host receives a packet. NET_RX_SOFTIRQ is raised by NIC hard IRQ handler 2) __do_softirq() does its first loop, handling NET_RX_SOFTIRQ and calling the driver napi->loop() function 3) TCP stores the skb in socket receive queue: 4) TCP calls sk->sk_data_ready() and wakeups a user thread waiting for EPOLLIN (as a result, need_resched() might now be true) 5) TCP cooks an ACK and sends it. 6) qdisc_run() processes one packet from qdisc, and sees need_resched(), this raises NET_TX_SOFTIRQ (even if there are no more packets in the qdisc) Then we go back to the __do_softirq() in 2), and we see that new softirqs were raised. Since need_resched() is true, we end up waking ksoftirqd in this path : if (pending) { if (time_before(jiffies, end) && !need_resched() && --max_restart) goto restart; wakeup_softirqd(); } So we have many wakeups of ksoftirqd kernel threads, and more calls to qdisc_run() with associated lock overhead. Note that another way to solve the issue would be to change TCP to first send the ACK packet, then signal the EPOLLIN, but this changes P99 latencies, as sending the ACK packet can add a long delay. Signed-off-by: NEric Dumazet <edumazet@google.com> Signed-off-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NLu Wei <luwei32@huawei.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Wen Gong 提交于
mainline inclusion from mainline-v5.13-rc4 commit 0dc267b1 category: bugfix bugzilla: 181870 CVE: CVE-2020-26141 ------------------------------------------------- TKIP Michael MIC was not verified properly for PCIe cases since the validation steps in ieee80211_rx_h_michael_mic_verify() in mac80211 did not get fully executed due to unexpected flag values in ieee80211_rx_status. Fix this by setting the flags property to meet mac80211 expectations for performing Michael MIC validation there. This fixes CVE-2020-26141. It does the same as ath10k_htt_rx_proc_rx_ind_hl() for SDIO which passed MIC verification case. This applies only to QCA6174/QCA9377 PCIe. Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00110-QCARMSWP-1 Cc: stable@vger.kernel.org Signed-off-by: NWen Gong <wgong@codeaurora.org> Signed-off-by: NJouni Malinen <jouni@codeaurora.org> Link: https://lore.kernel.org/r/20210511200110.c3f1d42c6746.I795593fcaae941c471425b8c7d5f7bb185d29142@changeidSigned-off-by: NJohannes Berg <johannes.berg@intel.com> Signed-off-by: NWang Hai <wanghai38@huawei.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Wen Gong 提交于
mainline inclusion from mainline-v5.13-rc4 commit 65c415a1 category: bugfix bugzilla: 181870 CVE: CVE-2020-26145 ------------------------------------------------- Fragmentation is not used with multicast frames. Discard unexpected fragments with multicast DA. This fixes CVE-2020-26145. Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00110-QCARMSWP-1 Cc: stable@vger.kernel.org Signed-off-by: NWen Gong <wgong@codeaurora.org> Signed-off-by: NJouni Malinen <jouni@codeaurora.org> Link: https://lore.kernel.org/r/20210511200110.5a0bd289bda8.Idd6ebea20038fb1cfee6de924aa595e5647c9eae@changeidSigned-off-by: NJohannes Berg <johannes.berg@intel.com> Signed-off-by: NWang Hai <wanghai38@huawei.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Wen Gong 提交于
mainline inclusion from mainline-v5.13-rc4 commit a1166b26 category: bugfix bugzilla: 181870 CVE: CVE-2020-26145 ------------------------------------------------- PN replay check for not fragmented frames is finished in the firmware, but this was not done for fragmented frames when ath10k is used with QCA6174/QCA6377 PCIe. mac80211 has the function ieee80211_rx_h_defragment() for PN replay check for fragmented frames, but this does not get checked with QCA6174 due to the ieee80211_has_protected() condition not matching the cleared Protected bit case. Validate the PN of received fragmented frames within ath10k when CCMP is used and drop the fragment if the PN is not correct (incremented by exactly one from the previous fragment). This applies only for QCA6174/QCA6377 PCIe. Tested-on: QCA6174 hw3.2 PCI WLAN.RM.4.4.1-00110-QCARMSWP-1 Cc: stable@vger.kernel.org Signed-off-by: NWen Gong <wgong@codeaurora.org> Signed-off-by: NJouni Malinen <jouni@codeaurora.org> Link: https://lore.kernel.org/r/20210511200110.9ba2664866a4.I756e47b67e210dba69966d989c4711ffc02dc6bc@changeidSigned-off-by: NJohannes Berg <johannes.berg@intel.com> Signed-off-by: NWang Hai <wanghai38@huawei.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Wen Gong 提交于
mainline inclusion from mainline-v5.3-rc1 commit e1bddde9 category: bugfix bugzilla: 181870 CVE: CVE-2020-26145 ------------------------------------------------- Add the struct for PN replay protection and fragment packet handler. Also fix the bitmask of HTT_RX_DESC_HL_INFO_MCAST_BCAST to match what's currently used by SDIO firmware. The defines are not used yet so it's safe to modify them. Remove the conflicting HTT_RX_DESC_HL_INFO_FRAGMENT as it's not either used in ath10k. Tested on QCA6174 SDIO with firmware WLAN.RMH.4.4.1-00007-QCARMSWP-1. Signed-off-by: NWen Gong <wgong@codeaurora.org> Signed-off-by: NKalle Valo <kvalo@codeaurora.org> conflict: drivers/net/wireless/ath/ath10k/htt.h Signed-off-by: NWang Hai <wanghai38@huawei.com> Reviewed-by: NYue Haibing <yuehaibing@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-