- 27 9月, 2022 2 次提交
-
-
由 Ard Biesheuvel 提交于
Use a EFI configuration table to pass the initrd to the core kernel, instead of per-arch methods. This cleans up the code considerably, and should make it easier for architectures to get rid of their reliance on DT for doing EFI boot in the future. Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
-
由 Ard Biesheuvel 提交于
Currently, struct efi_boot_memmap is a struct that is passed around between callers of efi_get_memory_map() and the users of the resulting data, and which carries pointers to various variables whose values are provided by the EFI GetMemoryMap() boot service. This is overly complex, and it is much easier to carry these values in the struct itself. So turn the struct into one that carries these data items directly, including a flex array for the variable number of EFI memory descriptors that the boot service may return. Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
-
- 06 9月, 2022 1 次提交
-
-
由 Huacai Chen 提交于
This patch adds efistub booting support, which is the standard UEFI boot protocol for LoongArch to use. We use generic efistub, which means we can pass boot information (i.e., system table, memory map, kernel command line, initrd) via a light FDT and drop a lot of non-standard code. We use a flat mapping to map the efi runtime in the kernel's address space. In efi, VA = PA; in kernel, VA = PA + PAGE_OFFSET. As a result, flat mapping is not identity mapping, SetVirtualAddressMap() is still needed for the efi runtime. Tested-by: NXi Ruoyao <xry111@xry111.site> Signed-off-by: NHuacai Chen <chenhuacai@loongson.cn> [ardb: change fpic to fpie as suggested by Xi Ruoyao] Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
-
- 15 8月, 2022 1 次提交
-
-
由 Yury Norov 提交于
Radix tree header includes gfp.h for __GFP_BITS_SHIFT only. Now we have gfp_types.h for this. Fixes powerpc allmodconfig build: In file included from include/linux/nodemask.h:97, from include/linux/mmzone.h:17, from include/linux/gfp.h:7, from include/linux/radix-tree.h:12, from include/linux/idr.h:15, from include/linux/kernfs.h:12, from include/linux/sysfs.h:16, from include/linux/kobject.h:20, from include/linux/pci.h:35, from arch/powerpc/kernel/prom_init.c:24: include/linux/random.h: In function 'add_latent_entropy': >> include/linux/random.h:25:46: error: 'latent_entropy' undeclared (first use in this function); did you mean 'add_latent_entropy'? 25 | add_device_randomness((const void *)&latent_entropy, sizeof(latent_entropy)); | ^~~~~~~~~~~~~~ | add_latent_entropy include/linux/random.h:25:46: note: each undeclared identifier is reported only once for each function it appears in Reported-by: Nkernel test robot <lkp@intel.com> CC: Andy Shevchenko <andriy.shevchenko@linux.intel.com> CC: Andrew Morton <akpm@linux-foundation.org> CC: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: NYury Norov <yury.norov@gmail.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 13 8月, 2022 2 次提交
-
-
由 Stefan Metzmacher 提交于
We need to make sure (at build time) that struct io_cmd_data is not casted to a structure that's larger. Signed-off-by: NStefan Metzmacher <metze@samba.org> Link: https://lore.kernel.org/r/c024cdf25ae19fc0319d4180e2298bade8ed17b8.1660201408.git.metze@samba.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Keith Busch 提交于
This is a size sensitive structure and randomizing can introduce extra padding that breaks io_uring's fixed size expectations. There are few fields here as it is, half of which need a fixed order to optimally pack, so the randomization isn't providing much. Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NKeith Busch <kbusch@kernel.org> Link: https://lore.kernel.org/io-uring/b6f508ca-b1b2-5f40-7998-e4cff1cf7212@kernel.dk/Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 11 8月, 2022 15 次提交
-
-
由 Eugenio Pérez 提交于
This operation is optional: It it's not implemented, backend feature bit will not be exposed. Signed-off-by: NEugenio Pérez <eperezma@redhat.com> Message-Id: <20220810171512.2343333-2-eperezma@redhat.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
-
由 Eli Cohen 提交于
Implement the suspend callback allowing to suspend the virtqueues so they stop processing descriptors. This is required to allow to query a consistent state of the virtqueue while live migration is taking place. Signed-off-by: NEli Cohen <elic@nvidia.com> Message-Id: <20220714113927.85729-2-elic@nvidia.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
-
由 Xuan Zhuo 提交于
Introduce helper virtio_find_vqs_ctx_size() to call find_vqs and specify the maximum size of each vq ring. Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: NJason Wang <jasowang@redhat.com> Message-Id: <20220801063902.129329-37-xuanzhuo@linux.alibaba.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
-
由 Xuan Zhuo 提交于
find_vqs() adds a new parameter sizes to specify the size of each vq vring. NULL as sizes means that all queues in find_vqs() use the maximum size. A value in the array is 0, which means that the corresponding queue uses the maximum size. In the split scenario, the meaning of size is the largest size, because it may be limited by memory, the virtio core will try a smaller size. And the size is power of 2. Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: NHans de Goede <hdegoede@redhat.com> Reviewed-by: NMathieu Poirier <mathieu.poirier@linaro.org> Acked-by: NJason Wang <jasowang@redhat.com> Message-Id: <20220801063902.129329-34-xuanzhuo@linux.alibaba.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
-
由 Xuan Zhuo 提交于
Introduce new helpers to implement queue reset and get queue reset status. https://github.com/oasis-tcs/virtio-spec/issues/124 https://github.com/oasis-tcs/virtio-spec/issues/139Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: NJason Wang <jasowang@redhat.com> Message-Id: <20220801063902.129329-31-xuanzhuo@linux.alibaba.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
-
由 Xuan Zhuo 提交于
Add queue_reset in virtio_pci_modern_common_cfg. https://github.com/oasis-tcs/virtio-spec/issues/124 https://github.com/oasis-tcs/virtio-spec/issues/139Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: NJason Wang <jasowang@redhat.com> Message-Id: <20220801063902.129329-30-xuanzhuo@linux.alibaba.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
-
由 Xuan Zhuo 提交于
Introduce a new member reset to the structure virtqueue to determine whether the current vq is in the reset state. Subsequent patches will use it. Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: NJason Wang <jasowang@redhat.com> Message-Id: <20220801063902.129329-29-xuanzhuo@linux.alibaba.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
-
由 Xuan Zhuo 提交于
This patch allows the new introduced __virtqueue_break()/__virtqueue_unbreak() to break/unbreak the virtqueue. Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: NJason Wang <jasowang@redhat.com> Message-Id: <20220801063902.129329-27-xuanzhuo@linux.alibaba.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
-
由 Xuan Zhuo 提交于
Add queue_notify_data in struct virtio_pci_common_cfg, which comes from here https://github.com/oasis-tcs/virtio-spec/issues/89 In order not to affect the API, add a dedicated structure struct virtio_pci_modern_common_cfg to virtio_pci_modern.h. Since I want to add queue_reset after queue_notify_data, I submitted this patch first. Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: NJason Wang <jasowang@redhat.com> Message-Id: <20220801063902.129329-26-xuanzhuo@linux.alibaba.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
-
由 Xuan Zhuo 提交于
Introduce virtqueue_resize() to implement the resize of vring. Based on these, the driver can dynamically adjust the size of the vring. For example: ethtool -G. virtqueue_resize() implements resize based on the vq reset function. In case of failure to allocate a new vring, it will give up resize and use the original vring. During this process, if the re-enable reset vq fails, the vq can no longer be used. Although the probability of this situation is not high. The parameter recycle is used to recycle the buffer that is no longer used. Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: NJason Wang <jasowang@redhat.com> Message-Id: <20220801063902.129329-25-xuanzhuo@linux.alibaba.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
-
由 Xuan Zhuo 提交于
There is currently only one place to reference __vring_new_virtqueue() directly from the outside of virtio core. And here vring_new_virtqueue() can be used instead. Subsequent patches will modify __vring_new_virtqueue, so stop it as an export symbol for now. Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: NJason Wang <jasowang@redhat.com> Message-Id: <20220801063902.129329-8-xuanzhuo@linux.alibaba.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
-
由 Xuan Zhuo 提交于
reset can be divided into the following four steps (example): 1. transport: notify the device to reset the queue 2. vring: recycle the buffer submitted 3. vring: reset/resize the vring (may re-alloc) 4. transport: mmap vring to device, and enable the queue In order to support queue reset, add two callbacks in struct virtio_config_ops to implement steps 1 and 4. Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: NJason Wang <jasowang@redhat.com> Message-Id: <20220801063902.129329-3-xuanzhuo@linux.alibaba.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
-
由 Xuan Zhuo 提交于
virtio-net can display the maximum (supported by hardware) ring size in ethtool -g eth0. When the subsequent patch implements vring reset, it can judge whether the ring size passed by the driver is legal based on this. Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com> Acked-by: NJason Wang <jasowang@redhat.com> Message-Id: <20220801063902.129329-2-xuanzhuo@linux.alibaba.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
-
由 Xuan Zhuo 提交于
Rename the member len in the structure rpoc_vring to num. And remove 'in bytes' from the comment of it. This is misleading. Because this actually refers to the size of the virtio vring to be created. The unit is not bytes. Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com> Message-Id: <20220624025621.128843-2-xuanzhuo@linux.alibaba.com> Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
-
由 Hawkins Jiawei 提交于
Syzkaller reports refcount bug as follows: ------------[ cut here ]------------ refcount_t: saturated; leaking memory. WARNING: CPU: 1 PID: 3605 at lib/refcount.c:19 refcount_warn_saturate+0xf4/0x1e0 lib/refcount.c:19 Modules linked in: CPU: 1 PID: 3605 Comm: syz-executor208 Not tainted 5.18.0-syzkaller-03023-g7e062cda #0 <TASK> __refcount_add_not_zero include/linux/refcount.h:163 [inline] __refcount_inc_not_zero include/linux/refcount.h:227 [inline] refcount_inc_not_zero include/linux/refcount.h:245 [inline] sk_psock_get+0x3bc/0x410 include/linux/skmsg.h:439 tls_data_ready+0x6d/0x1b0 net/tls/tls_sw.c:2091 tcp_data_ready+0x106/0x520 net/ipv4/tcp_input.c:4983 tcp_data_queue+0x25f2/0x4c90 net/ipv4/tcp_input.c:5057 tcp_rcv_state_process+0x1774/0x4e80 net/ipv4/tcp_input.c:6659 tcp_v4_do_rcv+0x339/0x980 net/ipv4/tcp_ipv4.c:1682 sk_backlog_rcv include/net/sock.h:1061 [inline] __release_sock+0x134/0x3b0 net/core/sock.c:2849 release_sock+0x54/0x1b0 net/core/sock.c:3404 inet_shutdown+0x1e0/0x430 net/ipv4/af_inet.c:909 __sys_shutdown_sock net/socket.c:2331 [inline] __sys_shutdown_sock net/socket.c:2325 [inline] __sys_shutdown+0xf1/0x1b0 net/socket.c:2343 __do_sys_shutdown net/socket.c:2351 [inline] __se_sys_shutdown net/socket.c:2349 [inline] __x64_sys_shutdown+0x50/0x70 net/socket.c:2349 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x46/0xb0 </TASK> During SMC fallback process in connect syscall, kernel will replaces TCP with SMC. In order to forward wakeup smc socket waitqueue after fallback, kernel will sets clcsk->sk_user_data to origin smc socket in smc_fback_replace_callbacks(). Later, in shutdown syscall, kernel will calls sk_psock_get(), which treats the clcsk->sk_user_data as psock type, triggering the refcnt warning. So, the root cause is that smc and psock, both will use sk_user_data field. So they will mismatch this field easily. This patch solves it by using another bit(defined as SK_USER_DATA_PSOCK) in PTRMASK, to mark whether sk_user_data points to a psock object or not. This patch depends on a PTRMASK introduced in commit f1ff5ce2 ("net, sk_msg: Clear sk_user_data pointer on clone if tagged"). For there will possibly be more flags in the sk_user_data field, this patch also refactor sk_user_data flags code to be more generic to improve its maintainability. Reported-and-tested-by: syzbot+5f26f85569bd179c18ce@syzkaller.appspotmail.com Suggested-by: NJakub Kicinski <kuba@kernel.org> Acked-by: NWen Gu <guwen@linux.alibaba.com> Signed-off-by: NHawkins Jiawei <yin31149@gmail.com> Reviewed-by: NJakub Sitnicki <jakub@cloudflare.com> Signed-off-by: NJakub Kicinski <kuba@kernel.org>
-
- 10 8月, 2022 3 次提交
-
-
由 Mikulas Patocka 提交于
Let's have a look at this piece of code in __bread_slow: get_bh(bh); bh->b_end_io = end_buffer_read_sync; submit_bh(REQ_OP_READ, 0, bh); wait_on_buffer(bh); if (buffer_uptodate(bh)) return bh; Neither wait_on_buffer nor buffer_uptodate contain any memory barrier. Consequently, if someone calls sb_bread and then reads the buffer data, the read of buffer data may be executed before wait_on_buffer(bh) on architectures with weak memory ordering and it may return invalid data. Fix this bug by adding a memory barrier to set_buffer_uptodate and an acquire barrier to buffer_uptodate (in a similar way as folio_test_uptodate and folio_mark_uptodate). Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Cc: stable@vger.kernel.org Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Trond Myklebust 提交于
Don't leak request pointers, but use the "device:inode" labelling that is used by all the other trace points. Furthermore, replace use of page indexes with an offset, again in order to align behaviour with other NFS trace points. Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
-
由 Youngmin Nam 提交于
In ns_to_kernel_old_timeval() definition, the function argument is defined with const identifier in kernel/time/time.c, but the prototype in include/linux/time32.h looks different. - The function is defined in kernel/time/time.c as below: struct __kernel_old_timeval ns_to_kernel_old_timeval(const s64 nsec) - The function is decalared in include/linux/time32.h as below: extern struct __kernel_old_timeval ns_to_kernel_old_timeval(s64 nsec); Because the variable of arithmethic types isn't modified in the calling scope, there's no need to mark arguments as const, which was already mentioned during review (Link[1) of the original patch. Likewise remove the "const" keyword in both definition and declaration of ns_to_timespec64() as requested by Arnd (Link[2]). Fixes: a84d1169 ("y2038: Introduce struct __kernel_old_timeval") Signed-off-by: NYoungmin Nam <youngmin.nam@samsung.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NArnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/all/20220712094715.2918823-1-youngmin.nam@samsung.com Link[1]: https://lore.kernel.org/all/20180310081123.thin6wphgk7tongy@gmail.com/ Link[2]: https://lore.kernel.org/all/CAK8P3a3nknJgEDESGdJH91jMj6R_xydFqWASd8r5BbesdvMBgA@mail.gmail.com/
-
- 09 8月, 2022 16 次提交
-
-
由 Al Viro 提交于
mechanical change; will be further massaged in subsequent commits Reviewed-by: NJeff Layton <jlayton@kernel.org> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Most of the users immediately follow successful iov_iter_get_pages() with advancing by the amount it had returned. Provide inline wrappers doing that, convert trivial open-coded uses of those. BTW, iov_iter_get_pages() never returns more than it had been asked to; such checks in cifs ought to be removed someday... Reviewed-by: NJeff Layton <jlayton@kernel.org> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
All their callers are next to each other; all of them want the total amount of pages and, possibly, the offset in the partial final buffer. Combine into a new helper (pipe_npages()), fix the bogosity in pipe_space_for_user(), while we are at it. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
We often need to find whether the last buffer is anon or not, and currently it's rather clumsy: check if ->iov_offset is non-zero (i.e. that pipe is not empty) if so, get the corresponding pipe_buffer and check its ->ops if it's &default_pipe_buf_ops, we have an anon buffer. Let's replace the use of ->iov_offset (which is nowhere near similar to its role for other flavours) with signed field (->last_offset), with the following rules: empty, no buffers occupied: 0 anon, with bytes up to N-1 filled: N zero-copy, with bytes up to N-1 filled: -N That way abs(i->last_offset) is equal to what used to be in i->iov_offset and empty vs. anon vs. zero-copy can be distinguished by the sign of i->last_offset. Checks for "should we extend the last buffer or should we start a new one?" become easier to follow that way. Note that most of the operations can only be done in a sane state - i.e. when the pipe has nothing past the current position of iterator. About the only thing that could be done outside of that state is iov_iter_advance(), which transitions to the sane state by truncating the pipe. There are only two cases where we leave the sane state: 1) iov_iter_get_pages()/iov_iter_get_pages_alloc(). Will be dealt with later, when we make get_pages advancing - the callers are actually happier that way. 2) iov_iter copied, then something is put into the copy. Since they share the underlying pipe, the original gets behind. When we decide that we are done with the copy (original is not usable until then) we advance the original. direct_io used to be done that way; nowadays it operates on the original and we do iov_iter_revert() to discard the excessive data. At the moment there's nothing in the kernel that could do that to ITER_PIPE iterators, so this reason for insane state is theoretical right now. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
Equivalent of single-segment iovec. Initialized by iov_iter_ubuf(), checked for by iter_is_ubuf(), otherwise behaves like ITER_IOVEC ones. We are going to expose the things like ->write_iter() et.al. to those in subsequent commits. New predicate (user_backed_iter()) that is true for ITER_IOVEC and ITER_UBUF; places like direct-IO handling should use that for checking that pages we modify after getting them from iov_iter_get_pages() would need to be dirtied. DO NOT assume that replacing iter_is_iovec() with user_backed_iter() will solve all problems - there's code that uses iter_is_iovec() to decide how to poke around in iov_iter guts and for that the predicate replacement obviously won't suffice. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Fabio M. De Francesco 提交于
kmap_local_page() should always be preferred in place of kmap() and kmap_atomic(). "Only use when really necessary." is not consistent with the Documentation/mm/highmem.rst and these kdocs it embeds. Therefore, delete the above-mentioned sentence from kdocs. Link: https://lkml.kernel.org/r/20220728154844.10874-7-fmdefrancesco@gmail.comSigned-off-by: NFabio M. De Francesco <fmdefrancesco@gmail.com> Suggested-by: NIra Weiny <ira.weiny@intel.com> Reviewed-by: NIra Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Peter Collingbourne <pcc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Fabio M. De Francesco 提交于
In a recent thread about converting kmap() to kmap_local_page(), the safety of calling kmap_local_page() was questioned.[1] "any context" should probably be enough detail for users who want to know whether or not kmap_local_page() can be called from interrupts. However, Linux still has kmap_atomic() which might make users think they must use the latter in interrupts. Add "including interrupts" for better clarity. [1] https://lore.kernel.org/lkml/3187836.aeNJFYEL58@opensuse/ Link: https://lkml.kernel.org/r/20220728154844.10874-3-fmdefrancesco@gmail.comSigned-off-by: NFabio M. De Francesco <fmdefrancesco@gmail.com> Suggested-by: NIra Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Peter Collingbourne <pcc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Will Deacon <will@kernel.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Fabio M. De Francesco 提交于
Patch series "highmem: Extend kmap_local_page() documentation", v2. The Highmem interface is evolving and the current documentation does not reflect the intended uses of each of the calls. Furthermore, after a recent series of reworks, the differences of the calls can still be confusing and may lead to the expanded use of calls which are deprecated. This series is the second round of changes towards an enhanced documentation of the Highmem's interface; at this stage the patches are only focused to kmap_local_page(). In addition it also contains some minor clean ups. This patch (of 7): In the kdocs of kmap_local_page(), the description of @page starts after several unnecessary spaces. Therefore, remove those spaces. Link: https://lkml.kernel.org/r/20220728154844.10874-1-fmdefrancesco@gmail.com Link: https://lkml.kernel.org/r/20220728154844.10874-2-fmdefrancesco@gmail.comSigned-off-by: NFabio M. De Francesco <fmdefrancesco@gmail.com> Suggested-by: NIra Weiny <ira.weiny@intel.com> Reviewed-by: NIra Weiny <ira.weiny@intel.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Peter Collingbourne <pcc@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Now error handling code is prepared, so remove the blocking code and enable memory error handling on 1GB hugepage. Link: https://lkml.kernel.org/r/20220714042420.1847125-9-naoya.horiguchi@linux.devSigned-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
If memory_failure() fails to grab page refcount on a hugetlb page because it's busy, it returns without setting PG_hwpoison on it. This not only loses a chance of error containment, but breaks the rule that action_result() should be called only when memory_failure() do any of handling work (even if that's just setting PG_hwpoison). This inconsistency could harm code maintainability. So set PG_hwpoison and call hugetlb_set_page_hwpoison() for such a case. Link: https://lkml.kernel.org/r/20220714042420.1847125-6-naoya.horiguchi@linux.dev Fixes: 405ce051 ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()") Signed-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: kernel test robot <lkp@intel.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
Raw error info list needs to be removed when hwpoisoned hugetlb is unpoisoned. And unpoison handler needs to know how many errors there are in the target hugepage. So add them. HPageVmemmapOptimized(hpage) and HPageRawHwpUnreliable(hpage)) sometimes can't be unpoisoned, so skip them. Link: https://lkml.kernel.org/r/20220714042420.1847125-5-naoya.horiguchi@linux.devSigned-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Nkernel test robot <lkp@intel.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Naoya Horiguchi 提交于
When handling memory error on a hugetlb page, the error handler tries to dissolve and turn it into 4kB pages. If it's successfully dissolved, PageHWPoison flag is moved to the raw error page, so that's all right. However, dissolve sometimes fails, then the error page is left as hwpoisoned hugepage. It's useful if we can retry to dissolve it to save healthy pages, but that's not possible now because the information about where the raw error pages is lost. Use the private field of a few tail pages to keep that information. The code path of shrinking hugepage pool uses this info to try delayed dissolve. In order to remember multiple errors in a hugepage, a singly-linked list originated from SUBPAGE_INDEX_HWPOISON-th tail page is constructed. Only simple operations (adding an entry or clearing all) are required and the list is assumed not to be very long, so this simple data structure should be enough. If we failed to save raw error info, the hwpoison hugepage has errors on unknown subpage, then this new saving mechanism does not work any more, so disable saving new raw error info and freeing hwpoison hugepages. Link: https://lkml.kernel.org/r/20220714042420.1847125-4-naoya.horiguchi@linux.devSigned-off-by: NNaoya Horiguchi <naoya.horiguchi@nec.com> Reported-by: Nkernel test robot <lkp@intel.com> Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com> Cc: David Hildenbrand <david@redhat.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Muchun Song 提交于
All the comments which explains how HVO works are moved to vmemmap_dedup.rst since commit 4917f55b ("mm/sparse-vmemmap: improve memory savings for compound devmaps") except some comments above page_fixed_fake_head(). This commit moves those comments to vmemmap_dedup.rst and improve vmemmap_dedup.rst as well. Link: https://lkml.kernel.org/r/20220628092235.91270-8-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Will Deacon <will@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Muchun Song 提交于
There is a discussion about the name of hugetlb_vmemmap_alloc/free in thread [1]. The suggestion suggested by David is rename "alloc/free" to "optimize/restore" to make functionalities clearer to users, "optimize" means the function will optimize vmemmap pages, while "restore" means restoring its vmemmap pages discared before. This commit does this. Another discussion is the confusion RESERVE_VMEMMAP_NR isn't used explicitly for vmemmap_addr but implicitly for vmemmap_end in hugetlb_vmemmap_alloc/free. David suggested we can compute what hugetlb_vmemmap_init() does now at runtime. We do not need to worry for the overhead of computing at runtime since the calculation is simple enough and those functions are not in a hot path. This commit has the following improvements: 1) The function suffixed name ("optimize/restore") is more expressive. 2) The logic becomes less weird in hugetlb_vmemmap_optimize/restore(). 3) The hugetlb_vmemmap_init() does not need to be exported anymore. 4) A ->optimize_vmemmap_pages field in struct hstate is killed. 5) There is only one place where checks is_power_of_2(sizeof(struct page)) instead of two places. 6) Add more comments for hugetlb_vmemmap_optimize/restore(). 7) For external users, hugetlb_optimize_vmemmap_pages() is used for detecting if the HugeTLB's vmemmap pages is optimizable originally. In this commit, it is killed and we introduce a new helper hugetlb_vmemmap_optimizable() to replace it. The name is more expressive. Link: https://lore.kernel.org/all/20220404074652.68024-2-songmuchun@bytedance.com/ [1] Link: https://lkml.kernel.org/r/20220628092235.91270-7-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Oscar Salvador <osalvador@suse.de> Cc: Will Deacon <will@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Muchun Song 提交于
When I first introduced vmemmap manipulation functions related to HugeTLB, I thought those functions may be reused by other modules (e.g. using similar approach to optimize vmemmap pages, unfortunately, the DAX used the same approach but does not use those functions). After two years, we didn't see any other users. So move those functions to hugetlb_vmemmap.c. Code movement without any functional change. Link: https://lkml.kernel.org/r/20220628092235.91270-5-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Will Deacon <will@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-
由 Muchun Song 提交于
It it inconvenient to mention the feature of optimizing vmemmap pages associated with HugeTLB pages when communicating with others since there is no specific or abbreviated name for it when it is first introduced. Let us give it a name HVO (HugeTLB Vmemmap Optimization) from now. This commit also updates the document about "hugetlb_free_vmemmap" by the way discussed in thread [1]. Link: https://lore.kernel.org/all/21aae898-d54d-cc4b-a11f-1bb7fddcfffa@redhat.com/ [1] Link: https://lkml.kernel.org/r/20220628092235.91270-4-songmuchun@bytedance.comSigned-off-by: NMuchun Song <songmuchun@bytedance.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Will Deacon <will@kernel.org> Cc: Xiongchun Duan <duanxiongchun@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
-