- 27 5月, 2020 9 次提交
-
-
由 Eugene Syromiatnikov 提交于
to #26323578 commit 1292e972fff2b2d81e139e0c2fe5f50249e78c58 upstream. fds field of struct io_uring_files_update is problematic with regards to compat user space, as pointer size is different in 32-bit, 32-on-64-bit, and 64-bit user space. In order to avoid custom handling of compat in the syscall implementation, make fds __u64 and use u64_to_user_ptr in order to retrieve it. Also, align the field naturally and check that no garbage is passed there. Fixes: c3a31e605620c279 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE") Signed-off-by: NEugene Syromiatnikov <esyr@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323578 commit 9e3aa61ae3e01ce1ce6361a41ef725e1f4d1d2bf upstream. If we submit an unknown opcode and have fd == -1, io_op_needs_file() will return true as we default to needing a file. Then when we go and assign the file, we find the 'fd' invalid and return -EBADF. We really should be returning -EINVAL for that case, as we normally do for unsupported opcodes. Change io_op_needs_file() to have the following return values: 0 - does not need a file 1 - does need a file < 0 - error value and use this to pass back the right value for this invalid case. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323578 commit 4e88d6e7793f2f445f43bd608828541d7f43b608 upstream. Some commands will invariably end in a failure in the sense that the completion result will be less than zero. One such example is timeouts that don't have a completion count set, they will always complete with -ETIME unless cancelled. For linked commands, we sever links and fail the rest of the chain if the result is less than zero. Since we have commands where we know that will happen, add IOSQE_IO_HARDLINK as a stronger link that doesn't sever regardless of the completion result. Note that the link will still sever if we fail submitting the parent request, hard links are only resilient in the presence of completion results for requests that did submit correctly. Cc: stable@vger.kernel.org # v5.4 Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Reported-by: N李通洲 <carter.li@eoitek.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323578 commit da8c96906990f1108cb626ee7865e69267a3263b upstream. If this flag is set, applications can be certain that any data for async offload has been consumed when the kernel has consumed the SQE. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323578 commit f499a021ea8c9f70321fce3d674d8eca5bbeee2c upstream. Just like commit f67676d160c6 for read/write requests, this one ensures that the sockaddr data has been copied for IORING_OP_CONNECT if we need to punt the request to async context. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323578 commit 03b1230ca12a12e045d83b0357792075bf94a1e0 upstream. Just like commit f67676d160c6 for read/write requests, this one ensures that the msghdr data is fully copied if we need to punt a recvmsg or sendmsg system call to async context. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323578 commit f8e85cf255ad57d65eeb9a9d0e59e3dec55bdd9e upstream. This allows an application to call connect() in an async fashion. Like other opcodes, we first try a non-blocking connect, then punt to async context if we have to. Note that we can still return -EINPROGRESS, and in that case the caller should use IORING_OP_POLL_ADD to do an async wait for completion of the connect request (just like for regular connect(2), except we can do it async here too). Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323578 commit bd3ded3146daa2cbb57ed353749ef99cf75371b0 upstream. This is identical to __sys_connect(), except it takes a struct file instead of an fd, and it also allows passing in extra file->f_flags flags. The latter is done to support masking in O_NONBLOCK without manipulating the original file flags. No functional changes in this patch. Cc: netdev@vger.kernel.org Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #26323578 commit 915967f69c591b34c5a18d6618af021a81ffd700 upstream. We don't have shadow requests anymore, so get rid of the shadow argument. Add the user_data argument, as that's often useful to easily match up requests, instead of having to look at request pointers. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
- 15 5月, 2020 1 次提交
-
-
由 xuanzhuo 提交于
to #26353046 TcpRT: Instrument and Diagnostic Analysis System for Service Quality of Cloud Databases at Massive Scale in Real-time. It can also provide information for all request/response services. Such as HTTP request. This is the kernel framework for tcprt, more work needs tcprt module support. TcpRt module should call tcp_unregitsert_rt before rmmod. TcpRt hooks will be called when sock init, recv data, send data, packet acked and socket been destroy. The private data save to icsk->icsk_tcp_rt_priv. Reviewed-by: NCambda Zhu <cambda@linux.alibaba.com> Acked-by: NDust Li <dust.li@linux.alibaba.com> Signed-off-by: Nxuanzhuo <xuanzhuo@linux.alibaba.com>
-
- 08 5月, 2020 1 次提交
-
-
由 Naoya Horiguchi 提交于
to #26809468 commit 907ec5fca3dc38d37737de826f06f25b063aa08e upstream. Patch series "mm: Fix for movable_node boot option", v3. This patch series contains a fix for the movable_node boot option issue which was introduced by commit 124049de ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved"). The commit breaks the option because it changed the memory gap range to reserved memblock. So, the node is marked as Normal zone even if the SRAT has Hot pluggable affinity. First and second patch fix the original issue which the commit tried to fix, then revert the commit. This patch (of 3): There is a kernel panic that is triggered when reading /proc/kpageflags on the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]': BUG: unable to handle kernel paging request at fffffffffffffffe PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0 Oops: 0000 [#1] SMP PTI CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014 RIP: 0010:stable_page_flags+0x27/0x3c0 Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 <48> 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7 RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202 RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000 RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0 RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001 R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0 R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10 FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0 Call Trace: kpageflags_read+0xc7/0x120 proc_reg_read+0x3c/0x60 __vfs_read+0x36/0x170 vfs_read+0x89/0x130 ksys_pread64+0x71/0x90 do_syscall_64+0x5b/0x160 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7efc42e75e23 Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24 According to kernel bisection, this problem became visible due to commit f7f99100 which changes how struct pages are initialized. Memblock layout affects the pfn ranges covered by node/zone. Consider that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the default (no memmap= given) memblock layout is like below: MEMBLOCK configuration: memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000 memory.cnt = 0x4 memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0 memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0 memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0 memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0 ... If you give memmap=1G!4G (so it just covers memory[0x2]), the range [0x100000000-0x13fffffff] is gone: MEMBLOCK configuration: memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000 memory.cnt = 0x3 memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0 memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0 memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0 ... This causes shrinking node 0's pfn range because it is calculated by the address range of memblock.memory. So some of struct pages in the gap range are left uninitialized. We have a function zero_resv_unavail() which does zeroing the struct pages outside memblock.memory, but currently it covers only the reserved unavailable range (i.e. memblock.memory && !memblock.reserved). This patch extends it to cover all unavailable range, which fixes the reported issue. Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com Fixes: f7f99100 ("mm: stop zeroing memory during allocation in vmemmap") Signed-off-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com> Tested-by: NOscar Salvador <osalvador@suse.de> Tested-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com> Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NShile Zhang <shile.zhang@linux.alibaba.com>
-
- 06 5月, 2020 3 次提交
-
-
由 Jan Kara 提交于
to #24913189 commit c780e86dd48ef6467a1146cf7d0fe1e05a635039 upstream. KASAN is reporting that __blk_add_trace() has a use-after-free issue when accessing q->blk_trace. Indeed the switching of block tracing (and thus eventual freeing of q->blk_trace) is completely unsynchronized with the currently running tracing and thus it can happen that the blk_trace structure is being freed just while __blk_add_trace() works on it. Protect accesses to q->blk_trace by RCU during tracing and make sure we wait for the end of RCU grace period when shutting down tracing. Luckily that is rare enough event that we can afford that. Note that postponing the freeing of blk_trace to an RCU callback should better be avoided as it could have unexpected user visible side-effects as debugfs files would be still existing for a short while block tracing has been shut down. Link: https://bugzilla.kernel.org/show_bug.cgi?id=205711 CC: stable@vger.kernel.org Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: NMing Lei <ming.lei@redhat.com> Tested-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NBart Van Assche <bvanassche@acm.org> Reported-by: NTristan Madani <tristmd@gmail.com> Signed-off-by: NJan Kara <jack@suse.cz> Signed-off-by: NJens Axboe <axboe@kernel.dk> [bwh: Backported to 4.19: adjust context] Signed-off-by: NBen Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: NSasha Levin <sashal@kernel.org> References: CVE-2019-19768 Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Sabrina Dubroca 提交于
to #24913189 commit 6c8991f41546c3c472503dff1ea9daaddf9331c2 upstream. ipv6_stub uses the ip6_dst_lookup function to allow other modules to perform IPv6 lookups. However, this function skips the XFRM layer entirely. All users of ipv6_stub->ip6_dst_lookup use ip_route_output_flow (via the ip_route_output_key and ip_route_output helpers) for their IPv4 lookups, which calls xfrm_lookup_route(). This patch fixes this inconsistent behavior by switching the stub to ip6_dst_lookup_flow, which also calls xfrm_lookup_route(). This requires some changes in all the callers, as these two functions take different arguments and have different return types. Fixes: 5f81bd2e ("ipv6: export a stub for IPv6 symbols used by vxlan") Reported-by: NXiumei Mu <xmu@redhat.com> Signed-off-by: NSabrina Dubroca <sd@queasysnail.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net> [bwh: Backported to 4.19: - Drop change in lwt_bpf.c - Delete now-unused "ret" in mlx5e_route_lookup_ipv6() - Initialise "out_dev" in mlx5e_create_encap_header_ipv6() to avoid introducing a spurious "may be used uninitialised" warning - Adjust filenames, context, indentation] Signed-off-by: NBen Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: NSasha Levin <sashal@kernel.org> References: CVE-2020-1749 Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
由 Sabrina Dubroca 提交于
to #24913189 commit c4e85f73afb6384123e5ef1bba3315b2e3ad031e upstream. This will be used in the conversion of ipv6_stub to ip6_dst_lookup_flow, as some modules currently pass a net argument without a socket to ip6_dst_lookup. This is equivalent to commit 343d60aa ("ipv6: change ipv6_stub_impl.ipv6_dst_lookup to take net argument"). Signed-off-by: NSabrina Dubroca <sd@queasysnail.net> Signed-off-by: NDavid S. Miller <davem@davemloft.net> [bwh: Backported to 4.19: adjust context] Signed-off-by: NBen Hutchings <ben.hutchings@codethink.co.uk> Signed-off-by: NSasha Levin <sashal@kernel.org> References: CVE-2020-1749 [zsl: fixes conflicts in net/sctp/ipv6.c] Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
- 30 4月, 2020 5 次提交
-
-
由 Pankaj Gupta 提交于
fix #27138800 commit 8c2e408e73f735d2e6e8b43f9b038c9abb082939 upstream. This patch fixes below sparse warning related to __virtio type in virtio pmem driver. This is reported by Intel test bot on linux-next tree. nd_virtio.c:56:28: warning: incorrect type in assignment (different base types) nd_virtio.c:56:28: expected unsigned int [unsigned] [usertype] type nd_virtio.c:56:28: got restricted __virtio32 nd_virtio.c:93:59: warning: incorrect type in argument 2 (different base types) nd_virtio.c:93:59: expected restricted __virtio32 [usertype] val nd_virtio.c:93:59: got unsigned int [unsigned] [usertype] ret Reported-by: Nkbuild test robot <lkp@intel.com> Signed-off-by: NPankaj Gupta <pagupta@redhat.com> Acked-by: NMichael S. Tsirkin <mst@redhat.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Pankaj Gupta 提交于
fix #27138800 commit 32de1484648a837db5dea0a7007fe7136804e392 upstream. This patch introduces 'daxdev_mapping_supported' helper which checks if 'MAP_SYNC' is supported with filesystem mapping. It also checks if corresponding dax_device is synchronous. Virtio pmem device is asynchronous and does not not support VM_SYNC. Suggested-by: NJan Kara <jack@suse.cz> Signed-off-by: NPankaj Gupta <pagupta@redhat.com> Reviewed-by: NJan Kara <jack@suse.cz> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Pankaj Gupta 提交于
fix #27138800 commit fefc1d97fa4b5e016bbe15447dc3edcd9e1bcb9f upstream. This patch adds 'DAXDEV_SYNC' flag which is set for nd_region doing synchronous flush. This later is used to disable MAP_SYNC functionality for ext4 & xfs filesystem for devices don't support synchronous flush. Signed-off-by: NPankaj Gupta <pagupta@redhat.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Pankaj Gupta 提交于
fix #27138800 commit 6e84200c0a2994b991259d19450eee561029bf70 upstream. This patch adds virtio-pmem driver for KVM guest. Guest reads the persistent memory range information from Qemu over VIRTIO and registers it on nvdimm_bus. It also creates a nd_region object with the persistent memory range information so that existing 'nvdimm/pmem' driver can reserve this into system memory map. This way 'virtio-pmem' driver uses existing functionality of pmem driver to register persistent memory compatible for DAX capable filesystems. This also provides function to perform guest flush over VIRTIO from 'pmem' driver when userspace performs flush on DAX memory range. Signed-off-by: NPankaj Gupta <pagupta@redhat.com> Reviewed-by: NYuval Shaia <yuval.shaia@oracle.com> Acked-by: NMichael S. Tsirkin <mst@redhat.com> Acked-by: NJakub Staron <jstaron@google.com> Tested-by: NJakub Staron <jstaron@google.com> Reviewed-by: NCornelia Huck <cohuck@redhat.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Pankaj Gupta 提交于
fix #27138800 commit c5d4355d10d414a96ca870b731756b89d068d57a upstream. This patch adds functionality to perform flush from guest to host over VIRTIO. We are registering a callback based on 'nd_region' type. virtio_pmem driver requires this special flush function. For rest of the region types we are registering existing flush function. Report error returned by host fsync failure to userspace. Signed-off-by: NPankaj Gupta <pagupta@redhat.com> Signed-off-by: NDan Williams <dan.j.williams@intel.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
- 29 4月, 2020 2 次提交
-
-
由 Al Viro 提交于
fix #27211210 commit 6c2d4798a8d16cf4f3a28c3cd4af4f1dcbbb4d04 upstream. Most of the callers of lookup_one_len_unlocked() treat negatives are ERR_PTR(-ENOENT). Provide a helper that would do just that. Note that a pinned positive dentry remains positive - it's ->d_inode is stable, etc.; a pinned _negative_ dentry can become positive at any point as long as you are not holding its parent at least shared. So using lookup_one_len_unlocked() needs to be careful; lookup_positive_unlocked() is safer and that's what the callers end up open-coding anyway. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Al Viro 提交于
fix #27211210 commit d41efb522e902364ab09c782d511c1bedc388ddd upstream. There are 4 callers; two proceed to check if result is positive and fail with ENOENT if it isn't; one (in handle_lookup_down()) is guaranteed to yield positive and one (in lookup_fast()) is _preceded_ by positivity check. However, follow_managed() on a negative dentry is a (fairly cheap) no-op on anything other than autofs. And negative autofs dentries are never hashed, so lookup_fast() is not going to run into one of those. Moreover, successful follow_managed() on a _positive_ dentry never yields a negative one (and we significantly rely upon that in callers of lookup_fast()). In other words, we can easily transpose the positivity check and the call of follow_managed() in lookup_fast(). And that allows to fold the positivity check *into* follow_managed(), simplifying life for the code downstream of its calls. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 28 4月, 2020 1 次提交
-
-
由 Babu Moger 提交于
to #26613714 commit 6fe07ce35e8ad870ba1cf82e0481e0fc0f526eff upstream. The resource control feature is supported by both Intel and AMD. So, rename CONFIG_INTEL_RDT to the vendor-neutral CONFIG_RESCTRL. Now CONFIG_RESCTRL will be used for both Intel and AMD to enable Resource Control support. Update the texts in config and condition accordingly. [ bp: Simplify Kconfig text. ] Signed-off-by: NBabu Moger <babu.moger@amd.com> Signed-off-by: NBorislav Petkov <bp@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Andy Lutomirski <luto@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: "Chang S. Bae" <chang.seok.bae@intel.com> Cc: David Miller <davem@davemloft.net> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Dmitry Safonov <dima@arista.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jann Horn <jannh@google.com> Cc: Joerg Roedel <jroedel@suse.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Kate Stewart <kstewart@linuxfoundation.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: <linux-doc@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Philippe Ombredanne <pombredanne@nexb.com> Cc: Pu Wen <puwen@hygon.cn> Cc: <qianyue.zj@alibaba-inc.com> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Reinette Chatre <reinette.chatre@intel.com> Cc: Rian Hunter <rian@alum.mit.edu> Cc: Sherry Hurwitz <sherry.hurwitz@amd.com> Cc: Suravee Suthikulpanit <suravee.suthikulpanit@amd.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Thomas Lendacky <Thomas.Lendacky@amd.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: <xiaochen.shen@intel.com> Link: https://lkml.kernel.org/r/20181121202811.4492-9-babu.moger@amd.com [ Shile: fixed conflict in arch/x86/Kconfig ] Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com> Tested-by: NWANG Siyuan <Siyuan.Wang@amd.com> Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
- 26 4月, 2020 1 次提交
-
-
由 zhongjiang-ali 提交于
to #24843736 It is because too much memcg oom printed message will trigger the softlockup. In general, we use the same ratelimit oom_rc between system and memcg to limit the print message. But it is more frequent to exceed its limit of the memcg, thus it would will result in oom easily. And A lot of printed information will be outputed. It's likely to trigger softlockup. The patch use different ratelimit to limit the memcg and system oom. And we test the patch using the default value in the memcg, The issue will go. [xuyu: adjust corresponding sysctl indexes] Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
-
- 24 4月, 2020 5 次提交
-
-
由 Yihao Wu 提交于
to #26424323 We account iowait when the cgroup's se is idle, and it has blocked task on the hierarchy of se->my_q. To achieve this, we also add cg_nr_running to track the hierarchical number of blocked tasks. We do it when a blocked task wakes up or a task is blocked. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #26424323 From the previous patch. We know there are 4 possible states. Since steal state's transition is complex. We choose to account its supplement. steal = elapse - idle - sum_exec_raw - ineffective Where elapse is the time since the cgroup is created. sum_exec_raw is the running time including IRQ time. ineffective is the total time that the cpuacct-binded cpuset doesn't allow this cpu for the cgroup. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Yihao Wu 提交于
to #26424323 Since we concern idle, let's take idle as the center state. And omit transition between other stats. Below is the state transition graph: sleep->deque +-----------+ cpumask +-------+ exit->deque +-------+ |ineffective|-------- | idle | <-----------|running| +-----------+ +-------+ +-------+ ^ | unthrtl child -> deque | | wake -> deque | |thrtl chlid -> enque migrate -> deque | |migrate -> enque | v +-------+ | steal | +-------+ We conclude idle state condition as: !se->on_rq && !my_q->throttled && cpu allowed. From this graph and condition, we can hook (de|en)queue_task_fair update_cpumasks_hier, (un|)throttle_cfs_rq to account idle state. In the hooked functions, we also check the conditions, to avoid accounting unwanted cpu clocks. Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com> Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com> Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>
-
由 Xunlei Pang 提交于
to #26424323 Add the cgroup file "cpuacct.proc_stat", we'll export per-cgroup cpu usages and some other scheduler statistics in this interface. Reviewed-by: NMichael Wang <yun.wang@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
-
由 Xunlei Pang 提交于
to #26424323 It's relatively easy to maintain nr_uninterruptible in scheduler compared to doing it in cpuacct, we assume that "cpu,cpuacct" are bound together, so that it can be used for per-cgroup load. This will be needed to calculate per-cgroup load average later. Reviewed-by: NMichael Wang <yun.wang@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com> Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
-
- 23 4月, 2020 3 次提交
-
-
由 Mel Gorman 提交于
to #26255339 commit 5e1f0f098b4649fad53011246bcaeff011ffdf5d upstream Compaction is inherently race-prone as a suitable page freed during compaction can be allocated by any parallel task. This patch uses a capture_control structure to isolate a page immediately when it is freed by a direct compactor in the slow path of the page allocator. The intent is to avoid redundant scanning. 5.0.0-rc1 5.0.0-rc1 selective-v3r17 capture-v3r19 Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%* Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%) Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%) Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%) Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%) Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%) Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%* Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%) Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%) Latency is only moderately affected but the devil is in the details. A closer examination indicates that base page fault latency is reduced but latency of huge pages is increased as it takes creater care to succeed. Part of the "problem" is that allocation success rates are close to 100% even when under pressure and compaction gets harder 5.0.0-rc1 5.0.0-rc1 selective-v3r17 capture-v3r19 Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%) Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%) Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%) Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%) Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%) Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%) Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%) Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%) And scan rates are reduced as expected by 6% for the migration scanner and 29% for the free scanner indicating that there is less redundant work. Compaction migrate scanned 20815362 19573286 Compaction free scanned 16352612 11510663 [mgorman@techsingularity.net: remove redundant check] Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Mel Gorman 提交于
to #26255339 commit e332f741a8dd1ec9a6dc8aa997296ecbfe64323e upstream Pageblock hints are cleared when compaction restarts or kswapd makes enough progress that it can sleep but it's over-eager in that the bit is cleared for migration sources with no LRU pages and migration targets with no free pages. As pageblock skip hint flushes are relatively rare and out-of-band with respect to kswapd, this patch makes a few more expensive checks to see if it's appropriate to even clear the bit. Every pageblock that is not cleared will avoid 512 pages being scanned unnecessarily on x86-64. The impact is variable with different workloads showing small differences in latency, success rates and scan rates. This is expected as clearing the hints is not that common but doing a small amount of work out-of-band to avoid a large amount of work in-band later is generally a good thing. Link: http://lkml.kernel.org/r/20190118175136.31341-22-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Signed-off-by: NQian Cai <cai@lca.pw> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Dan Carpenter <dan.carpenter@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: YueHaibing <yuehaibing@huawei.com> [cai@lca.pw: no stuck in __reset_isolation_pfn()] Link: http://lkml.kernel.org/r/20190206034732.75687-1-cai@lca.pwSigned-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Mel Gorman 提交于
to #26255339 commit a921444382b49cc7fdeca3fba3e278bc09484a27 upstream This is a preparation patch only, no functional change. Link: http://lkml.kernel.org/r/20181123114528.28802-3-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Zi Yan <zi.yan@cs.rutgers.edu> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
- 22 4月, 2020 2 次提交
-
-
由 Suthikulpanit, Suravee 提交于
fix #26319040 commit b9c6ff94e43a0ee053e0c1d983fba1ac4953b762 upstream. Re-factore the logic for activate/deactivate guest virtual APIC mode (GAM) into helper functions, and export them for other drivers (e.g. SVM). to support run-time activate/deactivate of SVM AVIC. Cc: Joerg Roedel <joro@8bytes.org> Signed-off-by: NSuravee Suthikulpanit <suravee.suthikulpanit@amd.com> Signed-off-by: NJoerg Roedel <jroedel@suse.de> Signed-off-by: Ntianyi <fujunkang@linux.alibaba.com> Reviewed-by: Nzhangliguang <zhangliguang@linux.alibaba.com> Acked-by: Nzhangliguang <zhangliguang@linux.alibaba.com>
-
由 Jeremy Linton 提交于
to #25688970 commit 56855a99f3d0d1e9f1f4e24f5851f9bf14c83296 upstream ACPI 6.3 adds a flag to indicate that child nodes are all identical cores. This is useful to authoritatively determine if a set of (possibly offline) cores are identical or not. Since the flag doesn't give us a unique id we can generate one and use it to create bitmaps of sibling nodes, or simply in a loop to determine if a subset of cores are identical. Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Tested-by: NHanjun Guo <hanjun.guo@linaro.org> Reviewed-by: NSudeep Holla <sudeep.holla@arm.com> Signed-off-by: NJeremy Linton <jeremy.linton@arm.com> Signed-off-by: NWill Deacon <will@kernel.org> Signed-off-by: Nluanshi <zhangliguang@linux.alibaba.com> Acked-by: Nzou cao <zoucao@linux.alibaba.com> Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
-
- 17 4月, 2020 2 次提交
-
-
由 Xunlei Pang 提交于
to #26782094 Pin code section of process in memory for the corresponding VMAs like mlock does. Usage: - pin process "PID" echo PID > /proc/unevictable/add_pid - unpin it echo PID > /proc/unevictable/del_pid - show all pinned process pids cat /proc/unevictable/add_pid For easy maintenance, we place it in kernel because it has no side effect if don't use it. Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
to #26424368 This reworks memsli "start", "end", "update" interfaces to make it more clear and symmetrical, by merging "update" action into "end", just like what psi_memstall_{enter, leave} does. Now the latency probe pattern of memsli is as follows: memcg_lat_stat_start(&start); /* kernel codes being probed */ memcg_lat_stat_end(MEM_LAT_XXX, start); This also formats the codes and fixes the warning(s) produced when CONFIG_MEMSLI is not set. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
- 16 4月, 2020 5 次提交
-
-
由 Xu Yu 提交于
to #26424368 This introduces the new bool kconfig MEMSLI, determining whether the memsli (memory latency histogram) feature should be built-in or not. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
to #26424368 Since memsli also records latency histogram for swapout and swapin, which are NOT in the slow memory path, the overhead of memsli could be nonnegligible in some specific scenarios. For example, in scenarios with frequent swapping out and in, memsli could introduce overhead of ~1% of total run time of the synthetic testcase. This adds procfs interface for memsli switch. The memsli feature is enabled by default, and you can now disable it by: $ echo 0 > /proc/memsli/enabled Apparently, you can check current memsli switch status by: $ cat /proc/memsli/enabled Note that disabling memsli at runtime will NOT clear the existing latency histogram. You still need to manually reset the specified latency histogram(s) by echo 0 into the corresponding cgroup control file(s). Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
to #26424368 Probe and calculate the latency of global swapout, memcg swapout and swapin respectively, and then group into the latency histogram in struct mem_cgroup. Note that the latency in each memcg is aggregated from all child memcgs. Usage: $ cat memory.direct_swapout_global_latency 0-1ms: 98313 1-5ms: 0 5-10ms: 0 10-100ms: 0 100-500ms: 0 500-1000ms: 0 >=1000ms: 0 total(ms): 52 Each line is the count of global swapout within the appropriate latency range. To clear the latency histogram: $ echo 0 > memory.direct_swapout_global_latency $ cat memory.direct_swapout_global_latency 0-1ms: 0 1-5ms: 0 5-10ms: 0 10-100ms: 0 100-500ms: 0 500-1000ms: 0 >=1000ms: 0 total(ms): 0 The usage of memory.direct_swapout_memcg_latency and memory.direct_swapin_latency is the same as memory.direct_swapout_global_latency. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
to #26424368 There are some duplicate codes in the original implementation of memory latency histogram, such as {x, y, z}_show, and {x, y, z}_write, where x, y, z represents various types of memory latency. This reworks common codes of memory latency histogram to make it easier to add more types of memory latency later. Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-
由 Xu Yu 提交于
to #26424368 Probe and calculate the latency of direct compact, and then group into the latency histogram in struct mem_cgroup. Note that the latency in each memcg is aggregated from all child memcgs. Usage: $ cat memory.direct_compact_latency 0-1ms: 1176 1-5ms: 259 5-10ms: 17 10-100ms: 10 100-500ms: 0 500-1000ms: 0 >=1000ms: 0 total(ms): 921 Each line is the count of direct compact within the appropriate latency range. To clear the latency histogram: $ echo 0 > memory.direct_compact_latency $ cat memory.direct_compact_latency 0-1ms: 0 1-5ms: 0 5-10ms: 0 10-100ms: 0 100-500ms: 0 500-1000ms: 0 >=1000ms: 0 total(ms): 0 Signed-off-by: NXu Yu <xuyu@linux.alibaba.com> Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
-