- 04 11月, 2020 40 次提交
-
-
由 Hugh Dickins 提交于
stable inclusion from linux-4.19.151 commit fbe96d5aab1ef3c992b1dd7a0a4a5aeb21093571 -------------------------------- commit 033b5d77 upstream. There have been elusive reports of filemap_fault() hitting its VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page) on kernels built with CONFIG_READ_ONLY_THP_FOR_FS=y. Suren has hit it on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y and CONFIG_NUMA is not set: and he has analyzed it down to how khugepaged without NUMA reuses the same huge page after collapse_file() failed (whereas NUMA targets its allocation to the respective node each time). And most of us were usually testing with CONFIG_NUMA=y kernels. collapse_file(old start) new_page = khugepaged_alloc_page(hpage) __SetPageLocked(new_page) new_page->index = start // hpage->index=old offset new_page->mapping = mapping xas_store(&xas, new_page) filemap_fault page = find_get_page(mapping, offset) // if offset falls inside hpage then // compound_head(page) == hpage lock_page_maybe_drop_mmap() __lock_page(page) // collapse fails xas_store(&xas, old page) new_page->mapping = NULL unlock_page(new_page) collapse_file(new start) new_page = khugepaged_alloc_page(hpage) __SetPageLocked(new_page) new_page->index = start // hpage->index=new offset new_page->mapping = mapping // mapping becomes valid again // since compound_head(page) == hpage // page_to_pgoff(page) got changed VM_BUG_ON_PAGE(page_to_pgoff(page) != offset) An initial patch replaced __SetPageLocked() by lock_page(), which did fix the race which Suren illustrates above. But testing showed that it's not good enough: if the racing task's __lock_page() gets delayed long after its find_get_page(), then it may follow collapse_file(new start)'s successful final unlock_page(), and crash on the same VM_BUG_ON_PAGE. It could be fixed by relaxing filemap_fault()'s VM_BUG_ON_PAGE to a check and retry (as is done for mapping), with similar relaxations in find_lock_entry() and pagecache_get_page(): but it's not obvious what else might get caught out; and khugepaged non-NUMA appears to be unique in exposing a page to page cache, then revoking, without going through a full cycle of freeing before reuse. Instead, non-NUMA khugepaged_prealloc_page() release the old page if anyone else has a reference to it (1% of cases when I tested). Although never reported on huge tmpfs, I believe its find_lock_entry() has been at similar risk; but huge tmpfs does not rely on khugepaged for its normal working nearly so much as READ_ONLY_THP_FOR_FS does. Reported-by: NDenis Lisov <dennis.lissov@gmail.com> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206569 Link: https://lore.kernel.org/linux-mm/?q=20200219144635.3b7417145de19b65f258c943%40linux-foundation.orgReported-by: NQian Cai <cai@lca.pw> Link: https://lore.kernel.org/linux-xfs/?q=20200616013309.GB815%40lca.pwReported-and-analyzed-by: NSuren Baghdasaryan <surenb@google.com> Fixes: 87c460a0 ("mm/khugepaged: collapse_shmem() without freezing new_page") Signed-off-by: NHugh Dickins <hughd@google.com> Cc: stable@vger.kernel.org # v4.9+ Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Chaitanya Kulkarni 提交于
stable inclusion from linux-4.19.151 commit b6df5afc3d81e34d32f0b092d59b7fe8915d824b -------------------------------- commit 4bab6909 upstream. When try_module_get() fails in the nvme_dev_open() it returns without releasing the ctrl reference which was taken earlier. Put the ctrl reference which is taken before calling the try_module_get() in the error return code path. Fixes: 52a3974f "nvme-core: get/put ctrl and transport module in nvme_dev_open/release()" Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: NLogan Gunthorpe <logang@deltatee.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Linus Torvalds 提交于
stable inclusion from linux-4.19.151 commit 33acb78c859f1a0bd3c6b67801fada16f99614f6 -------------------------------- commit 4013c149 upstream. Kernel threads intentionally do CLONE_FS in order to follow any changes that 'init' does to set up the root directory (or cwd). It is admittedly a bit odd, but it avoids the situation where 'init' does some extensive setup to initialize the system environment, and then we execute a usermode helper program, and it uses the original FS setup from boot time that may be very limited and incomplete. [ Both Al Viro and Eric Biederman point out that 'pivot_root()' will follow the root regardless, since it fixes up other users of root (see chroot_fs_refs() for details), but overmounting root and doing a chroot() would not. ] However, Vegard Nossum noticed that the CLONE_FS not only means that we follow the root and current working directories, it also means we share umask with whatever init changed it to. That wasn't intentional. Just reset umask to the original default (0022) before actually starting the usermode helper program. Reported-by: NVegard Nossum <vegard.nossum@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: NEric W. Biederman <ebiederm@xmission.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Conflicts: kernel/umh.c [yyl: adjust context] Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Will McVicker 提交于
stable inclusion from linux-4.19.150 commit 289fe546ea16c2dcb57c5198c5a7b7387604530e -------------------------------- commit 1cc5ef91 upstream. The indexes to the nf_nat_l[34]protos arrays come from userspace. So check the tuple's family, e.g. l3num, when creating the conntrack in order to prevent an OOB memory access during setup. Here is an example kernel panic on 4.14.180 when userspace passes in an index greater than NFPROTO_NUMPROTO. Internal error: Oops - BUG: 0 [#1] PREEMPT SMP Modules linked in:... Process poc (pid: 5614, stack limit = 0x00000000a3933121) CPU: 4 PID: 5614 Comm: poc Tainted: G S W O 4.14.180-g051355490483 Hardware name: Qualcomm Technologies, Inc. SM8150 V2 PM8150 Google Inc. MSM task: 000000002a3dfffe task.stack: 00000000a3933121 pc : __cfi_check_fail+0x1c/0x24 lr : __cfi_check_fail+0x1c/0x24 ... Call trace: __cfi_check_fail+0x1c/0x24 name_to_dev_t+0x0/0x468 nfnetlink_parse_nat_setup+0x234/0x258 ctnetlink_parse_nat_setup+0x4c/0x228 ctnetlink_new_conntrack+0x590/0xc40 nfnetlink_rcv_msg+0x31c/0x4d4 netlink_rcv_skb+0x100/0x184 nfnetlink_rcv+0xf4/0x180 netlink_unicast+0x360/0x770 netlink_sendmsg+0x5a0/0x6a4 ___sys_sendmsg+0x314/0x46c SyS_sendmsg+0xb4/0x108 el0_svc_naked+0x34/0x38 This crash is not happening since 5.4+, however, ctnetlink still allows for creating entries with unsupported layer 3 protocol number. Fixes: c1d10adb ("[NETFILTER]: Add ctnetlink port for nf_conntrack") Signed-off-by: NWill McVicker <willmcvicker@google.com> [pablo@netfilter.org: rebased original patch on top of nf.git] Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Al Viro 提交于
stable inclusion from linux-4.19.150 commit ced8ce5d2157142c469eccc5eef5ea8ad579fa5e -------------------------------- commit 3701cb59 upstream. or get freed, for that matter, if it's a long (separately stored) name. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Al Viro 提交于
stable inclusion from linux-4.19.150 commit 90ef231ba534d43033884b8560df26e608ca0a21 -------------------------------- commit fe0a916c upstream. Checking for the lack of epitems refering to the epoll we want to insert into is not enough; we might have an insertion of that epoll into another one that has already collected the set of files to recheck for excessive reverse paths, but hasn't gotten to creating/inserting the epitem for it. However, any such insertion in progress can be detected - it will update the generation count in our epoll when it's done looking through it for files to check. That gets done under ->mtx of our epoll and that allows us to detect that safely. We are *not* holding epmutex here, so the generation count is not stable. However, since both the update of ep->gen by loop check and (later) insertion into ->f_ep_link are done with ep->mtx held, we are fine - the sequence is grab epmutex bump loop_check_gen ... grab tep->mtx // 1 tep->gen = loop_check_gen ... drop tep->mtx // 2 ... grab tep->mtx // 3 ... insert into ->f_ep_link ... drop tep->mtx // 4 bump loop_check_gen drop epmutex and if the fastpath check in another thread happens for that eventpoll, it can come * before (1) - in that case fastpath is just fine * after (4) - we'll see non-empty ->f_ep_link, slow path taken * between (2) and (3) - loop_check_gen is stable, with ->mtx providing barriers and we end up taking slow path. Note that ->f_ep_link emptiness check is slightly racy - we are protected against insertions into that list, but removals can happen right under us. Not a problem - in the worst case we'll end up taking a slow path for no good reason. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Al Viro 提交于
stable inclusion from linux-4.19.150 commit ff329915a5b1f6778344a6fc7b060c991376b095 -------------------------------- commit 18306c40 upstream. removes the need to clear it, along with the races. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Al Viro 提交于
stable inclusion from linux-4.19.150 commit 3e3bbc4d23eeb90bf282e98c7dfeca7702df3169 -------------------------------- commit f8d4f44d upstream. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Laurent Dufour 提交于
stable inclusion from linux-4.19.150 commit b6f69f72c15d7f973f5709c5351f378f235b3654 -------------------------------- commit f85086f9 upstream. In register_mem_sect_under_node() the system_state's value is checked to detect whether the call is made during boot time or during an hot-plug operation. Unfortunately, that check against SYSTEM_BOOTING is wrong because regular memory is registered at SYSTEM_SCHEDULING state. In addition, memory hot-plug operation can be triggered at this system state by the ACPI [1]. So checking against the system state is not enough. The consequence is that on system with interleaved node's ranges like this: Early memory node ranges node 1: [mem 0x0000000000000000-0x000000011fffffff] node 2: [mem 0x0000000120000000-0x000000014fffffff] node 1: [mem 0x0000000150000000-0x00000001ffffffff] node 0: [mem 0x0000000200000000-0x000000048fffffff] node 2: [mem 0x0000000490000000-0x00000007ffffffff] This can be seen on PowerPC LPAR after multiple memory hot-plug and hot-unplug operations are done. At the next reboot the node's memory ranges can be interleaved and since the call to link_mem_sections() is made in topology_init() while the system is in the SYSTEM_SCHEDULING state, the node's id is not checked, and the sections registered to multiple nodes: $ ls -l /sys/devices/system/memory/memory21/node* total 0 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2 In that case, the system is able to boot but if later one of theses memory blocks is hot-unplugged and then hot-plugged, the sysfs inconsistency is detected and this is triggering a BUG_ON(): kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084! Oops: Exception in kernel mode, sig: 5 [#1] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4 CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25 Call Trace: add_memory_resource+0x23c/0x340 (unreliable) __add_memory+0x5c/0xf0 dlpar_add_lmb+0x1b4/0x500 dlpar_memory+0x1f8/0xb80 handle_dlpar_errorlog+0xc0/0x190 dlpar_store+0x198/0x4a0 kobj_attr_store+0x30/0x50 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1b0/0x290 vfs_write+0xe8/0x290 ksys_write+0xdc/0x130 system_call_exception+0x160/0x270 system_call_common+0xf0/0x27c This patch addresses the root cause by not relying on the system_state value to detect whether the call is due to a hot-plug operation. An extra parameter is added to link_mem_sections() detailing whether the operation is due to a hot-plug operation. [1] According to Oscar Salvador, using this qemu command line, ACPI memory hotplug operations are raised at SYSTEM_SCHEDULING state: $QEMU -enable-kvm -machine pc -smp 4,sockets=4,cores=1,threads=1 -cpu host -monitor pty \ -m size=$MEM,slots=255,maxmem=4294967296k \ -numa node,nodeid=0,cpus=0-3,mem=512 -numa node,nodeid=1,mem=512 \ -object memory-backend-ram,id=memdimm0,size=134217728 -device pc-dimm,node=0,memdev=memdimm0,id=dimm0,slot=0 \ -object memory-backend-ram,id=memdimm1,size=134217728 -device pc-dimm,node=0,memdev=memdimm1,id=dimm1,slot=1 \ -object memory-backend-ram,id=memdimm2,size=134217728 -device pc-dimm,node=0,memdev=memdimm2,id=dimm2,slot=2 \ -object memory-backend-ram,id=memdimm3,size=134217728 -device pc-dimm,node=0,memdev=memdimm3,id=dimm3,slot=3 \ -object memory-backend-ram,id=memdimm4,size=134217728 -device pc-dimm,node=1,memdev=memdimm4,id=dimm4,slot=4 \ -object memory-backend-ram,id=memdimm5,size=134217728 -device pc-dimm,node=1,memdev=memdimm5,id=dimm5,slot=5 \ -object memory-backend-ram,id=memdimm6,size=134217728 -device pc-dimm,node=1,memdev=memdimm6,id=dimm6,slot=6 \ Fixes: 4fbce633 ("mm/memory_hotplug.c: make register_mem_sect_under_node() a callback of walk_memory_range()") Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J. Wysocki" <rafael@kernel.org> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200915094143.79181-3-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Laurent Dufour 提交于
stable inclusion from linux-4.19.150 commit 25eaea1b33f2569f69a82dfddb3fb05384143bd0 -------------------------------- commit c1d0da83 upstream. Patch series "mm: fix memory to node bad links in sysfs", v3. Sometimes, firmware may expose interleaved memory layout like this: Early memory node ranges node 1: [mem 0x0000000000000000-0x000000011fffffff] node 2: [mem 0x0000000120000000-0x000000014fffffff] node 1: [mem 0x0000000150000000-0x00000001ffffffff] node 0: [mem 0x0000000200000000-0x000000048fffffff] node 2: [mem 0x0000000490000000-0x00000007ffffffff] In that case, we can see memory blocks assigned to multiple nodes in sysfs: $ ls -l /sys/devices/system/memory/memory21 total 0 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1 lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2 -rw-r--r-- 1 root root 65536 Aug 24 05:27 online -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index drwxr-xr-x 2 root root 0 Aug 24 05:27 power -r--r--r-- 1 root root 65536 Aug 24 05:27 removable -rw-r--r-- 1 root root 65536 Aug 24 05:27 state lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones The same applies in the node's directory with a memory21 link in both the node1 and node2's directory. This is wrong but doesn't prevent the system to run. However when later, one of these memory blocks is hot-unplugged and then hot-plugged, the system is detecting an inconsistency in the sysfs layout and a BUG_ON() is raised: kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084! LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4 CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25 Call Trace: add_memory_resource+0x23c/0x340 (unreliable) __add_memory+0x5c/0xf0 dlpar_add_lmb+0x1b4/0x500 dlpar_memory+0x1f8/0xb80 handle_dlpar_errorlog+0xc0/0x190 dlpar_store+0x198/0x4a0 kobj_attr_store+0x30/0x50 sysfs_kf_write+0x64/0x90 kernfs_fop_write+0x1b0/0x290 vfs_write+0xe8/0x290 ksys_write+0xdc/0x130 system_call_exception+0x160/0x270 system_call_common+0xf0/0x27c This has been seen on PowerPC LPAR. The root cause of this issue is that when node's memory is registered, the range used can overlap another node's range, thus the memory block is registered to multiple nodes in sysfs. There are two issues here: (a) The sysfs memory and node's layouts are broken due to these multiple links (b) The link errors in link_mem_sections() should not lead to a system panic. To address (a) register_mem_sect_under_node should not rely on the system state to detect whether the link operation is triggered by a hot plug operation or not. This is addressed by the patches 1 and 2 of this series. Issue (b) will be addressed separately. This patch (of 2): The memmap_context enum is used to detect whether a memory operation is due to a hot-add operation or happening at boot time. Make it general to the hotplug operation and rename it as meminit_context. There is no functional change introduced by this patch Suggested-by: NDavid Hildenbrand <david@redhat.com> Signed-off-by: NLaurent Dufour <ldufour@linux.ibm.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NDavid Hildenbrand <david@redhat.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "Rafael J . Wysocki" <rafael@kernel.org> Cc: Nathan Lynch <nathanl@linux.ibm.com> Cc: Scott Cheloha <cheloha@linux.ibm.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Thibaut Sautereau 提交于
stable inclusion from linux-4.19.150 commit a4ebc2d6aa3ac2aa92cac8f6f53662df2c4904c9 -------------------------------- [ Upstream commit 09a6b0bc ] Commit f227e3ec ("random32: update the net random state on interrupt and activity") broke compilation and was temporarily fixed by Linus in 83bdc727 ("random32: remove net_rand_state from the latent entropy gcc plugin") by entirely moving net_rand_state out of the things handled by the latent_entropy GCC plugin. From what I understand when reading the plugin code, using the __latent_entropy attribute on a declaration was the wrong part and simply keeping the __latent_entropy attribute on the variable definition was the correct fix. Fixes: 83bdc727 ("random32: remove net_rand_state from the latent entropy gcc plugin") Acked-by: NWilly Tarreau <w@1wt.eu> Cc: Emese Revfy <re.emese@gmail.com> Signed-off-by: NThibaut Sautereau <thibaut.sautereau@ssi.gouv.fr> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Jeffrey Mitchell 提交于
stable inclusion from linux-4.19.150 commit 345c6f260c89e417de6e7d81f3366bd5079f48a3 -------------------------------- [ Upstream commit d33030e2 ] nfs_readdir_page_filler() iterates over entries in a directory, reusing the same security label buffer, but does not reset the buffer's length. This causes decode_attr_security_label() to return -ERANGE if an entry's security label is longer than the previous one's. This error, in nfs4_decode_dirent(), only gets passed up as -EAGAIN, which causes another failed attempt to copy into the buffer. The second error is ignored and the remaining entries do not show up in ls, specifically the getdents64() syscall. Reproduce by creating multiple files in NFS and giving one of the later files a longer security label. ls will not see that file nor any that are added afterwards, though they will exist on the backend. In nfs_readdir_page_filler(), reset security label buffer length before every reuse Signed-off-by: NJeffrey Mitchell <jeffrey.mitchell@starlab.io> Fixes: b4487b93 ("nfs: Fix getxattr kernel panic and memory overflow") Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Chaitanya Kulkarni 提交于
stable inclusion from linux-4.19.150 commit c2df194a0d50bc1370c6761f5b80d3a32f42bcd4 -------------------------------- [ Upstream commit 52a3974f ] Get and put the reference to the ctrl in the nvme_dev_open() and nvme_dev_release() before and after module get/put for ctrl in char device file operations. Introduce char_dev relase function, get/put the controller and module which allows us to fix the potential Oops which can be easily reproduced with a passthru ctrl (although the problem also exists with pure user access): Entering kdb (current=0xffff8887f8290000, pid 3128) on processor 30 Oops: (null) due to oops @ 0xffffffffa01019ad CPU: 30 PID: 3128 Comm: bash Tainted: G W OE 5.8.0-rc4nvme-5.9+ #35 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.4 RIP: 0010:nvme_free_ctrl+0x234/0x285 [nvme_core] Code: 57 10 a0 e8 73 bf 02 e1 ba 3d 11 00 00 48 c7 c6 98 33 10 a0 48 c7 c7 1d 57 10 a0 e8 5b bf 02 e1 8 RSP: 0018:ffffc90001d63de0 EFLAGS: 00010246 RAX: ffffffffa05c0440 RBX: ffff8888119e45a0 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff8888177e9550 RDI: ffff8888119e43b0 RBP: ffff8887d4768000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: ffffc90001d63c90 R12: ffff8888119e43b0 R13: ffff8888119e5108 R14: dead000000000100 R15: ffff8888119e5108 FS: 00007f1ef27b0740(0000) GS:ffff888817600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffffffffa05c0470 CR3: 00000007f6bee000 CR4: 00000000003406e0 Call Trace: device_release+0x27/0x80 kobject_put+0x98/0x170 nvmet_passthru_ctrl_disable+0x4a/0x70 [nvmet] nvmet_passthru_enable_store+0x4c/0x90 [nvmet] configfs_write_file+0xe6/0x150 vfs_write+0xba/0x1e0 ksys_write+0x5f/0xe0 do_syscall_64+0x52/0xb0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f1ef1eb2840 Code: Bad RIP value. RSP: 002b:00007fffdbff0eb8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f1ef1eb2840 RDX: 0000000000000002 RSI: 00007f1ef27d2000 RDI: 0000000000000001 RBP: 00007f1ef27d2000 R08: 000000000000000a R09: 00007f1ef27b0740 R10: 0000000000000001 R11: 0000000000000246 R12: 00007f1ef2186400 R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000000000 With this patch fix we take the module ref count in nvme_dev_open() and release that ref count in newly introduced nvme_dev_release(). Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Steven Rostedt (VMware) 提交于
stable inclusion from linux-4.19.150 commit 2fd5a462eb7b39694ae013450dc47d84cdf7204a -------------------------------- commit b40341fa upstream. The first thing that the ftrace function callback helper functions should do is to check for recursion. Peter Zijlstra found that when "rcu_is_watching()" had its notrace removed, it caused perf function tracing to crash. This is because the call of rcu_is_watching() is tested before function recursion is checked and and if it is traced, it will cause an infinite recursion loop. rcu_is_watching() should still stay notrace, but to prevent this should never had crashed in the first place. The recursion prevention must be the first thing done in callback functions. Link: https://lore.kernel.org/r/20200929112541.GM2628@hirez.programming.kicks-ass.net Cc: stable@vger.kernel.org Cc: Paul McKenney <paulmck@kernel.org> Fixes: c68c0fa2 ("ftrace: Have ftrace_ops_get_func() handle RCU and PER_CPU flags too") Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Reported-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Gao Xiang 提交于
stable inclusion from linux-4.19.149 commit f3e8ed3d33fa963f1b6827977696235852cdd8d9 -------------------------------- commit 41663430 upstream. SWP_FS is used to make swap_{read,write}page() go through the filesystem, and it's only used for swap files over NFS. So, !SWP_FS means non NFS for now, it could be either file backed or device backed. Something similar goes with legacy SWP_FILE. So in order to achieve the goal of the original patch, SWP_BLKDEV should be used instead. FS corruption can be observed with SSD device + XFS + fragmented swapfile due to CONFIG_THP_SWAP=y. I reproduced the issue with the following details: Environment: QEMU + upstream kernel + buildroot + NVMe (2 GB) Kernel config: CONFIG_BLK_DEV_NVME=y CONFIG_THP_SWAP=y Some reproducible steps: mkfs.xfs -f /dev/nvme0n1 mkdir /tmp/mnt mount /dev/nvme0n1 /tmp/mnt bs="32k" sz="1024m" # doesn't matter too much, I also tried 16m xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -F -S 0 -b $bs 0 $sz" -c "fdatasync" /tmp/mnt/sw xfs_io -f -c "pwrite -R -b $bs 0 $sz" -c "fsync" /tmp/mnt/sw mkswap /tmp/mnt/sw swapon /tmp/mnt/sw stress --vm 2 --vm-bytes 600M # doesn't matter too much as well Symptoms: - FS corruption (e.g. checksum failure) - memory corruption at: 0xd2808010 - segfault Fixes: f0eea189 ("mm, THP, swap: Don't allocate huge cluster for file backed swap device") Fixes: 38d8b4e6 ("mm, THP, swap: delay splitting THP during swap out") Signed-off-by: NGao Xiang <hsiangkao@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: N"Huang, Ying" <ying.huang@intel.com> Reviewed-by: NYang Shi <shy828301@gmail.com> Acked-by: NRafael Aquini <aquini@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Carlos Maiolino <cmaiolino@redhat.com> Cc: Eric Sandeen <esandeen@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20200820045323.7809-1-hsiangkao@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Conflicts: mm/swapfile.c [yyl: keep same as mainline] Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Masami Hiramatsu 提交于
stable inclusion from linux-4.19.149 commit ce7ff920092130f249b75f9fe177edb3362fefe8 -------------------------------- commit 3031313e upstream. Commit 0cb2f137 ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler") fixed one bug but not completely fixed yet. If we run a kprobe_module.tc of ftracetest, kernel showed a warning as below. # ./ftracetest test.d/kprobe/kprobe_module.tc === Ftrace unit tests === [1] Kprobe dynamic event - probing module ... [ 22.400215] ------------[ cut here ]------------ [ 22.400962] Failed to disarm kprobe-ftrace at trace_printk_irq_work+0x0/0x7e [trace_printk] (-2) [ 22.402139] WARNING: CPU: 7 PID: 200 at kernel/kprobes.c:1091 __disarm_kprobe_ftrace.isra.0+0x7e/0xa0 [ 22.403358] Modules linked in: trace_printk(-) [ 22.404028] CPU: 7 PID: 200 Comm: rmmod Not tainted 5.9.0-rc2+ #66 [ 22.404870] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014 [ 22.406139] RIP: 0010:__disarm_kprobe_ftrace.isra.0+0x7e/0xa0 [ 22.406947] Code: 30 8b 03 eb c9 80 3d e5 09 1f 01 00 75 dc 49 8b 34 24 89 c2 48 c7 c7 a0 c2 05 82 89 45 e4 c6 05 cc 09 1f 01 01 e8 a9 c7 f0 ff <0f> 0b 8b 45 e4 eb b9 89 c6 48 c7 c7 70 c2 05 82 89 45 e4 e8 91 c7 [ 22.409544] RSP: 0018:ffffc90000237df0 EFLAGS: 00010286 [ 22.410385] RAX: 0000000000000000 RBX: ffffffff83066024 RCX: 0000000000000000 [ 22.411434] RDX: 0000000000000001 RSI: ffffffff810de8d3 RDI: ffffffff810de8d3 [ 22.412687] RBP: ffffc90000237e10 R08: 0000000000000001 R09: 0000000000000001 [ 22.413762] R10: 0000000000000000 R11: 0000000000000001 R12: ffff88807c478640 [ 22.414852] R13: ffffffff8235ebc0 R14: ffffffffa00060c0 R15: 0000000000000000 [ 22.415941] FS: 00000000019d48c0(0000) GS:ffff88807d7c0000(0000) knlGS:0000000000000000 [ 22.417264] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 22.418176] CR2: 00000000005bb7e3 CR3: 0000000078f7a000 CR4: 00000000000006a0 [ 22.419309] Call Trace: [ 22.419990] kill_kprobe+0x94/0x160 [ 22.420652] kprobes_module_callback+0x64/0x230 [ 22.421470] notifier_call_chain+0x4f/0x70 [ 22.422184] blocking_notifier_call_chain+0x49/0x70 [ 22.422979] __x64_sys_delete_module+0x1ac/0x240 [ 22.423733] do_syscall_64+0x38/0x50 [ 22.424366] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 22.425176] RIP: 0033:0x4bb81d [ 22.425741] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 e0 ff ff ff f7 d8 64 89 01 48 [ 22.428726] RSP: 002b:00007ffc70fef008 EFLAGS: 00000246 ORIG_RAX: 00000000000000b0 [ 22.430169] RAX: ffffffffffffffda RBX: 00000000019d48a0 RCX: 00000000004bb81d [ 22.431375] RDX: 0000000000000000 RSI: 0000000000000880 RDI: 00007ffc70fef028 [ 22.432543] RBP: 0000000000000880 R08: 00000000ffffffff R09: 00007ffc70fef320 [ 22.433692] R10: 0000000000656300 R11: 0000000000000246 R12: 00007ffc70fef028 [ 22.434635] R13: 0000000000000000 R14: 0000000000000002 R15: 0000000000000000 [ 22.435682] irq event stamp: 1169 [ 22.436240] hardirqs last enabled at (1179): [<ffffffff810df542>] console_unlock+0x422/0x580 [ 22.437466] hardirqs last disabled at (1188): [<ffffffff810df19b>] console_unlock+0x7b/0x580 [ 22.438608] softirqs last enabled at (866): [<ffffffff81c0038e>] __do_softirq+0x38e/0x490 [ 22.439637] softirqs last disabled at (859): [<ffffffff81a00f42>] asm_call_on_stack+0x12/0x20 [ 22.440690] ---[ end trace 1e7ce7e1e4567276 ]--- [ 22.472832] trace_kprobe: This probe might be able to register after target module is loaded. Continue. This is because the kill_kprobe() calls disarm_kprobe_ftrace() even if the given probe is not enabled. In that case, ftrace_set_filter_ip() fails because the given probe point is not registered to ftrace. Fix to check the given (going) probe is enabled before invoking disarm_kprobe_ftrace(). Link: https://lkml.kernel.org/r/159888672694.1411785.5987998076694782591.stgit@devnote2 Fixes: 0cb2f137 ("kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler") Cc: Ingo Molnar <mingo@kernel.org> Cc: "Naveen N . Rao" <naveen.n.rao@linux.ibm.com> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: David Miller <davem@davemloft.net> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Chengming Zhou <zhouchengming@bytedance.com> Cc: stable@vger.kernel.org Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Tom Rix 提交于
stable inclusion from linux-4.19.149 commit 240dd5118a9e0454f280ffeae63f22bd14735733 -------------------------------- commit 46bbe5c6 upstream. clang static analyzer reports this problem trace_events_hist.c:3824:3: warning: Attempt to free released memory kfree(hist_data->attrs->var_defs.name[i]); In parse_var_defs() if there is a problem allocating var_defs.expr, the earlier var_defs.name is freed. This free is duplicated by free_var_defs() which frees the rest of the list. Because free_var_defs() has to run anyway, remove the second free fom parse_var_defs(). Link: https://lkml.kernel.org/r/20200907135845.15804-1-trix@redhat.com Cc: stable@vger.kernel.org Fixes: 30350d65 ("tracing: Add variable support to hist triggers") Reviewed-by: NTom Zanussi <tom.zanussi@linux.intel.com> Signed-off-by: NTom Rix <trix@redhat.com> Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Yonghong Song 提交于
stable inclusion from linux-4.19.149 commit e1a75e94a3acf78e6afdd548a5d504fc29cbc953 -------------------------------- [ Upstream commit ce880cb8 ] Running selftest ./btf_btf -p the kernel had the following warning: [ 51.528185] WARNING: CPU: 3 PID: 1756 at kernel/bpf/hashtab.c:717 htab_map_get_next_key+0x2eb/0x300 [ 51.529217] Modules linked in: [ 51.529583] CPU: 3 PID: 1756 Comm: test_btf Not tainted 5.9.0-rc1+ #878 [ 51.530346] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.el7.centos 04/01/2014 [ 51.531410] RIP: 0010:htab_map_get_next_key+0x2eb/0x300 ... [ 51.542826] Call Trace: [ 51.543119] map_seq_next+0x53/0x80 [ 51.543528] seq_read+0x263/0x400 [ 51.543932] vfs_read+0xad/0x1c0 [ 51.544311] ksys_read+0x5f/0xe0 [ 51.544689] do_syscall_64+0x33/0x40 [ 51.545116] entry_SYSCALL_64_after_hwframe+0x44/0xa9 The related source code in kernel/bpf/hashtab.c: 709 static int htab_map_get_next_key(struct bpf_map *map, void *key, void *next_key) 710 { 711 struct bpf_htab *htab = container_of(map, struct bpf_htab, map); 712 struct hlist_nulls_head *head; 713 struct htab_elem *l, *next_l; 714 u32 hash, key_size; 715 int i = 0; 716 717 WARN_ON_ONCE(!rcu_read_lock_held()); In kernel/bpf/inode.c, bpffs map pretty print calls map->ops->map_get_next_key() without holding a rcu_read_lock(), hence causing the above warning. To fix the issue, just surrounding map->ops->map_get_next_key() with rcu read lock. Fixes: a26ca7c9 ("bpf: btf: Add pretty print support to the basic arraymap") Reported-by: NAlexei Starovoitov <ast@kernel.org> Signed-off-by: NYonghong Song <yhs@fb.com> Signed-off-by: NAlexei Starovoitov <ast@kernel.org> Acked-by: NAndrii Nakryiko <andriin@fb.com> Cc: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20200916004401.146277-1-yhs@fb.comSigned-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Sven Schnelle 提交于
stable inclusion from linux-4.19.149 commit aafa75ff39d05ad8011c1b8fa118c36acec9661a -------------------------------- [ Upstream commit 73ac74c7 ] Switch order so that locking state is consistent even if the IRQ tracer calls into lockdep again. Acked-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NSven Schnelle <svens@linux.ibm.com> Signed-off-by: NVasily Gorbik <gor@linux.ibm.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Anthony Iliopoulos 提交于
stable inclusion from linux-4.19.149 commit 906c9129787bf890f3f1b562ddac45c3ec0965a8 -------------------------------- [ Upstream commit 05b29021 ] Commit 3b4b1972 ("nvme: fix possible deadlock when I/O is blocked") reverted multipath head disk revalidation due to deadlocks caused by holding the bd_mutex during revalidate. Updating the multipath disk blockdev size is still required though for userspace to be able to observe any resizing while the device is mounted. Directly update the bdev inode size to avoid unnecessarily holding the bdev->bd_mutex. Fixes: 3b4b1972 ("nvme: fix possible deadlock when I/O is blocked") Signed-off-by: NAnthony Iliopoulos <ailiop@suse.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Jin Yao 提交于
stable inclusion from linux-4.19.149 commit 31c5c44707d8eb6809100a512b0877da51f795c2 -------------------------------- [ Upstream commit 8510895b ] A big uncore event group is split into multiple small groups which only include the uncore events from the same PMU. This has been supported in the commit 3cdc5c2c ("perf parse-events: Handle uncore event aliases in small groups properly"). If the event's PMU name starts to repeat, it must be a new event. That can be used to distinguish the leader from other members. But now it only compares the pointer of pmu_name (leader->pmu_name == evsel->pmu_name). If we use "perf stat -M LLC_MISSES.PCIE_WRITE -a" on cascadelakex, the event list is: evsel->name evsel->pmu_name --------------------------------------------------------------- unc_iio_data_req_of_cpu.mem_write.part0 uncore_iio_4 (as leader) unc_iio_data_req_of_cpu.mem_write.part0 uncore_iio_2 unc_iio_data_req_of_cpu.mem_write.part0 uncore_iio_0 unc_iio_data_req_of_cpu.mem_write.part0 uncore_iio_5 unc_iio_data_req_of_cpu.mem_write.part0 uncore_iio_3 unc_iio_data_req_of_cpu.mem_write.part0 uncore_iio_1 unc_iio_data_req_of_cpu.mem_write.part1 uncore_iio_4 ...... For the event "unc_iio_data_req_of_cpu.mem_write.part1" with "uncore_iio_4", it should be the event from PMU "uncore_iio_4". It's not a new leader for this PMU. But if we use "(leader->pmu_name == evsel->pmu_name)", the check would be failed and the event is stored to leaders[] as a new PMU leader. So this patch uses strcmp to compare the PMU name between events. Fixes: d4953f7e ("perf parse-events: Fix 3 use after frees found with clang ASAN") Signed-off-by: NJin Yao <yao.jin@linux.intel.com> Acked-by: NJiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Jin Yao <yao.jin@intel.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20200430003618.17002-1-yao.jin@linux.intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Zeng Tao 提交于
stable inclusion from linux-4.19.149 commit 0d1682ca6d1314c27d07afacda4dd51baf5fcd94 -------------------------------- [ Upstream commit b872d064 ] The vfio_pci_release call will free and clear the error and request eventfd ctx while these ctx could be in use at the same time in the function like vfio_pci_request, and it's expected to protect them under the vdev->igate mutex, which is missing in vfio_pci_release. This issue is introduced since commit 1518ac27 ("vfio/pci: fix memory leaks of eventfd ctx"),and since commit 5c5866c5 ("vfio/pci: Clear error and request eventfd ctx after releasing"), it's very easily to trigger the kernel panic like this: [ 9513.904346] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008 [ 9513.913091] Mem abort info: [ 9513.915871] ESR = 0x96000006 [ 9513.918912] EC = 0x25: DABT (current EL), IL = 32 bits [ 9513.924198] SET = 0, FnV = 0 [ 9513.927238] EA = 0, S1PTW = 0 [ 9513.930364] Data abort info: [ 9513.933231] ISV = 0, ISS = 0x00000006 [ 9513.937048] CM = 0, WnR = 0 [ 9513.940003] user pgtable: 4k pages, 48-bit VAs, pgdp=0000007ec7d12000 [ 9513.946414] [0000000000000008] pgd=0000007ec7d13003, p4d=0000007ec7d13003, pud=0000007ec728c003, pmd=0000000000000000 [ 9513.956975] Internal error: Oops: 96000006 [#1] PREEMPT SMP [ 9513.962521] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio hclge hns3 hnae3 [last unloaded: vfio_pci] [ 9513.972998] CPU: 4 PID: 1327 Comm: bash Tainted: G W 5.8.0-rc4+ #3 [ 9513.980443] Hardware name: Huawei TaiShan 2280 V2/BC82AMDC, BIOS 2280-V2 CS V3.B270.01 05/08/2020 [ 9513.989274] pstate: 80400089 (Nzcv daIf +PAN -UAO BTYPE=--) [ 9513.994827] pc : _raw_spin_lock_irqsave+0x48/0x88 [ 9513.999515] lr : eventfd_signal+0x6c/0x1b0 [ 9514.003591] sp : ffff800038a0b960 [ 9514.006889] x29: ffff800038a0b960 x28: ffff007ef7f4da10 [ 9514.012175] x27: ffff207eefbbfc80 x26: ffffbb7903457000 [ 9514.017462] x25: ffffbb7912191000 x24: ffff007ef7f4d400 [ 9514.022747] x23: ffff20be6e0e4c00 x22: 0000000000000008 [ 9514.028033] x21: 0000000000000000 x20: 0000000000000000 [ 9514.033321] x19: 0000000000000008 x18: 0000000000000000 [ 9514.038606] x17: 0000000000000000 x16: ffffbb7910029328 [ 9514.043893] x15: 0000000000000000 x14: 0000000000000001 [ 9514.049179] x13: 0000000000000000 x12: 0000000000000002 [ 9514.054466] x11: 0000000000000000 x10: 0000000000000a00 [ 9514.059752] x9 : ffff800038a0b840 x8 : ffff007ef7f4de60 [ 9514.065038] x7 : ffff007fffc96690 x6 : fffffe01faffb748 [ 9514.070324] x5 : 0000000000000000 x4 : 0000000000000000 [ 9514.075609] x3 : 0000000000000000 x2 : 0000000000000001 [ 9514.080895] x1 : ffff007ef7f4d400 x0 : 0000000000000000 [ 9514.086181] Call trace: [ 9514.088618] _raw_spin_lock_irqsave+0x48/0x88 [ 9514.092954] eventfd_signal+0x6c/0x1b0 [ 9514.096691] vfio_pci_request+0x84/0xd0 [vfio_pci] [ 9514.101464] vfio_del_group_dev+0x150/0x290 [vfio] [ 9514.106234] vfio_pci_remove+0x30/0x128 [vfio_pci] [ 9514.111007] pci_device_remove+0x48/0x108 [ 9514.115001] device_release_driver_internal+0x100/0x1b8 [ 9514.120200] device_release_driver+0x28/0x38 [ 9514.124452] pci_stop_bus_device+0x68/0xa8 [ 9514.128528] pci_stop_and_remove_bus_device+0x20/0x38 [ 9514.133557] pci_iov_remove_virtfn+0xb4/0x128 [ 9514.137893] sriov_disable+0x3c/0x108 [ 9514.141538] pci_disable_sriov+0x28/0x38 [ 9514.145445] hns3_pci_sriov_configure+0x48/0xb8 [hns3] [ 9514.150558] sriov_numvfs_store+0x110/0x198 [ 9514.154724] dev_attr_store+0x44/0x60 [ 9514.158373] sysfs_kf_write+0x5c/0x78 [ 9514.162018] kernfs_fop_write+0x104/0x210 [ 9514.166010] __vfs_write+0x48/0x90 [ 9514.169395] vfs_write+0xbc/0x1c0 [ 9514.172694] ksys_write+0x74/0x100 [ 9514.176079] __arm64_sys_write+0x24/0x30 [ 9514.179987] el0_svc_common.constprop.4+0x110/0x200 [ 9514.184842] do_el0_svc+0x34/0x98 [ 9514.188144] el0_svc+0x14/0x40 [ 9514.191185] el0_sync_handler+0xb0/0x2d0 [ 9514.195088] el0_sync+0x140/0x180 [ 9514.198389] Code: b9001020 d2800000 52800022 f9800271 (885ffe61) [ 9514.204455] ---[ end trace 648de00c8406465f ]--- [ 9514.212308] note: bash[1327] exited with preempt_count 1 Cc: Qian Cai <cai@lca.pw> Cc: Alex Williamson <alex.williamson@redhat.com> Fixes: 1518ac27 ("vfio/pci: fix memory leaks of eventfd ctx") Signed-off-by: NZeng Tao <prime.zeng@hisilicon.com> Signed-off-by: NAlex Williamson <alex.williamson@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Sagi Grimberg 提交于
stable inclusion from linux-4.19.149 commit 03dfb191acea76e6f92379abdbb5335139b28ffa -------------------------------- [ Upstream commit 3b4b1972 ] Revert fab7772b ("nvme-multipath: revalidate nvme_ns_head gendisk in nvme_validate_ns") When adding a new namespace to the head disk (via nvme_mpath_set_live) we will see partition scan which triggers I/O on the mpath device node. This process will usually be triggered from the scan_work which holds the scan_lock. If I/O blocks (if we got ana change currently have only available paths but none are accessible) this can deadlock on the head disk bd_mutex as both partition scan I/O takes it, and head disk revalidation takes it to check for resize (also triggered from scan_work on a different path). See trace [1]. The mpath disk revalidation was originally added to detect online disk size change, but this is no longer needed since commit cb224c3a ("nvme: Convert to use set_capacity_revalidate_and_notify") which already updates resize info without unnecessarily revalidating the disk (the mpath disk doesn't even implement .revalidate_disk fop). [1]: Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Zhang Xiaoxu 提交于
stable inclusion from linux-4.19.149 commit 5f7ca306c7db558fc81d9b1a45d59d5e1332a8a0 -------------------------------- [ Upstream commit 95a3d8f3 ] When xfstests generic/451, there is an BUG at mm/memcontrol.c: page:ffffea000560f2c0 refcount:2 mapcount:0 mapping:000000008544e0ea index:0xf mapping->aops:cifs_addr_ops dentry name:"tst-aio-dio-cycle-write.451" flags: 0x2fffff80000001(locked) raw: 002fffff80000001 ffffc90002023c50 ffffea0005280088 ffff88815cda0210 raw: 000000000000000f 0000000000000000 00000002ffffffff ffff88817287d000 page dumped because: VM_BUG_ON_PAGE(page->mem_cgroup) page->mem_cgroup:ffff88817287d000 ------------[ cut here ]------------ kernel BUG at mm/memcontrol.c:2659! invalid opcode: 0000 [#1] SMP CPU: 2 PID: 2038 Comm: xfs_io Not tainted 5.8.0-rc1 #44 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_ 073836-buildvm-ppc64le-16.ppc.4 RIP: 0010:commit_charge+0x35/0x50 Code: 0d 48 83 05 54 b2 02 05 01 48 89 77 38 c3 48 c7 c6 78 4a ea ba 48 83 05 38 b2 02 05 01 e8 63 0d9 RSP: 0018:ffffc90002023a50 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff88817287d000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffff88817ac97ea0 RDI: ffff88817ac97ea0 RBP: ffffea000560f2c0 R08: 0000000000000203 R09: 0000000000000005 R10: 0000000000000030 R11: ffffc900020237a8 R12: 0000000000000000 R13: 0000000000000001 R14: 0000000000000001 R15: ffff88815a1272c0 FS: 00007f5071ab0800(0000) GS:ffff88817ac80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055efcd5ca000 CR3: 000000015d312000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: mem_cgroup_charge+0x166/0x4f0 __add_to_page_cache_locked+0x4a9/0x710 add_to_page_cache_locked+0x15/0x20 cifs_readpages+0x217/0x1270 read_pages+0x29a/0x670 page_cache_readahead_unbounded+0x24f/0x390 __do_page_cache_readahead+0x3f/0x60 ondemand_readahead+0x1f1/0x470 page_cache_async_readahead+0x14c/0x170 generic_file_buffered_read+0x5df/0x1100 generic_file_read_iter+0x10c/0x1d0 cifs_strict_readv+0x139/0x170 new_sync_read+0x164/0x250 __vfs_read+0x39/0x60 vfs_read+0xb5/0x1e0 ksys_pread64+0x85/0xf0 __x64_sys_pread64+0x22/0x30 do_syscall_64+0x69/0x150 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7f5071fcb1af Code: Bad RIP value. RSP: 002b:00007ffde2cdb8e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011 RAX: ffffffffffffffda RBX: 00007ffde2cdb990 RCX: 00007f5071fcb1af RDX: 0000000000001000 RSI: 000055efcd5ca000 RDI: 0000000000000003 RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000001000 R11: 0000000000000293 R12: 0000000000000001 R13: 000000000009f000 R14: 0000000000000000 R15: 0000000000001000 Modules linked in: ---[ end trace 725fa14a3e1af65c ]--- Since commit 3fea5a49 ("mm: memcontrol: convert page cache to a new mem_cgroup_charge() API") not cancel the page charge, the pages maybe double add to pagecache: thread1 | thread2 cifs_readpages readpages_get_pages add_to_page_cache_locked(head,index=n)=0 | readpages_get_pages | add_to_page_cache_locked(head,index=n+1)=0 add_to_page_cache_locked(head, index=n+1)=-EEXIST then, will next loop with list head page's index=n+1 and the page->mapping not NULL readpages_get_pages add_to_page_cache_locked(head, index=n+1) commit_charge VM_BUG_ON_PAGE So, we should not do the next loop when any page add to page cache failed. Reported-by: NHulk Robot <hulkci@huawei.com> Signed-off-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com> Signed-off-by: NSteve French <stfrench@microsoft.com> Acked-by: NRonnie Sahlberg <lsahlber@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Alex Williamson 提交于
stable inclusion from linux-4.19.149 commit 41a77298809e7be112f91972d794aa231fbe27aa -------------------------------- [ Upstream commit 5c5866c5 ] The next use of the device will generate an underflow from the stale reference. Cc: Qian Cai <cai@lca.pw> Fixes: 1518ac27 ("vfio/pci: fix memory leaks of eventfd ctx") Reported-by: NDaniel Wagner <dwagner@suse.de> Reviewed-by: NCornelia Huck <cohuck@redhat.com> Tested-by: NDaniel Wagner <dwagner@suse.de> Signed-off-by: NAlex Williamson <alex.williamson@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Adrian Hunter 提交于
stable inclusion from linux-4.19.149 commit a63689c06a6dd5c0cf2a9221927b9b1b2b2bb9c1 -------------------------------- [ Upstream commit 61f82e3f ] In the absence of any modules, no "modules" map is created, but there are other executable pages to map, due to eBPF JIT, kprobe or ftrace. Map them by recognizing that the first "module" symbol is not necessarily from a module, and adjust the map accordingly. Signed-off-by: NAdrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Leo Yan <leo.yan@linaro.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mathieu Poirier <mathieu.poirier@linaro.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (VMware) <rostedt@goodmis.org> Cc: x86@kernel.org Link: http://lore.kernel.org/lkml/20200512121922.8997-10-adrian.hunter@intel.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ian Rogers 提交于
stable inclusion from linux-4.19.149 commit cc6ae85020035734eb13597fd6e8b0074897b837 -------------------------------- [ Upstream commit a159e2fe ] Avoid a simple memory leak. Signed-off-by: NIan Rogers <irogers@google.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Andi Kleen <ak@linux.intel.com> Cc: Andrii Nakryiko <andriin@fb.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jin Yao <yao.jin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: John Fastabend <john.fastabend@gmail.com> Cc: John Garry <john.garry@huawei.com> Cc: Kajol Jain <kjain@linux.ibm.com> Cc: Kan Liang <kan.liang@linux.intel.com> Cc: Kim Phillips <kim.phillips@amd.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Martin KaFai Lau <kafai@fb.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Song Liu <songliubraving@fb.com> Cc: Stephane Eranian <eranian@google.com> Cc: Vince Weaver <vincent.weaver@maine.edu> Cc: Yonghong Song <yhs@fb.com> Cc: bpf@vger.kernel.org Cc: kp singh <kpsingh@chromium.org> Cc: netdev@vger.kernel.org Link: http://lore.kernel.org/lkml/20200508053629.210324-10-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Xie XiuQi 提交于
stable inclusion from linux-4.19.149 commit dd155a48a0c9b53404b30f6f92ccf9f8160378c1 -------------------------------- [ Upstream commit 07e9a6f5 ] Need to free "str" before return when asprintf() failed to avoid memory leak. Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Hongbo Yao <yaohongbo@huawei.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Li Bin <huawei.libin@huawei.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Link: http://lore.kernel.org/lkml/20200521133218.30150-4-liwei391@huawei.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Jiri Olsa 提交于
stable inclusion from linux-4.19.149 commit d911653688c588c22bdbc83459f87961c9d4399e -------------------------------- [ Upstream commit ea9eb1f4 ] Joakim reported wrong duration_time value for interval bigger than 4000 [1]. The problem is in the interval value we pass to update_stats function, which is typed as 'unsigned int' and overflows when we get over 2^32 (happens between intervals 4000 and 5000). Retyping the passed value to unsigned long long. [1] https://www.spinics.net/lists/linux-perf-users/msg11777.html Fixes: b90f1333 ("perf stat: Update walltime_nsecs_stats in interval mode") Reported-by: NJoakim Zhang <qiangqing.zhang@nxp.com> Signed-off-by: NJiri Olsa <jolsa@kernel.org> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Michael Petlan <mpetlan@redhat.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lore.kernel.org/lkml/20200518131445.3745083-1-jolsa@kernel.orgSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ian Rogers 提交于
stable inclusion from linux-4.19.149 commit 56540590ce7c316947d6740edc0403182a1e1ade -------------------------------- [ Upstream commit 3efc899d ] If allocated, perf_pkg_mask and metric_events need freeing. Signed-off-by: NIan Rogers <irogers@google.com> Reviewed-by: NAndi Kleen <ak@linux.intel.com> Cc: Adrian Hunter <adrian.hunter@intel.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Jiri Olsa <jolsa@redhat.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Link: http://lore.kernel.org/lkml/20200512235918.10732-1-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Qian Cai 提交于
stable inclusion from linux-4.19.149 commit b7e24664cc816717ca2a45b773d950a9188fb5c1 -------------------------------- [ Upstream commit 1518ac27 ] Finished a qemu-kvm (-device vfio-pci,host=0001:01:00.0) triggers a few memory leaks after a while because vfio_pci_set_ctx_trigger_single() calls eventfd_ctx_fdget() without the matching eventfd_ctx_put() later. Fix it by calling eventfd_ctx_put() for those memory in vfio_pci_release() before vfio_device_release(). unreferenced object 0xebff008981cc2b00 (size 128): comm "qemu-kvm", pid 4043, jiffies 4294994816 (age 9796.310s) hex dump (first 32 bytes): 01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de ....kkkk.....N.. ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff ....kkkk........ backtrace: [<00000000917e8f8d>] slab_post_alloc_hook+0x74/0x9c [<00000000df0f2aa2>] kmem_cache_alloc_trace+0x2b4/0x3d4 [<000000005fcec025>] do_eventfd+0x54/0x1ac [<0000000082791a69>] __arm64_sys_eventfd2+0x34/0x44 [<00000000b819758c>] do_el0_svc+0x128/0x1dc [<00000000b244e810>] el0_sync_handler+0xd0/0x268 [<00000000d495ef94>] el0_sync+0x164/0x180 unreferenced object 0x29ff008981cc4180 (size 128): comm "qemu-kvm", pid 4043, jiffies 4294994818 (age 9796.290s) hex dump (first 32 bytes): 01 00 00 00 6b 6b 6b 6b 00 00 00 00 ad 4e ad de ....kkkk.....N.. ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff ....kkkk........ backtrace: [<00000000917e8f8d>] slab_post_alloc_hook+0x74/0x9c [<00000000df0f2aa2>] kmem_cache_alloc_trace+0x2b4/0x3d4 [<000000005fcec025>] do_eventfd+0x54/0x1ac [<0000000082791a69>] __arm64_sys_eventfd2+0x34/0x44 [<00000000b819758c>] do_el0_svc+0x128/0x1dc [<00000000b244e810>] el0_sync_handler+0xd0/0x268 [<00000000d495ef94>] el0_sync+0x164/0x180 Signed-off-by: NQian Cai <cai@lca.pw> Signed-off-by: NAlex Williamson <alex.williamson@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Shreyas Joshi 提交于
stable inclusion from linux-4.19.149 commit c6a9585611a538466c8ad2421035c0ffa7fabc77 -------------------------------- [ Upstream commit 48021f98 ] If uboot passes a blank string to console_setup then it results in a trashed memory. Ultimately, the kernel crashes during freeing up the memory. This fix checks if there is a blank parameter being passed to console_setup from uboot. In case it detects that the console parameter is blank then it doesn't setup the serial device and it gracefully exits. Link: https://lore.kernel.org/r/20200522065306.83-1-shreyas.joshi@biamp.comSigned-off-by: NShreyas Joshi <shreyas.joshi@biamp.com> Acked-by: NSergey Senozhatsky <sergey.senozhatsky@gmail.com> [pmladek@suse.com: Better format the commit message and code, remove unnecessary brackets.] Signed-off-by: NPetr Mladek <pmladek@suse.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Anshuman Khandual 提交于
stable inclusion from linux-4.19.149 commit e682e0d53c390467100dadd0cebcf8f4f0b9498e -------------------------------- [ Upstream commit 1ed1b90a ] ID_DFR0 based TraceFilt feature should not be exposed to guests. Hence lets drop it. Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: linux-arm-kernel@lists.infradead.org Cc: linux-kernel@vger.kernel.org Suggested-by: NMark Rutland <mark.rutland@arm.com> Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: NSuzuki K Poulose <suzuki.poulose@arm.com> Link: https://lore.kernel.org/r/1589881254-10082-3-git-send-email-anshuman.khandual@arm.comSigned-off-by: NWill Deacon <will@kernel.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Miklos Szeredi 提交于
stable inclusion from linux-4.19.149 commit 59da76a1713f7fd82d9c18ec72be99085b557027 -------------------------------- [ Upstream commit 32f98877 ] page_count() is unstable. Unless there has been an RCU grace period between when the page was removed from the page cache and now, a speculative reference may exist from the page cache. Reported-by: NMatthew Wilcox <willy@infradead.org> Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ian Rogers 提交于
stable inclusion from linux-4.19.149 commit 318af7241223eea9fc16413b04a6915518ab1e9c -------------------------------- [ Upstream commit 266150c9 ] Realloc of size zero is a free not an error, avoid this causing a double free. Caught by clang's address sanitizer: ==2634==ERROR: AddressSanitizer: attempting double-free on 0x6020000015f0 in thread T0: #0 0x5649659297fd in free llvm/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:123:3 #1 0x5649659e9251 in __zfree tools/lib/zalloc.c:13:2 #2 0x564965c0f92c in mem2node__exit tools/perf/util/mem2node.c:114:2 #3 0x564965a08b4c in perf_c2c__report tools/perf/builtin-c2c.c:2867:2 #4 0x564965a0616a in cmd_c2c tools/perf/builtin-c2c.c:2989:10 #5 0x564965944348 in run_builtin tools/perf/perf.c:312:11 #6 0x564965943235 in handle_internal_command tools/perf/perf.c:364:8 #7 0x5649659440c4 in run_argv tools/perf/perf.c:408:2 #8 0x564965942e41 in main tools/perf/perf.c:538:3 0x6020000015f0 is located 0 bytes inside of 1-byte region [0x6020000015f0,0x6020000015f1) freed by thread T0 here: #0 0x564965929da3 in realloc third_party/llvm/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:164:3 #1 0x564965c0f55e in mem2node__init tools/perf/util/mem2node.c:97:16 #2 0x564965a08956 in perf_c2c__report tools/perf/builtin-c2c.c:2803:8 #3 0x564965a0616a in cmd_c2c tools/perf/builtin-c2c.c:2989:10 #4 0x564965944348 in run_builtin tools/perf/perf.c:312:11 #5 0x564965943235 in handle_internal_command tools/perf/perf.c:364:8 #6 0x5649659440c4 in run_argv tools/perf/perf.c:408:2 #7 0x564965942e41 in main tools/perf/perf.c:538:3 previously allocated by thread T0 here: #0 0x564965929c42 in calloc third_party/llvm/llvm-project/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3 #1 0x5649659e9220 in zalloc tools/lib/zalloc.c:8:9 #2 0x564965c0f32d in mem2node__init tools/perf/util/mem2node.c:61:12 #3 0x564965a08956 in perf_c2c__report tools/perf/builtin-c2c.c:2803:8 #4 0x564965a0616a in cmd_c2c tools/perf/builtin-c2c.c:2989:10 #5 0x564965944348 in run_builtin tools/perf/perf.c:312:11 #6 0x564965943235 in handle_internal_command tools/perf/perf.c:364:8 #7 0x5649659440c4 in run_argv tools/perf/perf.c:408:2 #8 0x564965942e41 in main tools/perf/perf.c:538:3 v2: add a WARN_ON_ONCE when the free condition arises. Signed-off-by: NIan Rogers <irogers@google.com> Acked-by: NJiri Olsa <jolsa@redhat.com> Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Namhyung Kim <namhyung@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Stephane Eranian <eranian@google.com> Cc: clang-built-linux@googlegroups.com Link: http://lore.kernel.org/lkml/20200320182347.87675-1-irogers@google.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Douglas Anderson 提交于
stable inclusion from linux-4.19.149 commit b6256c2966706c279f54bdd2c6582c7c370e9467 -------------------------------- [ Upstream commit b849dd84 ] While trying to "dd" to the block device for a USB stick, I encountered a hung task warning (blocked for > 120 seconds). I managed to come up with an easy way to reproduce this on my system (where /dev/sdb is the block device for my USB stick) with: while true; do dd if=/dev/zero of=/dev/sdb bs=4M; done With my reproduction here are the relevant bits from the hung task detector: INFO: task udevd:294 blocked for more than 122 seconds. ... udevd D 0 294 1 0x00400008 Call trace: ... mutex_lock_nested+0x40/0x50 __blkdev_get+0x7c/0x3d4 blkdev_get+0x118/0x138 blkdev_open+0x94/0xa8 do_dentry_open+0x268/0x3a0 vfs_open+0x34/0x40 path_openat+0x39c/0xdf4 do_filp_open+0x90/0x10c do_sys_open+0x150/0x3c8 ... ... Showing all locks held in the system: ... 1 lock held by dd/2798: #0: ffffff814ac1a3b8 (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0x50/0x204 ... dd D 0 2798 2764 0x00400208 Call trace: ... schedule+0x8c/0xbc io_schedule+0x1c/0x40 wait_on_page_bit_common+0x238/0x338 __lock_page+0x5c/0x68 write_cache_pages+0x194/0x500 generic_writepages+0x64/0xa4 blkdev_writepages+0x24/0x30 do_writepages+0x48/0xa8 __filemap_fdatawrite_range+0xac/0xd8 filemap_write_and_wait+0x30/0x84 __blkdev_put+0x88/0x204 blkdev_put+0xc4/0xe4 blkdev_close+0x28/0x38 __fput+0xe0/0x238 ____fput+0x1c/0x28 task_work_run+0xb0/0xe4 do_notify_resume+0xfc0/0x14bc work_pending+0x8/0x14 The problem appears related to the fact that my USB disk is terribly slow and that I have a lot of RAM in my system to cache things. Specifically my writes seem to be happening at ~15 MB/s and I've got ~4 GB of RAM in my system that can be used for buffering. To write 4 GB of buffer to disk thus takes ~4000 MB / ~15 MB/s = ~267 seconds. The 267 second number is a problem because in __blkdev_put() we call sync_blockdev() while holding the bd_mutex. Any other callers who want the bd_mutex will be blocked for the whole time. The problem is made worse because I believe blkdev_put() specifically tells other tasks (namely udev) to go try to access the device at right around the same time we're going to hold the mutex for a long time. Putting some traces around this (after disabling the hung task detector), I could confirm: dd: 437.608600: __blkdev_put() right before sync_blockdev() for sdb udevd: 437.623901: blkdev_open() right before blkdev_get() for sdb dd: 661.468451: __blkdev_put() right after sync_blockdev() for sdb udevd: 663.820426: blkdev_open() right after blkdev_get() for sdb A simple fix for this is to realize that sync_blockdev() works fine if you're not holding the mutex. Also, it's not the end of the world if you sync a little early (though it can have performance impacts). Thus we can make a guess that we're going to need to do the sync and then do it without holding the mutex. We still do one last sync with the mutex but it should be much, much faster. With this, my hung task warnings for my test case are gone. Signed-off-by: NDouglas Anderson <dianders@chromium.org> Reviewed-by: NGuenter Roeck <groeck@chromium.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Jaewon Kim 提交于
stable inclusion from linux-4.19.149 commit 6bee7991f63e6ae8faba0c704f4d98575bb0312f -------------------------------- [ Upstream commit 09ef5283 ] On passing requirement to vm_unmapped_area, arch_get_unmapped_area and arch_get_unmapped_area_topdown did not set align_offset. Internally on both unmapped_area and unmapped_area_topdown, if info->align_mask is 0, then info->align_offset was meaningless. But commit df529cab ("mm: mmap: add trace point of vm_unmapped_area") always prints info->align_offset even though it is uninitialized. Fix this uninitialized value issue by setting it to 0 explicitly. Before: vm_unmapped_area: addr=0x755b155000 err=0 total_vm=0x15aaf0 flags=0x1 len=0x109000 lo=0x8000 hi=0x75eed48000 mask=0x0 ofs=0x4022 After: vm_unmapped_area: addr=0x74a4ca1000 err=0 total_vm=0x168ab1 flags=0x1 len=0x9000 lo=0x8000 hi=0x753d94b000 mask=0x0 ofs=0x0 Signed-off-by: NJaewon Kim <jaewon31.kim@samsung.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michel Lespinasse <walken@google.com> Cc: Borislav Petkov <bp@suse.de> Link: http://lkml.kernel.org/r/20200409094035.19457-1-jaewon31.kim@samsung.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Conflict: mm/mmap.c [yyl: adjust context] Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Qian Cai 提交于
stable inclusion from linux-4.19.149 commit b73c744019721ea47340b37440a7f6a263beea54 -------------------------------- [ Upstream commit 5644e1fb ] pgdat->kswapd_classzone_idx could be accessed concurrently in wakeup_kswapd(). Plain writes and reads without any lock protection result in data races. Fix them by adding a pair of READ|WRITE_ONCE() as well as saving a branch (compilers might well optimize the original code in an unintentional way anyway). While at it, also take care of pgdat->kswapd_order and non-kswapd threads in allow_direct_reclaim(). The data races were reported by KCSAN, BUG: KCSAN: data-race in wakeup_kswapd / wakeup_kswapd write to 0xffff9f427ffff2dc of 4 bytes by task 7454 on cpu 13: wakeup_kswapd+0xf1/0x400 wakeup_kswapd at mm/vmscan.c:3967 wake_all_kswapds+0x59/0xc0 wake_all_kswapds at mm/page_alloc.c:4241 __alloc_pages_slowpath+0xdcc/0x1290 __alloc_pages_slowpath at mm/page_alloc.c:4512 __alloc_pages_nodemask+0x3bb/0x450 alloc_pages_vma+0x8a/0x2c0 do_anonymous_page+0x16e/0x6f0 __handle_mm_fault+0xcd5/0xd40 handle_mm_fault+0xfc/0x2f0 do_page_fault+0x263/0x6f9 page_fault+0x34/0x40 1 lock held by mtest01/7454: #0: ffff9f425afe8808 (&mm->mmap_sem#2){++++}, at: do_page_fault+0x143/0x6f9 do_user_addr_fault at arch/x86/mm/fault.c:1405 (inlined by) do_page_fault at arch/x86/mm/fault.c:1539 irq event stamp: 6944085 count_memcg_event_mm+0x1a6/0x270 count_memcg_event_mm+0x119/0x270 __do_softirq+0x34c/0x57c irq_exit+0xa2/0xc0 read to 0xffff9f427ffff2dc of 4 bytes by task 7472 on cpu 38: wakeup_kswapd+0xc8/0x400 wake_all_kswapds+0x59/0xc0 __alloc_pages_slowpath+0xdcc/0x1290 __alloc_pages_nodemask+0x3bb/0x450 alloc_pages_vma+0x8a/0x2c0 do_anonymous_page+0x16e/0x6f0 __handle_mm_fault+0xcd5/0xd40 handle_mm_fault+0xfc/0x2f0 do_page_fault+0x263/0x6f9 page_fault+0x34/0x40 1 lock held by mtest01/7472: #0: ffff9f425a9ac148 (&mm->mmap_sem#2){++++}, at: do_page_fault+0x143/0x6f9 irq event stamp: 6793561 count_memcg_event_mm+0x1a6/0x270 count_memcg_event_mm+0x119/0x270 __do_softirq+0x34c/0x57c irq_exit+0xa2/0xc0 BUG: KCSAN: data-race in kswapd / wakeup_kswapd write to 0xffff90973ffff2dc of 4 bytes by task 820 on cpu 6: kswapd+0x27c/0x8d0 kthread+0x1e0/0x200 ret_from_fork+0x27/0x50 read to 0xffff90973ffff2dc of 4 bytes by task 6299 on cpu 0: wakeup_kswapd+0xf3/0x450 wake_all_kswapds+0x59/0xc0 __alloc_pages_slowpath+0xdcc/0x1290 __alloc_pages_nodemask+0x3bb/0x450 alloc_pages_vma+0x8a/0x2c0 do_anonymous_page+0x170/0x700 __handle_mm_fault+0xc9f/0xd00 handle_mm_fault+0xfc/0x2f0 do_page_fault+0x263/0x6f9 page_fault+0x34/0x40 Signed-off-by: NQian Cai <cai@lca.pw> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Marco Elver <elver@google.com> Cc: Matthew Wilcox <willy@infradead.org> Link: http://lkml.kernel.org/r/1582749472-5171-1-git-send-email-cai@lca.pwSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Xianting Tian 提交于
stable inclusion from linux-4.19.149 commit cebefe4f6fc0cf5721d443b91e8f43a66766fb06 -------------------------------- [ Upstream commit faffdfa0 ] Mount failure issue happens under the scenario: Application forked dozens of threads to mount the same number of cramfs images separately in docker, but several mounts failed with high probability. Mount failed due to the checking result of the page(read from the superblock of loop dev) is not uptodate after wait_on_page_locked(page) returned in function cramfs_read: wait_on_page_locked(page); if (!PageUptodate(page)) { ... } The reason of the checking result of the page not uptodate: systemd-udevd read the loopX dev before mount, because the status of loopX is Lo_unbound at this time, so loop_make_request directly trigger the calling of io_end handler end_buffer_async_read, which called SetPageError(page). So It caused the page can't be set to uptodate in function end_buffer_async_read: if(page_uptodate && !PageError(page)) { SetPageUptodate(page); } Then mount operation is performed, it used the same page which is just accessed by systemd-udevd above, Because this page is not uptodate, it will launch a actual read via submit_bh, then wait on this page by calling wait_on_page_locked(page). When the I/O of the page done, io_end handler end_buffer_async_read is called, because no one cleared the page error(during the whole read path of mount), which is caused by systemd-udevd reading, so this page is still in "PageError" status, which can't be set to uptodate in function end_buffer_async_read, then caused mount failure. But sometimes mount succeed even through systemd-udeved read loopX dev just before, The reason is systemd-udevd launched other loopX read just between step 3.1 and 3.2, the steps as below: 1, loopX dev default status is Lo_unbound; 2, systemd-udved read loopX dev (page is set to PageError); 3, mount operation 1) set loopX status to Lo_bound; ==>systemd-udevd read loopX dev<== 2) read loopX dev(page has no error) 3) mount succeed As the loopX dev status is set to Lo_bound after step 3.1, so the other loopX dev read by systemd-udevd will go through the whole I/O stack, part of the call trace as below: SYS_read vfs_read do_sync_read blkdev_aio_read generic_file_aio_read do_generic_file_read: ClearPageError(page); mapping->a_ops->readpage(filp, page); here, mapping->a_ops->readpage() is blkdev_readpage. In latest kernel, some function name changed, the call trace as below: blkdev_read_iter generic_file_read_iter generic_file_buffered_read: /* * A previous I/O error may have been due to temporary * failures, eg. mutipath errors. * Pg_error will be set again if readpage fails. */ ClearPageError(page); /* Start the actual read. The read will unlock the page*/ error=mapping->a_ops->readpage(flip, page); We can see ClearPageError(page) is called before the actual read, then the read in step 3.2 succeed. This patch is to add the calling of ClearPageError just before the actual read of read path of cramfs mount. Without the patch, the call trace as below when performing cramfs mount: do_mount cramfs_read cramfs_blkdev_read read_cache_page do_read_cache_page: filler(data, page); or mapping->a_ops->readpage(data, page); With the patch, the call trace as below when performing mount: do_mount cramfs_read cramfs_blkdev_read read_cache_page: do_read_cache_page: ClearPageError(page); <== new add filler(data, page); or mapping->a_ops->readpage(data, page); With the patch, mount operation trigger the calling of ClearPageError(page) before the actual read, the page has no error if no additional page error happen when I/O done. Signed-off-by: NXianting Tian <xianting_tian@126.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Cc: Jan Kara <jack@suse.cz> Cc: <yubin@h3c.com> Link: http://lkml.kernel.org/r/1583318844-22971-1-git-send-email-xianting_tian@126.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Nathan Chancellor 提交于
stable inclusion from linux-4.19.149 commit afe001488e7e8e1108a2d9fcac3757713ffae503 -------------------------------- [ Upstream commit b0d14fc4 ] Clang warns: mm/kmemleak.c:1955:28: warning: array comparison always evaluates to a constant [-Wtautological-compare] if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata) ^ mm/kmemleak.c:1955:60: warning: array comparison always evaluates to a constant [-Wtautological-compare] if (__start_ro_after_init < _sdata || __end_ro_after_init > _edata) These are not true arrays, they are linker defined symbols, which are just addresses. Using the address of operator silences the warning and does not change the resulting assembly with either clang/ld.lld or gcc/ld (tested with diff + objdump -Dr). Suggested-by: NNick Desaulniers <ndesaulniers@google.com> Signed-off-by: NNathan Chancellor <natechancellor@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Acked-by: NCatalin Marinas <catalin.marinas@arm.com> Link: https://github.com/ClangBuiltLinux/linux/issues/895 Link: http://lkml.kernel.org/r/20200220051551.44000-1-natechancellor@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NSasha Levin <sashal@kernel.org> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-