- 02 9月, 2020 40 次提交
-
-
由 Alex Shi 提交于
task #29077503 commit 756d25be457fc5497da0ceee0f3d0c9eb4d8535d upstream We have two types of users of page isolation: 1. Memory offlining: Offline memory so it can be unplugged. Memory won't be touched. 2. Memory allocation: Allocate memory (e.g., alloc_contig_range()) to become the owner of the memory and make use of it. For example, in case we want to offline memory, we can ignore (skip over) PageHWPoison() pages, as the memory won't get used. We can allow to offline memory. In contrast, we don't want to allow to allocate such memory. Let's generalize the approach so we can special case other types of pages we want to skip over in case we offline memory. While at it, also pass the same flags to test_pages_isolated(). Original-by: NDavid Hildenbrand <david@redhat.com> Link: http://lkml.kernel.org/r/20191021172353.3056-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com> Suggested-by: NMichal Hocko <mhocko@suse.com> Acked-by: NMichal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: David Hildenbrand <david@redhat.com> Cc: Pingfan Liu <kernelfans@gmail.com> Cc: Qian Cai <cai@lca.pw> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from ccommit 756d25be457fc5497da0ceee0f3d0c9eb4d8535d) Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com> Conflicts: reenable patch context on all files.
-
由 Michal Hocko 提交于
task #29077503 commit d381c54760dcfad23743da40516e7e003d73952a upstream Heiko has complained that his log is swamped by warnings from has_unmovable_pages [ 20.536664] page dumped because: has_unmovable_pages [ 20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0 [ 20.536794] flags: 0x3fffe0000010200(slab|head) [ 20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600 [ 20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000 [ 20.536797] page dumped because: has_unmovable_pages [ 20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0 [ 20.536815] flags: 0x7fffe0000000000() [ 20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000 [ 20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000 which are not triggered by the memory hotplug but rather CMA allocator. The original idea behind dumping the page state for all call paths was that these messages will be helpful debugging failures. From the above it seems that this is not the case for the CMA path because we are lacking much more context. E.g the second reported page might be a CMA allocated page. It is still interesting to see a slab page in the CMA area but it is hard to tell whether this is bug from the above output alone. Address this issue by dumping the page state only on request. Both start_isolate_page_range and has_unmovable_pages already have an argument to ignore hwpoison pages so make this argument more generic and turn it into flags and allow callers to combine non-default modes into a mask. While we are at it, has_unmovable_pages call from is_pageblock_removable_nolock (sysfs removable file) is questionable to report the failure so drop it from there as well. Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com> Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com> Reviewed-by: NOscar Salvador <osalvador@suse.de> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from ccommit d381c54760dcfad23743da40516e7e003d73952a) Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com> Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com> Conflicts: mm/page_alloc.c
-
由 David Hildenbrand 提交于
task #29077503 commit ca215086b14b89a0e70fc211314944aa6ce50020 upstream pages inflated in virtio-balloon. Nowadays, it is only a marker that a page is part of virtio-balloon and therefore logically offline. We also want to make use of this flag in other balloon drivers - for inflated pages or when onlining a section but keeping some pages offline (e.g. used right now by XEN and Hyper-V via set_online_page_callback()). We are going to expose this flag to dump tools like makedumpfile. But instead of exposing PG_balloon, let's generalize the concept of marking pages as logically offline, so it can be reused for other purposes later on. Rename PG_balloon to PG_offline. This is an indicator that the page is logically offline, the content stale and that it should not be touched (e.g. a hypervisor would have to allocate backing storage in order for the guest to dump an unused page). We can then e.g. exclude such pages from dumps. We replace and reuse KPF_BALLOON (23), as this shouldn't really harm (and for now the semantics stay the same). In following patches, we will make use of this bit also in other balloon drivers. While at it, document PGTABLE. [akpm@linux-foundation.org: fix comment text, per David] Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com> Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com> Acked-by: NMichael S. Tsirkin <mst@redhat.com> Acked-by: NPankaj gupta <pagupta@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Mike Rapoport <rppt@linux.vnet.ibm.com> Cc: Christian Hansen <chansen3@cisco.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Pavel Tatashin <pasha.tatashin@oracle.com> Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com> Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Miles Chen <miles.chen@mediatek.com> Cc: David Rientjes <rientjes@google.com> Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com> Cc: Dave Young <dyoung@redhat.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Haiyang Zhang <haiyangz@microsoft.com> Cc: Juergen Gross <jgross@suse.com> Cc: Julien Freche <jfreche@vmware.com> Cc: Kairui Song <kasong@redhat.com> Cc: "K. Y. Srinivasan" <kys@microsoft.com> Cc: Len Brown <len.brown@intel.com> Cc: Lianbo Jiang <lijiang@redhat.com> Cc: Michal Hocko <mhocko@kernel.org> Cc: Nadav Amit <namit@vmware.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Stephen Hemminger <sthemmin@microsoft.com> Cc: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Xavier Deguillard <xdeguillard@vmware.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> (cherry picked from ccommit ca215086b14b89a0e70fc211314944aa6ce50020) Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com> Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
-
由 Baolin Wang 提交于
fix #29327388 Now the blk-mq did not support multi-page bvec, which means each bvec can only contain one page. Though the physical segment is 1 in one bio, the bio still can contains multiple bvecs which are physically contiguous, so we can not use one bvec length to map the request, instead we should use the full length of the request to mapping the request, when the physical segment is 1 in a request. In future if we support multi-page bvecs, this patch can be dropped. Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Baolin Wang 提交于
fix #29327388 Now the blk-mq did not support multi-page bvec, which means each bvec can only contain one page. For simple PRP, it just support one bvec in the bio though the physical segment is only 1, otherwise it can not map the whole request. In future if we support multi-page bvecs, this patch can be dropped. Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Klaus Birkelund Jensen 提交于
fix #29327388 commit 049bf37262c61c99f45438910711b55054b24838 upstream The shortcut for single segment SGL requests did not set the PSDT field to mark the request as using SGLs. Fixes: 297910571f08 ("nvme-pci: optimize mapping single segment requests using SGLs") Signed-off-by: NKlaus Birkelund Jensen <klaus.jensen@cnexlabs.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Kevin Hao 提交于
fix #29327388 commit a4f40484e7f1dff56bb9f286cc59ffa36e0259eb upstream In the current code, the nvme is using a fixed 4k PRP entry size, but if the kernel use a page size which is more than 4k, we should consider the situation that the bv_offset may be larger than the dev->ctrl.page_size. Otherwise we may miss setting the prp2 and then cause the command can't be executed correctly. Fixes: dff824b2aadb ("nvme-pci: optimize mapping of small single segment requests") Cc: stable@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NKevin Hao <haokexin@gmail.com> Signed-off-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
fix #29327388 commit 70479b71bc80ae6f63c8d6644cc76dff99f79686 upstream Remove two pointless local variables, remove ret assignment that is never used, move the use_sgl initialization closer to where it is used. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
fix #29327388 commit 297910571f08f1d7e398793df6e606ebb375a3f1 upstream If the controller supports SGLs we can take another short cut for single segment request, given that we can always map those without another indirection structure, and thus don't need to create a scatterlist structure. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
fix #29327388 commit dff824b2aadb7808f50ceb0927acaec5ad750ce7 upstream If a request is single segment and fits into one or two PRP entries we do not have to create a scatterlist for it, but can just map the bio_vec directly. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
fix #29327388 commit d43f1ccfad053dbefba1d15443cdc36ca60958f0 upstream We'll have a better way to optimize for small I/O that doesn't require it soon, so remove the existing inline_sg case to make that optimization easier to implement. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
fix #29327388 commit 4aedb705437f6f98b45f45c394e6803ca67abd33 upstream This prepares for some bigger changes to the data mapping helpers. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
fix #29327388 commit 783b94bd9250478154904fa782d2cfc46336cdf6 upstream We always have exactly one segment, so we can simply call dma_map_bvec. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
fix #29327388 commit b15c592de37ed9d71499a3b8a750d1b235fcba3d upstream This mirrors how nvme_map_pci is called and will allow simplifying some checks in nvme_unmap_pci later on. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
fix #29327388 commit 7fe07d14f71fabef642a478626248a9121e95b7b upstream This means we now have a function that undoes everything nvme_map_data does and we can simplify the error handling a bit. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
fix #29327388 commit 915f04c93db4e3a7388c8ad8ddfc28830e4cbce3 upstream Cleaning up the command setup isn't related to unmapping data, and disentangling them will simplify error handling a bit down the road. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
fix #29327388 commit 9b048119a153590b934ef49aae309b723587f527 upstream nvme_init_iod should really be split into two parts: initialize a few general iod fields, which can easily be done at the beginning of nvme_queue_rq, and allocating the scatterlist if needed, which logically belongs into nvme_map_data with the code making use of it. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
fix #29327388 commit 3ab3a0313cb8c50391d74e40fd46a3408d8e4de9 upstream Provide a nice little shortcut for mapping a single bvec. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
fix #29327388 commit 9d9de535f385a8b3ba0e88ca0abf386c5704bbfc upstream In a lot of places we want to know the DMA direction for a given struct request. Add a little helper to make it a littler easier. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
fix #29327388 commit 2a876f5e25e8ec9fa5777d36e5695ee33dd63f6f upstream This provides a nice little shortcut to get the integrity data for drivers like NVMe that only support a single integrity segment. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
fix #29327388 commit 3aef3cae4342c1d8137a1c0782cbb66f1be3943c upstream Return the currently active bvec segment, potentially spanning multiple pages. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Keith Busch 提交于
fix #29327388 commit 39f8e36401142d73e33a954ac4bdf844fb5de9ae upstream Signed-off-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 YangYuxi 提交于
fix #29256237 commit a01a9445c00eca3e37523eb6b0d87f494eceeb4b TencentOS-kernel Since 'commit f719e375 ("ipvs: drop first packet to redirect conntrack")', when a new TCP connection meet the conditions that need reschedule, the first syn packet is dropped, this cause one second latency for the new connection, more discussion about this problem can easy search from google, such as: 1)One second connection delay in masque https://marc.info/?t=151683118100004&r=1&w=2 2)IPVS low throughput #70747 https://github.com/kubernetes/kubernetes/issues/70747 3)Apache Bench can fill up ipvs service proxy in seconds #544 https://github.com/cloudnativelabs/kube-router/issues/544 4)Additional 1s latency in `host -> service IP -> pod` https://github.com/kubernetes/kubernetes/issues/90854 5)kube-proxy ipvs conn_reuse_mode setting causes errors with high load from single client https://github.com/kubernetes/kubernetes/issues/81775 The root cause is when the old session is expired, the conntrack related to the session is dropped by ip_vs_conn_drop_conntrack. The code is as follows: ``` static void ip_vs_conn_expire(struct timer_list *t) { ... if ((cp->flags & IP_VS_CONN_F_NFCT) && !(cp->flags & IP_VS_CONN_F_ONE_PACKET)) { /* Do not access conntracks during subsys cleanup * because nf_conntrack_find_get can not be used after * conntrack cleanup for the net. */ smp_rmb(); if (ipvs->enable) ip_vs_conn_drop_conntrack(cp); } ... } ``` As shown in the code, only when condition (cp->flags & IP_VS_CONN_F_NFCT) is true, the function ip_vs_conn_drop_conntrack will be called. So we optimize this by following steps (Administrators can choose the following optimization by setting net.ipv4.vs.conn_reuse_old_conntrack=1): 1) erase the IP_VS_CONN_F_NFCT flag (it is safely because no packets will use the old session) 2) call ip_vs_conn_expire_now to release the old session, then the related conntrack will not be dropped 3) then ipvs unnecessary to drop the first syn packet, it just continue to pass the syn packet to the next process, create a new ipvs session, and the new session will related to the old conntrack(which is reopened by conntrack as a new one), the next whole things is just as normal as that the old session isn't used to exist. The above processing has no problems except for passive FTP, for passive FTP situation, ipvs can judging from condition (atomic_read(&cp->n_control)) and condition (cp->control). So, for other conditions(means not FTP), ipvs should give users the right to choose,they can choose a high performance one processing logical by setting net.ipv4.vs.conn_reuse_old_conntrack=1. It is necessary because most business scenarios (such as kubernetes) are very sensitive to TCP short connection latency. This patch has been verified on our thousands of kubernets node servers on Tencent Inc. Signed-off-by: NYangYuxi <yx.atom1@gmail.com> [Tony: add the missing sysctl knob and disable it by default] Signed-off-by: NTony Lu <tonylu@linux.alibaba.com> Acked-by: NDust Li <dust.li@linux.alibaba.com>
-
由 Bijan Mottahedeh 提交于
to #28991349 commit 9515743bfb39c61aaf3d4f3219a645c8d1fe9a0e upstream Completions need to consumed in the same order the controller submitted them, otherwise future completion entries may overwrite ones we haven't handled yet. Hold the nvme queue's poll lock while completing new CQEs to prevent another thread from freeing command tags for reuse out-of-order. Fixes: dabcefab45d3 ("nvme: provide optimized poll function for separate poll queues") Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Bart Van Assche 提交于
to #28991349 commit 0542cd57d266074114d70791ab245e18f750cd32 upstream This patch avoids that the kernel-doc script complains about these function headers when building with W=1. Cc: Hannes Reinecke <hare@suse.com> Cc: Keith Busch <keith.busch@intel.com> Fixes: ed76e329d74a ("blk-mq: abstract out queue map") # v5.0. Fixes: e42b3867de4b ("blk-mq-rdma: pass in queue map to blk_mq_rdma_map_queues") # v5.0. Reviewed-by: NChaitanya Kulkarni <chiatanya.kulkarni@wdc.com> Signed-off-by: NBart Van Assche <bvanassche@acm.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
to #28991349 commit 5aceaeb26394538858a9dbae5830d628469a44cf upstream We should check if a given queue map actually has queues enabled before dispatching to it. This allows drivers to not initialize optional but not used map types, which subsequently will allow fixing problems with queue map rebuilds for that case. Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Ming Lei 提交于
to #28991349 commit e5edd5f298fafda28284bafb8371e6f0b7681035 upstream From 7e849dd9cf37 ("nvme-pci: don't share queue maps"), the mapping table won't be initialized actually if map->nr_queues is zero, so we can't use blk_mq_map_queue_type() to retrieve hctx any more. This way still may cause broken mapping, fix it by skipping zero-queues maps in blk_mq_map_swqueue(). Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
to #28991349 commit e5edd5f298fafda28284bafb8371e6f0b7681035 upstream Now that the block layer checks if a queue map has any queues inside it there is no more reason to duplicate the maps for the non-default types. Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Bart Van Assche 提交于
to #28991349 commit 6e66b49392419f3fe134e1be583323ef75da1e4b upstream blk_mq_map_queues() and multiple .map_queues() implementations expect that set->map[HCTX_TYPE_DEFAULT].nr_queues is set to the number of hardware queues. Hence set .nr_queues before calling these functions. This patch fixes the following kernel warning: WARNING: CPU: 0 PID: 2501 at include/linux/cpumask.h:137 Call Trace: blk_mq_run_hw_queue+0x19d/0x350 block/blk-mq.c:1508 blk_mq_run_hw_queues+0x112/0x1a0 block/blk-mq.c:1525 blk_mq_requeue_work+0x502/0x780 block/blk-mq.c:775 process_one_work+0x9af/0x1740 kernel/workqueue.c:2269 worker_thread+0x98/0xe40 kernel/workqueue.c:2415 kthread+0x361/0x430 kernel/kthread.c:255 Fixes: ed76e329d74a ("blk-mq: abstract out queue map") # v5.0 Reported-by: syzbot+d44e1b26ce5c3e77458d@syzkaller.appspotmail.com Signed-off-by: NBart Van Assche <bvanassche@acm.org> Reviewed-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Cc: Johannes Thumshirn <jth@kernel.org> Cc: Hannes Reinecke <hare@suse.com> Cc: Ming Lei <ming.lei@redhat.com> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Ming Lei 提交于
to #28991349 commit 5938870247be4453ef6602c7ce467bebb48113c8 upstream Now almost all .map_queues() implementation based on managed irq affinity doesn't update queue mapping and it just retrieves the old built mapping, so if nr_hw_queues is changed, the mapping talbe includes stale mapping. And only blk_mq_map_queues() may rebuild the mapping talbe. One case is that we limit .nr_hw_queues as 1 in case of kdump kernel. However, drivers often builds queue mapping before allocating tagset via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set as 1 in case of kdump kernel, so wrong queue mapping is used, and kernel panic[1] is observed during booting. This patch fixes the kernel panic triggerd on nvme by rebulding the mapping table via blk_mq_map_queues(). [1] kernel panic log [ 4.438371] nvme nvme0: 16/0/0 default/read/poll queues [ 4.443277] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 [ 4.444681] PGD 0 P4D 0 [ 4.445367] Oops: 0000 [#1] SMP NOPTI [ 4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459 [ 4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014 [ 4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core] [ 4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222 [ 4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6 [ 4.453132] RSP: 0018:ffffc900023b3cd8 EFLAGS: 00010286 [ 4.454061] RAX: 0000000000000000 RBX: ffff888174448000 RCX: 0000000000000001 [ 4.456480] RDX: 0000000000000001 RSI: ffffe8feffc506c0 RDI: 0000000000000001 [ 4.458750] RBP: ffff88810722d008 R08: ffff88817647a880 R09: 0000000000000002 [ 4.464580] R10: ffffc900023b3c10 R11: 0000000000000004 R12: ffff888174448538 [ 4.467803] R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000001 [ 4.469220] FS: 0000000000000000(0000) GS:ffff88817bac0000(0000) knlGS:0000000000000000 [ 4.471554] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4.472464] CR2: 0000000000000098 CR3: 0000000174e4e001 CR4: 0000000000760ee0 [ 4.474264] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 4.476007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 4.477061] PKRU: 55555554 [ 4.477464] Call Trace: [ 4.478731] blk_mq_init_allocated_queue+0x36a/0x3ad [ 4.479595] blk_mq_init_queue+0x32/0x4e [ 4.480178] nvme_validate_ns+0x98/0x623 [nvme_core] [ 4.480963] ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core] [ 4.481685] ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core] [ 4.482601] nvme_scan_work+0x23a/0x29b [nvme_core] [ 4.483269] ? _raw_spin_unlock_irqrestore+0x25/0x38 [ 4.483930] ? try_to_wake_up+0x38d/0x3b3 [ 4.484478] ? process_one_work+0x179/0x2fc [ 4.485118] process_one_work+0x1d3/0x2fc [ 4.485655] ? rescuer_thread+0x2ae/0x2ae [ 4.486196] worker_thread+0x1e9/0x2be [ 4.486841] kthread+0x115/0x11d [ 4.487294] ? kthread_park+0x76/0x76 [ 4.487784] ret_from_fork+0x3a/0x50 [ 4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi ip_tables [ 4.489428] Dumping ftrace buffer: [ 4.489939] (ftrace buffer empty) [ 4.490492] CR2: 0000000000000098 [ 4.491052] ---[ end trace 03cd268ad5a86ff7 ]--- Cc: Christoph Hellwig <hch@lst.de> Cc: linux-nvme@lists.infradead.org Cc: David Milburn <dmilburn@redhat.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Ming Lei 提交于
to #28991349 commit c45b1fa2433c65e44bdf48f513cb37289f3116b9 upstream When -ENOSPC is returned from pci_alloc_irq_vectors_affinity(), we still try to allocate multiple irq vectors again, so irq queues covers the admin queue actually. But we don't consider that, then number of the allocated irq vector may be same with sum of io_queues[HCTX_TYPE_DEFAULT] and io_queues[HCTX_TYPE_READ], this way is obviously wrong, and finally breaks nvme_pci_map_queues(), and warning from pci_irq_get_affinity() is triggered. IRQ queues should cover admin queues, this patch makes this point explicitely in nvme_calc_io_queues(). We got severl boot failure internal report on aarch64, so please consider to fix it in v4.20. Fixes: 6451fe73fa0f ("nvme: fix irq vs io_queue calculations") Signed-off-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Tested-by: Nfin4478 <fin4478@hotmail.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit 6451fe73fa0f542a49bfacd7205b88a597897f58 upstream Guenter reported an boot hang issue on HPPA after we default to 0 poll queues. We have two issues in the queue count calculations: 1) We don't separate the poll queues from the read/write queues. This is important, since the former doesn't need interrupts. 2) The adjust logic is broken. Adjust the poll queue count before doing nvme_calc_io_queues(). The poll queue count is only limited by the IO queue count we were able to get from the controller, not failures in the IRQ allocation loop. This leaves nvme_calc_io_queues() just adjusting the read/write queue map. Reported-by: NReported-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
to #28991349 commit e20ba6e1da029136ded295f33076483d65ddf50a upstream Having another indirect all in the fast path doesn't really help in our post-spectre world. Also having too many queue type is just going to create confusion, so I'd rather manage them centrally. Note that the queue type naming and ordering changes a bit - the first index now is the default queue for everything not explicitly marked, the optional ones are read and poll queues. Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit dabcefab45d36ecb5a22f16577bb0f298876a22d upstream If we have separate poll queues, we know that they aren't using interrupts. Hence we don't need to disable interrupts around finding completions. Provide a separate set of blk_mq_ops for such devices. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Ming Lei 提交于
to #28991349 commit 07b35eb5a364fa59f88f65e6c786192f2c9163be upstream Type of each element in queue mapping table is 'unsigned int, intead of 'struct blk_mq_queue_map)', so fix it. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit a4668d9ba4be1ca9f4a39798ba3419fdfef0750d upstream We need a better way of configuring this, and given that polling is (still) a bit niche, let's default to using 0 poll queues. That way we'll have the same read/write/poll behavior as 4.20, and users that want to test/use polling are required to do manual configuration of the number of poll queues. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit db29eb059cdc571f9d75cec4a41b9884b3b8286a upstream At least on SPARC, if MSI/MSI-X isn't supported, we get EINVAL if we ask for more than one vector. This isn't covered by our ENOSPC check. If we get EINVAL, decrease our ask to just one vector, instead of bailing out in error. Fixes: 3b6592f70ad7 ("nvme: utilize two queue maps, one for reads and one for writes") Reported-by: NGuenter Roeck <linux@roeck-us.net> Tested-by: NGuenter Roeck <linux@roeck-us.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit 30e066286e232772cad72c87008a958e23e40a33 upstream NVMe always asks for io_queues + 1 worth of IRQ vectors, which means that even when we scale all the way down, we still ask for 2 vectors and get -ENOSPC in return if the system can't support more than 1. Getting just 1 vector is fine, it just means that we'll have 1 IO queue and 1 admin queue, with a shared vector between them. Check for this case and don't add our + 1 if it happens. Fixes: 3b6592f70ad7 ("nvme: utilize two queue maps, one for reads and one for writes") Reported-by: NGuenter Roeck <linux@roeck-us.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Ming Lei 提交于
to #28991349 commit 556f36e90dbe7dded81f4fac084d2bc8a2458330 upstream Spread queues among present CPUs first, then building mapping on other non-present CPUs. So we can minimize count of dead queues which are mapped by un-present CPUs only. Then bad IO performance can be avoided by unbalanced mapping between present CPUs and queues. The similar policy has been applied on Managed IRQ affinity. Cc: Yi Zhang <yi.zhang@redhat.com> Reported-by: NYi Zhang <yi.zhang@redhat.com> Reviewed-by: NBob Liu <bob.liu@oracle.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit 4b04cc6a8f86c4842314def22332de1f15de8523 upstream Adds support for defining a variable number of poll queues, currently configurable with the 'poll_queues' module parameter. Defaults to a single poll queue. And now we finally have poll support without triggering interrupts! Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-