提交 · 2050082598f4056ffbcc008408182fd1c13c7040 · openanolis / cloud-kernel

02 9月, 2020 40 次提交

mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE · 20500825

由 David Hildenbrand 提交于 8月 27, 2019

task #29077503
commit aa218795cb5fd583c94fc838dc76b7379dc4976a upstream
virtio-mem wants to allow to offline memory blocks of which some parts
were unplugged (allocated via alloc_contig_range()), especially, to later
offline and remove completely unplugged memory blocks. The important part
is that PageOffline() has to remain set until the section is offline, so
these pages will never get accessed (e.g., when dumping). The pages should
not be handed back to the buddy (which would require clearing PageOffline()
and result in issues if offlining fails and the pages are suddenly in the
buddy).

Let's allow to do that by allowing to isolate any PageOffline() page
when offlining. This way, we can reach the memory hotplug notifier
MEM_GOING_OFFLINE, where the driver can signal that he is fine with
offlining this page by dropping its reference count. PageOffline() pages
with a reference count of 0 can then be skipped when offlining the
pages (like if they were free, however they are not in the buddy).

Anybody who uses PageOffline() pages and does not agree to offline them
(e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
decrement the reference count and make offlining fail when trying to
migrate such an unmovable page. So there should be no observable change.
Same applies to balloon compaction users (movable PageOffline() pages), the
pages will simply be migrated.

Note 1: If offlining fails, a driver has to increment the reference
	count again in MEM_CANCEL_OFFLINE.

Note 2: A driver that makes use of this has to be aware that re-onlining
	the memory block has to be handled by hooking into onlining code
	(online_page_callback_t), resetting the page PageOffline() and
	not giving them to the buddy.
Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
cherry picked from ccommit aa218795cb5fd583c94fc838dc76b7379dc4976a
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>

Conflicts: keep non-related code old, and remove offlined_pages++
	mm/memory_hotplug.c
	mm/page_alloc.c
	mm/page_isolation.c
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

20500825

virtio-mem: Allow to specify an ACPI PXM as nid · c9f36272

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit f2af6d3978d74a7891d0f428537b4494498202cb upstream
virtio-mem device (and, therefore, its memory) belongs. Add a new
virtio-mem feature flag and export pxm_to_node, so it can be used in kernel
module context.

Acked-by: Michal Hocko <mhocko@suse.com> # for the export
Acked-by: "Rafael J. Wysocki" <rafael@kernel.org> # for the export
Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Len Brown <lenb@kernel.org>
Cc: linux-acpi@vger.kernel.org
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-4-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit f2af6d3978d74a7891d0f428537b4494498202cb)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>

Conflicts:
	move drivers/acpi/numa/srat.c modification into
	drivers/acpi/numa.c
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

c9f36272

virtio-mem: Paravirtualized memory hotplug · d3997ceb

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit 5f1f79bbc9e26fa9412fa9522f957bb8f030c442 upstream
for adding/removing memory from that memory region on request.

When the device driver starts up, the requested amount of memory is
queried and then plugged to Linux. On request, further memory can be
plugged or unplugged. This patch only implements the plugging part.

On x86-64, memory can currently be plugged in 4MB ("subblock") granularity.
When required, a new memory block will be added (e.g., usually 128MB on
x86-64) in order to plug more subblocks. Only x86-64 was tested for now.

The online_page callback is used to keep unplugged subblocks offline
when onlining memory - similar to the Hyper-V balloon driver. Unplugged
pages are marked PG_offline, to tell dump tools (e.g., makedumpfile) to
skip them.

User space is usually responsible for onlining the added memory. The
memory hotplug notifier is used to synchronize virtio-mem activity
against memory onlining/offlining.

Each virtio-mem device can belong to a NUMA node, which allows us to
easily add/remove small chunks of memory to/from a specific NUMA node by
using multiple virtio-mem devices. Something that works even when the
guest has no idea about the NUMA topology.

One way to view virtio-mem is as a "resizable DIMM" or a DIMM with many
"sub-DIMMS".

This patch directly introduces the basic infrastructure to implement memory
unplug. Especially the memory block states and subblock bitmaps will be
heavily used there.

Notes:
- In case memory is to be onlined by user space, we limit the amount of
  offline memory blocks, to not run out of memory. This is esp. an
  issue if memory is added faster than it is getting onlined.
- Suspend/Hibernate is not supported due to the way virtio-mem devices
  behave. Limited support might be possible in the future.
- Reloading the device driver is not supported.
Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: linux-acpi@vger.kernel.org
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-2-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit 5f1f79bbc9e26fa9412fa9522f957bb8f030c442)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>

Conflicts:
	drivers/virtio/Makefile
	include/uapi/linux/virtio_ids.h
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

d3997ceb

mm/memory_hotplug: export generic_online_page() · 0c6a9eb5

由 David Hildenbrand 提交于 11月 30, 2019

task #29077503
commit 18db149120c106cf2b1a2595f82f3229f9d223b8 upstream

Let's replace the __online_page...() functions by generic_online_page().
Hyper-V only wants to delay the actual onlining of un-backed pages, so
we can simpy re-use the generic function.

This patch (of 3):

Let's expose generic_online_page() so online_page_callback users can
simply fall back to the generic implementation when actually deciding to
online the pages.

Link: http://lkml.kernel.org/r/20190909114830.662-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from ccommit 18db149120c106cf2b1a2595f82f3229f9d223b8)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

0c6a9eb5

mm/page_alloc.c: memory hotplug: free pages as higher order · bd6aced3

由 Arun KS 提交于 3月 05, 2019

task #29077503
commit a9cd410a3d296846a8125aa43d97a573a354c472 upstream
When freeing pages are done with higher order, time spent on coalescing
pages by buddy allocator can be reduced.  With section size of 256MB,
hot add latency of a single section shows improvement from 50-60 ms to
less than 1 ms, hence improving the hot add latency by 60 times.  Modify
external providers of online callback to align with the change.

[arunks@codeaurora.org: v11]
  Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
[akpm@linux-foundation.org: remove unused local, per Arun]
[akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
[akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
[arunks@codeaurora.org: v8]
  Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
[arunks@codeaurora.org: v9]
  Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

(cherry picked from ccommit a9cd410a3d296846a8125aa43d97a573a354c472)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>

Conflicts:
	replace totalram_pages_add as old way.

bd6aced3

mm/page_isolation.c: convert SKIP_HWPOISON to MEMORY_OFFLINE · 80aa4777

由 Alex Shi 提交于 5月 11, 2020

task #29077503
commit 756d25be457fc5497da0ceee0f3d0c9eb4d8535d upstream
We have two types of users of page isolation:

 1. Memory offlining:  Offline memory so it can be unplugged. Memory
                       won't be touched.

 2. Memory allocation: Allocate memory (e.g., alloc_contig_range()) to
                       become the owner of the memory and make use of
                       it.

For example, in case we want to offline memory, we can ignore (skip
over) PageHWPoison() pages, as the memory won't get used.  We can allow
to offline memory.  In contrast, we don't want to allow to allocate such
memory.

Let's generalize the approach so we can special case other types of
pages we want to skip over in case we offline memory.  While at it, also
pass the same flags to test_pages_isolated().
Original-by: NDavid Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20191021172353.3056-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
Suggested-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from ccommit 756d25be457fc5497da0ceee0f3d0c9eb4d8535d)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>

Conflicts:
	reenable patch context on all files.

80aa4777

mm: only report isolation failures when offlining memory · 5316eb6e

由 Michal Hocko 提交于 12月 28, 2018

task #29077503
commit d381c54760dcfad23743da40516e7e003d73952a upstream
Heiko has complained that his log is swamped by warnings from
has_unmovable_pages

[   20.536664] page dumped because: has_unmovable_pages
[   20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0
[   20.536794] flags: 0x3fffe0000010200(slab|head)
[   20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600
[   20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000
[   20.536797] page dumped because: has_unmovable_pages
[   20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0
[   20.536815] flags: 0x7fffe0000000000()
[   20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000
[   20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000

which are not triggered by the memory hotplug but rather CMA allocator.
The original idea behind dumping the page state for all call paths was
that these messages will be helpful debugging failures.  From the above it
seems that this is not the case for the CMA path because we are lacking
much more context.  E.g the second reported page might be a CMA allocated
page.  It is still interesting to see a slab page in the CMA area but it
is hard to tell whether this is bug from the above output alone.

Address this issue by dumping the page state only on request.  Both
start_isolate_page_range and has_unmovable_pages already have an argument
to ignore hwpoison pages so make this argument more generic and turn it
into flags and allow callers to combine non-default modes into a mask.
While we are at it, has_unmovable_pages call from
is_pageblock_removable_nolock (sysfs removable file) is questionable to
report the failure so drop it from there as well.

Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from ccommit d381c54760dcfad23743da40516e7e003d73952a)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>

Conflicts:
	mm/page_alloc.c

5316eb6e

mm: convert PG_balloon to PG_offline · 59df23d6

由 David Hildenbrand 提交于 3月 05, 2019

task #29077503
commit ca215086b14b89a0e70fc211314944aa6ce50020 upstream
pages inflated in virtio-balloon.  Nowadays, it is only a marker that a
page is part of virtio-balloon and therefore logically offline.
We also want to make use of this flag in other balloon drivers - for
inflated pages or when onlining a section but keeping some pages offline
(e.g.  used right now by XEN and Hyper-V via set_online_page_callback()).

We are going to expose this flag to dump tools like makedumpfile.  But
instead of exposing PG_balloon, let's generalize the concept of marking
pages as logically offline, so it can be reused for other purposes later
on.

Rename PG_balloon to PG_offline.  This is an indicator that the page is
logically offline, the content stale and that it should not be touched
(e.g.  a hypervisor would have to allocate backing storage in order for
the guest to dump an unused page).  We can then e.g.  exclude such pages
from dumps.

We replace and reuse KPF_BALLOON (23), as this shouldn't really harm
(and for now the semantics stay the same).  In following patches, we
will make use of this bit also in other balloon drivers.  While at it,
document PGTABLE.

[akpm@linux-foundation.org: fix comment text, per David]
Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NPankaj gupta <pagupta@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Christian Hansen <chansen3@cisco.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Julien Freche <jfreche@vmware.com>
Cc: Kairui Song <kasong@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Lianbo Jiang <lijiang@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

(cherry picked from ccommit ca215086b14b89a0e70fc211314944aa6ce50020)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

59df23d6

block: add dma_map_bvec helper · 3a9cc9e5

由 Christoph Hellwig 提交于 3月 03, 2019

fix #29327388

commit 3ab3a0313cb8c50391d74e40fd46a3408d8e4de9 upstream

Provide a nice little shortcut for mapping a single bvec.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

3a9cc9e5

block: add a rq_dma_dir helper · 1c114fa2

由 Christoph Hellwig 提交于 3月 03, 2019

fix #29327388

commit 9d9de535f385a8b3ba0e88ca0abf386c5704bbfc upstream

In a lot of places we want to know the DMA direction for a given
struct request.  Add a little helper to make it a littler easier.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

1c114fa2

block: add a rq_integrity_vec helper · 152dc603

由 Christoph Hellwig 提交于 3月 03, 2019

fix #29327388

commit 2a876f5e25e8ec9fa5777d36e5695ee33dd63f6f upstream

This provides a nice little shortcut to get the integrity data for
drivers like NVMe that only support a single integrity segment.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

152dc603

block: add a req_bvec helper · b381a9ce

由 Christoph Hellwig 提交于 3月 03, 2019

fix #29327388

commit 3aef3cae4342c1d8137a1c0782cbb66f1be3943c upstream

Return the currently active bvec segment, potentially spanning multiple
pages.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b381a9ce

TencentOS-kernel: ipvs: avoid drop first packet by reusing conntrack · 173a8024

由 YangYuxi 提交于 6月 23, 2020

fix #29256237

commit a01a9445c00eca3e37523eb6b0d87f494eceeb4b TencentOS-kernel

Since 'commit f719e375 ("ipvs: drop first packet to
redirect conntrack")', when a new TCP connection meet
the conditions that need reschedule, the first syn packet
is dropped, this cause one second latency for the new
connection, more discussion about this problem can easy
search from google, such as:

1)One second connection delay in masque
https://marc.info/?t=151683118100004&r=1&w=2

2)IPVS low throughput #70747
https://github.com/kubernetes/kubernetes/issues/70747

3)Apache Bench can fill up ipvs service proxy in seconds #544
https://github.com/cloudnativelabs/kube-router/issues/544

4)Additional 1s latency in `host -> service IP -> pod`
https://github.com/kubernetes/kubernetes/issues/90854

5)kube-proxy ipvs conn_reuse_mode setting causes errors
with high load from single client
https://github.com/kubernetes/kubernetes/issues/81775

The root cause is when the old session is expired, the
conntrack related to the session is dropped by
ip_vs_conn_drop_conntrack. The code is as follows:
```
static void ip_vs_conn_expire(struct timer_list *t)
{
...

     if ((cp->flags & IP_VS_CONN_F_NFCT) &&
         !(cp->flags & IP_VS_CONN_F_ONE_PACKET)) {
             /* Do not access conntracks during subsys cleanup
              * because nf_conntrack_find_get can not be used after
              * conntrack cleanup for the net.
              */
             smp_rmb();
             if (ipvs->enable)
                     ip_vs_conn_drop_conntrack(cp);
     }
...
}
```
As shown in the code, only when condition (cp->flags & IP_VS_CONN_F_NFCT)
is true, the function ip_vs_conn_drop_conntrack will be called.

So we optimize this by following steps (Administrators
can choose the following optimization by setting
net.ipv4.vs.conn_reuse_old_conntrack=1):
1) erase the IP_VS_CONN_F_NFCT flag (it is safely because
   no packets will use the old session)
2) call ip_vs_conn_expire_now to release the old session,
   then the related conntrack will not be dropped
3) then ipvs unnecessary to drop the first syn packet, it
   just continue to pass the syn packet to the next process,
   create a new ipvs session, and the new session will related
   to the old conntrack(which is reopened by conntrack as a new
   one), the next whole things is just as normal as that the old
   session isn't used to exist.

The above processing has no problems except for passive FTP,
for passive FTP situation, ipvs can judging from
condition (atomic_read(&cp->n_control)) and condition (cp->control).
So, for other conditions(means not FTP), ipvs should give users
the right to choose，they can choose a high performance one processing
logical by setting net.ipv4.vs.conn_reuse_old_conntrack=1. It is necessary
because most business scenarios (such as kubernetes) are very sensitive
to TCP short connection latency.

This patch has been verified on our thousands of kubernets
node servers on Tencent Inc.
Signed-off-by: NYangYuxi <yx.atom1@gmail.com>
[Tony: add the missing sysctl knob and disable it by default]
Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
Acked-by: NDust Li <dust.li@linux.alibaba.com>

173a8024

block: move queues types to the block layer · cd87b6e5

由 Christoph Hellwig 提交于 12月 02, 2018

to #28991349

commit e20ba6e1da029136ded295f33076483d65ddf50a upstream

Having another indirect all in the fast path doesn't really help
in our post-spectre world.  Also having too many queue type is just
going to create confusion, so I'd rather manage them centrally.

Note that the queue type naming and ordering changes a bit - the
first index now is the default queue for everything not explicitly
marked, the optional ones are read and poll queues.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

cd87b6e5

nvme: add separate poll queue map · b244a53e

由 Jens Axboe 提交于 11月 05, 2018

to #28991349

commit 4b04cc6a8f86c4842314def22332de1f15de8523 upstream

Adds support for defining a variable number of poll queues, currently
configurable with the 'poll_queues' module parameter. Defaults to
a single poll queue.

And now we finally have poll support without triggering interrupts!
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b244a53e

blk-mq: initial support for multiple queue maps · c454176d

由 Jens Axboe 提交于 10月 24, 2018

to #28991349

commit 843477d4cc5c4bb4e346c561ecd3b9d0bd67e8c8 upstream

Add a queue offset to the tag map. This enables users to map
iteratively, for each queue map type they support.

Bump maximum number of supported maps to 2, we're now fully
able to support more than 1 map.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c454176d

blk-mq: cache request hardware queue mapping · 107174b9

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit ea4f995ee8b8f0578b3319949f2edd5d812fdb0a upstream

We call blk_mq_map_queue() a lot, at least two times for each
request per IO, sometimes more. Since we now have an indirect
call as well in that function. cache the mapping so we don't
have to re-call blk_mq_map_queue() for the same request
multiple times.
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

107174b9

blk-mq: support multiple hctx maps · 2ca20c32

由 Jens Axboe 提交于 10月 30, 2018

to #28991349

commit b3c661b15d5ab11d982e58bee23e05c1780528a1 upstream

Add support for the tag set carrying multiple queue maps, and
for the driver to inform blk-mq how many it wishes to support
through setting set->nr_maps.

This adds an mq_ops helper for drivers that support more than 1
map, mq_ops->rq_flags_to_type(). The function takes request/bio
flags and CPU, and returns a queue map index for that. We then
use the type information in blk_mq_map_queue() to index the map
set.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

2ca20c32

blk-mq: allow software queue to map to multiple hardware queues · d78f5292

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit f31967f0e455d08d3ea1d2f849bf62dafc92dbf4 upstream

The mapping used to be dependent on just the CPU location, but
now it's a tuple of (type, cpu) instead. This is a prep patch
for allowing a single software queue to map to multiple hardware
queues. No functional changes in this patch.

This changes the software queue count to an unsigned short
to save a bit of space. We can still support 64K-1 CPUs,
which should be enough. Add a check to catch a wrap.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

d78f5292

blk-mq: abstract out queue map · f63859ea

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit ed76e329d74a4b15ac0f5fd3adbd52ec0178a134 upstream

This is in preparation for allowing multiple sets of maps per
queue, if so desired.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f63859ea

blk-mq: kill q->mq_map · 7c058cfc

由 Jens Axboe 提交于 10月 16, 2018

to #28991349

commit a8908939af569ce2419f43fd56eeaf003bc3d85d upstream

It's just a pointer to set->mq_map, use that instead. Move the
assignment a bit earlier, so we always know it's valid.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

7c058cfc

genirq/affinity: Add support for allocating interrupt sets · 292e47ef

由 Jens Axboe 提交于 11月 02, 2018

to #28991349

commit 6da4b3ab9a6e9b1b5f90322ab3fa3a7dd18edb19 upstream

A driver may have a need to allocate multiple sets of MSI/MSI-X interrupts,
and have them appropriately affinitized.

Add support for defining a number of sets in the irq_affinity structure, of
varying sizes, and get each set affinitized correctly across the machine.

[ tglx: Minor changelog tweaks ]
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Cc: linux-block@vger.kernel.org
Link: https://lkml.kernel.org/r/20181102145951.31979-5-ming.lei@redhat.comSigned-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

292e47ef

block: make rq sector size accessible for block stats · 9bdcaff2

由 Hou Tao 提交于 5月 21, 2019

to #29361128

commit 3d24430694077313c75c6b89f618db09943621e4 upstream.

Currently rq->data_len will be decreased by partial completion or
zeroed by completion, so when blk_stat_add() is invoked, data_len
will be zero and there will never be samples in poll_cb because
blk_mq_poll_stats_bkt() will return -1 if data_len is zero.

We could move blk_stat_add() back to __blk_mq_complete_request(),
but that would make the effort of trying to call ktime_get_ns()
once in vain. Instead we can reuse throtl_size field, and use
it for both block stats and block throttle, and adjust the
logic in blk_mq_poll_stats_bkt() accordingly.

Fixes: 4bc6339a ("block: move blk_stat_add() to __blk_mq_end_request()")
Tested-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9bdcaff2

blk-iocost: switch to fixed non-auto-decaying use_delay · fc94dc72

由 Tejun Heo 提交于 4月 13, 2020

to #29361128

commit 54c52e10dc9b939084a7e6e3d32ce8fd8dee7898 upstream.

The use_delay mechanism was introduced by blk-iolatency to hold memory
allocators accountable for the reclaim and other shared IOs they cause. The
duration of the delay is dynamically balanced between iolatency increasing the
value on each target miss and it auto-decaying as time passes and threads get
delayed on it.

While this works well for iolatency, iocost's control model isn't compatible
with it. There is no repeated "violation" events which can be balanced against
auto-decaying. iocost instead knows how much a given cgroup is over budget and
wants to prevent that cgroup from issuing IOs while over budget. Until now,
iocost has been adding the cost of force-issued IOs. However, this doesn't
reflect the amount which is already over budget and is simply not enough to
counter the auto-decaying allowing anon-memory leaking low priority cgroup to
go over its alloted share of IOs.

As auto-decaying doesn't make much sense for iocost, this patch introduces a
different mode of operation for use_delay - when blkcg_set_delay() are used
insted of blkcg_add/use_delay(), the delay duration is not auto-decayed until it
is explicitly cleared with blkcg_clear_delay(). iocost is updated to keep the
delay duration synchronized to the budget overage amount.

With this change, iocost can effectively police cgroups which generate
significant amount of force-issued IOs.
Signed-off-by: NTejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

fc94dc72

blk-iocost: Fix error on iocost_ioc_vrate_adj · d14b5329

由 Waiman Long 提交于 4月 21, 2020

to #29361128

commmit d6c8e949a35d6906d6c03a50e9a9cdf4e494528a upstream.

Systemtap 4.2 is unable to correctly interpret the "u32 (*missed_ppm)[2]"
argument of the iocost_ioc_vrate_adj trace entry defined in
include/trace/events/iocost.h leading to the following error:

  /tmp/stapAcz0G0/stap_c89c58b83cea1724e26395efa9ed4939_6321_aux_6.c:78:8:
  error: expected ‘;’, ‘,’ or ‘)’ before ‘*’ token
   , u32[]* __tracepoint_arg_missed_ppm

That argument type is indeed rather complex and hard to read. Looking
at block/blk-iocost.c. It is just a 2-entry u32 array. By simplifying
the argument to a simple "u32 *missed_ppm" and adjusting the trace
entry accordingly, the compilation error was gone.

Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost")
Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Acked-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

d14b5329

alinux: sched: Add cpu_stress to show system-wide task waiting · ab81d2d9

由 Yihao Wu 提交于 6月 01, 2020

to #28739709

/proc/loadavg can reflex the waiting tasks over a period of time
to some extent. But to become a SLI requires better precision and
quicker response. Furthermore, I/O block is not concerned here,
and bandwidth control is excluded from cpu_stress.

This patch adds a new interface /proc/cpu_stress. It's based on
task runtime tracking so we don't need to deal with complex state
transition. And because task runtime tracking is done in most
scheduler events, the precision is quite enough.

Like loadavg, cpu_stress has 3 average windows too (1,5,15 min)
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

ab81d2d9

vfs, afs, ext4: Make the inode hash table RCU searchable · dc136109

由 David Howells 提交于 12月 01, 2017

task #29263287

commit 3f19b2ab97a97b413c24b66c67ae16daa4f56c35 upstream

Make the inode hash table RCU searchable so that searches that want to
access or modify an inode without taking a ref on that inode can do so
without taking the inode hash table lock.

The main thing this requires is some RCU annotation on the list
manipulation operations.  Inodes are already freed by RCU in most cases.

Users of this interface must take care as the inode may be still under
construction or may be being torn down around them.

There are at least three instances where this can be of use:

 (1) Testing whether the inode number iunique() is going to return is
     currently unique (the iunique_lock is still held).

 (2) Ext4 date stamp updating.

 (3) AFS callback breaking.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
cc: linux-ext4@vger.kernel.org
cc: linux-afs@lists.infradead.org
[jeffle: resolve collision in afs_break_one_callback since code base change]
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

dc136109

io_uring: export cq overflow status to userspace · c7f865a4

由 Xiaoguang Wang 提交于 7月 09, 2020

to #29233603

commit 6d5f904904608a9cd32854d7d0a4dd65b27f9935 upstream

For those applications which are not willing to use io_uring_enter()
to reap and handle cqes, they may completely rely on liburing's
io_uring_peek_cqe(), but if cq ring has overflowed, currently because
io_uring_peek_cqe() is not aware of this overflow, it won't enter
kernel to flush cqes, below test program can reveal this bug:

static void test_cq_overflow(struct io_uring *ring)
{
        struct io_uring_cqe *cqe;
        struct io_uring_sqe *sqe;
        int issued = 0;
        int ret = 0;

        do {
                sqe = io_uring_get_sqe(ring);
                if (!sqe) {
                        fprintf(stderr, "get sqe failed\n");
                        break;;
                }
                ret = io_uring_submit(ring);
                if (ret <= 0) {
                        if (ret != -EBUSY)
                                fprintf(stderr, "sqe submit failed: %d\n", ret);
                        break;
                }
                issued++;
        } while (ret > 0);
        assert(ret == -EBUSY);

        printf("issued requests: %d\n", issued);

        while (issued) {
                ret = io_uring_peek_cqe(ring, &cqe);
                if (ret) {
                        if (ret != -EAGAIN) {
                                fprintf(stderr, "peek completion failed: %s\n",
                                        strerror(ret));
                                break;
                        }
                        printf("left requets: %d\n", issued);
                        continue;
                }
                io_uring_cqe_seen(ring, cqe);
                issued--;
                printf("left requets: %d\n", issued);
        }
}

int main(int argc, char *argv[])
{
        int ret;
        struct io_uring ring;

        ret = io_uring_queue_init(16, &ring, 0);
        if (ret) {
                fprintf(stderr, "ring setup failed: %d\n", ret);
                return 1;
        }

        test_cq_overflow(&ring);
        return 0;
}

To fix this issue, export cq overflow status to userspace by adding new
IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c7f865a4

acpi/nfit, libnvdimm/security: add Intel DSM 1.8 master passphrase support · 8302c7d9

由 Dave Jiang 提交于 12月 10, 2018

to #27305291

commit 89fa9d8ea7bdfa841d19044485cec5f4171069e5 upstream.

With Intel DSM 1.8 [1] two new security DSMs are introduced. Enable/update
master passphrase and master secure erase. The master passphrase allows
a secure erase to be performed without the user passphrase that is set on
the NVDIMM. The commands of master_update and master_erase are added to
the sysfs knob in order to initiate the DSMs. They are similar in opeartion
mechanism compare to update and erase.

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdfSigned-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

8302c7d9

acpi/nfit, libnvdimm/security: Add security DSM overwrite support · a3e32b16

由 Dave Jiang 提交于 12月 13, 2018

to #27305291

commit 7d988097c546187ada602cc9bccd0f03d473eb8f upstream.

Add support for the NVDIMM_FAMILY_INTEL "ovewrite" capability as
described by the Intel DSM spec v1.7. This will allow triggering of
overwrite on Intel NVDIMMs. The overwrite operation can take tens of
minutes. When the overwrite DSM is issued successfully, the NVDIMMs will
be unaccessible. The kernel will do backoff polling to detect when the
overwrite process is completed. According to the DSM spec v1.7, the 128G
NVDIMMs can take up to 15mins to perform overwrite and larger DIMMs will
take longer.

Given that overwrite puts the DIMM in an indeterminate state until it
completes introduce the NDD_SECURITY_OVERWRITE flag to prevent other
operations from executing when overwrite is happening. The
NDD_WORK_PENDING flag is added to denote that there is a device reference
on the nvdimm device for an async workqueue thread context.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

a3e32b16

acpi/nfit, libnvdimm: Add support for issue secure erase DSM to Intel nvdimm · 7c7f13d6

由 Dave Jiang 提交于 12月 07, 2018

to #27305291

commit 64e77c8c047fb91ea8c7800c1238108a72f0bf9c upstream.

Add support to issue a secure erase DSM to the Intel nvdimm. The
required passphrase is acquired from an encrypted key in the kernel user
keyring. To trigger the action, "erase <keyid>" is written to the
"security" sysfs attribute.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

7c7f13d6

acpi/nfit, libnvdimm: Add disable passphrase support to Intel nvdimm. · 24684577

由 Dave Jiang 提交于 12月 07, 2018

to #27305291

commit 03b65b22ada8115a7a7bfdf0789f6a94adfd6070 upstream.

Add support to disable passphrase (security) for the Intel nvdimm. The
passphrase used for disabling is pulled from an encrypted-key in the kernel
user keyring. The action is triggered by writing "disable <keyid>" to the
sysfs attribute "security".
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

24684577

acpi/nfit, libnvdimm: Add unlock of nvdimm support for Intel DIMMs · ba87cbfc

由 Dave Jiang 提交于 12月 06, 2018

to #27305291

commit 4c6926a23b76ea23403976290cd45a7a143f6500 upstream.

Add support to unlock the dimm via the kernel key management APIs. The
passphrase is expected to be pulled from userspace through keyutils.
The key management and sysfs attributes are libnvdimm generic.

Encrypted keys are used to protect the nvdimm passphrase at rest. The
master key can be a trusted-key sealed in a TPM, preferred, or an
encrypted-key, more flexible, but more exposure to a potential attacker.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Co-developed-by: NDan Williams <dan.j.williams@intel.com>
Reported-by: NRandy Dunlap <rdunlap@infradead.org>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

ba87cbfc

acpi/nfit, libnvdimm: Add freeze security support to Intel nvdimm · 29fd2e68

由 Dave Jiang 提交于 12月 06, 2018

to #27305291

commit 37833fb7989a9d3c3e26354e6878e682c340d718 upstream.

Add support for freeze security on Intel nvdimm. This locks out any
changes to security for the DIMM until a hard reset of the DIMM is
performed. This is triggered by writing "freeze" to the generic
nvdimm/nmemX "security" sysfs attribute.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Co-developed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

29fd2e68

acpi/nfit, libnvdimm: Introduce nvdimm_security_ops · 3b8f481b

由 Dave Jiang 提交于 12月 05, 2018

to #27305291

commit f2989396553a0bd13f4b25f567a3dee3d722ce40 upstream.

Some NVDIMMs, like the ones defined by the NVDIMM_FAMILY_INTEL command
set, expose a security capability to lock the DIMMs at poweroff and
require a passphrase to unlock them. The security model is derived from
ATA security. In anticipation of other DIMMs implementing a similar
scheme, and to abstract the core security implementation away from the
device-specific details, introduce nvdimm_security_ops.

Initially only a status retrieval operation, ->state(), is defined,
along with the base infrastructure and definitions for future
operations.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Co-developed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

3b8f481b

keys: Export lookup_user_key to external users · 12aad331

由 Dave Jiang 提交于 12月 04, 2018

to #27305291

commit 76ef5e17252789da79db78341851922af0c16181 upstream.

Export lookup_user_key() symbol in order to allow nvdimm passphrase
update to retrieve user injected keys.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Acked-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

12aad331

acpi/nfit, libnvdimm: Store dimm id as a member to struct nvdimm · 83d94276

由 Dave Jiang 提交于 12月 04, 2018

to #27305291

commit d6548ae4d16dc231dec22860c9c472bcb991fb15 upstream.

The generated dimm id is needed for the sysfs attribute as well as being
used as the identifier/description for the security key. Since it's
constant and should never change, store it as a member of struct nvdimm.

As nvdimm_create() continues to grow parameters relative to NFIT driver
requirements, do not require other implementations to keep pace.
Introduce __nvdimm_create() to carry the new parameters and keep
nvdimm_create() with the long standing default api.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

[ Shile: fixed conflict in drivers/acpi/nfit/nfit.h ]
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

83d94276

acpi/nfit: Add support for Intel DSM 1.8 commands · d7258548

由 Dave Jiang 提交于 12月 04, 2018

to #27305291

commit b3ed2ce024c36054e51cca2eb31a1cdbe4a5f11e upstream.

Add command definition for security commands defined in Intel DSM
specification v1.8 [1]. This includes "get security state", "set
passphrase", "unlock unit", "freeze lock", "secure erase", "overwrite",
"overwrite query", "master passphrase enable/disable", and "master
erase", . Since this adds several Intel definitions, move the relevant
bits to their own header.

These commands mutate physical data, but that manipulation is not cache
coherent. The requirement to flush and invalidate caches makes these
commands unsuitable to be called from userspace, so extra logic is added
to detect and block these commands from being submitted via the ioctl
command submission path.

Lastly, the commands may contain sensitive key material that should not
be dumped in a standard debug session. Update the nvdimm-command
payload-dump facility to move security command payloads behind a
default-off compile time switch.

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdfSigned-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

[ Shile: fixed conflicts:
This patch updated the file "drivers/acpi/nfit/intel.h". The header file is
introduced by commit 0ead111 ("acpi, nfit: Collect shutdown status") in
upstream, which also update the test files. So let's fetch this part to fix
the conflict:
- tools/testing/nvdimm/test/nfit.c
- tools/testing/nvdimm/test/nfit_test.h ]
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

d7258548

mm, page_alloc: do not wake kswapd with zone lock held · ba16c9c8

由 Mel Gorman 提交于 1月 08, 2019

to #28825456

commit 73444bc4d8f92e46a20cb6bd3342fc2ea75c6787 upstream.

syzbot reported the following regression in the latest merge window and
it was confirmed by Qian Cai that a similar bug was visible from a
different context.

  ======================================================
  WARNING: possible circular locking dependency detected
  4.20.0+ #297 Not tainted
  ------------------------------------------------------
  syz-executor0/8529 is trying to acquire lock:
  000000005e7fb829 (&pgdat->kswapd_wait){....}, at:
  __wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120

  but task is already holding lock:
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: spin_lock
  include/linux/spinlock.h:329 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_bulk
  mm/page_alloc.c:2548 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: __rmqueue_pcplist
  mm/page_alloc.c:3021 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_pcplist
  mm/page_alloc.c:3050 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue
  mm/page_alloc.c:3072 [inline]
  000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at:
  get_page_from_freelist+0x1bae/0x52a0 mm/page_alloc.c:3491

It appears to be a false positive in that the only way the lock ordering
should be inverted is if kswapd is waking itself and the wakeup
allocates debugging objects which should already be allocated if it's
kswapd doing the waking.  Nevertheless, the possibility exists and so
it's best to avoid the problem.

This patch flags a zone as needing a kswapd using the, surprisingly,
unused zone flag field.  The flag is read without the lock held to do
the wakeup.  It's possible that the flag setting context is not the same
as the flag clearing context or for small races to occur.  However, each
race possibility is harmless and there is no visible degredation in
fragmentation treatment.

While zone->flag could have continued to be unused, there is potential
for moving some existing fields into the flags field instead.
Particularly read-mostly ones like zone->initialized and
zone->contiguous.

Link: http://lkml.kernel.org/r/20190103225712.GJ31517@techsingularity.net
Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Reported-by: syzbot+93d94a001cfbce9e60e1@syzkaller.appspotmail.com
Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Tested-by: NQian Cai <cai@lca.pw>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

Conflicts:
	include/linux/mmzone.h
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

ba16c9c8

mm: reclaim small amounts of memory when an external fragmentation event occurs · 9bcadc70

由 Mel Gorman 提交于 12月 28, 2018

to #28825456

commit 1c30844d2dfe272d58c8fc000960b835d13aa2ac upstream.

An external fragmentation event was previously described as

    When the page allocator fragments memory, it records the event using
    the mm_page_alloc_extfrag event. If the fallback_order is smaller
    than a pageblock order (order-9 on 64-bit x86) then it's considered
    an event that will cause external fragmentation issues in the future.

The kernel reduces the probability of such events by increasing the
watermark sizes by calling set_recommended_min_free_kbytes early in the
lifetime of the system.  This works reasonably well in general but if
there are enough sparsely populated pageblocks then the problem can still
occur as enough memory is free overall and kswapd stays asleep.

This patch introduces a watermark_boost_factor sysctl that allows a zone
watermark to be temporarily boosted when an external fragmentation causing
events occurs.  The boosting will stall allocations that would decrease
free memory below the boosted low watermark and kswapd is woken if the
calling context allows to reclaim an amount of memory relative to the size
of the high watermark and the watermark_boost_factor until the boost is
cleared.  When kswapd finishes, it wakes kcompactd at the pageblock order
to clean some of the pageblocks that may have been affected by the
fragmentation event.  kswapd avoids any writeback, slab shrinkage and swap
from reclaim context during this operation to avoid excessive system
disruption in the name of fragmentation avoidance.  Care is taken so that
kswapd will do normal reclaim work if the system is really low on memory.

This was evaluated using the same workloads as "mm, page_alloc: Spread
allocations across zones before introducing fragmentation".

1-socket Skylake machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 1 THP allocating thread
--------------------------------------

4.20-rc3 extfrag events < order 9:   804694
4.20-rc3+patch:                      408912 (49% reduction)
4.20-rc3+patch1-4:                    18421 (98% reduction)

                                   4.20.0-rc3             4.20.0-rc3
                                 lowzone-v5r8             boost-v5r8
Amean     fault-base-1      653.58 (   0.00%)      652.71 (   0.13%)
Amean     fault-huge-1        0.00 (   0.00%)      178.93 * -99.00%*

                              4.20.0-rc3             4.20.0-rc3
                            lowzone-v5r8             boost-v5r8
Percentage huge-1        0.00 (   0.00%)        5.12 ( 100.00%)

Note that external fragmentation causing events are massively reduced by
this path whether in comparison to the previous kernel or the vanilla
kernel.  The fault latency for huge pages appears to be increased but that
is only because THP allocations were successful with the patch applied.

1-socket Skylake machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  291392
4.20-rc3+patch:                     191187 (34% reduction)
4.20-rc3+patch1-4:                   13464 (95% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                 lowzone-v5r8             boost-v5r8
Min       fault-base-1      912.00 (   0.00%)      905.00 (   0.77%)
Min       fault-huge-1      127.00 (   0.00%)      135.00 (  -6.30%)
Amean     fault-base-1     1467.55 (   0.00%)     1481.67 (  -0.96%)
Amean     fault-huge-1     1127.11 (   0.00%)     1063.88 *   5.61%*

                              4.20.0-rc3             4.20.0-rc3
                            lowzone-v5r8             boost-v5r8
Percentage huge-1       77.64 (   0.00%)       83.46 (   7.49%)

As before, massive reduction in external fragmentation events, some jitter
on latencies and an increase in THP allocation success rates.

2-socket Haswell machine
config-global-dhp__workload_thpfioscale XFS (no special madvise)
4 fio threads, 5 THP allocating threads
----------------------------------------------------------------

4.20-rc3 extfrag events < order 9:  215698
4.20-rc3+patch:                     200210 (7% reduction)
4.20-rc3+patch1-4:                   14263 (93% reduction)

                                   4.20.0-rc3             4.20.0-rc3
                                 lowzone-v5r8             boost-v5r8
Amean     fault-base-5     1346.45 (   0.00%)     1306.87 (   2.94%)
Amean     fault-huge-5     3418.60 (   0.00%)     1348.94 (  60.54%)

                              4.20.0-rc3             4.20.0-rc3
                            lowzone-v5r8             boost-v5r8
Percentage huge-5        0.78 (   0.00%)        7.91 ( 910.64%)

There is a 93% reduction in fragmentation causing events, there is a big
reduction in the huge page fault latency and allocation success rate is
higher.

2-socket Haswell machine
global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
-----------------------------------------------------------------

4.20-rc3 extfrag events < order 9: 166352
4.20-rc3+patch:                    147463 (11% reduction)
4.20-rc3+patch1-4:                  11095 (93% reduction)

thpfioscale Fault Latencies
                                   4.20.0-rc3             4.20.0-rc3
                                 lowzone-v5r8             boost-v5r8
Amean     fault-base-5     6217.43 (   0.00%)     7419.67 * -19.34%*
Amean     fault-huge-5     3163.33 (   0.00%)     3263.80 (  -3.18%)

                              4.20.0-rc3             4.20.0-rc3
                            lowzone-v5r8             boost-v5r8
Percentage huge-5       95.14 (   0.00%)       87.98 (  -7.53%)

There is a large reduction in fragmentation events with some jitter around
the latencies and success rates.  As before, the high THP allocation
success rate does mean the system is under a lot of pressure.  However, as
the fragmentation events are reduced, it would be expected that the
long-term allocation success rate would be higher.

Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Zi Yan <zi.yan@cs.rutgers.edu>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

9bcadc70

openanolis / cloud-kernel 接近 2 年 前同步成功

openanolis / cloud-kernel
接近 2 年前同步成功