提交 · f5fa6847df7143f04f8594c1614124dec95b86d1 · openanolis / cloud-kernel

02 9月, 2020 40 次提交

virtiofs: Set FR_SENT flag only after request has been sent · f5fa6847

由 Vivek Goyal 提交于 10月 15, 2019

task #28910367
commit 5dbe190f341206a7896f7e40c1e3a36933d812f3 upstream

FR_SENT flag should be set when request has been sent successfully sent
over virtqueue. This is used by interrupt logic to figure out if interrupt
request should be sent or not.

Also add it to fqp->processing list after sending it successfully.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f5fa6847

virtiofs: No need to check fpq->connected state · da960fb2

由 Vivek Goyal 提交于 10月 15, 2019

task #28910367
commit 7ee1e2e631dbf0ff0df2a67a1e01ba3c1dce7a46 upstream

In virtiofs we keep per queue connected state in virtio_fs_vq->connected
and use that to end request if queue is not connected. And virtiofs does
not even touch fpq->connected state.

We probably need to merge these two at some point of time. For now,
simplify the code a bit and do not worry about checking state of
fpq->connected.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>

da960fb2

virtiofs: Do not end request in submission context · c3694064

由 Vivek Goyal 提交于 10月 15, 2019

task #28910367
commit 51fecdd2555b3e0e05a78d30093c638d164a32f9 upstream

Submission context can hold some locks which end request code tries to hold
again and deadlock can occur. For example, fc->bg_lock. If a background
request is being submitted, it might hold fc->bg_lock and if we could not
submit request (because device went away) and tried to end request, then
deadlock happens. During testing, I also got a warning from deadlock
detection code.

So put requests on a list and end requests from a worker thread.

I got following warning from deadlock detector.

[  603.137138] WARNING: possible recursive locking detected
[  603.137142] --------------------------------------------
[  603.137144] blogbench/2036 is trying to acquire lock:
[  603.137149] 00000000f0f51107 (&(&fc->bg_lock)->rlock){+.+.}, at: fuse_request_end+0xdf/0x1c0 [fuse]
[  603.140701]
[  603.140701] but task is already holding lock:
[  603.140703] 00000000f0f51107 (&(&fc->bg_lock)->rlock){+.+.}, at: fuse_simple_background+0x92/0x1d0 [fuse]
[  603.140713]
[  603.140713] other info that might help us debug this:
[  603.140714]  Possible unsafe locking scenario:
[  603.140714]
[  603.140715]        CPU0
[  603.140716]        ----
[  603.140716]   lock(&(&fc->bg_lock)->rlock);
[  603.140718]   lock(&(&fc->bg_lock)->rlock);
[  603.140719]
[  603.140719]  *** DEADLOCK ***
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>

c3694064

virtio-fs: Change module name to virtiofs.ko · 5cb5b717

由 Vivek Goyal 提交于 10月 11, 2019

task #28910367
commit 112e72373d1f60f1e4558d0a7f0de5da39a1224d upstream

We have been calling it virtio_fs and even file name is virtio_fs.c. Module
name is virtio_fs.ko but when registering file system user is supposed to
specify filesystem type as "virtiofs".

Masayoshi Mizuma reported that he specified filesytem type as "virtio_fs"
and got this warning on console.

  ------------[ cut here ]------------
  request_module fs-virtio_fs succeeded, but still no fs?
  WARNING: CPU: 1 PID: 1234 at fs/filesystems.c:274 get_fs_type+0x12c/0x138
  Modules linked in: ... virtio_fs fuse virtio_net net_failover ...
  CPU: 1 PID: 1234 Comm: mount Not tainted 5.4.0-rc1 #1

So looks like kernel could find the module virtio_fs.ko but could not find
filesystem type after that.

It probably is better to rename module name to virtiofs.ko so that above
warning goes away in case user ends up specifying wrong fs name.
Reported-by: NMasayoshi Mizuma <msys.mizuma@gmail.com>
Suggested-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Tested-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Reviewed-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
(cherry picked from commit 112e72373d1f60f1e4558d0a7f0de5da39a1224d)
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

5cb5b717

virtio-fs: add virtiofs filesystem · 917f6dfb

由 Stefan Hajnoczi 提交于 6月 12, 2018

task #28910367
commit a62a8ef9d97da23762a588592c8b8eb50a8deb6a upstream

Add a basic file system module for virtio-fs.  This does not yet contain
shared data support between host and guest or metadata coherency speedups.
However it is already significantly faster than virtio-9p.

Design Overview
===============

With the goal of designing something with better performance and local file
system semantics, a bunch of ideas were proposed.

 - Use fuse protocol (instead of 9p) for communication between guest and
   host.  Guest kernel will be fuse client and a fuse server will run on
   host to serve the requests.

 - For data access inside guest, mmap portion of file in QEMU address space
   and guest accesses this memory using dax.  That way guest page cache is
   bypassed and there is only one copy of data (on host).  This will also
   enable mmap(MAP_SHARED) between guests.

 - For metadata coherency, there is a shared memory region which contains
   version number associated with metadata and any guest changing metadata
   updates version number and other guests refresh metadata on next access.
   This is yet to be implemented.

How virtio-fs differs from existing approaches
==============================================

The unique idea behind virtio-fs is to take advantage of the co-location of
the virtual machine and hypervisor to avoid communication (vmexits).

DAX allows file contents to be accessed without communication with the
hypervisor.  The shared memory region for metadata avoids communication in
the common case where metadata is unchanged.

By replacing expensive communication with cheaper shared memory accesses,
we expect to achieve better performance than approaches based on network
file system protocols.  In addition, this also makes it easier to achieve
local file system semantics (coherency).

These techniques are not applicable to network file system protocols since
the communications channel is bypassed by taking advantage of shared memory
on a local machine.  This is why we decided to build virtio-fs rather than
focus on 9P or NFS.

Caching Modes
=============

Like virtio-9p, different caching modes are supported which determine the
coherency level as well.  The “cache=FOO” and “writeback” options control
the level of coherence between the guest and host filesystems.

 - cache=none
   metadata, data and pathname lookup are not cached in guest.  They are
   always fetched from host and any changes are immediately pushed to host.

 - cache=always
   metadata, data and pathname lookup are cached in guest and never expire.

 - cache=auto
   metadata and pathname lookup cache expires after a configured amount of
   time (default is 1 second).  Data is cached while the file is open
   (close to open consistency).

 - writeback/no_writeback
   These options control the writeback strategy.  If writeback is disabled,
   then normal writes will immediately be synchronized with the host fs.
   If writeback is enabled, then writes may be cached in the guest until
   the file is closed or an fsync(2) performed.  This option has no effect
   on mmap-ed writes or writes going through the DAX mechanism.
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

(cherry picked from commit a62a8ef9d97da23762a588592c8b8eb50a8deb6a)
[Liubo: given that 4.19 lacks the support of fs_context to parse mount
option, here I just change it back to the 4.19 way, so we still use -o
tag=myfs-1 to get virtiofs mount.]
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

917f6dfb

virtio-fs: add Documentation/filesystems/virtiofs.rst · 3fcc16fc

由 Stefan Hajnoczi 提交于 8月 29, 2019

task #28910367
commit 2d1d25d0a224dcd2021004d52342fc1727ccd85f upstream

Add information about the new "virtiofs" file system.
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

3fcc16fc

fuse: reserve values for mapping protocol · 30efb5f4

由 Dr. David Alan Gilbert 提交于 8月 02, 2019

task #28910367
commit c4bb667eaf520f21b3a3db0489682becc9c49bcc upstream

SETUPMAPPING is a command for use with 'virtiofsd', a fuse-over-virtio
implementation; it may find use in other fuse impelementations as well in
which the kernel does not have access to the address space of the daemon
directly.

A SETUPMAPPING operation causes a section of a file to be mapped into a
memory window visible to the kernel.  The offsets in the file and the
window are defined by the kernel performing the operation.

The daemon may reject the request, for reasons including permissions and
limited resources.

When a request perfectly overlaps a previous mapping, the previous mapping
is replaced.  When a mapping partially overlaps a previous mapping, the
previous mapping is split into one or two smaller mappings.

REMOVEMAPPING is the complement to SETUPMAPPING; it unmaps a range of
mapped files from the window visible to the kernel.

The map_alignment field communicates the alignment constraint for
FUSE_SETUPMAPPING/FUSE_REMOVEMAPPING and allows the daemon to constrain the
addresses and file offsets chosen by the kernel.
Signed-off-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

30efb5f4

fuse: reserve byteswapped init opcodes · 4b61cdee

由 Michael S. Tsirkin 提交于 9月 04, 2019

task #28910367
commit 501ae8ecae2ba5122774dee4445003505a7fd01b upstream

virtio fs tunnels fuse over a virtio channel.  One issue is two sides might
be speaking different endian-ness. To detects this, host side looks at the
opcode value in the FUSE_INIT command.  Works fine at the moment but might
fail if a future version of fuse will use such an opcode for
initialization.  Let's reserve this opcode so we remember and don't do
this.

Same for CUSE_INIT.
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

4b61cdee

fuse: delete dentry if timeout is zero · 9ffcf1ac

由 Miklos Szeredi 提交于 8月 15, 2018

task #28910367
commit 8fab010644363f8f80194322aa7a81e38c867af3 upstream

Don't hold onto dentry in lru list if need to re-lookup it anyway at next
access.  Only do this if explicitly enabled, otherwise it could result in
performance regression.

More advanced version of this patch would periodically flush out dentries
from the lru which have gone stale.
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9ffcf1ac

fuse: Export fuse_dequeue_forget() function · 09db4841

由 Vivek Goyal 提交于 5月 31, 2019

task #28910367
commit 4388c5aac4bae5c83a2c66882043942002ba09a2 upstream

stacked file systems like virtio-fs do not have to play directly with
forget list data structures. There is a helper function use that instead.

Rename dequeue_forget() to fuse_dequeue_forget() and export it so that
stacked filesystems can use it.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

09db4841

fuse: export fuse_get_unique() · f96b6dd6

由 Stefan Hajnoczi 提交于 6月 22, 2018

task #28910367
commit 79d96efffda7597b41968d5d8813b39fc2965f1b upstream

virtio-fs will need unique IDs for FORGET requests from outside
fs/fuse/dev.c.  Make the symbol visible.
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f96b6dd6

fuse: Separate fuse device allocation and installation in fuse_conn · 0213de76

由 Vivek Goyal 提交于 3月 06, 2019

task #28910367
commit 0cd1eb9a4160a96e0ec9b93b2e7b489f449bf22d upstream

As of now fuse_dev_alloc() both allocates a fuse device and installs it
in fuse_conn list. fuse_dev_alloc() can fail if fuse_device allocation
fails.

virtio-fs needs to initialize multiple fuse devices (one per virtio
queue). It initializes one fuse device as part of call to
fuse_fill_super_common() and rest of the devices are allocated and
installed after that.

But, we can't affort to fail after calling fuse_fill_super_common() as
we don't have a way to undo all the actions done by fuse_fill_super_common().
So to avoid failures after the call to fuse_fill_super_common(),
pre-allocate all fuse devices early and install them into fuse connection
later.

This patch provides two separate helpers for fuse device allocation and
fuse device installation in fuse_conn.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0213de76

fuse: add fuse_iqueue_ops callbacks · 57f16587

由 Stefan Hajnoczi 提交于 6月 18, 2018

task #28910367
commit ae3aad77f46fbba56eff7141b2fc49870b60827e upstream

The /dev/fuse device uses fiq->waitq and fasync to signal that requests
are available.  These mechanisms do not apply to virtio-fs.  This patch
introduces callbacks so alternative behavior can be used.

Note that queue_interrupt() changes along these lines:

  spin_lock(&fiq->waitq.lock);
  wake_up_locked(&fiq->waitq);
+ kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
  spin_unlock(&fiq->waitq.lock);
- kill_fasync(&fiq->fasync, SIGIO, POLL_IN);

Since queue_request() and queue_forget() also call kill_fasync() inside
the spinlock this should be safe.
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

57f16587

fuse: Export fuse_send_init_request() · 6769b1fd

由 Vivek Goyal 提交于 3月 06, 2019

task #28910367
commit 95a84cdb11c26315a6d34664846f82c438c961a1 upstream

This will be used by virtio-fs to send init request to fuse server after
initialization of virt queues.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

6769b1fd

fuse: export fuse_len_args() · 609c1cf3

由 Stefan Hajnoczi 提交于 6月 21, 2018

task #28910367
commit 14d46d7abc3973a47e8eb0eb5eb87ee8d910a505 upstream

virtio-fs will need to query the length of fuse_arg lists.  Make the
symbol visible.
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

609c1cf3

fuse: export fuse_end_request() · 63b1ffab

由 Stefan Hajnoczi 提交于 6月 21, 2018

task #28910367
commit 04ec5af0776e9baefed59891f12adbcb5fa71a23 upstream

virtio-fs will need to complete requests from outside fs/fuse/dev.c.
Make the symbol visible.
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

63b1ffab

fuse: extract fuse_fill_super_common() · 0fdc23c2

由 Stefan Hajnoczi 提交于 6月 13, 2018

task #28910367
commit 0cc2656cdb0b1f234e6d29378cb061e29d7522bc upstream

fuse_fill_super() includes code to process the fd= option and link the
struct fuse_dev to the fd's struct file.  In virtio-fs there is no file
descriptor because /dev/fuse is not used.

This patch extracts fuse_fill_super_common() so that both classic fuse
and virtio-fs can share the code to initialize a mount.

parse_fuse_opt() is also extracted so that the fuse_fill_super_common()
caller has access to the mount options.  This allows classic fuse to
handle the fd= option outside fuse_fill_super_common().
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0fdc23c2

virtio_mem: convert device block size into 64bit · 1e5d7fb5

由 Michael S. Tsirkin 提交于 6月 08, 2020

task #29077503
commit 544fc7dbbf920a3e64d109c416ee229e8e1763c5 upstream
can overflow. Rather than try to catch all instances of that,
let's tweak block size to 64 bit.

It ripples through UAPI which is an ABI change, but it's not too late to
make it, and it will allow supporting >4Gbyte blocks while might
become necessary down the road.

Fixes: 5f1f79bbc9e26 ("virtio-mem: Paravirtualized memory hotplug")
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

1e5d7fb5

mm/memory_hotplug: set node_start_pfn of hotadded pgdat to 0 · 64f827bf

由 David Hildenbrand 提交于 6月 04, 2020

task #29077503
commit c68ab18c6aee0397574afb418f6775f23379198e upstream
Patch series "mm/memory_hotplug: handle memblocks only with
CONFIG_ARCH_KEEP_MEMBLOCK", v1.

A hotadded node/pgdat will span no pages at all, until memory is moved to
the zone/node via move_pfn_range_to_zone() -> resize_pgdat_range - e.g.,
when onlining memory blocks.  We don't have to initialize the
node_start_pfn to the memory we are adding.

This patch (of 2):

Especially, there is an inconsistency:
 - Hotplugging memory to a memory-less node with cpus: node_start_pf ==  0
 - Offlining and removing last memory from a node: node_start_pfn == 0
 - Hotplugging memory to a memory-less node without cpus: node_start_pfn != 0

As soon as memory is onlined, node_start_pfn is overwritten with the
actual start.  E.g., when adding two DIMMs but only onlining one of both,
only that DIMM (with online memory blocks) is spanned by the node.

Currently, the validity of node_start_pfn really is linked to
node_spanned_pages != 0.  With node_spanned_pages == 0 (e.g., before
onlining memory), it has no meaning.

So let's stop setting node_start_pfn, just to be overwritten via
move_pfn_range_to_zone().  This avoids confusion when looking at the code,
wondering which magic will be performed with the node_start_pfn in this
function, when hotadding a pgdat.
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Baoquan He <bhe@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Michal Hocko <mhocko@suse.com>
Link: http://lkml.kernel.org/r/20200422155353.25381-1-david@redhat.com
Link: http://lkml.kernel.org/r/20200422155353.25381-2-david@redhat.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from ccommit c68ab18c6aee0397574afb418f6775f23379198e)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

64f827bf

virtio-mem: silence a static checker warning · 99d68412

由 Dan Carpenter 提交于 6月 10, 2020

task #29077503
commit 1c3d69ab5348b661616992206357a3ebf19b1008 upstream
statement on the first iteration through the loop.  I suspect that this
can't happen in real life, but returning a zero literal is cleaner and
silence the static checker warning.

Fixes: 5f1f79bbc9e2 ("virtio-mem: Paravirtualized memory hotplug")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Link: https://lore.kernel.org/r/20200610085911.GC5439@mwandaSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit 1c3d69ab5348b661616992206357a3ebf19b1008)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

99d68412

virtio-mem: drop unnecessary initialization · a640f759

由 Michael S. Tsirkin 提交于 6月 08, 2020

task #29077503
commit b3fb6de7c6019c5d8495c3a115d42a0f118f631c upstream

Fixes: 5f1f79bbc9e2 ("virtio-mem: Paravirtualized memory hotplug")
Reported-by: Nkernel test robot <lkp@intel.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NDavid Hildenbrand <david@redhat.com>
(cherry picked from ccommit b3fb6de7c6019c5d8495c3a115d42a0f118f631c)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

a640f759

virtio-mem: Don't rely on implicit compiler padding for requests · 8e6d8cc8

由 David Hildenbrand 提交于 5月 15, 2020

task #29077503
commit fce8afd76e3a4d8c59c92f84f8027569fd7031d0 upstream
The compiler will add padding after the last member, make that explicit.
The size of a request is always 24 bytes. The size of a response always
10 bytes. Add compile-time checks.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200515101402.16597-1-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit fce8afd76e3a4d8c59c92f84f8027569fd7031d0)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

8e6d8cc8

virtio-mem: Try to unplug the complete online memory block first · fe30f1c1

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit 72f9525ad76b1ddfe663285805982e9d57c7b2c2 upstream
Right now, we always try to unplug single subblocks when processing an
online memory block. Let's try to unplug the complete online memory block
first, in case it is fully plugged and the unplug request is large
enough. Fallback to single subblocks in case the memory block cannot get
unplugged as a whole.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-16-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit 72f9525ad76b1ddfe663285805982e9d57c7b2c2)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

fe30f1c1

virtio-mem: Use -ETXTBSY as error code if the device is busy · 7540dce9

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit 8d4edcfe78c0008d95effc0c90455cee59e18d10 upstream
Let's be able to distinguish if the device or if memory is busy.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-15-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit 8d4edcfe78c0008d95effc0c90455cee59e18d10)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

7540dce9

virtio-mem: Unplug subblocks right-to-left · c9d27e7b

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit 562e08cd249f98af3a3e0845998f3b27b56b0067 upstream
right-to-left.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-14-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit 562e08cd249f98af3a3e0845998f3b27b56b0067)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

c9d27e7b

virtio-mem: Drop manual check for already present memory · e134eeb0

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit 3c42e198e668e4040ef5cf3ad60d57765abc08a4 upstream
Registering our parent resource will fail if any memory is still present
(e.g., because somebody unloaded the driver and tries to reload it). No
need for the manual check.

Move our "unplug all" handling to after registering the resource.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-13-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit 3c42e198e668e4040ef5cf3ad60d57765abc08a4)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

e134eeb0

virtio-mem: Add parent resource for all added "System RAM" · 94271a31

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit ebf71552bb0e690cad523ad175e8c4c89a33c333 upstream
Let's add a parent resource, named after the virtio device (inspired by
drivers/dax/kmem.c). This allows user space to identify which memory
belongs to which virtio-mem device.

With this change and two virtio-mem devices:
	:/# cat /proc/iomem
	00000000-00000fff : Reserved
	00001000-0009fbff : System RAM
	[...]
	140000000-333ffffff : virtio0
	  140000000-147ffffff : System RAM
	  148000000-14fffffff : System RAM
	  150000000-157ffffff : System RAM
	[...]
	334000000-3033ffffff : virtio1
	  338000000-33fffffff : System RAM
	  340000000-347ffffff : System RAM
	  348000000-34fffffff : System RAM
	[...]

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-12-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
(cherry picked from ccommit ebf71552bb0e690cad523ad175e8c4c89a33c333)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

94271a31

virtio-mem: Better retry handling · f362154a

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit 23e77b5dc9cd88709c48ada936c07bdd72c49426 upstream
we reach 5 minutes, in case we keep getting errors. Reset the retry
interval in case we succeeded.

The two main reasons for having to retry are
- The hypervisor is busy and cannot process our request
- We cannot reach the desired requested_size (esp., not enough memory can
  get unplugged because we can't allocate any subblocks).
Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-11-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit 23e77b5dc9cd88709c48ada936c07bdd72c49426)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

f362154a

virtio-mem: Offline and remove completely unplugged memory blocks · bf191d23

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit a573238786f8f16aca6946fc7b804b965e3038e9 upstream
Let's offline+remove memory blocks once all subblocks are unplugged. We
can use the new Linux MM interface for that. As no memory is in use
anymore, this shouldn't take a long time and shouldn't fail. There might
be corner cases where the offlining could still fail (especially, if
another notifier NACKs the offlining request).
Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-10-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit a573238786f8f16aca6946fc7b804b965e3038e9)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

bf191d23

mm/memory_hotplug: Introduce offline_and_remove_memory() · c062d118

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit 08b3acd7a68fc17902e1cb6b146389322840deab upstream
virtio-mem wants to offline and remove a memory block once it unplugged
all subblocks (e.g., using alloc_contig_range()). Let's provide
an interface to do that from a driver. virtio-mem already supports to
offline partially unplugged memory blocks. Offlining a fully unplugged
memory block will not require to migrate any pages. All unplugged
subblocks are PageOffline() and have a reference count of 0 - so
offlining code will simply skip them.

All we need is an interface to offline and remove the memory from kernel
module context, where we don't have access to the memory block devices
(esp. find_memory_block() and device_offline()) and the device hotplug
lock.

To keep things simple, allow to only work on a single memory block.
Acked-by: NMichal Hocko <mhocko@suse.com>
Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-9-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit 08b3acd7a68fc17902e1cb6b146389322840deab)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

c062d118

virtio-mem: Allow to offline partially unplugged memory blocks · 97fe5a44

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit 8e5c921ca0cd9aa59386e6be4b86b32f0ba7296b upstream
Dropping the reference count of PageOffline() pages during MEM_GOING_ONLINE
allows offlining code to skip them. However, we also have to clear
PG_reserved, because PG_reserved pages get detected as unmovable right
away. Take care of restoring the reference count when offlining is
canceled.

Clarify why we don't have to perform any action when unloading the
driver. Also, let's add a warning if anybody is still holding a
reference to unplugged pages when offlining.
Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-8-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit 8e5c921ca0cd9aa59386e6be4b86b32f0ba7296b)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

97fe5a44

mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE · 20500825

由 David Hildenbrand 提交于 8月 27, 2019

task #29077503
commit aa218795cb5fd583c94fc838dc76b7379dc4976a upstream
virtio-mem wants to allow to offline memory blocks of which some parts
were unplugged (allocated via alloc_contig_range()), especially, to later
offline and remove completely unplugged memory blocks. The important part
is that PageOffline() has to remain set until the section is offline, so
these pages will never get accessed (e.g., when dumping). The pages should
not be handed back to the buddy (which would require clearing PageOffline()
and result in issues if offlining fails and the pages are suddenly in the
buddy).

Let's allow to do that by allowing to isolate any PageOffline() page
when offlining. This way, we can reach the memory hotplug notifier
MEM_GOING_OFFLINE, where the driver can signal that he is fine with
offlining this page by dropping its reference count. PageOffline() pages
with a reference count of 0 can then be skipped when offlining the
pages (like if they were free, however they are not in the buddy).

Anybody who uses PageOffline() pages and does not agree to offline them
(e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
decrement the reference count and make offlining fail when trying to
migrate such an unmovable page. So there should be no observable change.
Same applies to balloon compaction users (movable PageOffline() pages), the
pages will simply be migrated.

Note 1: If offlining fails, a driver has to increment the reference
	count again in MEM_CANCEL_OFFLINE.

Note 2: A driver that makes use of this has to be aware that re-onlining
	the memory block has to be handled by hooking into onlining code
	(online_page_callback_t), resetting the page PageOffline() and
	not giving them to the buddy.
Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
cherry picked from ccommit aa218795cb5fd583c94fc838dc76b7379dc4976a
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>

Conflicts: keep non-related code old, and remove offlined_pages++
	mm/memory_hotplug.c
	mm/page_alloc.c
	mm/page_isolation.c
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

20500825

virtio-mem: Paravirtualized memory hotunplug part 2 · 733f2794

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit 255f598507083905995ecab96392770ae03aac7f upstream
and, therefore, managed by the buddy), and eventually replug it later.

When requested to unplug memory, we use alloc_contig_range() to allocate
subblocks in online memory blocks (so we are the owner) and send them to
our hypervisor. When requested to plug memory, we can replug such memory
using free_contig_range() after asking our hypervisor.

We also want to mark all allocated pages PG_offline, so nobody will
touch them. To differentiate pages that were never onlined when
onlining the memory block from pages allocated via alloc_contig_range(), we
use PageDirty(). Based on this flag, virtio_mem_fake_online() can either
online the pages for the first time or use free_contig_range().

It is worth noting that there are no guarantees on how much memory can
actually get unplugged again. All device memory might completely be
fragmented with unmovable data, such that no subblock can get unplugged.

We are not touching the ZONE_MOVABLE. If memory is onlined to the
ZONE_MOVABLE, it can only get unplugged after that memory was offlined
manually by user space. In normal operation, virtio-mem memory is
suggested to be onlined to ZONE_NORMAL. In the future, we will try to
make unplug more likely to succeed.

Add a module parameter to control if online memory shall be touched.

As we want to access alloc_contig_range()/free_contig_range() from
kernel module context, export the symbols.

Note: Whenever virtio-mem uses alloc_contig_range(), all affected pages
are on the same node, in the same zone, and contain no holes.

Acked-by: Michal Hocko <mhocko@suse.com> # to export contig range allocator API
Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Alexander Potapenko <glider@google.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-6-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit 255f598507083905995ecab96392770ae03aac7f)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>

Conflicts:
	minium fix on mm/page_alloc.c
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

733f2794

virtio-mem: Paravirtualized memory hotunplug part 1 · 91f30ea2

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit c627ff5d982276908188fae86dbe727ed49c9594 upstream
have to do is watch out for concurrent onlining activity.
Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-5-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit c627ff5d982276908188fae86dbe727ed49c9594)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

91f30ea2

virtio-mem: Allow to specify an ACPI PXM as nid · c9f36272

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit f2af6d3978d74a7891d0f428537b4494498202cb upstream
virtio-mem device (and, therefore, its memory) belongs. Add a new
virtio-mem feature flag and export pxm_to_node, so it can be used in kernel
module context.

Acked-by: Michal Hocko <mhocko@suse.com> # for the export
Acked-by: "Rafael J. Wysocki" <rafael@kernel.org> # for the export
Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Len Brown <lenb@kernel.org>
Cc: linux-acpi@vger.kernel.org
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-4-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit f2af6d3978d74a7891d0f428537b4494498202cb)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>

Conflicts:
	move drivers/acpi/numa/srat.c modification into
	drivers/acpi/numa.c
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

c9f36272

virtio-mem: Paravirtualized memory hotplug · d3997ceb

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit 5f1f79bbc9e26fa9412fa9522f957bb8f030c442 upstream
for adding/removing memory from that memory region on request.

When the device driver starts up, the requested amount of memory is
queried and then plugged to Linux. On request, further memory can be
plugged or unplugged. This patch only implements the plugging part.

On x86-64, memory can currently be plugged in 4MB ("subblock") granularity.
When required, a new memory block will be added (e.g., usually 128MB on
x86-64) in order to plug more subblocks. Only x86-64 was tested for now.

The online_page callback is used to keep unplugged subblocks offline
when onlining memory - similar to the Hyper-V balloon driver. Unplugged
pages are marked PG_offline, to tell dump tools (e.g., makedumpfile) to
skip them.

User space is usually responsible for onlining the added memory. The
memory hotplug notifier is used to synchronize virtio-mem activity
against memory onlining/offlining.

Each virtio-mem device can belong to a NUMA node, which allows us to
easily add/remove small chunks of memory to/from a specific NUMA node by
using multiple virtio-mem devices. Something that works even when the
guest has no idea about the NUMA topology.

One way to view virtio-mem is as a "resizable DIMM" or a DIMM with many
"sub-DIMMS".

This patch directly introduces the basic infrastructure to implement memory
unplug. Especially the memory block states and subblock bitmaps will be
heavily used there.

Notes:
- In case memory is to be onlined by user space, we limit the amount of
  offline memory blocks, to not run out of memory. This is esp. an
  issue if memory is added faster than it is getting onlined.
- Suspend/Hibernate is not supported due to the way virtio-mem devices
  behave. Limited support might be possible in the future.
- Reloading the device driver is not supported.
Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: linux-acpi@vger.kernel.org
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-2-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit 5f1f79bbc9e26fa9412fa9522f957bb8f030c442)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>

Conflicts:
	drivers/virtio/Makefile
	include/uapi/linux/virtio_ids.h
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

d3997ceb

mm/memory_hotplug: export generic_online_page() · 0c6a9eb5

由 David Hildenbrand 提交于 11月 30, 2019

task #29077503
commit 18db149120c106cf2b1a2595f82f3229f9d223b8 upstream

Let's replace the __online_page...() functions by generic_online_page().
Hyper-V only wants to delay the actual onlining of un-backed pages, so
we can simpy re-use the generic function.

This patch (of 3):

Let's expose generic_online_page() so online_page_callback users can
simply fall back to the generic implementation when actually deciding to
online the pages.

Link: http://lkml.kernel.org/r/20190909114830.662-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from ccommit 18db149120c106cf2b1a2595f82f3229f9d223b8)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

0c6a9eb5

mm/page_alloc.c: memory hotplug: free pages as higher order · bd6aced3

由 Arun KS 提交于 3月 05, 2019

task #29077503
commit a9cd410a3d296846a8125aa43d97a573a354c472 upstream
When freeing pages are done with higher order, time spent on coalescing
pages by buddy allocator can be reduced.  With section size of 256MB,
hot add latency of a single section shows improvement from 50-60 ms to
less than 1 ms, hence improving the hot add latency by 60 times.  Modify
external providers of online callback to align with the change.

[arunks@codeaurora.org: v11]
  Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
[akpm@linux-foundation.org: remove unused local, per Arun]
[akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
[akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
[arunks@codeaurora.org: v8]
  Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
[arunks@codeaurora.org: v9]
  Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

(cherry picked from ccommit a9cd410a3d296846a8125aa43d97a573a354c472)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>

Conflicts:
	replace totalram_pages_add as old way.

bd6aced3

mm, memory_hotplug: deobfuscate migration part of offlining · 5eee4728

由 Michal Hocko 提交于 12月 28, 2018

task #29077503
commit bb8965bd82fd4ed433a888f1383016ab3fa0d7de upstream
Memory migration might fail during offlining and we keep retrying in that
case.  This is currently obfuscated by goto retry loop.  The code is hard
to follow and as a result it is even suboptimal becase each retry round
scans the full range from start_pfn even though we have successfully
scanned/migrated [start_pfn, pfn] range already.  This is all only because
check_pages_isolated failure has to rescan the full range again.

De-obfuscate the migration retry loop by promoting it to a real for loop.
In fact remove the goto altogether by making it a proper double loop
(yeah, gotos are nasty in this specific case).  In the end we will get a
slightly more optimal code which is better readable.

[akpm@linux-foundation.org: reflow comments to 80 cols]
Link: http://lkml.kernel.org/r/20181211142741.2607-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Cc: Hugh Dickins <hughd@google.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: William Kucharski <william.kucharski@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

(cherry picked from ccommit bb8965bd82fd4ed433a888f1383016ab3fa0d7de)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

5eee4728

mm, memory_hotplug: __offline_pages fix wrong locking · a6785cdc

由 Michal Hocko 提交于 2月 01, 2019

task #29077503
commit e3df4c6e4836ce93cd5cf92d9cbdeaf4439a0241 upstream
offlining a page range.  This is indeed the case when
test_pages_in_a_zone respp.  start_isolate_page_range fail.  This was an
omission when forward porting the debugging patch from an older kernel.

Fix the issue by dropping mem_hotplug_done from the failure condition
and keeping the single unlock in the catch all failure path.

Link: http://lkml.kernel.org/r/20190115120307.22768-1-mhocko@kernel.org
Fixes: 7960509329c2 ("mm, memory_hotplug: print reason for the offlining failure")
Signed-off-by: NMichal Hocko <mhocko@suse.com>
Reported-by: NJan Kara <jack@suse.cz>
Reviewed-by: NJan Kara <jack@suse.cz>
Tested-by: NJan Kara <jack@suse.cz>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from ccommit e3df4c6e4836ce93cd5cf92d9cbdeaf4439a0241)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

a6785cdc

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功