提交 · 75abfe7d7de12fc6c77df9348dbf2cfb0973ac81 · openanolis / cloud-kernel

02 9月, 2020 40 次提交

genirq: Provide interrupt injection mechanism · 75abfe7d

由 Thomas Gleixner 提交于 7月 27, 2020

task #29600094

commit acd26bcf362708594ea081ef55140e37d0854ed2 upstream.
Backport summary: for 4.19 kernel ICX PCIe Gen4 support.

Error injection mechanisms need a half ways safe way to inject interrupts as
invoking generic_handle_irq() or the actual device interrupt handler
directly from e.g. a debugfs write is not guaranteed to be safe.

On x86 generic_handle_irq() is unsafe due to the hardware trainwreck which
is the base of x86 interrupt delivery and affinity management.

Move the irq debugfs injection code into a separate function which can be
used by error injection code as well.

The implementation prevents at least that state is corrupted, but it cannot
close a very tiny race window on x86 which might result in a stale and not
serviced device interrupt under very unlikely circumstances.

This is explicitly for debugging and testing and not for production use or
abuse in random driver code.
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: NKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: NMarc Zyngier <maz@kernel.org>
Link: https://lkml.kernel.org/r/20200306130623.990928309@linutronix.de

(cherry picked from commit acd26bcf362708594ea081ef55140e37d0854ed2)
Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com>

Conflicts:
	include/linux/interrupt.h
	kernel/irq/debugfs.c
Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

75abfe7d

genirq: Add protection against unsafe usage of generic_handle_irq() · 90000805

由 Thomas Gleixner 提交于 7月 27, 2020

task #29600094

commit c16816acd08697b02a53f56f8936497a9f6f6e7a upstream.
Backport summary: for 4.19 kernel ICX PCIe Gen4 support.

In general calling generic_handle_irq() with interrupts disabled from non
interrupt context is harmless. For some interrupt controllers like the x86
trainwrecks this is outright dangerous as it might corrupt state if an
interrupt affinity change is pending.

Add infrastructure which allows to mark interrupts as unsafe and catch such
usage in generic_handle_irq().

Reported-by: sathyanarayanan.kuppuswamy@linux.intel.com
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NMarc Zyngier <maz@kernel.org>
Link: https://lkml.kernel.org/r/20200306130623.590923677@linutronix.de

(cherry picked from commit c16816acd08697b02a53f56f8936497a9f6f6e7a)
Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com>

Conflicts:
	include/linux/irq.h
Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

90000805

PCI/AER: Save AER Capability for suspend/resume · 920bc476

由 Patel, Mayurkumar 提交于 7月 27, 2020

task #29600094

commit af65d1ad416bc6e069ccb9e649faeda224248f96 upstream.
Backport summary: for 4.19 kernel ICX PCIe Gen4 support.

Previously we did not save and restore the AER configuration on
suspend/resume, so the configuration may be lost after resume.

Save the AER configuration during suspend and restore it during resume.

[bhelgaas: commit log]
Link: https://lore.kernel.org/r/92EBB4272BF81E4089A7126EC1E7B28492C3B007@IRSMSX101.ger.corp.intel.comSigned-off-by: NMayurkumar Patel <mayurkumar.patel@intel.com>
Signed-off-by: NKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>

(cherry picked from commit af65d1ad416bc6e069ccb9e649faeda224248f96)
Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com>
Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

920bc476

PCI: Get rid of dev->has_secondary_link flag · 9a20a746

由 Mika Westerberg 提交于 7月 27, 2020

task #29600094

commit ca78410403dd64ac0ee0e3cc8646b38335271bfd upstream.
Backport summary: for 4.19 kernel ICX PCIe Gen4 support.

In some systems, the Device/Port Type in the PCI Express Capabilities
register incorrectly identifies upstream ports as downstream ports.

d0751b98 ("PCI: Add dev->has_secondary_link to track downstream PCIe
links") addressed this by adding pci_dev.has_secondary_link, which is set
for downstream ports. But this is confusing because pci_pcie_type()
sometimes gives the wrong answer, and it's not obvious that we should use
pci_dev.has_secondary_link instead.

Reduce the confusion by correcting the type of the port itself so that
pci_pcie_type() returns the actual type regardless of what the Device/Port
Type register claims it is. Update the users to call pci_pcie_type() and
pcie_downstream_port() accordingly, and remove pci_dev.has_secondary_link
completely.

Link: https://lore.kernel.org/linux-pci/20190703133953.GK128603@google.com/Suggested-by: NBjorn Helgaas <bhelgaas@google.com>
Link: https://lore.kernel.org/r/20190822085553.62697-2-mika.westerberg@linux.intel.comSigned-off-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
(cherry picked from commit ca78410403dd64ac0ee0e3cc8646b38335271bfd)
Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com>
Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

9a20a746

PCI: Assign bus numbers present in EA capability for bridges · f3eb83cf

由 Subbaraya Sundeep 提交于 7月 27, 2020

task #29600094

commit 2dbce590117981196fe355efc0569bc6f949ae9b upstream.
Backport summary: for 4.19 kernel ICX PCIe Gen4 support.

The "Enhanced Allocation (EA) for Memory and I/O Resources" ECN, approved
23 October 2014, sec 6.9.1.2, specifies a second DW in the capability for
type 1 (bridge) functions to describe fixed secondary and subordinate bus
numbers.  This ECN was included in the PCIe r4.0 spec, but sec 6.9.1.2 was
omitted, presumably by mistake.

Read fixed bus numbers from the EA capability for bridges.
Signed-off-by: NSubbaraya Sundeep <sbhatta@marvell.com>
[bhelgaas: add pci_ea_fixed_busnrs() return value]
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>

(cherry picked from commit 2dbce590117981196fe355efc0569bc6f949ae9b)
Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com>
Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

f3eb83cf

PCI/ATS: Add pci_ats_page_aligned() interface · 1ef6e384

由 Kuppuswamy Sathyanarayanan 提交于 7月 27, 2020

task #29600094

commit 8c938ddc6df3bbe72809db1be6c9f3af83f5d7a9 upstream.
Backport summary: for 4.19 kernel ICX PCIe Gen4 support.

Return the Page Aligned Request bit in the ATS Capability Register.

As per PCIe spec r4.0, sec 10.5.1.2, if the Page Aligned Request bit is
set, it indicates the Untranslated Addresses generated by the device are
always aligned to a 4096 byte boundary.

An IOMMU that can only translate page-aligned addresses can only be used
with devices that always produce aligned Untranslated Addresses. This
interface will be used by drivers for such IOMMUs to determine whether
devices can use the ATS service.

Cc: Ashok Raj <ashok.raj@intel.com>
Cc: Jacob Pan <jacob.jun.pan@linux.intel.com>
Cc: Keith Busch <keith.busch@intel.com>
Suggested-by: NAshok Raj <ashok.raj@intel.com>
Signed-off-by: NKuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: NBjorn Helgaas <bhelgaas@google.com>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
(cherry picked from commit 8c938ddc6df3bbe72809db1be6c9f3af83f5d7a9)
Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com>
Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

1ef6e384

PCI: Make link active reporting detection generic · 02160c1f

由 Keith Busch 提交于 7月 27, 2020

task #29600094

commit f0157160b359b1d263ee9d4e0a435a7ad85bbcea upstream.
Backport summary: for 4.19 kernel ICX PCIe Gen4 support.

The spec has timing requirements when waiting for a link to become active
after a conventional reset.  Implement those hard delays when waiting for
an active link so pciehp and dpc drivers don't need to duplicate this.

For devices that don't support data link layer active reporting, wait the
fixed time recommended by the PCIe spec.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
[bhelgaas: changelog]
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Reviewed-by: NSinan Kaya <okaya@kernel.org>
(cherry picked from commit f0157160b359b1d263ee9d4e0a435a7ad85bbcea)
Signed-off-by: NEthan Zhao <haifeng.zhao@intel.com>
Signed-off-by: NArtie Ding <artie.ding@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

02160c1f

Intel: cpuidle: Allow idle states to be disabled by default · 07049896

由 Rafael J. Wysocki 提交于 7月 16, 2020

task #29239886

commit 75a80267410e38ab76c4ceb39753f96d72113781 upstream

In certain situations it may be useful to prevent some idle states
from being used by default while allowing user space to enable them
later on.

For this purpose, introduce a new state flag, CPUIDLE_FLAG_OFF, to
mark idle states that should be disabled by default, make the core
set CPUIDLE_STATE_DISABLED_BY_USER for those states at the
initialization time and add a new state attribute in sysfs,
"default_status", to inform user space of the initial status of
the given idle state ("disabled" if CPUIDLE_FLAG_OFF is set for it,
"enabled" otherwise).
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Nyjia <yingbao.jia@intel.com>
Signed-off-by: NErwei Deng <erwei@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

07049896

Intel: cpuidle: use BIT() for idle state flags and remove CPUIDLE_DRIVER_FLAGS_MASK · abb7d817

由 Yangtao Li 提交于 7月 16, 2020

task #29239886

commit 44021606298870e4adc641ef3927e7bb47ca8236 upstream

Use BIT() macro to do a small tidy-up.

CPUIDLE_DRIVER_FLAGS_MASK is not used, so remove it.
Signed-off-by: NYangtao Li <tiny.windzz@gmail.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Nyjia <yingbao.jia@intel.com>
Signed-off-by: NErwei Deng <erwei@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

abb7d817

Intel: ACPI: processor: Export acpi_processor_evaluate_cst() · c8af7266

由 Rafael J. Wysocki 提交于 7月 16, 2020

task #29239886

commit 77fb4e0a559a960eb36d0b2c50c781c5492577eb upsteam

The intel_idle driver will be modified to use ACPI _CST subsequently
and it will need to call acpi_processor_evaluate_cst(), so move that
function to acpi_processor.c so that it is always present (which is
required by intel_idle) and export it to modules to allow the ACPI
processor driver (which is modular) to call it.

No intentional functional impact.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Nyjia <yingbao.jia@intel.com>
Signed-off-by: NErwei Deng <erwei@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

c8af7266

Intel: ACPI: processor: Export function to claim _CST control · 19805ef7

由 Rafael J. Wysocki 提交于 7月 16, 2020

task #29239886

commit bc94638886ab21f8247d3f7f39573d3feb7d8284 upstream

The intel_idle driver will be modified to use ACPI _CST subsequently
and it will need to notify the platform firmware of that if
acpi_gbl_FADT.cst_control is set, so add a routine for this purpose,
acpi_processor_claim_cst_control(), to acpi_processor.c (so that it
is always present which is required by intel_idle) and export it
to allow the ACPI processor driver (which is modular) to call it.

No intentional functional impact.
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: Nyjia <yingbao.jia@intel.com>
Signed-off-by: NErwei Deng <erwei@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

19805ef7

virtio-fs: add virtiofs filesystem · 917f6dfb

由 Stefan Hajnoczi 提交于 6月 12, 2018

task #28910367
commit a62a8ef9d97da23762a588592c8b8eb50a8deb6a upstream

Add a basic file system module for virtio-fs.  This does not yet contain
shared data support between host and guest or metadata coherency speedups.
However it is already significantly faster than virtio-9p.

Design Overview
===============

With the goal of designing something with better performance and local file
system semantics, a bunch of ideas were proposed.

 - Use fuse protocol (instead of 9p) for communication between guest and
   host.  Guest kernel will be fuse client and a fuse server will run on
   host to serve the requests.

 - For data access inside guest, mmap portion of file in QEMU address space
   and guest accesses this memory using dax.  That way guest page cache is
   bypassed and there is only one copy of data (on host).  This will also
   enable mmap(MAP_SHARED) between guests.

 - For metadata coherency, there is a shared memory region which contains
   version number associated with metadata and any guest changing metadata
   updates version number and other guests refresh metadata on next access.
   This is yet to be implemented.

How virtio-fs differs from existing approaches
==============================================

The unique idea behind virtio-fs is to take advantage of the co-location of
the virtual machine and hypervisor to avoid communication (vmexits).

DAX allows file contents to be accessed without communication with the
hypervisor.  The shared memory region for metadata avoids communication in
the common case where metadata is unchanged.

By replacing expensive communication with cheaper shared memory accesses,
we expect to achieve better performance than approaches based on network
file system protocols.  In addition, this also makes it easier to achieve
local file system semantics (coherency).

These techniques are not applicable to network file system protocols since
the communications channel is bypassed by taking advantage of shared memory
on a local machine.  This is why we decided to build virtio-fs rather than
focus on 9P or NFS.

Caching Modes
=============

Like virtio-9p, different caching modes are supported which determine the
coherency level as well.  The “cache=FOO” and “writeback” options control
the level of coherence between the guest and host filesystems.

 - cache=none
   metadata, data and pathname lookup are not cached in guest.  They are
   always fetched from host and any changes are immediately pushed to host.

 - cache=always
   metadata, data and pathname lookup are cached in guest and never expire.

 - cache=auto
   metadata and pathname lookup cache expires after a configured amount of
   time (default is 1 second).  Data is cached while the file is open
   (close to open consistency).

 - writeback/no_writeback
   These options control the writeback strategy.  If writeback is disabled,
   then normal writes will immediately be synchronized with the host fs.
   If writeback is enabled, then writes may be cached in the guest until
   the file is closed or an fsync(2) performed.  This option has no effect
   on mmap-ed writes or writes going through the DAX mechanism.
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

(cherry picked from commit a62a8ef9d97da23762a588592c8b8eb50a8deb6a)
[Liubo: given that 4.19 lacks the support of fs_context to parse mount
option, here I just change it back to the 4.19 way, so we still use -o
tag=myfs-1 to get virtiofs mount.]
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

917f6dfb

fuse: reserve values for mapping protocol · 30efb5f4

由 Dr. David Alan Gilbert 提交于 8月 02, 2019

task #28910367
commit c4bb667eaf520f21b3a3db0489682becc9c49bcc upstream

SETUPMAPPING is a command for use with 'virtiofsd', a fuse-over-virtio
implementation; it may find use in other fuse impelementations as well in
which the kernel does not have access to the address space of the daemon
directly.

A SETUPMAPPING operation causes a section of a file to be mapped into a
memory window visible to the kernel.  The offsets in the file and the
window are defined by the kernel performing the operation.

The daemon may reject the request, for reasons including permissions and
limited resources.

When a request perfectly overlaps a previous mapping, the previous mapping
is replaced.  When a mapping partially overlaps a previous mapping, the
previous mapping is split into one or two smaller mappings.

REMOVEMAPPING is the complement to SETUPMAPPING; it unmaps a range of
mapped files from the window visible to the kernel.

The map_alignment field communicates the alignment constraint for
FUSE_SETUPMAPPING/FUSE_REMOVEMAPPING and allows the daemon to constrain the
addresses and file offsets chosen by the kernel.
Signed-off-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

30efb5f4

fuse: reserve byteswapped init opcodes · 4b61cdee

由 Michael S. Tsirkin 提交于 9月 04, 2019

task #28910367
commit 501ae8ecae2ba5122774dee4445003505a7fd01b upstream

virtio fs tunnels fuse over a virtio channel.  One issue is two sides might
be speaking different endian-ness. To detects this, host side looks at the
opcode value in the FUSE_INIT command.  Works fine at the moment but might
fail if a future version of fuse will use such an opcode for
initialization.  Let's reserve this opcode so we remember and don't do
this.

Same for CUSE_INIT.
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

4b61cdee

virtio_mem: convert device block size into 64bit · 1e5d7fb5

由 Michael S. Tsirkin 提交于 6月 08, 2020

task #29077503
commit 544fc7dbbf920a3e64d109c416ee229e8e1763c5 upstream
can overflow. Rather than try to catch all instances of that,
let's tweak block size to 64 bit.

It ripples through UAPI which is an ABI change, but it's not too late to
make it, and it will allow supporting >4Gbyte blocks while might
become necessary down the road.

Fixes: 5f1f79bbc9e26 ("virtio-mem: Paravirtualized memory hotplug")
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NDavid Hildenbrand <david@redhat.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

1e5d7fb5

virtio-mem: Don't rely on implicit compiler padding for requests · 8e6d8cc8

由 David Hildenbrand 提交于 5月 15, 2020

task #29077503
commit fce8afd76e3a4d8c59c92f84f8027569fd7031d0 upstream
The compiler will add padding after the last member, make that explicit.
The size of a request is always 24 bytes. The size of a response always
10 bytes. Add compile-time checks.

Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: teawater <teawaterz@linux.alibaba.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200515101402.16597-1-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit fce8afd76e3a4d8c59c92f84f8027569fd7031d0)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

8e6d8cc8

mm/memory_hotplug: Introduce offline_and_remove_memory() · c062d118

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit 08b3acd7a68fc17902e1cb6b146389322840deab upstream
virtio-mem wants to offline and remove a memory block once it unplugged
all subblocks (e.g., using alloc_contig_range()). Let's provide
an interface to do that from a driver. virtio-mem already supports to
offline partially unplugged memory blocks. Offlining a fully unplugged
memory block will not require to migrate any pages. All unplugged
subblocks are PageOffline() and have a reference count of 0 - so
offlining code will simply skip them.

All we need is an interface to offline and remove the memory from kernel
module context, where we don't have access to the memory block devices
(esp. find_memory_block() and device_offline()) and the device hotplug
lock.

To keep things simple, allow to only work on a single memory block.
Acked-by: NMichal Hocko <mhocko@suse.com>
Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Acked-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Qian Cai <cai@lca.pw>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-9-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit 08b3acd7a68fc17902e1cb6b146389322840deab)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

c062d118

mm: Allow to offline unmovable PageOffline() pages via MEM_GOING_OFFLINE · 20500825

由 David Hildenbrand 提交于 8月 27, 2019

task #29077503
commit aa218795cb5fd583c94fc838dc76b7379dc4976a upstream
virtio-mem wants to allow to offline memory blocks of which some parts
were unplugged (allocated via alloc_contig_range()), especially, to later
offline and remove completely unplugged memory blocks. The important part
is that PageOffline() has to remain set until the section is offline, so
these pages will never get accessed (e.g., when dumping). The pages should
not be handed back to the buddy (which would require clearing PageOffline()
and result in issues if offlining fails and the pages are suddenly in the
buddy).

Let's allow to do that by allowing to isolate any PageOffline() page
when offlining. This way, we can reach the memory hotplug notifier
MEM_GOING_OFFLINE, where the driver can signal that he is fine with
offlining this page by dropping its reference count. PageOffline() pages
with a reference count of 0 can then be skipped when offlining the
pages (like if they were free, however they are not in the buddy).

Anybody who uses PageOffline() pages and does not agree to offline them
(e.g., Hyper-V balloon, XEN balloon, VMWare balloon for 2MB pages) will not
decrement the reference count and make offlining fail when trying to
migrate such an unmovable page. So there should be no observable change.
Same applies to balloon compaction users (movable PageOffline() pages), the
pages will simply be migrated.

Note 1: If offlining fails, a driver has to increment the reference
	count again in MEM_CANCEL_OFFLINE.

Note 2: A driver that makes use of this has to be aware that re-onlining
	the memory block has to be handled by hooking into onlining code
	(online_page_callback_t), resetting the page PageOffline() and
	not giving them to the buddy.
Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Anthony Yznaga <anthony.yznaga@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Pingfan Liu <kernelfans@gmail.com>
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
cherry picked from ccommit aa218795cb5fd583c94fc838dc76b7379dc4976a
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>

Conflicts: keep non-related code old, and remove offlined_pages++
	mm/memory_hotplug.c
	mm/page_alloc.c
	mm/page_isolation.c
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

20500825

virtio-mem: Allow to specify an ACPI PXM as nid · c9f36272

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit f2af6d3978d74a7891d0f428537b4494498202cb upstream
virtio-mem device (and, therefore, its memory) belongs. Add a new
virtio-mem feature flag and export pxm_to_node, so it can be used in kernel
module context.

Acked-by: Michal Hocko <mhocko@suse.com> # for the export
Acked-by: "Rafael J. Wysocki" <rafael@kernel.org> # for the export
Acked-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Len Brown <lenb@kernel.org>
Cc: linux-acpi@vger.kernel.org
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-4-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit f2af6d3978d74a7891d0f428537b4494498202cb)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>

Conflicts:
	move drivers/acpi/numa/srat.c modification into
	drivers/acpi/numa.c
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

c9f36272

virtio-mem: Paravirtualized memory hotplug · d3997ceb

由 David Hildenbrand 提交于 5月 07, 2020

task #29077503
commit 5f1f79bbc9e26fa9412fa9522f957bb8f030c442 upstream
for adding/removing memory from that memory region on request.

When the device driver starts up, the requested amount of memory is
queried and then plugged to Linux. On request, further memory can be
plugged or unplugged. This patch only implements the plugging part.

On x86-64, memory can currently be plugged in 4MB ("subblock") granularity.
When required, a new memory block will be added (e.g., usually 128MB on
x86-64) in order to plug more subblocks. Only x86-64 was tested for now.

The online_page callback is used to keep unplugged subblocks offline
when onlining memory - similar to the Hyper-V balloon driver. Unplugged
pages are marked PG_offline, to tell dump tools (e.g., makedumpfile) to
skip them.

User space is usually responsible for onlining the added memory. The
memory hotplug notifier is used to synchronize virtio-mem activity
against memory onlining/offlining.

Each virtio-mem device can belong to a NUMA node, which allows us to
easily add/remove small chunks of memory to/from a specific NUMA node by
using multiple virtio-mem devices. Something that works even when the
guest has no idea about the NUMA topology.

One way to view virtio-mem is as a "resizable DIMM" or a DIMM with many
"sub-DIMMS".

This patch directly introduces the basic infrastructure to implement memory
unplug. Especially the memory block states and subblock bitmaps will be
heavily used there.

Notes:
- In case memory is to be onlined by user space, we limit the amount of
  offline memory blocks, to not run out of memory. This is esp. an
  issue if memory is added faster than it is getting onlined.
- Suspend/Hibernate is not supported due to the way virtio-mem devices
  behave. Limited support might be possible in the future.
- Reloading the device driver is not supported.
Reviewed-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Tested-by: NPankaj Gupta <pankaj.gupta.linux@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Igor Mammedov <imammedo@redhat.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Len Brown <lenb@kernel.org>
Cc: linux-acpi@vger.kernel.org
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Link: https://lore.kernel.org/r/20200507140139.17083-2-david@redhat.comSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
(cherry picked from ccommit 5f1f79bbc9e26fa9412fa9522f957bb8f030c442)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>

Conflicts:
	drivers/virtio/Makefile
	include/uapi/linux/virtio_ids.h
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

d3997ceb

mm/memory_hotplug: export generic_online_page() · 0c6a9eb5

由 David Hildenbrand 提交于 11月 30, 2019

task #29077503
commit 18db149120c106cf2b1a2595f82f3229f9d223b8 upstream

Let's replace the __online_page...() functions by generic_online_page().
Hyper-V only wants to delay the actual onlining of un-backed pages, so
we can simpy re-use the generic function.

This patch (of 3):

Let's expose generic_online_page() so online_page_callback users can
simply fall back to the generic implementation when actually deciding to
online the pages.

Link: http://lkml.kernel.org/r/20190909114830.662-2-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from ccommit 18db149120c106cf2b1a2595f82f3229f9d223b8)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

0c6a9eb5

mm/page_alloc.c: memory hotplug: free pages as higher order · bd6aced3

由 Arun KS 提交于 3月 05, 2019

task #29077503
commit a9cd410a3d296846a8125aa43d97a573a354c472 upstream
When freeing pages are done with higher order, time spent on coalescing
pages by buddy allocator can be reduced.  With section size of 256MB,
hot add latency of a single section shows improvement from 50-60 ms to
less than 1 ms, hence improving the hot add latency by 60 times.  Modify
external providers of online callback to align with the change.

[arunks@codeaurora.org: v11]
  Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
[akpm@linux-foundation.org: remove unused local, per Arun]
[akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
[akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
[arunks@codeaurora.org: v8]
  Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
[arunks@codeaurora.org: v9]
  Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.orgSigned-off-by: NArun KS <arunks@codeaurora.org>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Reviewed-by: NAlexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Mathieu Malaterre <malat@debian.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Souptick Joarder <jrdr.linux@gmail.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Aaron Lu <aaron.lu@intel.com>
Cc: Srivatsa Vaddagiri <vatsa@codeaurora.org>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

(cherry picked from ccommit a9cd410a3d296846a8125aa43d97a573a354c472)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>

Conflicts:
	replace totalram_pages_add as old way.

bd6aced3

mm/page_isolation.c: convert SKIP_HWPOISON to MEMORY_OFFLINE · 80aa4777

由 Alex Shi 提交于 5月 11, 2020

task #29077503
commit 756d25be457fc5497da0ceee0f3d0c9eb4d8535d upstream
We have two types of users of page isolation:

 1. Memory offlining:  Offline memory so it can be unplugged. Memory
                       won't be touched.

 2. Memory allocation: Allocate memory (e.g., alloc_contig_range()) to
                       become the owner of the memory and make use of
                       it.

For example, in case we want to offline memory, we can ignore (skip
over) PageHWPoison() pages, as the memory won't get used.  We can allow
to offline memory.  In contrast, we don't want to allow to allocate such
memory.

Let's generalize the approach so we can special case other types of
pages we want to skip over in case we offline memory.  While at it, also
pass the same flags to test_pages_isolated().
Original-by: NDavid Hildenbrand <david@redhat.com>
Link: http://lkml.kernel.org/r/20191021172353.3056-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
Suggested-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Pingfan Liu <kernelfans@gmail.com>
Cc: Qian Cai <cai@lca.pw>
Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
Cc: Wei Yang <richard.weiyang@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from ccommit 756d25be457fc5497da0ceee0f3d0c9eb4d8535d)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>

Conflicts:
	reenable patch context on all files.

80aa4777

mm: only report isolation failures when offlining memory · 5316eb6e

由 Michal Hocko 提交于 12月 28, 2018

task #29077503
commit d381c54760dcfad23743da40516e7e003d73952a upstream
Heiko has complained that his log is swamped by warnings from
has_unmovable_pages

[   20.536664] page dumped because: has_unmovable_pages
[   20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0
[   20.536794] flags: 0x3fffe0000010200(slab|head)
[   20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600
[   20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000
[   20.536797] page dumped because: has_unmovable_pages
[   20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0
[   20.536815] flags: 0x7fffe0000000000()
[   20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000
[   20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000

which are not triggered by the memory hotplug but rather CMA allocator.
The original idea behind dumping the page state for all call paths was
that these messages will be helpful debugging failures.  From the above it
seems that this is not the case for the CMA path because we are lacking
much more context.  E.g the second reported page might be a CMA allocated
page.  It is still interesting to see a slab page in the CMA area but it
is hard to tell whether this is bug from the above output alone.

Address this issue by dumping the page state only on request.  Both
start_isolate_page_range and has_unmovable_pages already have an argument
to ignore hwpoison pages so make this argument more generic and turn it
into flags and allow callers to combine non-default modes into a mask.
While we are at it, has_unmovable_pages call from
is_pageblock_removable_nolock (sysfs removable file) is questionable to
report the failure so drop it from there as well.

Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
Reported-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
Reviewed-by: NOscar Salvador <osalvador@suse.de>
Cc: Anshuman Khandual <anshuman.khandual@arm.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from ccommit d381c54760dcfad23743da40516e7e003d73952a)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>

Conflicts:
	mm/page_alloc.c

5316eb6e

mm: convert PG_balloon to PG_offline · 59df23d6

由 David Hildenbrand 提交于 3月 05, 2019

task #29077503
commit ca215086b14b89a0e70fc211314944aa6ce50020 upstream
pages inflated in virtio-balloon.  Nowadays, it is only a marker that a
page is part of virtio-balloon and therefore logically offline.
We also want to make use of this flag in other balloon drivers - for
inflated pages or when onlining a section but keeping some pages offline
(e.g.  used right now by XEN and Hyper-V via set_online_page_callback()).

We are going to expose this flag to dump tools like makedumpfile.  But
instead of exposing PG_balloon, let's generalize the concept of marking
pages as logically offline, so it can be reused for other purposes later
on.

Rename PG_balloon to PG_offline.  This is an indicator that the page is
logically offline, the content stale and that it should not be touched
(e.g.  a hypervisor would have to allocate backing storage in order for
the guest to dump an unused page).  We can then e.g.  exclude such pages
from dumps.

We replace and reuse KPF_BALLOON (23), as this shouldn't really harm
(and for now the semantics stay the same).  In following patches, we
will make use of this bit also in other balloon drivers.  While at it,
document PGTABLE.

[akpm@linux-foundation.org: fix comment text, per David]
Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NPankaj gupta <pagupta@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Christian Hansen <chansen3@cisco.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Julien Freche <jfreche@vmware.com>
Cc: Kairui Song <kasong@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Lianbo Jiang <lijiang@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

(cherry picked from ccommit ca215086b14b89a0e70fc211314944aa6ce50020)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

59df23d6

block: add dma_map_bvec helper · 3a9cc9e5

由 Christoph Hellwig 提交于 3月 03, 2019

fix #29327388

commit 3ab3a0313cb8c50391d74e40fd46a3408d8e4de9 upstream

Provide a nice little shortcut for mapping a single bvec.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

3a9cc9e5

block: add a rq_dma_dir helper · 1c114fa2

由 Christoph Hellwig 提交于 3月 03, 2019

fix #29327388

commit 9d9de535f385a8b3ba0e88ca0abf386c5704bbfc upstream

In a lot of places we want to know the DMA direction for a given
struct request.  Add a little helper to make it a littler easier.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

1c114fa2

block: add a rq_integrity_vec helper · 152dc603

由 Christoph Hellwig 提交于 3月 03, 2019

fix #29327388

commit 2a876f5e25e8ec9fa5777d36e5695ee33dd63f6f upstream

This provides a nice little shortcut to get the integrity data for
drivers like NVMe that only support a single integrity segment.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

152dc603

block: add a req_bvec helper · b381a9ce

由 Christoph Hellwig 提交于 3月 03, 2019

fix #29327388

commit 3aef3cae4342c1d8137a1c0782cbb66f1be3943c upstream

Return the currently active bvec segment, potentially spanning multiple
pages.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b381a9ce

TencentOS-kernel: ipvs: avoid drop first packet by reusing conntrack · 173a8024

由 YangYuxi 提交于 6月 23, 2020

fix #29256237

commit a01a9445c00eca3e37523eb6b0d87f494eceeb4b TencentOS-kernel

Since 'commit f719e375 ("ipvs: drop first packet to
redirect conntrack")', when a new TCP connection meet
the conditions that need reschedule, the first syn packet
is dropped, this cause one second latency for the new
connection, more discussion about this problem can easy
search from google, such as:

1)One second connection delay in masque
https://marc.info/?t=151683118100004&r=1&w=2

2)IPVS low throughput #70747
https://github.com/kubernetes/kubernetes/issues/70747

3)Apache Bench can fill up ipvs service proxy in seconds #544
https://github.com/cloudnativelabs/kube-router/issues/544

4)Additional 1s latency in `host -> service IP -> pod`
https://github.com/kubernetes/kubernetes/issues/90854

5)kube-proxy ipvs conn_reuse_mode setting causes errors
with high load from single client
https://github.com/kubernetes/kubernetes/issues/81775

The root cause is when the old session is expired, the
conntrack related to the session is dropped by
ip_vs_conn_drop_conntrack. The code is as follows:
```
static void ip_vs_conn_expire(struct timer_list *t)
{
...

     if ((cp->flags & IP_VS_CONN_F_NFCT) &&
         !(cp->flags & IP_VS_CONN_F_ONE_PACKET)) {
             /* Do not access conntracks during subsys cleanup
              * because nf_conntrack_find_get can not be used after
              * conntrack cleanup for the net.
              */
             smp_rmb();
             if (ipvs->enable)
                     ip_vs_conn_drop_conntrack(cp);
     }
...
}
```
As shown in the code, only when condition (cp->flags & IP_VS_CONN_F_NFCT)
is true, the function ip_vs_conn_drop_conntrack will be called.

So we optimize this by following steps (Administrators
can choose the following optimization by setting
net.ipv4.vs.conn_reuse_old_conntrack=1):
1) erase the IP_VS_CONN_F_NFCT flag (it is safely because
   no packets will use the old session)
2) call ip_vs_conn_expire_now to release the old session,
   then the related conntrack will not be dropped
3) then ipvs unnecessary to drop the first syn packet, it
   just continue to pass the syn packet to the next process,
   create a new ipvs session, and the new session will related
   to the old conntrack(which is reopened by conntrack as a new
   one), the next whole things is just as normal as that the old
   session isn't used to exist.

The above processing has no problems except for passive FTP,
for passive FTP situation, ipvs can judging from
condition (atomic_read(&cp->n_control)) and condition (cp->control).
So, for other conditions(means not FTP), ipvs should give users
the right to choose，they can choose a high performance one processing
logical by setting net.ipv4.vs.conn_reuse_old_conntrack=1. It is necessary
because most business scenarios (such as kubernetes) are very sensitive
to TCP short connection latency.

This patch has been verified on our thousands of kubernets
node servers on Tencent Inc.
Signed-off-by: NYangYuxi <yx.atom1@gmail.com>
[Tony: add the missing sysctl knob and disable it by default]
Signed-off-by: NTony Lu <tonylu@linux.alibaba.com>
Acked-by: NDust Li <dust.li@linux.alibaba.com>

173a8024

block: move queues types to the block layer · cd87b6e5

由 Christoph Hellwig 提交于 12月 02, 2018

to #28991349

commit e20ba6e1da029136ded295f33076483d65ddf50a upstream

Having another indirect all in the fast path doesn't really help
in our post-spectre world.  Also having too many queue type is just
going to create confusion, so I'd rather manage them centrally.

Note that the queue type naming and ordering changes a bit - the
first index now is the default queue for everything not explicitly
marked, the optional ones are read and poll queues.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

cd87b6e5

nvme: add separate poll queue map · b244a53e

由 Jens Axboe 提交于 11月 05, 2018

to #28991349

commit 4b04cc6a8f86c4842314def22332de1f15de8523 upstream

Adds support for defining a variable number of poll queues, currently
configurable with the 'poll_queues' module parameter. Defaults to
a single poll queue.

And now we finally have poll support without triggering interrupts!
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b244a53e

blk-mq: initial support for multiple queue maps · c454176d

由 Jens Axboe 提交于 10月 24, 2018

to #28991349

commit 843477d4cc5c4bb4e346c561ecd3b9d0bd67e8c8 upstream

Add a queue offset to the tag map. This enables users to map
iteratively, for each queue map type they support.

Bump maximum number of supported maps to 2, we're now fully
able to support more than 1 map.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c454176d

blk-mq: cache request hardware queue mapping · 107174b9

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit ea4f995ee8b8f0578b3319949f2edd5d812fdb0a upstream

We call blk_mq_map_queue() a lot, at least two times for each
request per IO, sometimes more. Since we now have an indirect
call as well in that function. cache the mapping so we don't
have to re-call blk_mq_map_queue() for the same request
multiple times.
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

107174b9

blk-mq: support multiple hctx maps · 2ca20c32

由 Jens Axboe 提交于 10月 30, 2018

to #28991349

commit b3c661b15d5ab11d982e58bee23e05c1780528a1 upstream

Add support for the tag set carrying multiple queue maps, and
for the driver to inform blk-mq how many it wishes to support
through setting set->nr_maps.

This adds an mq_ops helper for drivers that support more than 1
map, mq_ops->rq_flags_to_type(). The function takes request/bio
flags and CPU, and returns a queue map index for that. We then
use the type information in blk_mq_map_queue() to index the map
set.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

2ca20c32

blk-mq: allow software queue to map to multiple hardware queues · d78f5292

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit f31967f0e455d08d3ea1d2f849bf62dafc92dbf4 upstream

The mapping used to be dependent on just the CPU location, but
now it's a tuple of (type, cpu) instead. This is a prep patch
for allowing a single software queue to map to multiple hardware
queues. No functional changes in this patch.

This changes the software queue count to an unsigned short
to save a bit of space. We can still support 64K-1 CPUs,
which should be enough. Add a check to catch a wrap.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

d78f5292

blk-mq: abstract out queue map · f63859ea

由 Jens Axboe 提交于 10月 29, 2018

to #28991349

commit ed76e329d74a4b15ac0f5fd3adbd52ec0178a134 upstream

This is in preparation for allowing multiple sets of maps per
queue, if so desired.
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f63859ea

blk-mq: kill q->mq_map · 7c058cfc

由 Jens Axboe 提交于 10月 16, 2018

to #28991349

commit a8908939af569ce2419f43fd56eeaf003bc3d85d upstream

It's just a pointer to set->mq_map, use that instead. Move the
assignment a bit earlier, so we always know it's valid.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

7c058cfc

genirq/affinity: Add support for allocating interrupt sets · 292e47ef

由 Jens Axboe 提交于 11月 02, 2018

to #28991349

commit 6da4b3ab9a6e9b1b5f90322ab3fa3a7dd18edb19 upstream

A driver may have a need to allocate multiple sets of MSI/MSI-X interrupts,
and have them appropriately affinitized.

Add support for defining a number of sets in the irq_affinity structure, of
varying sizes, and get each set affinitized correctly across the machine.

[ tglx: Minor changelog tweaks ]
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Cc: linux-block@vger.kernel.org
Link: https://lkml.kernel.org/r/20181102145951.31979-5-ming.lei@redhat.comSigned-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

292e47ef

block: make rq sector size accessible for block stats · 9bdcaff2

由 Hou Tao 提交于 5月 21, 2019

to #29361128

commit 3d24430694077313c75c6b89f618db09943621e4 upstream.

Currently rq->data_len will be decreased by partial completion or
zeroed by completion, so when blk_stat_add() is invoked, data_len
will be zero and there will never be samples in poll_cb because
blk_mq_poll_stats_bkt() will return -1 if data_len is zero.

We could move blk_stat_add() back to __blk_mq_complete_request(),
but that would make the effort of trying to call ktime_get_ns()
once in vain. Instead we can reuse throtl_size field, and use
it for both block stats and block throttle, and adjust the
logic in blk_mq_poll_stats_bkt() accordingly.

Fixes: 4bc6339a ("block: move blk_stat_add() to __blk_mq_end_request()")
Tested-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9bdcaff2

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功