提交 · 453a258b1e8be45914530f91c68743a2bcea3006 · openanolis / cloud-kernel

29 6月, 2020 9 次提交

ACPI / APEI: Don't store CPER records physical address in struct ghes · 453a258b

由 James Morse 提交于 1月 29, 2019

fix #28612342

commit eeb2555779471abdbcc6289a52dc54ce513feaf2 upstream

When CPER records are found the address of the records is stashed
in the struct ghes. Once the records have been processed, this
address is overwritten with zero so that it won't be processed
again without being re-populated by firmware.

This goes wrong if a struct ghes can be processed concurrently,
as can happen at probe time when an NMI occurs. If the NMI arrives
on another CPU, the probing CPU may call ghes_clear_estatus() on the
records before the handler had finished with them.
Even on the same CPU, once the interrupted handler is resumed, it
will call ghes_clear_estatus() on the NMIs records, this memory may
have already been re-used by firmware.

Avoid this stashing by letting the caller hold the address. A
later patch will do away with the use of ghes->flags in the
read/clear code too.
Signed-off-by: NJames Morse <james.morse@arm.com>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

453a258b

ACPI / APEI: Make estatus pool allocation a static size · 4d0a055c

由 James Morse 提交于 1月 29, 2019

fix #28612342

commit fb7be08f1a091ec243780bfdad4bf0c492057808 upstream

Adding new NMI-like notifications duplicates the calls that grow
and shrink the estatus pool. This is all pretty pointless, as the
size is capped to 64K. Allocate this for each ghes and drop
the code that grows and shrinks the pool.
Suggested-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NJames Morse <james.morse@arm.com>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

4d0a055c

ACPI / APEI: Make hest.c manage the estatus memory pool · 06905477

由 James Morse 提交于 1月 29, 2019

fix #28612342

commit e147133a42cb9df6cbc99503fdf58d0e6388bf2a upstream

ghes.c has a memory pool it uses for the estatus cache and the estatus
queue. The cache is initialised when registering the platform driver.
For the queue, an NMI-like notification has to grow/shrink the pool
as it is registered and unregistered.

This is all pretty noisy when adding new NMI-like notifications, it
would be better to replace this with a static pool size based on the
number of users.

As a precursor, move the call that creates the pool from ghes_init(),
into hest.c. Later this will take the number of ghes entries and
consolidate the queue allocations.
Remove ghes_estatus_pool_exit() as hest.c doesn't have anywhere to put
this.

The pool is now initialised as part of ACPI's subsys_initcall():
(acpi_init(), acpi_scan_init(), acpi_pci_root_init(), acpi_hest_init())
Before this patch it happened later as a GHES specific device_initcall().
Signed-off-by: NJames Morse <james.morse@arm.com>
Reviewed-by: NBorislav Petkov <bp@suse.de>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: Nluanshi <zhangliguang@linux.alibaba.com>

06905477

blk-mq: add mq_ops->commit_rqs() · 0111cff3

由 Jens Axboe 提交于 11月 27, 2018

fix #28871358

commit d666ba98f849ad44c4405ecc2180390ebe80f4f9 upstream

blk-mq passes information to the hardware about any given request being
the last that we will issue in this sequence. The point is that hardware
can defer costly doorbell type writes to the last request. But if we run
into errors issuing a sequence of requests, we may never send the request
with bd->last == true set. For that case, we need a hook that tells the
hardware that nothing else is coming right now.

For failures returned by the drivers ->queue_rq() hook, the driver is
responsible for flushing pending requests, if it uses bd->last to
optimize that part. This works like before, no changes there.
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0111cff3

block: improve logic around when to sort a plug list · 26702d43

由 Jens Axboe 提交于 11月 27, 2018

fix #28871358

Only do it if we have requests for multiple queues in the same
plug.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

26702d43

usb: add a hcd_uses_dma helper · 1d57a672

由 Christoph Hellwig 提交于 8月 11, 2019

fix #28339081

commit edfbcb321faf07ca970e4191abe061deeb7d3788 upstream

The USB buffer allocation code is the only place in the usb core (and in
fact the whole kernel) that uses is_device_dma_capable, while the URB
mapping code uses the uses_dma flag in struct usb_bus. Switch the buffer
allocation to use the uses_dma flag used by the rest of the USB code,
and create a helper in hcd.h that checks this flag as well as the
CONFIG_HAS_DMA to simplify the caller a bit.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20190811080520.21712-3-hch@lst.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>

1d57a672

USB: drop HCD_LOCAL_MEM flag · ac971f36

由 Laurentiu Tudor 提交于 5月 29, 2019

fix #28339081

commit 2d7a3dc3e24f43504b1f25eae8195e600f4cce8b upstream

With the addition of the local memory allocator, the HCD_LOCAL_MEM
flag can be dropped and the checks against it replaced with a check
for the localmem_pool ptr being initialized.
Signed-off-by: NLaurentiu Tudor <laurentiu.tudor@nxp.com>
Tested-by: NFredrik Noring <noring@nocrew.org>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>

ac971f36

USB: use genalloc for USB HCs with local memory · af6ff530

由 Laurentiu Tudor 提交于 5月 29, 2019

fix #28339081

commit b0310c2f09bbe8aebefb97ed67949a3a7092aca6 upstream

For HCs that have local memory, replace the current DMA API usage with
a genalloc generic allocator to manage the mappings for these devices.
To help users, introduce a new HCD API, usb_hcd_setup_local_mem() that
will setup up the genalloc backing up the device local memory. It will
be used in subsequent patches.  This is in preparation for dropping
the existing "coherent" dma mem declaration APIs.  The current
implementation was relying on a short circuit in the DMA API that in
the end, was acting as an allocator for these type of devices.
Signed-off-by: NLaurentiu Tudor <laurentiu.tudor@nxp.com>
Tested-by: NFredrik Noring <noring@nocrew.org>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>

af6ff530

lib/genalloc: add gen_pool_dma_zalloc() for zeroed DMA allocations · 45d3ff62

由 Fredrik Noring 提交于 5月 29, 2019

fix #28339081

commit da83a722959a82733c3ca60030cc364ca2318c5a upstream

gen_pool_dma_zalloc() is a zeroed memory variant of
gen_pool_dma_alloc().  Also document the return values of both, and
indicate NULL as a "%NULL" constant.
Signed-off-by: NFredrik Noring <noring@nocrew.org>
Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: NAlex Shi <alex.shi@linux.alibaba.com>

45d3ff62

24 6月, 2020 3 次提交

sched/core: Remove sd->*_idx · b21904d4

由 Dietmar Eggemann 提交于 5月 27, 2019

to #28739709

commit 0e1fef63d92d61ed561e504c3a078a827a0f9bfe upstream

The sched domain per rq load index files also disappear from the
/proc/sys/kernel/sched_domain/cpuX/domainY directories.
Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NRik van Riel <riel@surriel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20190527062116.11512-6-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>

b21904d4

sched/fair: Remove the rq->cpu_load[] update code · 112598d6

由 Dietmar Eggemann 提交于 5月 27, 2019

to #28739709

commit 5e83eafbfd3b351537c0d74467fc43e8a88f4ae4 upstream

With LB_BIAS disabled, there is no need to update the rq->cpu_load[idx]
any more.
Signed-off-by: NDietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NRik van Riel <riel@surriel.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>
Cc: Patrick Bellasi <patrick.bellasi@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Quentin Perret <quentin.perret@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Valentin Schneider <valentin.schneider@arm.com>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20190527062116.11512-2-dietmar.eggemann@arm.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>

112598d6

cpuidle: menu: Remove get_loadavg() from the performance multiplier · 800bf05d

由 Daniel Lezcano 提交于 10月 04, 2018

to #28739709

commit a7fe5190c03f8137ef08db84a58dd4daf2c4785d upstream

The function get_loadavg() returns almost always zero. To be more
precise, statistically speaking for a total of 1023379 times passing
in the function, the load is equal to zero 1020728 times, greater than
100, 610 times, the remaining is between 0 and 5.

In 2011, the get_loadavg() was removed from the Android tree because
of the above [1]. At this time, the load was:

unsigned long this_cpu_load(void)
{
        struct rq *this = this_rq();
        return this->cpu_load[0];
}

In 2014, the code was changed by commit 372ba8cb (cpuidle: menu: Lookup CPU
runqueues less) and the load is:

void get_iowait_load(unsigned long *nr_waiters, unsigned long *load)
{
        struct rq *rq = this_rq();
        *nr_waiters = atomic_read(&rq->nr_iowait);
        *load = rq->load.weight;
}

with the same result.

Both measurements show using the load in this code path does no matter
anymore. Removing it.

[1] https://android.googlesource.com/kernel/common/+/4dedd9f124703207895777ac6e91dacde0f7cc17Signed-off-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Acked-by: NMel Gorman <mgorman@suse.de>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Reviewed-by: NShanpei Chen <shanpeic@linux.alibaba.com>

800bf05d

23 6月, 2020 4 次提交

alinux: sched: Add switch for scheduler_tick load tracking · bcaf8afd

由 Yihao Wu 提交于 5月 13, 2020

to #28739709

Assume workloads are composed of massive short tasks. Then periodical
load tracking is unnecessary. Because load tracking should be already
guaranteed by frequent sleep and wake-up.

If these massive short tasks run in their individual cgroups, the load
tracking becomes extremely heavy.

This patch adds a switch to bypass scheduler_tick load tracking, in
order to reduce scheduler overhead, without sacrificing much balance
in this scenario.

Performance Tests:

1) 1100+ tasks in their individual cgroups, on a 96-HT Skylake machine

	sched overhead(each HT): 0.74% -> 0.48%

	(This test's baseline is from the previous patch)

2) sysbench-threads with 96 threads, running for 5min

	latency_ms 95th: 63.07 -> 54.01

Besides these, no regression is found on our test platform.
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

bcaf8afd

alinux: sched: Add switch for update_blocked_averages · bb48b716

由 Yihao Wu 提交于 5月 14, 2020

to #28739709

Unless the workloads are IO-bounded, update_blocked_averages doesn't help
load balance. This patch adds a switch to bypass update_blocked_averages
if prior knowledge about workloads indicates IO is negligible.

Performance Tests:

1) 1100+ tasks in their individual cgroups, on a 96-HT Skylake machine

	sched overhead(each HT): 3.78% -> 0.74%

2) cgroup-overhead benchmark in our sched-test suite on a 96-HT Skylake

	overhead: 21.06 -> 18.08

3) unixbench context1 with 96 threads running for 1min

	Score: 15409.40 -> 16821.77

Besides these, UnixBench has some performance ups and downs. But
generally, the performance of UnixBench hasn't changed.
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

bb48b716

alinux: mm: thp: add fast_cow switch · 56a432f5

由 Yang Shi 提交于 6月 16, 2020

task #27327988

The commit ("thp: change CoW semantics for anon-THP") rewrites THP CoW
page fault handler to allocate base page only, but there is request to
keep the old behavior just in case.  So, introduce a new sysfs knob,
fast_cow, to control the behavior, the default is the new behavior.
Write that knob to 0 to switch to old behavior.
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
[ caspar: fix checkpatch.pl warnings ]
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

56a432f5

khugepaged: introduce 'max_ptes_shared' tunable · e5b2cc5d

由 Kirill A. Shutemov 提交于 6月 16, 2020

task #27327988

commit 71a2c112a0f6da497e1b44e18e97b1716c240518 upstream

'max_ptes_shared' specifies how many pages can be shared across multiple
processes.  Exceeding the number would block the collapse::

        /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared

A higher value may increase memory footprint for some workloads.

By default, at least half of pages has to be not shared.

[colin.king@canonical.com: fix several spelling mistakes]
  Link: http://lkml.kernel.org/r/20200420084241.65433-1-colin.king@canonical.comSigned-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Tested-by: NZi Yan <ziy@nvidia.com>
Reviewed-by: NWilliam Kucharski <william.kucharski@oracle.com>
Reviewed-by: NZi Yan <ziy@nvidia.com>
Acked-by: NYang Shi <yang.shi@linux.alibaba.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Ralph Campbell <rcampbell@nvidia.com>
Link: http://lkml.kernel.org/r/20200416160026.16538-9-kirill.shutemov@linux.intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

e5b2cc5d

15 6月, 2020 2 次提交

block: Fix use-after-free issue accessing struct io_cq · fba123ba

由 Sahitya Tummala 提交于 3月 11, 2020

task #28557799

[ Upstream commit 30a2da7b7e225ef6c87a660419ea04d3cef3f6a7 ]

There is a potential race between ioc_release_fn() and
ioc_clear_queue() as shown below, due to which below kernel
crash is observed. It also can result into use-after-free
issue.

context#1:				context#2:
ioc_release_fn()			__ioc_clear_queue() gets the same icq
->spin_lock(&ioc->lock);		->spin_lock(&ioc->lock);
->ioc_destroy_icq(icq);
  ->list_del_init(&icq->q_node);
  ->call_rcu(&icq->__rcu_head,
  	icq_free_icq_rcu);
->spin_unlock(&ioc->lock);
					->ioc_destroy_icq(icq);
					  ->hlist_del_init(&icq->ioc_node);
					  This results into below crash as this memory
					  is now used by icq->__rcu_head in context#1.
					  There is a chance that icq could be free'd
					  as well.

22150.386550:   <6> Unable to handle kernel write to read-only memory
at virtual address ffffffaa8d31ca50
...
Call trace:
22150.607350:   <2>  ioc_destroy_icq+0x44/0x110
22150.611202:   <2>  ioc_clear_queue+0xac/0x148
22150.615056:   <2>  blk_cleanup_queue+0x11c/0x1a0
22150.619174:   <2>  __scsi_remove_device+0xdc/0x128
22150.623465:   <2>  scsi_forget_host+0x2c/0x78
22150.627315:   <2>  scsi_remove_host+0x7c/0x2a0
22150.631257:   <2>  usb_stor_disconnect+0x74/0xc8
22150.635371:   <2>  usb_unbind_interface+0xc8/0x278
22150.639665:   <2>  device_release_driver_internal+0x198/0x250
22150.644897:   <2>  device_release_driver+0x24/0x30
22150.649176:   <2>  bus_remove_device+0xec/0x140
22150.653204:   <2>  device_del+0x270/0x460
22150.656712:   <2>  usb_disable_device+0x120/0x390
22150.660918:   <2>  usb_disconnect+0xf4/0x2e0
22150.664684:   <2>  hub_event+0xd70/0x17e8
22150.668197:   <2>  process_one_work+0x210/0x480
22150.672222:   <2>  worker_thread+0x32c/0x4c8

Fix this by adding a new ICQ_DESTROYED flag in ioc_destroy_icq() to
indicate this icq is once marked as destroyed. Also, ensure
__ioc_clear_queue() is accessing icq within rcu_read_lock/unlock so
that icq doesn't get free'd up while it is still using it.
Signed-off-by: NSahitya Tummala <stummala@codeaurora.org>
Co-developed-by: NPradeep P V K <ppvk@codeaurora.org>
Signed-off-by: NPradeep P V K <ppvk@codeaurora.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

fba123ba

block: fix an integer overflow in logical block size · 8b05616d

由 Mikulas Patocka 提交于 1月 15, 2020

task #28557799

commit ad6bf88a6c19a39fb3b0045d78ea880325dfcf15 upstream.

Logical block size has type unsigned short. That means that it can be at
most 32768. However, there are architectures that can run with 64k pages
(for example arm64) and on these architectures, it may be possible to
create block devices with 64k block size.

For exmaple (run this on an architecture with 64k pages):

Mount will fail with this error because it tries to read the superblock using 2-sector
access:
  device-mapper: writecache: I/O is not aligned, sector 2, size 1024, block size 65536
  EXT4-fs (dm-0): unable to read superblock

This patch changes the logical block size from unsigned short to unsigned
int to avoid the overflow.

Cc: stable@vger.kernel.org
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

8b05616d

11 6月, 2020 1 次提交

alinux: blk-mq: remove QUEUE_FLAG_POLL from default MQ flags · 294d5fb2

由 Joseph Qi 提交于 6月 10, 2020

fix #28528017

In case of virtio-blk device, checking /sys/block/<device>/queue/io_poll
will show 1 and user can't disable it. Actually virtio-blk doesn't
support poll yet, so it will confuse end user. The root cause is mq
initialization will default set bit QUEUE_FLAG_POLL.

This fix takes ideas from the following upstream commits:
6544d229bf43 ("block: enable polling by default if a poll map is initalized")
6e0de61107f0 ("blk-mq: remove QUEUE_FLAG_POLL from default MQ flags")
Since we don't want to get HCTX_TYPE_POLL related logic involved, so
just check mq_ops->poll and then set QUEUE_FLAG_POLL.
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

294d5fb2

09 6月, 2020 5 次提交

block: annotate refault stalls from IO submission · 4da72359

由 Johannes Weiner 提交于 8月 08, 2019

task #28327019

commit b8e24a9300b0836a9d39f6b20746766b3b81f1bd upstream

psi tracks the time tasks wait for refaulting pages to become
uptodate, but it does not track the time spent submitting the IO. The
submission part can be significant if backing storage is contended or
when cgroup throttling (io.latency) is in effect - a lot of time is
spent in submit_bio(). In that case, we underreport memory pressure.

Annotate submit_bio() to account submission time as memory stall when
the bio is reading userspace workingset pages.
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

4da72359

alinux: block: replace reserved field with extended bio_flags · 422652e5

由 zhongjiang-ali 提交于 6月 09, 2020

task #28327019

Commit bc0cc360 ("alinux: blk-throttle: fix tg NULL pointer
dereference") add an self-defined bio flags to fix an issue of
use-after-free. But it is limited to 13 entry and has used up,
hence it will fails to sync related patch.

The patch replace reserved field with extended bio_flags to allow
us to define more bio flags.
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
Signed-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>

422652e5

psi: Move PF_MEMSTALL out of task->flags · e7b88a8a

由 Yafang Shao 提交于 3月 16, 2020

task #28327019

commit 1066d1b6974e095d5a6c472ad9180a957b496cd6 upstream

The task->flags is a 32-bits flag, in which 31 bits have already been
consumed. So it is hardly to introduce other new per process flag.
Currently there're still enough spaces in the bit-field section of
task_struct, so we can define the memstall state as a single bit in
task_struct instead.
This patch also removes an out-of-date comment pointed by Matthew.
Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Link: https://lkml.kernel.org/r/1584408485-1921-1-git-send-email-laoar.shao@gmail.comSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

e7b88a8a

psi: Optimize switching tasks inside shared cgroups · 0e5c5cd8

由 Johannes Weiner 提交于 3月 16, 2020

task #28327019

commit 36b238d5717279163859fb6ba0f4360abcafab83 upstream

When switching tasks running on a CPU, the psi state of a cgroup
containing both of these tasks does not change. Right now, we don't
exploit that, and can perform many unnecessary state changes in nested
hierarchies, especially when most activity comes from one leaf cgroup.

This patch implements an optimization where we only update cgroups
whose state actually changes during a task switch. These are all
cgroups that contain one task but not the other, up to the first
shared ancestor. When both tasks are in the same group, we don't need
to update anything at all.

We can identify the first shared ancestor by walking the groups of the
incoming task until we see TSK_ONCPU set on the local CPU; that's the
first group that also contains the outgoing task.

The new psi_task_switch() is similar to psi_task_change(). To allow
code reuse, move the task flag maintenance code into a new function
and the poll/avg worker wakeups into the shared psi_group_change().
Suggested-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200316191333.115523-3-hannes@cmpxchg.orgSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

0e5c5cd8

psi: Fix cpu.pressure for cpu.max and competing cgroups · 9bf3d89c

由 Johannes Weiner 提交于 3月 16, 2020

task #28327019

commit b05e75d611380881e73edc58a20fd8c6bb71720b upstream

For simplicity, cpu pressure is defined as having more than one
runnable task on a given CPU. This works on the system-level, but it
has limitations in a cgrouped reality: When cpu.max is in use, it
doesn't capture the time in which a task is not executing on the CPU
due to throttling. Likewise, it doesn't capture the time in which a
competing cgroup is occupying the CPU - meaning it only reflects
cgroup-internal competitive pressure, not outside pressure.

Enable tracking of currently executing tasks, and then change the
definition of cpu pressure in a cgroup from

	NR_RUNNING > 1

to

	NR_RUNNING > ON_CPU

which will capture the effects of cpu.max as well as competition from
outside the cgroup.

After this patch, a cgroup running `stress -c 1` with a cpu.max
setting of 5000 10000 shows ~50% continuous CPU pressure.
Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20200316191333.115523-2-hannes@cmpxchg.orgSigned-off-by: Nzhongjiang-ali <zhongjiang-ali@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

9bf3d89c

05 6月, 2020 2 次提交

KVM: polling: add architecture backend to disable polling · 15d3ec88

由 Christian Borntraeger 提交于 3月 05, 2019

fix #28092200

commit cdd6ad3ac63d2fa320baefcf92a02a918375c30f upstream

There are cases where halt polling is unwanted. For example when running
KVM on an over committed LPAR we rather want to give back the CPU to
neighbour LPARs instead of polling. Let us provide a callback that
allows architectures to disable polling.
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Acked-by: NPaolo Bonzini <pbonzini@redhat.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>
Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
Signed-off-by: Nchenxiangzuo <cxz18821786681@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>

15d3ec88

KVM: x86: fix missing prototypes · c053cc11

由 Paolo Bonzini 提交于 2月 13, 2020

to #28092200

commit d970a325561da5e611596cbb06475db3755ce823 upstream

Reported with "make W=1" due to -Wmissing-prototypes.
Reported-by: NQian Cai <cai@lca.pw>
Reviewed-by: NMiaohe Lin <linmiaohe@huawei.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Nchenxiangzuo <cxz18821786681@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>

c053cc11

04 6月, 2020 9 次提交

io_uring: make spdxcheck.py happy · 8b3e2b70

由 Lukas Bulwahn 提交于 3月 21, 2020

to #28170604

commit 9f5834c868e901b00f1bfe4d0052b5906b4a2b7f upstream

Commit bbbdeb4720a0 ("io_uring: dual license io_uring.h uapi header")
uses a nested SPDX-License-Identifier to dual license the header.

Since then, ./scripts/spdxcheck.py complains:

  include/uapi/linux/io_uring.h: 1:60 Missing parentheses: OR

Add parentheses to make spdxcheck.py happy.
Signed-off-by: NLukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

8b3e2b70

io_uring: dual license io_uring.h uapi header · 6032d618

由 Jens Axboe 提交于 3月 11, 2020

to #28170604

commit bbbdeb4720a0759ec90e3bcb20ad28d19e531346 upstream

This just syncs the header it with the liburing version, so there's no
confusion on the license of the header parts.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

6032d618

io_uring: provide means of removing buffers · bdc3a9cd

由 Jens Axboe 提交于 3月 02, 2020

to #28170604

commit 067524e914cb23e20d59480b318fe2625eaee7c8 upstream

We have IORING_OP_PROVIDE_BUFFERS, but the only way to remove buffers
is to trigger IO on them. The usual case of shrinking a buffer pool
would be to just not replenish the buffers when IO completes, and
instead just free it. But it may be nice to have a way to manually
remove a number of buffers from a given group, and
IORING_OP_REMOVE_BUFFERS provides that functionality.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

bdc3a9cd

net: abstract out normal and compat msghdr import · 992cfc11

由 Jens Axboe 提交于 2月 27, 2020

to #28170604

commit 0a384abfae66651b28e4bbe16883b1ff046ba3b3 upstream

This splits it into two parts, one that imports the message, and one
that imports the iovec. This allows a caller to only do the first part,
and import the iovec manually afterwards.

No functional changes in this patch.
Acked-by: NDavid Miller <davem@davemloft.net>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

992cfc11

io_uring: support buffer selection for OP_READ and OP_RECV · 1d68f9f6

由 Jens Axboe 提交于 2月 23, 2020

to #28170604

commit bcda7baaa3f15c7a95db3c024bb046d6e298f76b upstream

If a server process has tons of pending socket connections, generally
it uses epoll to wait for activity. When the socket is ready for reading
(or writing), the task can select a buffer and issue a recv/send on the
given fd.

Now that we have fast (non-async thread) support, a task can have tons
of pending reads or writes pending. But that means they need buffers to
back that data, and if the number of connections is high enough, having
them preallocated for all possible connections is unfeasible.

With IORING_OP_PROVIDE_BUFFERS, an application can register buffers to
use for any request. The request then sets IOSQE_BUFFER_SELECT in the
sqe, and a given group ID in sqe->buf_group. When the fd becomes ready,
a free buffer from the specified group is selected. If none are
available, the request is terminated with -ENOBUFS. If successful, the
CQE on completion will contain the buffer ID chosen in the cqe->flags
member, encoded as:

	(buffer_id << IORING_CQE_BUFFER_SHIFT) | IORING_CQE_F_BUFFER;

Once a buffer has been consumed by a request, it is no longer available
and must be registered again with IORING_OP_PROVIDE_BUFFERS.

Requests need to support this feature. For now, IORING_OP_READ and
IORING_OP_RECV support it. This is checked on SQE submission, a CQE with
res == -EOPNOTSUPP will be posted if attempted on unsupported requests.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

1d68f9f6

io_uring: add IORING_OP_PROVIDE_BUFFERS · 72e1286a

由 Jens Axboe 提交于 2月 23, 2020

to #28170604

commit ddf0322db79c5984dc1a1db890f946dd19b7d6d9 upstream

IORING_OP_PROVIDE_BUFFERS uses the buffer registration infrastructure to
support passing in an addr/len that is associated with a buffer ID and
buffer group ID. The group ID is used to index and lookup the buffers,
while the buffer ID can be used to notify the application which buffer
in the group was used. The addr passed in is the starting buffer address,
and length is each buffer length. A number of buffers to add with can be
specified, in which case addr is incremented by length for each addition,
and each buffer increments the buffer ID specified.

No validation is done of the buffer ID. If the application provides
buffers within the same group with identical buffer IDs, then it'll have
a hard time telling which buffer ID was used. The only restriction is
that the buffer ID can be a max of 16-bits in size, so USHRT_MAX is the
maximum ID that can be used.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

Notes: use VERIFY_WRITE for access_ok()
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

72e1286a

io_uring: use poll driven retry for files that support it · 73411260

由 Jens Axboe 提交于 2月 14, 2020

to #28170604

commit d7718a9d25a61442da8ee8aeeff6a0097f0ccfd6 upstream

Currently io_uring tries any request in a non-blocking manner, if it can,
and then retries from a worker thread if we get -EAGAIN. Now that we have
a new and fancy poll based retry backend, use that to retry requests if
the file supports it.

This means that, for example, an IORING_OP_RECVMSG on a socket no longer
requires an async thread to complete the IO. If we get -EAGAIN reading
from the socket in a non-blocking manner, we arm a poll handler for
notification on when the socket becomes readable. When it does, the
pending read is executed directly by the task again, through the io_uring
task work handlers. Not only is this faster and more efficient, it also
means we're not generating potentially tons of async threads that just
sit and block, waiting for the IO to complete.

The feature is marked with IORING_FEAT_FAST_POLL, meaning that async
pollable IO is fast, and that poll<link>other_op is fast as well.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

73411260

io_uring: add splice(2) support · a3e58e00

由 Pavel Begunkov 提交于 2月 24, 2020

to #28170604

commit 7d67af2c013402537385dae343a2d0f6a4cb3bfd upstream

Add support for splice(2).

- output file is specified as sqe->fd, so it's handled by generic code
- hash_reg_file handled by generic code as well
- len is 32bit, but should be fine
- the fd_in is registered file, when SPLICE_F_FD_IN_FIXED is set, which
is a splice flag (i.e. sqe->splice_flags).
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

a3e58e00

splice: make do_splice public · 49142728

由 Pavel Begunkov 提交于 2月 24, 2020

to #28170604

commit 444ebb5768c5c43aadfc60111fecd6c4f946e77b upstream

Make do_splice(), so other kernel parts can reuse it
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

49142728

28 5月, 2020 5 次提交

io_uring: make sure accept honor rlimit nofile · 4520967e

由 Jens Axboe 提交于 3月 19, 2020

to #26323588

commit 09952e3e7826119ddd4357c453d54bcc7ef25156 upstream.

Just like commit 4022e7af86be, this fixes the fact that
IORING_OP_ACCEPT ends up using get_unused_fd_flags(), which checks
current->signal->rlim[] for limits.

Add an extra argument to __sys_accept4_file() that allows us to pass
in the proper nofile limit, and grab it at request prep time.
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

4520967e

io_uring: make sure openat/openat2 honor rlimit nofile · 4d28e850

由 Jens Axboe 提交于 3月 19, 2020

to #26323588

commit 4022e7af86be2dd62975dedb6b7ea551d108695e upstream.

Dmitry reports that a test case shows that io_uring isn't honoring a
modified rlimit nofile setting. get_unused_fd_flags() checks the task
signal->rlimi[] for the limits. As this isn't easily inheritable,
provide a __get_unused_fd_flags() that takes the value instead. Then we
can grab it when the request is prepared (from the original task), and
pass that in when we do the async part part of the open.
Reported-by: NDmitry Kadashev <dkadashev@gmail.com>
Tested-by: NDmitry Kadashev <dkadashev@gmail.com>
Acked-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

4d28e850

eventfd: track eventfd_signal() recursion depth · 0c570e16

由 Jens Axboe 提交于 2月 02, 2020

to #26323588

commit b5e683d5cab8cd433b06ae178621f083cabd4f63 upstream.

eventfd use cases from aio and io_uring can deadlock due to circular
or resursive calling, when eventfd_signal() tries to grab the waitqueue
lock. On top of that, it's also possible to construct notification
chains that are deep enough that we could blow the stack.

Add a percpu counter that tracks the percpu recursion depth, warn if we
exceed it. The counter is also exposed so that users of eventfd_signal()
can do the right thing if it's non-zero in the context where it is
called.

Cc: stable@vger.kernel.org # 4.19+
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

0c570e16

io_uring: add support for epoll_ctl(2) · c9166992

由 Jens Axboe 提交于 1月 08, 2020

to #26323588

commit 3e4827b05d2ac2d377ed136a52829ec46787bf4b upstream.

This adds IORING_OP_EPOLL_CTL, which can perform the same work as the
epoll_ctl(2) system call.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

c9166992

eventpoll: support non-blocking do_epoll_ctl() calls · 73681652

由 Jens Axboe 提交于 1月 08, 2020

to #26323588

commit 39220e8d4a2aaab045ea03cc16d737e85d0817bf upstream.

Also make it available outside of epoll, along with the helper that
decides if we need to copy the passed in epoll_event.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

73681652

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功