提交 · 25ab3dc432c7a65c88b28c833607009b2248e459 · openanolis / cloud-kernel

27 12月, 2019 40 次提交

irqchip/gic-v3-its: Change initialization ordering for LPIs · 25ab3dc4

由 Marc Zyngier 提交于 7月 27, 2018

commit d38a71c5452529fd3326b0ae488292e5fbd8d2a1 upstream

We currently initialize the LPIs (and the ITS) fairly early, even
before the SMP support and the CPU interface. This is a bit odd
(as LPIs are not exactly crutial for the early boot process),
and is going to cause issues when reorganizing the probing code.

Let's move this initialization later.
Reviewed-by: NJulien Thierry <julien.thierry@arm.com>
Tested-by: NJeremy Linton <jeremy.linton@arm.com>
Tested-by: NBhupesh Sharma <bhsharma@redhat.com>
Tested-by: NLei Zhang <zhang.lei@jp.fujitsu.com>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

25ab3dc4

iommu/arm-smmu-v3: Remove unnecessary wrapper function · 607e86d5

由 Andrew Murray 提交于 10月 10, 2018

commit 5e731073bc0a4a53a213412dbd33982d829560f1 upstream

Simplify the code by removing an unnecessary wrapper function.

This was left behind by commit 2f657add
("iommu/arm-smmu-v3: Specialise CMD_SYNC handling")
Signed-off-by: NAndrew Murray <andrew.murray@arm.com>
Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

607e86d5

iommu/arm-smmu: Support non-strict mode · a1fdb30f

由 Robin Murphy 提交于 9月 20, 2018

commit 44f6876a00e83df5fd28681502b19b0f51e4a3c6 upstream

All we need is to wire up .flush_iotlb_all properly and implement the
domain attribute, and iommu-dma and io-pgtable will do the rest for us.
The only real subtlety is documenting the barrier semantics we're
introducing between io-pgtable and the drivers for non-strict flushes.
Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

a1fdb30f

iommu/io-pgtable-arm-v7s: Add support for non-strict mode · 989011e7

由 Robin Murphy 提交于 9月 20, 2018

commit b2dfeba654cb08db327d0ed4547b66c2f8fce997 upstream

As for LPAE, it's simply a case of skipping the leaf invalidation for a
regular unmap, and ensuring that the one in split_blk_unmap() is paired
with an explicit sync ASAP rather than relying on one which might only
eventually happen way down the line.
Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

989011e7

iommu/arm-smmu-v3: Add support for non-strict mode · c35eee6e

由 Zhen Lei 提交于 9月 20, 2018

commit 9662b99a19abccb0b7bfc91abb3fec1447c35bf0 upstream

Now that io-pgtable knows how to dodge strict TLB maintenance, all
that's left to do is bridge the gap between the IOMMU core requesting
DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE for default domains, and showing the
appropriate IO_PGTABLE_QUIRK_NON_STRICT flag to alloc_io_pgtable_ops().
Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
[rm: convert to domain attribute, tweak commit message]
Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

c35eee6e

iommu/io-pgtable-arm: Add support for non-strict mode · 0e8d86ed

由 Zhen Lei 提交于 9月 20, 2018

commit b6b65ca20bc93d14319f9b5cf98fd3c19a4244e3 upstream

Non-strict mode is simply a case of skipping 'regular' leaf TLBIs, since
the sync is already factored out into ops->iotlb_sync at the core API
level. Non-leaf invalidations where we change the page table structure
itself still have to be issued synchronously in order to maintain walk
caches correctly.

To save having to reason about it too much, make sure the invalidation
in arm_lpae_split_blk_unmap() just performs its own unconditional sync
to minimise the window in which we're technically violating the break-
before-make requirement on a live mapping. This might work out redundant
with an outer-level sync for strict unmaps, but we'll never be splitting
blocks on a DMA fastpath anyway.
Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
[rm: tweak comment, commit message, split_blk_unmap logic and barriers]
Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

0e8d86ed

iommu: Add "iommu.strict" command line option · eb59e389

由 Zhen Lei 提交于 9月 20, 2018

commit 68a6efe86f6a16e25556a2aff40efad41097b486 upstream

Add a generic command line option to enable lazy unmapping via IOVA
flush queues, which will initally be suuported by iommu-dma. This echoes
the semantics of "intel_iommu=strict" (albeit with the opposite default
value), but in the driver-agnostic fashion of "iommu.passthrough".
Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
[rm: move handling out of SMMUv3 driver, clean up documentation]
Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
[will: dropped broken printk when parsing command-line option]
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

eb59e389

iommu/dma: Add support for non-strict mode · bd385efe

由 Zhen Lei 提交于 9月 20, 2018

commit 2da274cdf998a1c12afa6b5975db2df1df01edf1 upstream

With the flush queue infrastructure already abstracted into IOVA
domains, hooking it up in iommu-dma is pretty simple. Since there is a
degree of dependency on the IOMMU driver knowing what to do to play
along, we key the whole thing off a domain attribute which will be set
on default DMA ops domains to request non-strict invalidation. That way,
drivers can indicate the appropriate support by acknowledging the
attribute, and we can easily fall back to strict invalidation otherwise.

The flush queue callback needs a handle on the iommu_domain which owns
our cookie, so we have to add a pointer back to that, but neatly, that's
also sufficient to indicate whether we're using a flush queue or not,
and thus which way to release IOVAs. The only slight subtlety is
switching __iommu_dma_unmap() from calling iommu_unmap() to explicit
iommu_unmap_fast()/iommu_tlb_sync() so that we can elide the sync
entirely in non-strict mode.
Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
[rm: convert to domain attribute, tweak comments and commit message]
Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

bd385efe

iommu/arm-smmu-v3: Implement flush_iotlb_all hook · f83ff7c9

由 Zhen Lei 提交于 9月 20, 2018

commit 07fdef34d2be6811f00c6f9e4e2a1483cf86696c upstream

.flush_iotlb_all is currently stubbed to arm_smmu_iotlb_sync() since the
only time it would ever need to actually do anything is for callers
doing their own explicit batching, e.g.:

	iommu_unmap_fast(domain, ...);
	iommu_unmap_fast(domain, ...);
	iommu_iotlb_flush_all(domain, ...);

where since io-pgtable still issues the TLBI commands implicitly in the
unmap instead of implementing .iotlb_range_add, the "flush" only needs
to ensure completion of those already-in-flight invalidations.

However, we're about to start using it in anger with flush queues, so
let's get a proper implementation wired up.
Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: NRobin Murphy <robin.murphy@arm.com>
[rm: document why it wasn't a bug]
Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

f83ff7c9

iommu/arm-smmu-v3: Avoid back-to-back CMD_SYNC operations · 57afa222

由 Zhen Lei 提交于 8月 19, 2018

commit 901510ee32f7190902f6fe4affb463e5d86a804c upstream

Putting adjacent CMD_SYNCs into the command queue is nonsensical, but
can happen when multiple CPUs are inserting commands. Rather than leave
the poor old hardware to chew through these operations, we can instead
drop the subsequent SYNCs and poll for completion of the first. This
has been shown to improve IO performance under pressure, where the
number of SYNC operations reduces by about a third:

	CMD_SYNCs reduced:	19542181
	CMD_SYNCs total:	58098548	(include reduced)
	CMDs total:		116197099	(TLBI:SYNC about 1:1)
Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

57afa222

iommu/arm-smmu-v3: Fix a couple of minor comment typos · e4416a58

由 John Garry 提交于 8月 17, 2018

commit 657135f3108122556c3cf60a78c6f0e76aeb60e6 upstream.

Fix some comment typos spotted.
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NZou Cao <zoucao@linux.alibaba.com>
Reviewed-by: NBaoyou Xie <xie.baoyou@linux.alibaba.com>

e4416a58

sched/psi: Correct overly pessimistic size calculation · eee77c97

由 Miles Chen 提交于 9月 12, 2019

commit 4adcdcea717cb2d8436bef00dd689aa5bc76f11b upstream.

When passing a equal or more then 32 bytes long string to psi_write(),
psi_write() copies 31 bytes to its buf and overwrites buf[30]
with '\0'. Which makes the input string 1 byte shorter than
it should be.

Fix it by copying sizeof(buf) bytes when nbytes >= sizeof(buf).

This does not cause problems in normal use case like:
"some 500000 10000000" or "full 500000 10000000" because they
are less than 32 bytes in length.

	/* assuming nbytes == 35 */
	char buf[32];

	buf_size = min(nbytes, (sizeof(buf) - 1)); /* buf_size = 31 */
	if (copy_from_user(buf, user_buf, buf_size))
		return -EFAULT;

	buf[buf_size - 1] = '\0'; /* buf[30] = '\0' */

Before:

 %cd /proc/pressure/
 %echo "123456789|123456789|123456789|1234" > memory
 [   22.473497] nbytes=35,buf_size=31
 [   22.473775] 123456789|123456789|123456789| (print 30 chars)
 %sh: write error: Invalid argument

 %echo "123456789|123456789|123456789|1" > memory
 [   64.916162] nbytes=32,buf_size=31
 [   64.916331] 123456789|123456789|123456789| (print 30 chars)
 %sh: write error: Invalid argument

After:

 %cd /proc/pressure/
 %echo "123456789|123456789|123456789|1234" > memory
 [  254.837863] nbytes=35,buf_size=32
 [  254.838541] 123456789|123456789|123456789|1 (print 31 chars)
 %sh: write error: Invalid argument

 %echo "123456789|123456789|123456789|1" > memory
 [ 9965.714935] nbytes=32,buf_size=32
 [ 9965.715096] 123456789|123456789|123456789|1 (print 31 chars)
 %sh: write error: Invalid argument

Also remove the superfluous parentheses.
Signed-off-by: NMiles Chen <miles.chen@mediatek.com>
Cc: <linux-mediatek@lists.infradead.org>
Cc: <wsd_upstream@mediatek.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/20190912103452.13281-1-miles.chen@mediatek.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

eee77c97

sched/psi: Do not require setsched permission from the trigger creator · 39c75373

由 Suren Baghdasaryan 提交于 7月 29, 2019

commit 04e048cf09d7b5fc995817cdc5ae1acd4482429c upstream.

When a process creates a new trigger by writing into /proc/pressure/*
files, permissions to write such a file should be used to determine whether
the process is allowed to do so or not. Current implementation would also
require such a process to have setsched capability. Setting of psi trigger
thread's scheduling policy is an implementation detail and should not be
exposed to the user level. Remove the permission check by using _nocheck
version of the function.
Suggested-by: NNick Kralevich <nnk@google.com>
Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: lizefan@huawei.com
Cc: mingo@redhat.com
Cc: akpm@linux-foundation.org
Cc: kernel-team@android.com
Cc: dennisszhou@gmail.com
Cc: dennis@kernel.org
Cc: hannes@cmpxchg.org
Cc: axboe@kernel.dk
Link: https://lkml.kernel.org/r/20190730013310.162367-1-surenb@google.comSigned-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

39c75373

sched/psi: Reduce psimon FIFO priority · a4a4fdd9

由 Peter Zijlstra 提交于 8月 01, 2019

commit 14f5c7b46a41a595fc61db37f55721714729e59e upstream.

PSI defaults to a FIFO-99 thread, reduce this to FIFO-1.

FIFO-99 is the very highest priority available to SCHED_FIFO and
it not a suitable default; it would indicate the psi work is the
most important work on the machine.

Since Real-Time tasks will have pre-allocated memory and locked it in
place, Real-Time tasks do not care about PSI. All it needs is to be
above OTHER.
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Tested-by: NSuren Baghdasaryan <surenb@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

a4a4fdd9

blk-cgroup: turn on psi memstall stuff · df56e1cf

由 Josef Bacik 提交于 7月 09, 2019

commit fd112c74652371a023f85d87b70bee7169e8f4d0 upstream.

With the psi stuff in place we can use the memstall flag to indicate
pressure that happens from throttling.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

df56e1cf

zswap: use movable memory if zpool support allocate movable memory · 149e0aae

由 Hui Zhu 提交于 9月 27, 2019

commit d2fcd82bb83aab47c6d63aa8c960cd5edb578065 upstream

This is the third version that was updated according to the comments from
Sergey Senozhatsky https://lkml.org/lkml/2019/5/29/73 and Shakeel Butt
https://lkml.org/lkml/2019/6/4/973

zswap compresses swap pages into a dynamically allocated RAM-based memory
pool.  The memory pool should be zbud, z3fold or zsmalloc.  All of them
will allocate unmovable pages.  It will increase the number of unmovable
page blocks that will bad for anti-fragment.

zsmalloc support page migration if request movable page:
        handle = zs_malloc(zram->mem_pool, comp_len,
                GFP_NOIO | __GFP_HIGHMEM |
                __GFP_MOVABLE);

And commit "zpool: Add malloc_support_movable to zpool_driver" add
zpool_malloc_support_movable check malloc_support_movable to make sure if
a zpool support allocate movable memory.

This commit let zswap allocate block with gfp
__GFP_HIGHMEM | __GFP_MOVABLE if zpool support allocate movable memory.

Following part is test log in a pc that has 8G memory and 2G swap.

Without this commit:
~# echo lz4 > /sys/module/zswap/parameters/compressor
~# echo zsmalloc > /sys/module/zswap/parameters/zpool
~# echo 1 > /sys/module/zswap/parameters/enabled
~# swapon /swapfile
~# cd /home/teawater/kernel/vm-scalability/
/home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
/home/teawater/kernel/vm-scalability# ./case-anon-w-seq
2717908992 bytes / 4826062 usecs = 549973 KB/s
2717908992 bytes / 4864201 usecs = 545661 KB/s
2717908992 bytes / 4867015 usecs = 545346 KB/s
2717908992 bytes / 4915485 usecs = 539968 KB/s
397853 usecs to free memory
357820 usecs to free memory
421333 usecs to free memory
420454 usecs to free memory
/home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    0, zone      DMA, type    Unmovable      1      1      1      0      2      1      1      0      1      0      0
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3
Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type    Unmovable      6      5      8      6      6      5      4      1      1      1      0
Node    0, zone    DMA32, type      Movable     25     20     20     19     22     15     14     11     11      5    767
Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type    Unmovable   4753   5588   5159   4613   3712   2520   1448    594    188     11      0
Node    0, zone   Normal, type      Movable     16      3    457   2648   2143   1435    860    459    223    224    296
Node    0, zone   Normal, type  Reclaimable      0      0     44     38     11      2      0      0      0      0      0
Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0

Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
Node 0, zone      DMA            1            7            0            0            0            0
Node 0, zone    DMA32            4         1652            0            0            0            0
Node 0, zone   Normal          931         1485           15            0            0            0

With this commit:
~# echo lz4 > /sys/module/zswap/parameters/compressor
~# echo zsmalloc > /sys/module/zswap/parameters/zpool
~# echo 1 > /sys/module/zswap/parameters/enabled
~# swapon /swapfile
~# cd /home/teawater/kernel/vm-scalability/
/home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
/home/teawater/kernel/vm-scalability# ./case-anon-w-seq
2717908992 bytes / 4689240 usecs = 566020 KB/s
2717908992 bytes / 4760605 usecs = 557535 KB/s
2717908992 bytes / 4803621 usecs = 552543 KB/s
2717908992 bytes / 5069828 usecs = 523530 KB/s
431546 usecs to free memory
383397 usecs to free memory
456454 usecs to free memory
224487 usecs to free memory
/home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
Page block order: 9
Pages per block:  512

Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
Node    0, zone      DMA, type    Unmovable      1      1      1      0      2      1      1      0      1      0      0
Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3
Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type    Unmovable     10      8     10      9     10      4      3      2      3      0      0
Node    0, zone    DMA32, type      Movable     18     12     14     16     16     11      9      5      5      6    775
Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      1
Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type    Unmovable   2669   1236    452    118     37     14      4      1      2      3      0
Node    0, zone   Normal, type      Movable   3850   6086   5274   4327   3510   2494   1520    934    438    220    470
Node    0, zone   Normal, type  Reclaimable     56     93    155    124     47     31     17      7      3      0      0
Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0

Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
Node 0, zone      DMA            1            7            0            0            0            0
Node 0, zone    DMA32            4         1650            2            0            0            0
Node 0, zone   Normal           79         2326           26            0            0            0

You can see that the number of unmovable page blocks is decreased
when the kernel has this commit.

Link: http://lkml.kernel.org/r/20190605100630.13293-2-teawaterz@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

149e0aae

zpool: add malloc_support_movable to zpool_driver · d3fc8a97

由 Hui Zhu 提交于 9月 27, 2019

commit c165f25d23ecb2f9f121ced20435415b931219e2 upstream

As a zpool_driver, zsmalloc can allocate movable memory because it support
migate pages.  But zbud and z3fold cannot allocate movable memory.

Add malloc_support_movable to zpool_driver.  If a zpool_driver support
allocate movable memory, set it to true.  And add
zpool_malloc_support_movable check malloc_support_movable to make sure if
a zpool support allocate movable memory.

Link: http://lkml.kernel.org/r/20190605100630.13293-1-teawaterz@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Cc: Dan Streetman <ddstreet@ieee.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
Cc: Seth Jennings <sjenning@redhat.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

d3fc8a97

ip_sockglue: Fix missing-check bug in ip_ra_control() · 3560e0b3

由 Gen Zhang 提交于 5月 24, 2019

commit 425aa0e1d01513437668fa3d4a971168bbaa8515 upstream.

In function ip_ra_control(), the pointer new_ra is allocated a memory
space via kmalloc(). And it is used in the following codes. However,
when  there is a memory allocation error, kmalloc() fails. Thus null
pointer dereference may happen. And it will cause the kernel to crash.
Therefore, we should check the return value and handle the error.
Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

3560e0b3

efi/x86/Add missing error handling to old_memmap 1:1 mapping code · ddb88d84

由 Gen Zhang 提交于 5月 25, 2019

commit 4e78921ba4dd0aca1cc89168f45039add4183f8e upstream.

The old_memmap flow in efi_call_phys_prolog() performs numerous memory
allocations, and either does not check for failure at all, or it does
but fails to propagate it back to the caller, which may end up calling
into the firmware with an incomplete 1:1 mapping.

So let's fix this by returning NULL from efi_call_phys_prolog() on
memory allocation failures only, and by handling this condition in the
caller. Also, clean up any half baked sets of page tables that we may
have created before returning with a NULL return value.

Note that any failure at this level will trigger a panic() two levels
up, so none of this makes a huge difference, but it is a nice cleanup
nonetheless.

[ardb: update commit log, add efi_call_phys_epilog() call on error path]
Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rob Bradford <robert.bradford@intel.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-efi@vger.kernel.org
Link: http://lkml.kernel.org/r/20190525112559.7917-2-ard.biesheuvel@linaro.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

ddb88d84

ipv6_sockglue: Fix a missing-check bug in ip6_ra_control() · 45d22128

由 Gen Zhang 提交于 5月 24, 2019

commit 95baa60a0da80a0143e3ddd4d3725758b4513825 upstream.

In function ip6_ra_control(), the pointer new_ra is allocated a memory
space via kmalloc(). And it is used in the following codes. However,
when there is a memory allocation error, kmalloc() fails. Thus null
pointer dereference may happen. And it will cause the kernel to crash.
Therefore, we should check the return value and handle the error.
Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

45d22128

scsi: mpt3sas_ctl: fix double-fetch bug in _ctl_ioctl_main() · 8d62aaac

由 Gen Zhang 提交于 5月 30, 2019

commit f9e3ebeea4521652318af903cddeaf033527e93e upstream.

In _ctl_ioctl_main(), 'ioctl_header' is fetched the first time from
userspace. 'ioctl_header.ioc_number' is then checked. The legal result is
saved to 'ioc'. Then, in condition MPT3COMMAND, the whole struct is fetched
again from the userspace. Then _ctl_do_mpt_command() is called, 'ioc' and
'karg' as inputs.

However, a malicious user can change the 'ioc_number' between the two
fetches, which will cause a potential security issues.  Moreover, a
malicious user can provide a valid 'ioc_number' to pass the check in first
fetch, and then modify it in the second fetch.

To fix this, we need to recheck the 'ioc_number' in the second fetch.
Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
Acked-by: NSuganath Prabu S <suganath-prabu.subramani@broadcom.com>
Signed-off-by: NMartin K. Petersen <martin.petersen@oracle.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

8d62aaac

clk-sunxi: fix a missing-check bug in sunxi_divs_clk_setup() · ab73d500

由 Gen Zhang 提交于 5月 28, 2019

commit fcdf445ff42f036d22178b49cf64e92d527c1330 upstream.

In sunxi_divs_clk_setup(), 'derived_name' is allocated by kstrndup().
It returns NULL when fails. 'derived_name' should be checked.
Signed-off-by: NGen Zhang <blackgod016574@gmail.com>
Signed-off-by: NMaxime Ripard <maxime.ripard@bootlin.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

ab73d500

e1000e: increase pause and refresh time · 5cfae4b6

由 Miguel Bernal Marin 提交于 3月 27, 2017

commit f74dc880098b4a29f76d756b888fb31d81ad9a0c upstream.
Suggested-by: NTim Pepper <timothy.c.pepper@linux.intel.com>
Signed-off-by: NMiguel Bernal Marin <miguel.bernal.marin@linux.intel.com>
Signed-off-by: NPaul Menzel <pmenzel@molgen.mpg.de>
Acked-by: NSasha Neftin <sasha.neftin@intel.com>
Tested-by: NAaron Brown <aaron.f.brown@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

5cfae4b6

reduce e1000e boot time by tightening sleep ranges · e005918b

由 Arjan van de Ven 提交于 7月 25, 2016

commit ab6973aed6200510662856afce5e3d1e386b7b64 upstream.

The e1000e driver is a great user of the usleep_range() API,
and has any nice ranges that in principle help power management.

However the ranges that are used only during system startup are
very long (and can add easily 100 msec to the boot time) while
the power savings of such long ranges is irrelevant due to the
one-off, boot only, nature of these functions.

This patch shrinks some of the longest ranges to be shorter
(while still using a power friendly 1 msec range); this saves
100msec+ of boot time on my BDW NUCs
Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
Signed-off-by: NPaul Menzel <pmenzel@molgen.mpg.de>
Tested-by: NAaron Brown <aaron.f.brown@intel.com>
Signed-off-by: NJeff Kirsher <jeffrey.t.kirsher@intel.com>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e005918b

psi: get poll_work to run when calling poll syscall next time · c91dc1e1

由 Jason Xing 提交于 7月 30, 2019

commit 7b2b55da1db10a5525460633ae4b6fb0be060c41 upstream.

Only when calling the poll syscall the first time can user
receive POLLPRI correctly. After that, user always fails to
acquire the event signal.

Reproduce case:
1. Get the monitor code in Documentation/accounting/psi.txt
2. Run it, and wait for the event triggered.
3. Kill and restart the process.

The question is why we can end up with poll_scheduled = 1 but the work
not running (which would reset it to 0). And the answer is because the
scheduling side sees group->poll_kworker under RCU protection and then
schedules it, but here we cancel the work and destroy the worker. The
cancel needs to pair with resetting the poll_scheduled flag.

Link: http://lkml.kernel.org/r/1566357985-97781-1-git-send-email-joseph.qi@linux.alibaba.comSigned-off-by: NJason Xing <kerneljasonxing@linux.alibaba.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: NSuren Baghdasaryan <surenb@google.com>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

c91dc1e1

ext4: fix bigalloc cluster freeing when hole punching under load · 9e24ae5c

由 Eric Whitney 提交于 2月 28, 2019

commit 7bd75230b43727b258a4f7a59d62114cffe1b6c8 upstream.

Ext4 may not free clusters correctly when punching holes in bigalloc
file systems under high load conditions. If it's not possible to
extend and restart the journal in ext4_ext_rm_leaf() when preparing to
remove blocks from a punched region, a retry of the entire punch
operation is triggered in ext4_ext_remove_space(). This causes a
partial cluster to be set to the first cluster in the extent found to
the right of the punched region. However, if the punch operation
prior to the retry had made enough progress to delete one or more
extents and a partial cluster candidate for freeing had already been
recorded, the retry would overwrite the partial cluster. The loss of
this information makes it impossible to correctly free the original
partial cluster in all cases.

This bug can cause generic/476 to fail when run as part of
xfstests-bld's bigalloc and bigalloc_1k test cases. The failure is
reported when e2fsck detects bad iblocks counts greater than expected
in units of whole clusters and also detects a number of negative block
bitmap differences equal to the iblocks discrepancy in cluster units.
Signed-off-by: NEric Whitney <enwlinux@gmail.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9e24ae5c

x86: uaccess: Inhibit speculation past access_ok() in user_access_begin() · 514f013a

由 Will Deacon 提交于 1月 19, 2019

commit 6e693b3ffecb0b478c7050b44a4842854154f715 upstream.

Commit 594cc251fdd0 ("make 'user_access_begin()' do 'access_ok()'")
makes the access_ok() check part of the user_access_begin() preceding a
series of 'unsafe' accesses.  This has the desirable effect of ensuring
that all 'unsafe' accesses have been range-checked, without having to
pick through all of the callsites to verify whether the appropriate
checking has been made.

However, the consolidated range check does not inhibit speculation, so
it is still up to the caller to ensure that they are not susceptible to
any speculative side-channel attacks for user addresses that ultimately
fail the access_ok() check.

This is an oversight, so use __uaccess_begin_nospec() to ensure that
speculation is inhibited until the access_ok() check has passed.
Reported-by: NJulien Thierry <julien.thierry@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

514f013a

make 'user_access_begin()' do 'access_ok()' · 83460ef1

由 Linus Torvalds 提交于 1月 04, 2019

commit 594cc251fdd0d231d342d88b2fdff4bc42fb0690 upstream.

Originally, the rule used to be that you'd have to do access_ok()
separately, and then user_access_begin() before actually doing the
direct (optimized) user access.

But experience has shown that people then decide not to do access_ok()
at all, and instead rely on it being implied by other operations or
similar.  Which makes it very hard to verify that the access has
actually been range-checked.

If you use the unsafe direct user accesses, hardware features (either
SMAP - Supervisor Mode Access Protection - on x86, or PAN - Privileged
Access Never - on ARM) do force you to use user_access_begin().  But
nothing really forces the range check.

By putting the range check into user_access_begin(), we actually force
people to do the right thing (tm), and the range check vill be visible
near the actual accesses.  We have way too long a history of people
trying to avoid them.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

[ Shile: fix following conflicts by adding a dummy arguments ]
Conflicts:
	kernel/compat.c
	kernel/exit.c
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>

83460ef1

i915: fix missing user_access_end() in page fault exception case · b3ebfe9a

由 Linus Torvalds 提交于 1月 04, 2019

commit 0b2c8f8b6b0c7530e2866c95862546d0da2057b0 upstream.

When commit fddcd00a49e9 ("drm/i915: Force the slow path after a
user-write error") unified the error handling for various user access
problems, it didn't do the user_access_end() that is needed for the
unsafe_put_user() case.

It's not a huge deal: a missed user_access_end() will only mean that
SMAP protection isn't active afterwards, and for the error case we'll be
returning to user mode soon enough anyway.  But it's wrong, and adding
the proper user_access_end() is trivial enough (and doing it for the
other error cases where it isn't needed doesn't hurt).

I noticed it while doing the same prep-work for changing
user_access_begin() that precipitated the access_ok() changes in commit
96d4f267e40f ("Remove 'type' argument from access_ok() function").

Fixes: fddcd00a49e9 ("drm/i915: Force the slow path after a user-write error")
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
Cc: stable@kernel.org # v4.20
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b3ebfe9a

drm/i915: Force the slow path after a user-write error · 69b1d0e6

由 Chris Wilson 提交于 9月 03, 2018

commit fddcd00a49e9122a3579247151e9cb3ce5a1a36e upstream.

If we fail to write the user relocation back when it is changed, force
ourselves to take the slow relocation path where we can handle faults in
the write path. There is still an element of dubiousness as having
patched up the batch to use the correct offset, it no longer matches the
presumed_offset in the relocation, so a second pass may miss any changes
in layout.
Signed-off-by: NChris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: NJoonas Lahtinen <joonas.lahtinen@linux.intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20180903083337.13134-3-chris@chris-wilson.co.ukSigned-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

69b1d0e6

userfaultfd: allow get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) to trigger userfaults · 56c691f6

由 Andrea Arcangeli 提交于 10月 26, 2018

commit 3b9aadf7278d16d7bed4d5d808501065f70898d8 upstream.

get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) called a get_user_pages that would
not be waiting for userfaults before failing and it would hit on a SIGBUS
instead.  Using get_user_pages_locked/unlocked instead will allow
get_mempolicy to allow userfaults to resolve the fault and fill the hole,
before grabbing the node id of the page.

If the user calls get_mempolicy() with MPOL_F_ADDR | MPOL_F_NODE for an
address inside an area managed by uffd and there is no page at that
address, the page allocation from within get_mempolicy() will fail
because get_user_pages() does not allow for page fault retry required
for uffd; the user will get SIGBUS.

With this patch, the page fault will be resolved by the uffd and the
get_mempolicy() will continue normally.

Background:

Via code review, previously the syscall would have returned -EFAULT
(vm_fault_to_errno), now it will block and wait for an userfault (if
it's waken before the fault is resolved it'll still -EFAULT).

This way get_mempolicy will give a chance to an "unaware" app to be
compliant with userfaults.

The reason this visible change is that becoming "userfault compliant"
cannot regress anything: all other syscalls including read(2)/write(2)
had to become "userfault compliant" long time ago (that's one of the
things userfaultfd can do that PROT_NONE and trapping segfaults can't).

So this is just one more syscall that become "userfault compliant" like
all other major ones already were.

This has been happening on virtio-bridge dpdk process which just called
get_mempolicy on the guest space post live migration, but before the
memory had a chance to be migrated to destination.

I didn't run an strace to be able to show the -EFAULT going away, but
I've the confirmation of the below debug aid information (only visible
with CONFIG_DEBUG_VM=y) going away with the patch:

    [20116.371461] FAULT_FLAG_ALLOW_RETRY missing 0
    [20116.371464] CPU: 1 PID: 13381 Comm: vhost-events Not tainted 4.17.12-200.fc28.x86_64 #1
    [20116.371465] Hardware name: LENOVO 20FAS2BN0A/20FAS2BN0A, BIOS N1CET54W (1.22 ) 02/10/2017
    [20116.371466] Call Trace:
    [20116.371473]  dump_stack+0x5c/0x80
    [20116.371476]  handle_userfault.cold.37+0x1b/0x22
    [20116.371479]  ? remove_wait_queue+0x20/0x60
    [20116.371481]  ? poll_freewait+0x45/0xa0
    [20116.371483]  ? do_sys_poll+0x31c/0x520
    [20116.371485]  ? radix_tree_lookup_slot+0x1e/0x50
    [20116.371488]  shmem_getpage_gfp+0xce7/0xe50
    [20116.371491]  ? page_add_file_rmap+0x1a/0x2c0
    [20116.371493]  shmem_fault+0x78/0x1e0
    [20116.371495]  ? filemap_map_pages+0x3a1/0x450
    [20116.371498]  __do_fault+0x1f/0xc0
    [20116.371500]  __handle_mm_fault+0xe2e/0x12f0
    [20116.371502]  handle_mm_fault+0xda/0x200
    [20116.371504]  __get_user_pages+0x238/0x790
    [20116.371506]  get_user_pages+0x3e/0x50
    [20116.371510]  kernel_get_mempolicy+0x40b/0x700
    [20116.371512]  ? vfs_write+0x170/0x1a0
    [20116.371515]  __x64_sys_get_mempolicy+0x21/0x30
    [20116.371517]  do_syscall_64+0x5b/0x160
    [20116.371520]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

The above harmless debug message (not a kernel crash, just a
dump_stack()) is shown with CONFIG_DEBUG_VM=y to more quickly identify
and improve kernel spots that may have to become "userfaultfd
compliant" like this one (without having to run an strace and search
for syscall misbehavior).  Spots like the above are more closer to a
kernel bug for the non-cooperative usages that Mike focuses on, than
for for dpdk qemu-cooperative usages that reproduced it, but it's still
nicer to get this fixed for dpdk too.

The part of the patch that caused me to think is only the
implementation issue of mpol_get, but it looks like it should work safe
no matter the kind of mempolicy structure that is (the default static
policy also starts at 1 so it'll go to 2 and back to 1 without crashing
everything at 0).

[rppt@linux.vnet.ibm.com: changelog addition]
  http://lkml.kernel.org/r/20180904073718.GA26916@rapoport-lnx
Link: http://lkml.kernel.org/r/20180831214848.23676-1-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
Reported-by: NMaxime Coquelin <maxime.coquelin@redhat.com>
Tested-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

56c691f6

alinux: random: speed up the initialization of module · 7afda44c

由 Xingjun Liu 提交于 6月 26, 2019

During the module initialization phase, entropy will be added
to entropy pool for every interrupt, the change should speed up
initialization of the random module.

Before optimization:
[   22.180236] random: crng init done

After optimization:
[    1.474832] random: crng init done
Signed-off-by: NXingjun Liu <xingjun.lxj@alibaba-inc.com>
Reviewed-by: NLiu Jiang <gerry@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: Jia Zhang <zhang.jia@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>

7afda44c

alinux: random: introduce the initialization seed · f6446129

由 Xingjun Liu 提交于 6月 24, 2019

Add random entropy with the module parameter as the initialization
seed when the kernel startup.

For guest OS working in VM, the random entropy will be less,
it cause the random module to initialize very slowly, and if
the application which running in guest os gets a certain amount of
random numbers in the initialization phase, it will be blocked.

This patch allows the VMM to provide a certain amount of random seed
when starting guest OS, speeding up the initialization of the entire
guest OS random module.

Before optimization:
[   22.180236] random: crng init done

After optimization:
[    1.553362] random: crng init done
Signed-off-by: NXingjun Liu <xingjun.lxj@alibaba-inc.com>
Reviewed-by: NLiu Jiang <gerry@linux.alibaba.com>
Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
Reviewed-by: Jia Zhang <zhang.jia@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>

f6446129

cpufreq/intel_pstate: Load only on Intel hardware · cedb3c3f

由 Borislav Petkov 提交于 7月 29, 2019

commit 4ab526468344c11d2d1807ae95feb1f5305dc014 upstream.

This driver is Intel-only so loading on anything which is not Intel is
pointless. Prevent it from doing so.

While at it, correct the "not supported" print statement to say CPU
"model" which is what that test does.

Fixes: 076b862c7e44 (cpufreq: intel_pstate: Add reasons for failure and debug messages)
Suggested-by: NErwan Velu <e.velu@criteo.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Reviewed-by: NThomas Renninger <trenn@suse.de>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

cedb3c3f

cpufreq: intel_pstate: Add reasons for failure and debug messages · c34a3d19

由 Erwan Velu 提交于 7月 29, 2019

commit 076b862c7e4409d2dcacfda19f7eaf8d07ab9200 upstream.

The init code path has several exceptions where the driver can
decide not to load.

As CONFIG_X86_INTEL_PSTATE is generally set to Y, the return code is
not reachable.  The initialization code is neither verbose of the
reason why it did choose to prematurely exit, so it is difficult for
a user to determine, on a given platform, why the driver didn't load
properly.

This patch is about reporting to the user the reason/context of why
the driver failed to load.  That is a precious hint when debugging
a platform.
Signed-off-by: NErwan Velu <e.velu@criteo.com>
[ rjw: Subject & changelog, minor fixups ]
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

c34a3d19

cpufreq: intel_pstate: Force HWP min perf before offline · 270ab6d9

由 Srinivas Pandruvada 提交于 7月 29, 2019

commit af3b7379e2d709f2d7c6966b8a6f5ec6bd134241 upstream.

Force HWP Request MAX = HWP Request MIN = HWP Capability MIN and EPP to
0xFF. In this way the performance limits on the offlined CPU will not
influence performance limits on its sibling CPU, which is still online.

If the sibling CPU is calling for higher performance, it will impact the
max core performance. Here core performance will follow higher of the
performance requests from each sibling.
Reported-and-tested-by: NChen Yu <yu.c.chen@intel.com>
Signed-off-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Acked-by: NMichael Wang <yun.wang@linux.alibaba.com>

270ab6d9

dm: add missing trace_block_split() to __split_and_process_bio() · 2caf8302

由 Mike Snitzer 提交于 1月 18, 2019

commit 075c18c3e124a1511ebc10a89f1858c8a77dcb01 upstream.

Provides useful context about bio splits in blktrace.
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

2caf8302

dm: fix dm_wq_work() to only use __split_and_process_bio() if appropriate · df918dc8

由 Mike Snitzer 提交于 1月 17, 2019

commit 6548c7c538e5658cbce686c2dd1a9b4f5398bf34 upstream.

Otherwise targets that don't support/expect IO splitting could resubmit
bios using code paths with unnecessary IO splitting complexity.

Depends-on: 24113d487843 ("dm: avoid indirect call in __dm_make_request")
Fixes: 978e51ba ("dm: optimize bio-based NVMe IO submission")
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

df918dc8

dm: avoid indirect call in __dm_make_request · 39eb7095

由 Mikulas Patocka 提交于 11月 06, 2018

commit 24113d4878439baf1f23c1a33dfcc340fba66e97 upstream.

Indirect calls are inefficient because of retpolines that are used for
spectre workaround. This patch replaces an indirect call with a condition
(that can be predicted by the branch predictor).
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

39eb7095

dm: fix redundant IO accounting for bios that need splitting · 9ea8684c

由 Mike Snitzer 提交于 1月 17, 2019

commit a1e1cb72d96491277ede8d257ce6b48a381dd336 upstream.

[Joseph: cherry-pick part_stat_get() from commit 1226b8dd0e91 ("block:
switch to per-cpu in-flight counters") since we don't want the whole
patch series get involved.]

The risk of redundant IO accounting was not taken into consideration
when commit 18a25da8 ("dm: ensure bio submission follows a
depth-first tree walk") introduced IO splitting in terms of recursion
via generic_make_request().

Fix this by subtracting the split bio's payload from the IO stats that
were already accounted for by start_io_acct() upon dm_make_request()
entry.  This repeat oscillation of the IO accounting, up then down,
isn't ideal but refactoring DM core's IO splitting to pre-split bios
_before_ they are accounted turned out to be an excessive amount of
change that will need a full development cycle to refine and verify.

Before this fix:

  /dev/mapper/stripe_dev is a 4-way stripe using a 32k chunksize, so
  bios are split on 32k boundaries.

  # fio --name=16M --filename=/dev/mapper/stripe_dev --rw=write --bs=64k --size=16M \
    	--iodepth=1 --ioengine=libaio --direct=1 --refill_buffers

  with debugging added:
  [103898.310264] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=0 len=128
  [103898.318704] device-mapper: core: __split_and_process_bio: recursing for following split bio:
  [103898.329136] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=64 len=64
  ...

  16M written yet 136M (278528 * 512b) accounted:
  # cat /sys/block/dm-2/stat | awk '{ print $7 }'
  278528

After this fix:

  16M written and 16M (32768 * 512b) accounted:
  # cat /sys/block/dm-2/stat | awk '{ print $7 }'
  32768

Fixes: 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk")
Cc: stable@vger.kernel.org # 4.16+
Reported-by: NBryan Gurney <bgurney@redhat.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

9ea8684c

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功