提交 · 068400dfd4e2b957384e283387b754eab1294479 · openeuler / Kernel

22 9月, 2020 40 次提交

nvme: fix memory leak caused by incorrect subsystem free · 068400df

由 Logan Gunthorpe 提交于 9月 22, 2020

mainline inclusion
from mainline-5.3-rc2
commit e654dfd3
category: bugfix
bugzilla: 19960
CVE: NA
---------------------------

When freeing the subsystem after finding another match with
__nvme_find_get_subsystem(), use put_device() instead of
__nvme_release_subsystem() which calls kfree() directly.

Per the documentation, put_device() should always be used
after device_initialization() is called. Otherwise, leaks
like the one below which was detected by kmemleak may occur.

Once the call of __nvme_release_subsystem() is removed it no
longer makes sense to keep the helper, so fold it back
into nvme_release_subsystem().

unreferenced object 0xffff8883d12bfbc0 (size 16):
  comm "nvme", pid 2635, jiffies 4294933602 (age 739.952s)
  hex dump (first 16 bytes):
    6e 76 6d 65 2d 73 75 62 73 79 73 32 00 88 ff ff  nvme-subsys2....
  backtrace:
    [<000000007d8fc208>] __kmalloc_track_caller+0x16d/0x2a0
    [<0000000081169e5f>] kvasprintf+0xad/0x130
    [<0000000025626f25>] kvasprintf_const+0x47/0x120
    [<00000000fa66ad36>] kobject_set_name_vargs+0x44/0x120
    [<000000004881f8b3>] dev_set_name+0x98/0xc0
    [<000000007124dae3>] nvme_init_identify+0x1995/0x38e0
    [<000000009315020a>] nvme_loop_configure_admin_queue+0x4fa/0x5e0
    [<000000001a63e766>] nvme_loop_create_ctrl+0x489/0xf80
    [<00000000a46ecc23>] nvmf_dev_write+0x1a12/0x2220
    [<000000002259b3d5>] __vfs_write+0x66/0x120
    [<000000002f6df81e>] vfs_write+0x154/0x490
    [<000000007e8cfc19>] ksys_write+0x10a/0x240
    [<00000000ff5c7b85>] __x64_sys_write+0x73/0xb0
    [<00000000fee6d692>] do_syscall_64+0xaa/0x470
    [<00000000997e1ede>] entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: ab9e00cc ("nvme: track subsystems")
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSun Ke <sunke32@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

068400df

nvme: fix possible deadlock when nvme_update_formats fails · 066b9c4c

由 Sagi Grimberg 提交于 9月 22, 2020

mainline inclusion
from mainline-5.4-rc4
commit 6abff1b9
category: bugfix
bugzilla: 24170
CVE: NA
---------------------------

nvme_update_formats may fail to revalidate the namespace and
attempt to remove the namespace. This may lead to a deadlock
as nvme_ns_remove will attempt to acquire the subsystem lock
which is already acquired by the passthru command with effects.

Move the invalid namepsace removal to after the passthru command
releases the subsystem lock.
Reported-by: NJudy Brock <judy.brock@samsung.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NSun Ke <sunke32@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

066b9c4c

dm verity: don't prefetch hash blocks for already-verified data · 12f27b72

由 xianrong.zhou 提交于 9月 22, 2020

mainline inclusion
from mainline-5.6-rc1
commit 0a531c5a
category: bugfix
bugzilla: 29444
CVE: NA
---------------------------

Try to skip prefetching hash blocks that won't be needed due to the
"check_at_most_once" option being enabled and the corresponding data
blocks already having been verified.

Since prefetching operates on a range of data blocks, do this by just
trimming the two ends of the range.  This doesn't skip every unneeded
hash block, since data blocks in the middle of the range could also be
unneeded, and hash blocks are still prefetched in large clusters as
controlled by dm_verity_prefetch_cluster.  But it can still help a lot.

In a test on Android Q launching 91 apps every 15s repeated 21 times,
prefetching was only done for 447177/4776629 = 9.36% of data blocks.
Tested-by: Nruxian.feng <ruxian.feng@transsion.com>
Co-developed-by: Nyuanjiong.gao <yuanjiong.gao@transsion.com>
Signed-off-by: Nyuanjiong.gao <yuanjiong.gao@transsion.com>
Signed-off-by: Nxianrong.zhou <xianrong.zhou@transsion.com>
[EB: simplified the 'while' loops and improved the commit message]
Signed-off-by: NEric Biggers <ebiggers@google.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSun Ke <sunke32@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

12f27b72

arm64: kprobes: Recover pstate.D in single-step exception handler · 9bcec7a7

由 Masami Hiramatsu 提交于 9月 22, 2020

mainline inclusion
from mainline-5.3-rc3
commit b3980e48
category: bugfix
bugzilla: 20080
CVE: NA

-------------------------------------------------

kprobes manipulates the interrupted PSTATE for single step, and
doesn't restore it. Thus, if we put a kprobe where the pstate.D
(debug) masked, the mask will be cleared after the kprobe hits.

Moreover, in the most complicated case, this can lead a kernel
crash with below message when a nested kprobe hits.

[  152.118921] Unexpected kernel single-step exception at EL1

When the 1st kprobe hits, do_debug_exception() will be called.
At this point, debug exception (= pstate.D) must be masked (=1).
But if another kprobes hits before single-step of the first kprobe
(e.g. inside user pre_handler), it unmask the debug exception
(pstate.D = 0) and return.
Then, when the 1st kprobe setting up single-step, it saves current
DAIF, mask DAIF, enable single-step, and restore DAIF.
However, since "D" flag in DAIF is cleared by the 2nd kprobe, the
single-step exception happens soon after restoring DAIF.

This has been introduced by commit 7419333f ("arm64: kprobe:
Always clear pstate.D in breakpoint exception handler")

To solve this issue, this stores all DAIF bits and restore it
after single stepping.
Reported-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
Fixes: 7419333f ("arm64: kprobe: Always clear pstate.D in breakpoint exception handler")
Reviewed-by: NJames Morse <james.morse@arm.com>
Tested-by: NJames Morse <james.morse@arm.com>
Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: NWill Deacon <will@kernel.org>
Signed-off-by: NWei Li <liwei391@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

9bcec7a7

nbd: fix possible page fault for nbd disk · 86d4595d

由 Xiubo Li 提交于 9月 22, 2020

mainline inclusion
from mainline-5.4-rc1
commit 8454d685
category: bugfix
bugzilla: 23253
CVE: NA
---------------------------

When the NBD_CFLAG_DESTROY_ON_DISCONNECT flag is set and at the same
time when the socket is closed due to the server daemon is restarted,
just before the last DISCONNET is totally done if we start a new connection
by using the old nbd_index, there will be crashing randomly, like:

<3>[  110.151949] block nbd1: Receive control failed (result -32)
<1>[  110.152024] BUG: unable to handle page fault for address: 0000058000000840
<1>[  110.152063] #PF: supervisor read access in kernel mode
<1>[  110.152083] #PF: error_code(0x0000) - not-present page
<6>[  110.152094] PGD 0 P4D 0
<4>[  110.152106] Oops: 0000 [#1] SMP PTI
<4>[  110.152120] CPU: 0 PID: 6698 Comm: kworker/u5:1 Kdump: loaded Not tainted 5.3.0-rc4+ #2
<4>[  110.152136] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
<4>[  110.152166] Workqueue: knbd-recv recv_work [nbd]
<4>[  110.152187] RIP: 0010:__dev_printk+0xd/0x67
<4>[  110.152206] Code: 10 e8 c5 fd ff ff 48 8b 4c 24 18 65 48 33 0c 25 28 00 [...]
<4>[  110.152244] RSP: 0018:ffffa41581f13d18 EFLAGS: 00010206
<4>[  110.152256] RAX: ffffa41581f13d30 RBX: ffff96dd7374e900 RCX: 0000000000000000
<4>[  110.152271] RDX: ffffa41581f13d20 RSI: 00000580000007f0 RDI: ffffffff970ec24f
<4>[  110.152285] RBP: ffffa41581f13d80 R08: ffff96dd7fc17908 R09: 0000000000002e56
<4>[  110.152299] R10: ffffffff970ec24f R11: 0000000000000003 R12: ffff96dd7374e900
<4>[  110.152313] R13: 0000000000000000 R14: ffff96dd7374e9d8 R15: ffff96dd6e3b02c8
<4>[  110.152329] FS:  0000000000000000(0000) GS:ffff96dd7fc00000(0000) knlGS:0000000000000000
<4>[  110.152362] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[  110.152383] CR2: 0000058000000840 CR3: 0000000067cc6002 CR4: 00000000001606f0
<4>[  110.152401] Call Trace:
<4>[  110.152422]  _dev_err+0x6c/0x83
<4>[  110.152435]  nbd_read_stat.cold+0xda/0x578 [nbd]
<4>[  110.152448]  ? __switch_to_asm+0x34/0x70
<4>[  110.152468]  ? __switch_to_asm+0x40/0x70
<4>[  110.152478]  ? __switch_to_asm+0x34/0x70
<4>[  110.152491]  ? __switch_to_asm+0x40/0x70
<4>[  110.152501]  ? __switch_to_asm+0x34/0x70
<4>[  110.152511]  ? __switch_to_asm+0x40/0x70
<4>[  110.152522]  ? __switch_to_asm+0x34/0x70
<4>[  110.152533]  recv_work+0x35/0x9e [nbd]
<4>[  110.152547]  process_one_work+0x19d/0x340
<4>[  110.152558]  worker_thread+0x50/0x3b0
<4>[  110.152568]  kthread+0xfb/0x130
<4>[  110.152577]  ? process_one_work+0x340/0x340
<4>[  110.152609]  ? kthread_park+0x80/0x80
<4>[  110.152637]  ret_from_fork+0x35/0x40

This is very easy to reproduce by running the nbd-runner.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NXiubo Li <xiubli@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSunKe <sunke32@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

86d4595d

nbd: rename the runtime flags as NBD_RT_ prefixed · 1d6a11af

由 Xiubo Li 提交于 9月 22, 2020

mainline inclusion
from mainline-5.4-rc1
commit ec76a7b9
category: bugfix
bugzilla: 23253
CVE: NA
---------------------------

Preparing for the destory when disconnecting crash fixing.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NXiubo Li <xiubli@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSun Ke <sunke32@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

1d6a11af

jbd2: flush_descriptor(): Do not decrease buffer head's ref count · 16b5dfc8

由 Chandan Rajendra 提交于 9月 22, 2020

mainline inclusion
from mainline-5.4-rc1
commit 547b9ad6
category: bugfix
bugzilla: 23039
CVE: NA
---------------------------

When executing generic/388 on a ppc64le machine, we notice the following
call trace,

VFS: brelse: Trying to free free buffer
WARNING: CPU: 0 PID: 6637 at /root/repos/linux/fs/buffer.c:1195 __brelse+0x84/0xc0

Call Trace:
 __brelse+0x80/0xc0 (unreliable)
 invalidate_bh_lru+0x78/0xc0
 on_each_cpu_mask+0xa8/0x130
 on_each_cpu_cond_mask+0x130/0x170
 invalidate_bh_lrus+0x44/0x60
 invalidate_bdev+0x38/0x70
 ext4_put_super+0x294/0x560
 generic_shutdown_super+0xb0/0x170
 kill_block_super+0x38/0xb0
 deactivate_locked_super+0xa4/0xf0
 cleanup_mnt+0x164/0x1d0
 task_work_run+0x110/0x160
 do_notify_resume+0x414/0x460
 ret_from_except_lite+0x70/0x74

The warning happens because flush_descriptor() drops bh reference it
does not own. The bh reference acquired by
jbd2_journal_get_descriptor_buffer() is owned by the log_bufs list and
gets released when this list is processed. The reference for doing IO is
only acquired in write_dirty_buffer() later in flush_descriptor().
Reported-by: NHarish Sriram <harish@linux.ibm.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NChandan Rajendra <chandan@linux.ibm.com>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

16b5dfc8

Revert "dm crypt: use WQ_HIGHPRI for the IO and crypt workqueues" · 659d4a14

由 Mike Snitzer 提交于 9月 22, 2020

mainline inclusion
from mainline-5.5-rc1
commit f612b213
category: bugfix
bugzilla: 25149
CVE: NA
---------------------------

This reverts commit a1b89132.

Revert required hand-patching due to subsequent changes that were
applied since commit a1b89132.

Requires: ed0302e8 ("dm crypt: make workqueue names device-specific")
Cc: stable@vger.kernel.org
Bug: https://bugzilla.kernel.org/show_bug.cgi?id=199857Reported-by: NVito Caputo <vcaputo@pengaru.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSun Ke <sunke32@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

659d4a14

ACPICA: Win OSL: Replace get_tick_count with get_tick_count64 · fc11de88

由 Bob Moore 提交于 9月 22, 2020

mainline inclusion
from mainline-v5.5-rc1
commit 197aba20
category: bugfix
bugzilla: 25975
CVE: NA

-------------------------------------------------

ACPICA commit 7bc16c650317001bc82d4bae227b888a49c51f5e

Avoid possible overflow from get_tick_count. Also, cast math
using ACPI_100NSEC_PER_MSEC to uint64.

Link: https://github.com/acpica/acpica/commit/7bc16c65Signed-off-by: NBob Moore <robert.moore@intel.com>
Signed-off-by: NErik Schmauss <erik.schmauss@intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

fc11de88

ext4: avoid fetching btime in ext4_getattr() unless requested · bfdd7679

由 Theodore Ts'o 提交于 9月 22, 2020

mainline inclusion
from mainline-5.6-rc1
commit d4c5e960
category: bugfix
bugzilla: 29909
CVE: NA
---------------------------

Linus observed that an allmodconfig build which does a lot of stat(2)
calls that ext4_getattr() was a noticeable (1%) amount of CPU time,
due to the cache line for i_extra_isize getting pulled in.  Since the
normal stat system call doesn't return btime, it's a complete waste.
So only calculate btime when it is explicitly requested.

[ Fixed to check against request_mask instead of query_flags. ]

Link: https://lore.kernel.org/r/CAHk-=wivmk_j6KbTX+Er64mLrG8abXZo0M10PNdAnHc8fWXfsQ@mail.gmail.comReported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: NYang Erkun <yangerkun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

bfdd7679

mm: pagewalk: fix termination condition in walk_pte_range() · 7035b83e

由 Steven Price 提交于 9月 22, 2020

mainline inclusion
from mainline-5.6-rc1
commit c02a9875
category: bugfix
bugzilla: 30160
CVE: NA

-------------------------------------------------
If walk_pte_range() is called with a 'end' argument that is beyond the
last page of memory (e.g.  ~0UL) then the comparison between 'addr' and
'end' will always fail and the loop will be infinite.  Instead change the
comparison to >= while accounting for overflow.

Link: http://lkml.kernel.org/r/20191218162402.45610-15-steven.price@arm.comSigned-off-by: NSteven Price <steven.price@arm.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexandre Ghiti <alex@ghiti.fr>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: James Hogan <jhogan@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: "Liang, Kan" <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Paul Burton <paul.burton@mips.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zong Li <zong.li@sifive.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit c02a9875)
Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

7035b83e

mm/huge_memory.c: use head to check huge zero page · 6881cad7

由 Wei Yang 提交于 9月 22, 2020

mainline inclusion
from mainline-5.6-rc1
commit cb829624
category: bugfix
bugzilla: 29957
CVE: NA

-------------------------------------------------
The page could be a tail page, if this is the case, this BUG_ON will
never be triggered.

Link: http://lkml.kernel.org/r/20200110032610.26499-1-richardw.yang@linux.intel.com
Fixes: e9b61f19 ("thp: reintroduce split_huge_page()")
Signed-off-by: NWei Yang <richardw.yang@linux.intel.com>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit cb829624)
Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

6881cad7

mm/page-writeback.c: improve arithmetic divisions · 23a9944d

由 Wen Yang 提交于 9月 22, 2020

mainline inclusion
from mainline-5.5-rc7
commit 0a5d1a7f
category: bugfix
bugzilla: 28317
CVE: NA

-------------------------------------------------
Use div64_ul() instead of do_div() if the divisor is unsigned long, to
avoid truncation to 32-bit on 64-bit platforms.

Link: http://lkml.kernel.org/r/20200102081442.8273-4-wenyang@linux.alibaba.comSigned-off-by: NWen Yang <wenyang@linux.alibaba.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit 0a5d1a7f)
Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

23a9944d

mm/page-writeback.c: use div64_ul() for u64-by-unsigned-long divide · 3b6cc741

由 Wen Yang 提交于 9月 22, 2020

mainline inclusion
from mainline-5.5-rc7
commit d3ac946e
category: bugfix
bugzilla: 28317
CVE: NA

-------------------------------------------------
The two variables 'numerator' and 'denominator', though they are
declared as long, they should actually be unsigned long (according to
the implementation of the fprop_fraction_percpu() function)

And do_div() does a 64-by-32 division, while the divisor 'denominator'
is unsigned long, thus 64-bit on 64-bit platforms.  Hence the proper
function to call is div64_ul().

Link: http://lkml.kernel.org/r/20200102081442.8273-3-wenyang@linux.alibaba.comSigned-off-by: NWen Yang <wenyang@linux.alibaba.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Qian Cai <cai@lca.pw>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
(cherry picked from commit d3ac946e)
Signed-off-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NJing Xiangfeng <jingxiangfeng@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

3b6cc741

PCI: PM/ACPI: Refresh all stale power state data in pci_pm_complete() · 6e50c9b5

由 Rafael J. Wysocki 提交于 9月 22, 2020

mainline inclusion
from mainline-v5.3-rc1
commit b51033e0
category: bugfix
bugzilla: 17644
CVE: NA

-------------------------------------------------

In pci_pm_complete() there are checks to decide whether or not to
resume devices that were left in runtime-suspend during the preceding
system-wide transition into a sleep state.  They involve checking the
current power state of the device and comparing it with the power
state of it set before the preceding system-wide transition, but the
platform component of the device's power state is not handled
correctly in there.

Namely, on platforms with ACPI, the device power state information
needs to be updated with care, so that the reference counters of
power resources used by the device (if any) are set to ensure that
the refreshed power state of it will be maintained going forward.

To that end, introduce a new ->refresh_state() platform PM callback
for PCI devices, for asking the platform to refresh the device power
state data and ensure that the corresponding power state will be
maintained going forward, make it invoke acpi_device_update_power()
(for devices with ACPI PM) on platforms with ACPI and make
pci_pm_complete() use it, through a new pci_refresh_power_state()
wrapper function.

Fixes: a0d2a959 (PCI: Avoid unnecessary resume after direct-complete)
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: NMika Westerberg <mika.westerberg@linux.intel.com>

Conflicts:
	drivers/pci/pci.c
[wangxiongfeng: fix a little conflicts]
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

6e50c9b5

ACPI: PM: Fix regression in acpi_device_set_power() · d9cf2cac

由 Rafael J. Wysocki 提交于 9月 22, 2020

mainline inclusion
from mainline-v5.3-rc3
commit 42787ed7
category: bugfix
bugzilla: 17645
CVE: NA

-------------------------------------------------

Commit f850a48a ("ACPI: PM: Allow transitions to D0 to occur in
special cases") overlooked the fact that acpi_power_transition() may
change the power.state value for the target device and if that
happens, it may confuse acpi_device_set_power() and cause it to
omit the _PS0 evaluation which on some systems is necessary to
change power states of devices from low-power to D0.

Fix that by saving the current value of power.state for the
target device before passing it to acpi_power_transition() and
using the saved value in a subsequent check.

Fixes: f850a48a ("ACPI: PM: Allow transitions to D0 to occur in special cases")
Reported-by: NKai-Heng Feng <kai.heng.feng@canonical.com>
Reported-by: NMario Limonciello <mario.limonciello@dell.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Tested-by: NKai-Heng Feng <kai.heng.feng@canonical.com>
Tested-by: NMario Limonciello <mario.limonciello@dell.com>
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d9cf2cac

ACPI: PM: Allow transitions to D0 to occur in special cases · d4fee400

由 Rafael J. Wysocki 提交于 9月 22, 2020

mainline inclusion
from mainline-v5.3-rc1
commit f850a48a
category: bugfix
bugzilla: 17645
CVE: NA

-------------------------------------------------

If a device with ACPI PM is left in D0 during a system-wide
transition to the S3 (suspend-to-RAM) or S4 (hibernation) sleep
state, the actual state of the device need not be D0 during resume
from it, although its power.state value will still reflect D0 (that
is, the power state from before the system-wide transition).

In that case, the acpi_device_set_power() call made to ensure that
the power state of the device will be D0 going forward has no effect,
because the new state (D0) is equal to the one reflected by the
device's power.state value.  That does not affect power resources,
which are taken care of by acpi_resume_power_resources() called from
acpi_pm_finish() during resume from system-wide sleep states, but it
still may be necessary to invoke _PS0 for the device on top of that
in order to finalize its transition to D0.

For this reason, modify acpi_device_set_power() to allow transitions
to D0 to occur even if D0 is the current power state of the device
according to its power.state value.

That will not affect power resources, which are assumed to be in
the right configuration already (as reflected by the current values
of their reference counters), but it may cause _PS0 to be evaluated
for the device.  However, evaluating _PS0 for a device already in D0
may lead to confusion in general, so invoke _PSC (if present) to
check the device's current power state upfront and only evaluate
_PS0 for it if _PSC has returned a power state different from D0.
[If _PSC is not present or the evaluation of it fails, the power
state of the device is assumed to be D0 at this point.]

Fixes: 20dacb71 (ACPI / PM: Rework device power management to follow ACPI 6)
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d4fee400

ACPI: PM: Avoid evaluating _PS3 on transitions from D3hot to D3cold · ebbaf956

由 Rafael J. Wysocki 提交于 9月 22, 2020

mainline inclusion
from mainline-v5.3-rc1
commit 21ba2379
category: bugfix
bugzilla: NA
CVE: NA

-------------------------------------------------

If the power state of a device with ACPI PM is changed from D3hot to
D3cold, it merely is a matter of dropping references to additional
power resources (specifically, those in the list returned by _PR3),
and the _PS3 method should not be invoked for the device then (as
it has already been evaluated during the previous transition to
D3hot).

Fixes: 20dacb71 (ACPI / PM: Rework device power management to follow ACPI 6)
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: NMika Westerberg <mika.westerberg@linux.intel.com>
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ebbaf956

iommu/arm-smmu: Mark expected switch fall-through · f3d941e8

由 Anders Roxell 提交于 9月 22, 2020

mainline inclusion
from mainline-v5.3-rc5
commit 11f4fe9b
category: bugfix
bugzilla: 20500
CVE: NA

-------------------------------------------------------------------------

Now that -Wimplicit-fallthrough is passed to GCC by default, the
following warning shows up:

../drivers/iommu/arm-smmu-v3.c: In function ‘arm_smmu_write_strtab_ent’:
../drivers/iommu/arm-smmu-v3.c:1189:7: warning: this statement may fall
 through [-Wimplicit-fallthrough=]
    if (disable_bypass)
       ^
../drivers/iommu/arm-smmu-v3.c:1191:3: note: here
   default:
   ^~~~~~~

Rework so that the compiler doesn't warn about fall-through. Make it
clearer by calling 'BUG_ON()' when disable_bypass is set, and always
'break;'
Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
Acked-by: NWill Deacon <will@kernel.org>
Signed-off-by: NJoerg Roedel <jroedel@suse.de>
Signed-off-by: NZhen Lei <thunder.leizhen@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

f3d941e8

efi/memreserve: Register reservations as 'reserved' in /proc/iomem · db59945b

由 Ard Biesheuvel 提交于 9月 22, 2020

mainline inclusion
from mainline-5.5-rc3
commit ab0eb162
category: bugfix
bugzilla: 27656
CVE: NA

-------------------------------------------------

Memory regions that are reserved using efi_mem_reserve_persistent()
are recorded in a special EFI config table which survives kexec,
allowing the incoming kernel to honour them as well. However,
such reservations are not visible in /proc/iomem, and so the kexec
tools that load the incoming kernel and its initrd into memory may
overwrite these reserved regions before the incoming kernel has a
chance to reserve them from further use.

Address this problem by adding these reservations to /proc/iomem as
they are created. Note that reservations that are inherited from a
previous kernel are memblock_reserve()'d early on, so they are already
visible in /proc/iomem.
Tested-by: NMasayoshi Mizuma <m.mizuma@jp.fujitsu.com>
Tested-by: NBhupesh Sharma <bhsharma@redhat.com>
Signed-off-by: NArd Biesheuvel <ardb@kernel.org>
Reviewed-by: NBhupesh Sharma <bhsharma@redhat.com>
Cc: <stable@vger.kernel.org> # v5.4+
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Arvind Sankar <nivedita@alum.mit.edu>
Cc: linux-efi@vger.kernel.org
Link: https://lkml.kernel.org/r/20191206165542.31469-2-ardb@kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

db59945b

compat_ioctl: handle SIOCOUTQNSD · 5b093c61

由 Arnd Bergmann 提交于 9月 22, 2020

mainline inclusion
from mainline-v5.5-rc1
commit 9d7bf41f
category: bugfix
bugzilla: 26804
CVE: NA

-------------------------------------

Unlike the normal SIOCOUTQ, SIOCOUTQNSD was never handled in compat
mode. Add it to the common socket compat handler along with similar
ones.

Fixes: 2f4e1b39 ("tcp: ioctl type SIOCOUTQNSD returns amount of data not sent")
Cc: Eric Dumazet <edumazet@google.com>
Cc: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: Nguodeqing <geffrey.guo@huawei.com>
Reviewed-by: NWenan Mao <maowenan@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

5b093c61

mm: slub: fix conversion of freelist_corrupted() · 1c6abe16

由 Eugeniu Rosca 提交于 9月 22, 2020

mainline inclusion
from mainline-v5.9-rc4
commit dc07a728
category: bugfix
bugzilla: NA
CVE: NA

-------------------------------------------------

Commit 52f23478 ("mm/slub.c: fix corrupted freechain in
deactivate_slab()") suffered an update when picked up from LKML [1].

Specifically, relocating 'freelist = NULL' into 'freelist_corrupted()'
created a no-op statement.  Fix it by sticking to the behavior intended
in the original patch [1].  In addition, make freelist_corrupted()
immune to passing NULL instead of &freelist.

The issue has been spotted via static analysis and code review.

[1] https://lore.kernel.org/linux-mm/20200331031450.12182-1-dongli.zhang@oracle.com/

Fixes: 52f23478 ("mm/slub.c: fix corrupted freechain in deactivate_slab()")
Signed-off-by: NEugeniu Rosca <erosca@de.adit-jv.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: Joe Jin <joe.jin@oracle.com>
Cc: Christoph Lameter <cl@linux.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: <stable@vger.kernel.org>
Link: https://lkml.kernel.org/r/20200824130643.10291-1-erosca@de.adit-jv.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

1c6abe16

khugepaged: retract_page_tables() remember to test exit · 768c38b6

由 Hugh Dickins 提交于 9月 22, 2020

stable inclusion
from linux-4.19.141
commit 2406c45db3de8da8052eccb1dcfde69ea5988b19

--------------------------------

commit 18e77600 upstream.

Only once have I seen this scenario (and forgot even to notice what forced
the eventual crash): a sequence of "BUG: Bad page map" alerts from
vm_normal_page(), from zap_pte_range() servicing exit_mmap();
pmd:00000000, pte values corresponding to data in physical page 0.

The pte mappings being zapped in this case were supposed to be from a huge
page of ext4 text (but could as well have been shmem): my belief is that
it was racing with collapse_file()'s retract_page_tables(), found *pmd
pointing to a page table, locked it, but *pmd had become 0 by the time
start_pte was decided.

In most cases, that possibility is excluded by holding mmap lock; but
exit_mmap() proceeds without mmap lock.  Most of what's run by khugepaged
checks khugepaged_test_exit() after acquiring mmap lock:
khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
for example.  But retract_page_tables() did not: fix that.

The fix is for retract_page_tables() to check khugepaged_test_exit(),
after acquiring mmap lock, before doing anything to the page table.
Getting the mmap lock serializes with __mmput(), which briefly takes and
drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
mm_users makes sure we don't touch the page table once exit_mmap() might
reach it, since exit_mmap() will be proceeding without mmap lock, not
expecting anyone to be racing with it.

Fixes: f3f0e1d2 ("khugepaged: add support of collapse for tmpfs/shmem pages")
Signed-off-by: NHugh Dickins <hughd@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Song Liu <songliubraving@fb.com>
Cc: <stable@vger.kernel.org>	[4.8+]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvilsSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

768c38b6

kprobes: Fix NULL pointer dereference at kprobe_ftrace_handler · bf46a479

由 Muchun Song 提交于 9月 22, 2020

stable inclusion
from linux-4.19.141
commit 46c9d3925ab0ceb4d19cee4be1a061b87faf1e11

--------------------------------

commit 0cb2f137 upstream.

We found a case of kernel panic on our server. The stack trace is as
follows(omit some irrelevant information):

  BUG: kernel NULL pointer dereference, address: 0000000000000080
  RIP: 0010:kprobe_ftrace_handler+0x5e/0xe0
  RSP: 0018:ffffb512c6550998 EFLAGS: 00010282
  RAX: 0000000000000000 RBX: ffff8e9d16eea018 RCX: 0000000000000000
  RDX: ffffffffbe1179c0 RSI: ffffffffc0535564 RDI: ffffffffc0534ec0
  RBP: ffffffffc0534ec1 R08: ffff8e9d1bbb0f00 R09: 0000000000000004
  R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
  R13: ffff8e9d1f797060 R14: 000000000000bacc R15: ffff8e9ce13eca00
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 0000000000000080 CR3: 00000008453d0005 CR4: 00000000003606e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <IRQ>
   ftrace_ops_assist_func+0x56/0xe0
   ftrace_call+0x5/0x34
   tcpa_statistic_send+0x5/0x130 [ttcp_engine]

The tcpa_statistic_send is the function being kprobed. After analysis,
the root cause is that the fourth parameter regs of kprobe_ftrace_handler
is NULL. Why regs is NULL? We use the crash tool to analyze the kdump.

  crash> dis tcpa_statistic_send -r
         <tcpa_statistic_send>: callq 0xffffffffbd8018c0 <ftrace_caller>

The tcpa_statistic_send calls ftrace_caller instead of ftrace_regs_caller.
So it is reasonable that the fourth parameter regs of kprobe_ftrace_handler
is NULL. In theory, we should call the ftrace_regs_caller instead of the
ftrace_caller. After in-depth analysis, we found a reproducible path.

  Writing a simple kernel module which starts a periodic timer. The
  timer's handler is named 'kprobe_test_timer_handler'. The module
  name is kprobe_test.ko.

  1) insmod kprobe_test.ko
  2) bpftrace -e 'kretprobe:kprobe_test_timer_handler {}'
  3) echo 0 > /proc/sys/kernel/ftrace_enabled
  4) rmmod kprobe_test
  5) stop step 2) kprobe
  6) insmod kprobe_test.ko
  7) bpftrace -e 'kretprobe:kprobe_test_timer_handler {}'

We mark the kprobe as GONE but not disarm the kprobe in the step 4).
The step 5) also do not disarm the kprobe when unregister kprobe. So
we do not remove the ip from the filter. In this case, when the module
loads again in the step 6), we will replace the code to ftrace_caller
via the ftrace_module_enable(). When we register kprobe again, we will
not replace ftrace_caller to ftrace_regs_caller because the ftrace is
disabled in the step 3). So the step 7) will trigger kernel panic. Fix
this problem by disarming the kprobe when the module is going away.

Link: https://lkml.kernel.org/r/20200728064536.24405-1-songmuchun@bytedance.com

Cc: stable@vger.kernel.org
Fixes: ae6aa16f ("kprobes: introduce ftrace based optimization")
Acked-by: NMasami Hiramatsu <mhiramat@kernel.org>
Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
Co-developed-by: NChengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: NChengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

bf46a479

ftrace: Setup correct FTRACE_FL_REGS flags for module · ef544e15

由 Chengming Zhou 提交于 9月 22, 2020

stable inclusion
from linux-4.19.141
commit 892fd3637a2c0c29d3dd7402bbb0802eb231b69d

--------------------------------

commit 8a224ffb upstream.

When module loaded and enabled, we will use __ftrace_replace_code
for module if any ftrace_ops referenced it found. But we will get
wrong ftrace_addr for module rec in ftrace_get_addr_new, because
rec->flags has not been setup correctly. It can cause the callback
function of a ftrace_ops has FTRACE_OPS_FL_SAVE_REGS to be called
with pt_regs set to NULL.
So setup correct FTRACE_FL_REGS flags for rec when we call
referenced_filters to find ftrace_ops references it.

Link: https://lkml.kernel.org/r/20200728180554.65203-1-zhouchengming@bytedance.com

Cc: stable@vger.kernel.org
Fixes: 8c4f3c3f ("ftrace: Check module functions being traced on reload")
Signed-off-by: NChengming Zhou <zhouchengming@bytedance.com>
Signed-off-by: NMuchun Song <songmuchun@bytedance.com>
Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ef544e15

mm/page_counter.c: fix protection usage propagation · 58c0de9b

由 Michal Koutný 提交于 9月 22, 2020

stable inclusion
from linux-4.19.141
commit e88a72e86bd0c0c19ba00865ed68dce22824aac6

--------------------------------

commit a6f23d14 upstream.

When workload runs in cgroups that aren't directly below root cgroup and
their parent specifies reclaim protection, it may end up ineffective.

The reason is that propagate_protected_usage() is not called in all
hierarchy up.  All the protected usage is incorrectly accumulated in the
workload's parent.  This means that siblings_low_usage is overestimated
and effective protection underestimated.  Even though it is transitional
phenomenon (uncharge path does correct propagation and fixes the wrong
children_low_usage), it can undermine the intended protection
unexpectedly.

We have noticed this problem while seeing a swap out in a descendant of a
protected memcg (intermediate node) while the parent was conveniently
under its protection limit and the memory pressure was external to that
hierarchy.  Michal has pinpointed this down to the wrong
siblings_low_usage which led to the unwanted reclaim.

The fix is simply updating children_low_usage in respective ancestors also
in the charging path.

Fixes: 23067153 ("mm: memory.low hierarchical behavior")
Signed-off-by: NMichal Koutný <mkoutny@suse.com>
Signed-off-by: NMichal Hocko <mhocko@suse.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Acked-by: NMichal Hocko <mhocko@suse.com>
Acked-by: NRoman Gushchin <guro@fb.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>	[4.18+]
Link: http://lkml.kernel.org/r/20200803153231.15477-1-mhocko@kernel.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

58c0de9b

driver core: Avoid binding drivers to dead devices · df4bb640

由 Lukas Wunner 提交于 9月 22, 2020

stable inclusion
from linux-4.19.141
commit 706695d477fb16a5920098535380ee39337e7ea8

--------------------------------

commit 65488832 upstream.

Commit 3451a495 ("driver core: Establish order of operations for
device_add and device_del via bitflag") sought to prevent asynchronous
driver binding to a device which is being removed.  It added a
per-device "dead" flag which is checked in the following code paths:

* asynchronous binding in __driver_attach_async_helper()
*  synchronous binding in device_driver_attach()
* asynchronous binding in __device_attach_async_helper()

It did *not* check the flag upon:

*  synchronous binding in __device_attach()

However __device_attach() may also be called asynchronously from:

deferred_probe_work_func()
  bus_probe_device()
    device_initial_probe()
      __device_attach()

So if the commit's intention was to check the "dead" flag in all
asynchronous code paths, then a check is also necessary in
__device_attach().  Add the missing check.

Fixes: 3451a495 ("driver core: Establish order of operations for device_add and device_del via bitflag")
Signed-off-by: NLukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org # v5.1+
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Link: https://lore.kernel.org/r/de88a23a6fe0ef70f7cfd13c8aea9ab51b4edab6.1594214103.git.lukas@wunner.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

df4bb640

genirq/affinity: Make affinity setting if activated opt-in · 7cf94405

由 Thomas Gleixner 提交于 9月 22, 2020

stable inclusion
from linux-4.19.141
commit 5c4d9eefd314e763dcb2a499797176c17ad6ab69

--------------------------------

commit f0c7baca upstream.

John reported that on a RK3288 system the perf per CPU interrupts are all
affine to CPU0 and provided the analysis:

 "It looks like what happens is that because the interrupts are not per-CPU
  in the hardware, armpmu_request_irq() calls irq_force_affinity() while
  the interrupt is deactivated and then request_irq() with IRQF_PERCPU |
  IRQF_NOBALANCING.

  Now when irq_startup() runs with IRQ_STARTUP_NORMAL, it calls
  irq_setup_affinity() which returns early because IRQF_PERCPU and
  IRQF_NOBALANCING are set, leaving the interrupt on its original CPU."

This was broken by the recent commit which blocked interrupt affinity
setting in hardware before activation of the interrupt. While this works in
general, it does not work for this particular case. As contrary to the
initial analysis not all interrupt chip drivers implement an activate
callback, the safe cure is to make the deferred interrupt affinity setting
at activation time opt-in.

Implement the necessary core logic and make the two irqchip implementations
for which this is required opt-in. In hindsight this would have been the
right thing to do, but ...

Fixes: baedb87d ("genirq/affinity: Handle affinity setting on inactive interrupts correctly")
Reported-by: NJohn Keeping <john@metanate.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Tested-by: NMarc Zyngier <maz@kernel.org>
Acked-by: NMarc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/87blk4tzgm.fsf@nanos.tec.linutronix.deSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

7cf94405

mm/mmap.c: Add cond_resched() for exit_mmap() CPU stalls · c5570966

由 Paul E. McKenney 提交于 9月 22, 2020

stable inclusion
from linux-4.19.140
commit abfa9c47ece7c712d3fab494d1771c8f15bba8fa

--------------------------------

[ Upstream commit 0a3b3c25 ]

A large process running on a heavily loaded system can encounter the
following RCU CPU stall warning:

  rcu: INFO: rcu_sched self-detected stall on CPU
  rcu: 	3-....: (20998 ticks this GP) idle=4ea/1/0x4000000000000002 softirq=556558/556558 fqs=5190
  	(t=21013 jiffies g=1005461 q=132576)
  NMI backtrace for cpu 3
  CPU: 3 PID: 501900 Comm: aio-free-ring-w Kdump: loaded Not tainted 5.2.9-108_fbk12_rc3_3858_gb83b75af7909 #1
  Hardware name: Wiwynn   HoneyBadger/PantherPlus, BIOS HBM6.71 02/03/2016
  Call Trace:
   <IRQ>
   dump_stack+0x46/0x60
   nmi_cpu_backtrace.cold.3+0x13/0x50
   ? lapic_can_unplug_cpu.cold.27+0x34/0x34
   nmi_trigger_cpumask_backtrace+0xba/0xca
   rcu_dump_cpu_stacks+0x99/0xc7
   rcu_sched_clock_irq.cold.87+0x1aa/0x397
   ? tick_sched_do_timer+0x60/0x60
   update_process_times+0x28/0x60
   tick_sched_timer+0x37/0x70
   __hrtimer_run_queues+0xfe/0x270
   hrtimer_interrupt+0xf4/0x210
   smp_apic_timer_interrupt+0x5e/0x120
   apic_timer_interrupt+0xf/0x20
   </IRQ>
  RIP: 0010:kmem_cache_free+0x223/0x300
  Code: 88 00 00 00 0f 85 ca 00 00 00 41 8b 55 18 31 f6 f7 da 41 f6 45 0a 02 40 0f 94 c6 83 c6 05 9c 41 5e fa e8 a0 a7 01 00 41 56 9d <49> 8b 47 08 a8 03 0f 85 87 00 00 00 65 48 ff 08 e9 3d fe ff ff 65
  RSP: 0018:ffffc9000e8e3da8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
  RAX: 0000000000020000 RBX: ffff88861b9de960 RCX: 0000000000000030
  RDX: fffffffffffe41e8 RSI: 000060777fe3a100 RDI: 000000000001be18
  RBP: ffffea00186e7780 R08: ffffffffffffffff R09: ffffffffffffffff
  R10: ffff88861b9dea28 R11: ffff88887ffde000 R12: ffffffff81230a1f
  R13: ffff888854684dc0 R14: 0000000000000206 R15: ffff8888547dbc00
   ? remove_vma+0x4f/0x60
   remove_vma+0x4f/0x60
   exit_mmap+0xd6/0x160
   mmput+0x4a/0x110
   do_exit+0x278/0xae0
   ? syscall_trace_enter+0x1d3/0x2b0
   ? handle_mm_fault+0xaa/0x1c0
   do_group_exit+0x3a/0xa0
   __x64_sys_exit_group+0x14/0x20
   do_syscall_64+0x42/0x100
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

And on a PREEMPT=n kernel, the "while (vma)" loop in exit_mmap() can run
for a very long time given a large process.  This commit therefore adds
a cond_resched() to this loop, providing RCU any needed quiescent states.

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-mm@kvack.org>
Reviewed-by: NShakeel Butt <shakeelb@google.com>
Reviewed-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: NPaul E. McKenney <paulmck@kernel.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

c5570966

sched: correct SD_flags returned by tl->sd_flags() · 54e08f07

由 Peng Liu 提交于 9月 22, 2020

stable inclusion
from linux-4.19.140
commit 008560eb23a8ad242c15cb90478c428a5a69f546

--------------------------------

[ Upstream commit 9b1b234b ]

During sched domain init, we check whether non-topological SD_flags are
returned by tl->sd_flags(), if found, fire a waning and correct the
violation, but the code failed to correct the violation. Correct this.

Fixes: 143e1e28 ("sched: Rework sched_domain topology definition")
Signed-off-by: NPeng Liu <iwtbavbm@gmail.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NVincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: NValentin Schneider <valentin.schneider@arm.com>
Link: https://lkml.kernel.org/r/20200609150936.GA13060@iZj6chx1xj0e0buvshuecpZSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

54e08f07

sched/fair: Fix NOHZ next idle balance · 7dad97ba

由 Vincent Guittot 提交于 9月 22, 2020

stable inclusion
from linux-4.19.140
commit 519252e38ca1e1d614b11f982b1f489d62c135ee

--------------------------------

[ Upstream commit 3ea2f097 ]

With commit:
  'b7031a02 ("sched/fair: Add NOHZ_STATS_KICK")'
rebalance_domains of the local cfs_rq happens before others idle cpus have
updated nohz.next_balance and its value is overwritten.

Move the update of nohz.next_balance for other idles cpus before balancing
and updating the next_balance of local cfs_rq.

Also, the nohz.next_balance is now updated only if all idle cpus got a
chance to rebalance their domains and the idle balance has not been aborted
because of new activities on the CPU. In case of need_resched, the idle
load balance will be kick the next jiffie in order to address remaining
ilb.

Fixes: b7031a02 ("sched/fair: Add NOHZ_STATS_KICK")
Reported-by: NPeng Liu <iwtbavbm@gmail.com>
Signed-off-by: NVincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NValentin Schneider <valentin.schneider@arm.com>
Acked-by: NMel Gorman <mgorman@suse.de>
Link: https://lkml.kernel.org/r/20200609123748.18636-1-vincent.guittot@linaro.orgSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

7dad97ba

xattr: break delegations in {set, remove}xattr · 9f6e50a6

由 Frank van der Linden 提交于 9月 22, 2020

stable inclusion
from linux-4.19.139
commit fabe9d6cc1663deafba59499829e0c4c28971345

--------------------------------

commit 08b5d501 upstream.

set/removexattr on an exported filesystem should break NFS delegations.
This is true in general, but also for the upcoming support for
RFC 8726 (NFSv4 extended attribute support). Make sure that they do.

Additionally, they need to grow a _locked variant, since callers might
call this with i_rwsem held (like the NFS server code).

Cc: stable@vger.kernel.org # v4.9+
Cc: linux-fsdevel@vger.kernel.org
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NFrank van der Linden <fllinden@amazon.com>
Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

9f6e50a6

firmware: Fix a reference count leak. · 2c7e1495

由 Qiushi Wu 提交于 9月 22, 2020

stable inclusion
from linux-4.19.139
commit 937dafe8682044e70821c886d9063869744b3057

--------------------------------

[ Upstream commit fe3c6068 ]

kobject_init_and_add() takes reference even when it fails.
If this function returns an error, kobject_put() must be called to
properly clean up the memory associated with the object.
Callback function fw_cfg_sysfs_release_entry() in kobject_put()
can handle the pointer "entry" properly.
Signed-off-by: NQiushi Wu <wu000273@umn.edu>
Link: https://lore.kernel.org/r/20200613190533.15712-1-wu000273@umn.eduSigned-off-by: NMichael S. Tsirkin <mst@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

2c7e1495

ext4: fix direct I/O read error · 126a0166

由 Jiang Ying 提交于 9月 22, 2020

stable inclusion
from linux-4.19.138
commit aa0962310814356b10a973c578cfe595d89a3e39

--------------------------------

This patch is used to fix ext4 direct I/O read error when
the read size is not aligned with block size.

Then, I will use a test to explain the error.

(1) Make a file that is not aligned with block size:
	$dd if=/dev/zero of=./test.jar bs=1000 count=3

(2) I wrote a source file named "direct_io_read_file.c" as following:

	#include <stdio.h>
	#include <stdlib.h>
	#include <unistd.h>
	#include <sys/file.h>
	#include <sys/types.h>
	#include <sys/stat.h>
	#include <string.h>
	#define BUF_SIZE 1024

	int main()
	{
		int fd;
		int ret;

		unsigned char *buf;
		ret = posix_memalign((void **)&buf, 512, BUF_SIZE);
		if (ret) {
			perror("posix_memalign failed");
			exit(1);
		}
		fd = open("./test.jar", O_RDONLY | O_DIRECT, 0755);
		if (fd < 0){
			perror("open ./test.jar failed");
			exit(1);
		}

		do {
			ret = read(fd, buf, BUF_SIZE);
			printf("ret=%d\n",ret);
			if (ret < 0) {
				perror("write test.jar failed");
			}
		} while (ret > 0);

		free(buf);
		close(fd);
	}

(3) Compile the source file:
	$gcc direct_io_read_file.c -D_GNU_SOURCE

(4) Run the test program:
	$./a.out

	The result is as following:
	ret=1024
	ret=1024
	ret=952
	ret=-1
	write test.jar failed: Invalid argument.

I have tested this program on XFS filesystem, XFS does not have
this problem, because XFS use iomap_dio_rw() to do direct I/O
read. And the comparing between read offset and file size is done
in iomap_dio_rw(), the code is as following:

	if (pos < size) {
		retval = filemap_write_and_wait_range(mapping, pos,
				pos + iov_length(iov, nr_segs) - 1);

		if (!retval) {
			retval = mapping->a_ops->direct_IO(READ, iocb,
						iov, pos, nr_segs);
		}
		...
	}

...only when "pos < size", direct I/O can be done, or 0 will be return.

I have tested the fix patch on Ext4, it is up to the mustard of
EINVAL in man2(read) as following:
	#include <unistd.h>
	ssize_t read(int fd, void *buf, size_t count);

	EINVAL
		fd is attached to an object which is unsuitable for reading;
		or the file was opened with the O_DIRECT flag, and either the
		address specified in buf, the value specified in count, or the
		current file offset is not suitably aligned.

So I think this patch can be applied to fix ext4 direct I/O error.

However Ext4 introduces direct I/O read using iomap infrastructure
on kernel 5.5, the patch is commit <b1b4705d>
("ext4: introduce direct I/O read using iomap infrastructure"),
then Ext4 will be the same as XFS, they all use iomap_dio_rw() to do direct
I/O read. So this problem does not exist on kernel 5.5 for Ext4.

>From above description, we can see this problem exists on all the kernel
versions between kernel 3.14 and kernel 5.4. It will cause the Applications
to fail to read. For example, when the search service downloads a new full
index file, the search engine is loading the previous index file and is
processing the search request, it can not use buffer io that may squeeze
the previous index file in use from pagecache, so the serch service must
use direct I/O read.

Please apply this patch on these kernel versions, or please use the method
on kernel 5.5 to fix this problem.

Fixes: 9fe55eea ("Fix race when checking i_size on direct i/o read")
Reviewed-by: NJan Kara <jack@suse.cz>
Co-developed-by: NWang Long <wanglong19@meituan.com>
Signed-off-by: NWang Long <wanglong19@meituan.com>
Signed-off-by: NJiang Ying <jiangying8582@126.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

126a0166

arm64: csum: Fix handling of bad packets · 83d7b1b8

由 Robin Murphy 提交于 9月 22, 2020

stable inclusion
from linux-4.19.137
commit 0be9b57b5bb5a1e643624ea68decc4a8a14ffda6

--------------------------------

[ Upstream commit 05fb3dbd ]

Although iph is expected to point to at least 20 bytes of valid memory,
ihl may be bogus, for example on reception of a corrupt packet. If it
happens to be less than 5, we really don't want to run away and
dereference 16GB worth of memory until it wraps back to exactly zero...

Fixes: 0e455d8e ("arm64: Implement optimised IP checksum helpers")
Reported-by: Nguodeqing <geffrey.guo@huawei.com>
Signed-off-by: NRobin Murphy <robin.murphy@arm.com>
Signed-off-by: NWill Deacon <will@kernel.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

83d7b1b8

arm64/alternatives: move length validation inside the subsection · fa2a5217

由 Sami Tolvanen 提交于 9月 22, 2020

stable inclusion
from linux-4.19.137
commit 53f941777b9df64fa70fad775efabfcfb5db92c4

--------------------------------

[ Upstream commit 966a0acc ]

Commit f7b93d42 ("arm64/alternatives: use subsections for replacement
sequences") breaks LLVM's integrated assembler, because due to its
one-pass design, it cannot compute instruction sequence lengths before the
layout for the subsection has been finalized. This change fixes the build
by moving the .org directives inside the subsection, so they are processed
after the subsection layout is known.

Fixes: f7b93d42 ("arm64/alternatives: use subsections for replacement sequences")
Signed-off-by: NSami Tolvanen <samitolvanen@google.com>
Link: https://github.com/ClangBuiltLinux/linux/issues/1078
Link: https://lore.kernel.org/r/20200730153701.3892953-1-samitolvanen@google.comSigned-off-by: NWill Deacon <will@kernel.org>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

fa2a5217

bpf: Fix map leak in HASH_OF_MAPS map · 4aab68e5

由 Andrii Nakryiko 提交于 9月 22, 2020

stable inclusion
from linux-4.19.137
commit 634d42cadc4771fbe3b70e0fa8b82b334fd41959

--------------------------------

[ Upstream commit 1d4e1eab ]

Fix HASH_OF_MAPS bug of not putting inner map pointer on bpf_map_elem_update()
operation. This is due to per-cpu extra_elems optimization, which bypassed
free_htab_elem() logic doing proper clean ups. Make sure that inner map is put
properly in optimized case as well.

Fixes: 8c290e60 ("bpf: fix hashmap extra_elems logic")
Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NSong Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200729040913.2815687-1-andriin@fb.comSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

4aab68e5

dm integrity: fix integrity recalculation that is improperly skipped · c476abe3

由 Mikulas Patocka 提交于 9月 22, 2020

stable inclusion
from linux-4.19.135
commit cab7ef0066401efb5ab2dea1d85f490ce9c626f3

--------------------------------

commit 5df96f2b upstream.

Commit adc0daad ("dm: report suspended
device during destroy") broke integrity recalculation.

The problem is dm_suspended() returns true not only during suspend,
but also during resume. So this race condition could occur:
1. dm_integrity_resume calls queue_work(ic->recalc_wq, &ic->recalc_work)
2. integrity_recalc (&ic->recalc_work) preempts the current thread
3. integrity_recalc calls if (unlikely(dm_suspended(ic->ti))) goto unlock_ret;
4. integrity_recalc exits and no recalculating is done.

To fix this race condition, add a function dm_post_suspending that is
only true during the postsuspend phase and use it instead of
dm_suspended().

Signed-off-by: Mikulas Patocka <mpatocka redhat com>
Fixes: adc0daad ("dm: report suspended device during destroy")
Cc: stable vger kernel org # v4.18+
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

c476abe3

io-mapping: indicate mapping failure · d47cb8b7

由 Michael J. Ruhl 提交于 9月 22, 2020

stable inclusion
from linux-4.19.135
commit 4daa403143be80d373f520029a5220602db2e4cd

--------------------------------

commit e0b3e0b1 upstream.

The !ATOMIC_IOMAP version of io_maping_init_wc will always return
success, even when the ioremap fails.

Since the ATOMIC_IOMAP version returns NULL when the init fails, and
callers check for a NULL return on error this is unexpected.

During a device probe, where the ioremap failed, a crash can look like
this:

    BUG: unable to handle page fault for address: 0000000000210000
     #PF: supervisor write access in kernel mode
     #PF: error_code(0x0002) - not-present page
     Oops: 0002 [#1] PREEMPT SMP
     CPU: 0 PID: 177 Comm:
     RIP: 0010:fill_page_dma [i915]
       gen8_ppgtt_create [i915]
       i915_ppgtt_create [i915]
       intel_gt_init [i915]
       i915_gem_init [i915]
       i915_driver_probe [i915]
       pci_device_probe
       really_probe
       driver_probe_device

The remap failure occurred much earlier in the probe.  If it had been
propagated, the driver would have exited with an error.

Return NULL on ioremap failure.

[akpm@linux-foundation.org: detect ioremap_wc() errors earlier]

Fixes: cafaf14a ("io-mapping: Always create a struct to hold metadata about the io-mapping")
Signed-off-by: NMichael J. Ruhl <michael.j.ruhl@intel.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Chris Wilson <chris@chris-wilson.co.uk>
Cc: Daniel Vetter <daniel@ffwll.ch>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/20200721171936.81563-1-michael.j.ruhl@intel.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d47cb8b7

vt: Reject zero-sized screen buffer size. · d876ff28

由 Tetsuo Handa 提交于 9月 22, 2020

stable inclusion
from linux-4.19.135
commit 74752b81eae8ae64e97de222320026367e92c4b5

--------------------------------

commit ce684552 upstream.

syzbot is reporting general protection fault in do_con_write() [1] caused
by vc->vc_screenbuf == ZERO_SIZE_PTR caused by vc->vc_screenbuf_size == 0
caused by vc->vc_cols == vc->vc_rows == vc->vc_size_row == 0 caused by
fb_set_var() from ioctl(FBIOPUT_VSCREENINFO) on /dev/fb0 , for
gotoxy(vc, 0, 0) from reset_terminal() from vc_init() from vc_allocate()
 from con_install() from tty_init_dev() from tty_open() on such console
causes vc->vc_pos == 0x10000000e due to
((unsigned long) ZERO_SIZE_PTR) + -1U * 0 + (-1U << 1).

I don't think that a console with 0 column or 0 row makes sense. And it
seems that vc_do_resize() does not intend to allow resizing a console to
0 column or 0 row due to

  new_cols = (cols ? cols : vc->vc_cols);
  new_rows = (lines ? lines : vc->vc_rows);

exception.

Theoretically, cols and rows can be any range as long as
0 < cols * rows * 2 <= KMALLOC_MAX_SIZE is satisfied (e.g.
cols == 1048576 && rows == 2 is possible) because of

  vc->vc_size_row = vc->vc_cols << 1;
  vc->vc_screenbuf_size = vc->vc_rows * vc->vc_size_row;

in visual_init() and kzalloc(vc->vc_screenbuf_size) in vc_allocate().

Since we can detect cols == 0 or rows == 0 via screenbuf_size = 0 in
visual_init(), we can reject kzalloc(0). Then, vc_allocate() will return
an error, and con_write() will not be called on a console with 0 column
or 0 row.

We need to make sure that integer overflow in visual_init() won't happen.
Since vc_do_resize() restricts cols <= 32767 and rows <= 32767, applying
1 <= cols <= 32767 and 1 <= rows <= 32767 restrictions to vc_allocate()
will be practically fine.

This patch does not touch con_init(), for returning -EINVAL there
does not help when we are not returning -ENOMEM.

[1] https://syzkaller.appspot.com/bug?extid=017265e8553724e514e8Reported-and-tested-by: Nsyzbot <syzbot+017265e8553724e514e8@syzkaller.appspotmail.com>
Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: stable <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20200712111013.11881-1-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d876ff28

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功