提交 · dbb16df6443c59e8a1ef21c2272fcf387d600ddf · openeuler / Kernel

29 8月, 2022 2 次提交

Revert "memcg: cleanup racy sum avoidance code" · dbb16df6

由 Shakeel Butt 提交于 8月 17, 2022

This reverts commit 96e51ccf.

Recently we started running the kernel with rstat infrastructure on
production traffic and begin to see negative memcg stats values. 
Particularly the 'sock' stat is the one which we observed having negative
value.

$ grep "sock " /mnt/memory/job/memory.stat
sock 253952
total_sock 18446744073708724224

Re-run after couple of seconds

$ grep "sock " /mnt/memory/job/memory.stat
sock 253952
total_sock 53248

For now we are only seeing this issue on large machines (256 CPUs) and
only with 'sock' stat.  I think the networking stack increase the stat on
one cpu and decrease it on another cpu much more often.  So, this negative
sock is due to rstat flusher flushing the stats on the CPU that has seen
the decrement of sock but missed the CPU that has increments.  A typical
race condition.

For easy stable backport, revert is the most simple solution.  For long
term solution, I am thinking of two directions.  First is just reduce the
race window by optimizing the rstat flusher.  Second is if the reader sees
a negative stat value, force flush and restart the stat collection. 
Basically retry but limited.

Link: https://lkml.kernel.org/r/20220817172139.3141101-1-shakeelb@google.com
Fixes: 96e51ccf ("memcg: cleanup racy sum avoidance code")
Signed-off-by: NShakeel Butt <shakeelb@google.com>
Cc: "Michal Koutný" <mkoutny@suse.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Greg Thelen <gthelen@google.com>
Cc: <stable@vger.kernel.org>	[5.15]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

dbb16df6

mm: re-allow pinning of zero pfns (again) · fcab34b4

由 Alex Williamson 提交于 8月 10, 2022

The below referenced commit makes the same error as 1c563432 ("mm: fix
is_pinnable_page against a cma page"), re-interpreting the logic to
exclude pinning of the zero page, which breaks device assignment with
vfio.

To avoid further subtle mistakes, split the logic into discrete tests.

[akpm@linux-foundation.org: simplify comment, per John]
Link: https://lkml.kernel.org/r/166015037385.760108.16881097713975517242.stgit@omen
Link: https://lore.kernel.org/all/165490039431.944052.12458624139225785964.stgit@omen
Fixes: f25cbb7a ("mm: add zone device coherent type memory support")
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>
Suggested-by: NMatthew Wilcox <willy@infradead.org>
Suggested-by: NFelix Kuehling <felix.kuehling@amd.com>
Tested-by: NSlawomir Laba <slawomirx.laba@intel.com>
Reviewed-by: NJohn Hubbard <jhubbard@nvidia.com>
Cc: Alex Sierra <alex.sierra@amd.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Alistair Popple <apopple@nvidia.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

fcab34b4

24 8月, 2022 3 次提交

net: Fix data-races around sysctl_devconf_inherit_init_net. · a5612ca1

由 Kuniyuki Iwashima 提交于 8月 23, 2022

While reading sysctl_devconf_inherit_init_net, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.

Fixes: 856c395c ("net: introduce a knob to control whether to inherit devconf config")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

a5612ca1

net: Fix data-races around sysctl_fb_tunnels_only_for_init_net. · af67508e

由 Kuniyuki Iwashima 提交于 8月 23, 2022

While reading sysctl_fb_tunnels_only_for_init_net, it can be changed
concurrently. Thus, we need to add READ_ONCE() to its readers.

Fixes: 79134e6c ("net: do not create fallback tunnels for non-default namespaces")
Signed-off-by: NKuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

af67508e

netfilter: ebtables: reject blobs that don't provide all entry points · 7997eff8

由 Florian Westphal 提交于 8月 20, 2022

Harshit Mogalapalli says:
 In ebt_do_table() function dereferencing 'private->hook_entry[hook]'
 can lead to NULL pointer dereference. [..] Kernel panic:

general protection fault, probably for non-canonical address 0xdffffc0000000005: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000028-0x000000000000002f]
[..]
RIP: 0010:ebt_do_table+0x1dc/0x1ce0
Code: 89 fa 48 c1 ea 03 80 3c 02 00 0f 85 5c 16 00 00 48 b8 00 00 00 00 00 fc ff df 49 8b 6c df 08 48 8d 7d 2c 48 89 fa 48 c1 ea 03 <0f> b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 08 84 d2 0f 85 88
[..]
Call Trace:
 nf_hook_slow+0xb1/0x170
 __br_forward+0x289/0x730
 maybe_deliver+0x24b/0x380
 br_flood+0xc6/0x390
 br_dev_xmit+0xa2e/0x12c0

For some reason ebtables rejects blobs that provide entry points that are
not supported by the table, but what it should instead reject is the
opposite: blobs that DO NOT provide an entry point supported by the table.

t->valid_hooks is the bitmask of hooks (input, forward ...) that will see
packets.  Providing an entry point that is not support is harmless
(never called/used), but the inverse isn't: it results in a crash
because the ebtables traverser doesn't expect a NULL blob for a location
its receiving packets for.

Instead of fixing all the individual checks, do what iptables is doing and
reject all blobs that differ from the expected hooks.

Fixes: 1da177e4 ("Linux-2.6.12-rc2")
Reported-by: NHarshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Reported-by: Nsyzkaller <syzkaller@googlegroups.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>

7997eff8

23 8月, 2022 1 次提交

net/mlx5: Avoid false positive lockdep warning by adding lock_class_key · d59b73a6

由 Moshe Shemesh 提交于 8月 03, 2022

Add a lock_class_key per mlx5 device to avoid a false positive
"possible circular locking dependency" warning by lockdep, on flows
which lock more than one mlx5 device, such as adding SF.

kernel log:
 ======================================================
 WARNING: possible circular locking dependency detected
 5.19.0-rc8+ #2 Not tainted
 ------------------------------------------------------
 kworker/u20:0/8 is trying to acquire lock:
 ffff88812dfe0d98 (&dev->intf_state_mutex){+.+.}-{3:3}, at: mlx5_init_one+0x2e/0x490 [mlx5_core]

 but task is already holding lock:
 ffff888101aa7898 (&(&notifier->n_head)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x5a/0x130

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (&(&notifier->n_head)->rwsem){++++}-{3:3}:
        down_write+0x90/0x150
        blocking_notifier_chain_register+0x53/0xa0
        mlx5_sf_table_init+0x369/0x4a0 [mlx5_core]
        mlx5_init_one+0x261/0x490 [mlx5_core]
        probe_one+0x430/0x680 [mlx5_core]
        local_pci_probe+0xd6/0x170
        work_for_cpu_fn+0x4e/0xa0
        process_one_work+0x7c2/0x1340
        worker_thread+0x6f6/0xec0
        kthread+0x28f/0x330
        ret_from_fork+0x1f/0x30

 -> #0 (&dev->intf_state_mutex){+.+.}-{3:3}:
        __lock_acquire+0x2fc7/0x6720
        lock_acquire+0x1c1/0x550
        __mutex_lock+0x12c/0x14b0
        mlx5_init_one+0x2e/0x490 [mlx5_core]
        mlx5_sf_dev_probe+0x29c/0x370 [mlx5_core]
        auxiliary_bus_probe+0x9d/0xe0
        really_probe+0x1e0/0xaa0
        __driver_probe_device+0x219/0x480
        driver_probe_device+0x49/0x130
        __device_attach_driver+0x1b8/0x280
        bus_for_each_drv+0x123/0x1a0
        __device_attach+0x1a3/0x460
        bus_probe_device+0x1a2/0x260
        device_add+0x9b1/0x1b40
        __auxiliary_device_add+0x88/0xc0
        mlx5_sf_dev_state_change_handler+0x67e/0x9d0 [mlx5_core]
        blocking_notifier_call_chain+0xd5/0x130
        mlx5_vhca_state_work_handler+0x2b0/0x3f0 [mlx5_core]
        process_one_work+0x7c2/0x1340
        worker_thread+0x59d/0xec0
        kthread+0x28f/0x330
        ret_from_fork+0x1f/0x30

  other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(&(&notifier->n_head)->rwsem);
                                lock(&dev->intf_state_mutex);
                                lock(&(&notifier->n_head)->rwsem);
   lock(&dev->intf_state_mutex);

  *** DEADLOCK ***

 4 locks held by kworker/u20:0/8:
  #0: ffff888150612938 ((wq_completion)mlx5_events){+.+.}-{0:0}, at: process_one_work+0x6e2/0x1340
  #1: ffff888100cafdb8 ((work_completion)(&work->work)#3){+.+.}-{0:0}, at: process_one_work+0x70f/0x1340
  #2: ffff888101aa7898 (&(&notifier->n_head)->rwsem){++++}-{3:3}, at: blocking_notifier_call_chain+0x5a/0x130
  #3: ffff88813682d0e8 (&dev->mutex){....}-{3:3}, at:__device_attach+0x76/0x460

 stack backtrace:
 CPU: 6 PID: 8 Comm: kworker/u20:0 Not tainted 5.19.0-rc8+
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
 Workqueue: mlx5_events mlx5_vhca_state_work_handler [mlx5_core]
 Call Trace:
  <TASK>
  dump_stack_lvl+0x57/0x7d
  check_noncircular+0x278/0x300
  ? print_circular_bug+0x460/0x460
  ? lock_chain_count+0x20/0x20
  ? register_lock_class+0x1880/0x1880
  __lock_acquire+0x2fc7/0x6720
  ? register_lock_class+0x1880/0x1880
  ? register_lock_class+0x1880/0x1880
  lock_acquire+0x1c1/0x550
  ? mlx5_init_one+0x2e/0x490 [mlx5_core]
  ? lockdep_hardirqs_on_prepare+0x400/0x400
  __mutex_lock+0x12c/0x14b0
  ? mlx5_init_one+0x2e/0x490 [mlx5_core]
  ? mlx5_init_one+0x2e/0x490 [mlx5_core]
  ? _raw_read_unlock+0x1f/0x30
  ? mutex_lock_io_nested+0x1320/0x1320
  ? __ioremap_caller.constprop.0+0x306/0x490
  ? mlx5_sf_dev_probe+0x269/0x370 [mlx5_core]
  ? iounmap+0x160/0x160
  mlx5_init_one+0x2e/0x490 [mlx5_core]
  mlx5_sf_dev_probe+0x29c/0x370 [mlx5_core]
  ? mlx5_sf_dev_remove+0x130/0x130 [mlx5_core]
  auxiliary_bus_probe+0x9d/0xe0
  really_probe+0x1e0/0xaa0
  __driver_probe_device+0x219/0x480
  ? auxiliary_match_id+0xe9/0x140
  driver_probe_device+0x49/0x130
  __device_attach_driver+0x1b8/0x280
  ? driver_allows_async_probing+0x140/0x140
  bus_for_each_drv+0x123/0x1a0
  ? bus_for_each_dev+0x1a0/0x1a0
  ? lockdep_hardirqs_on_prepare+0x286/0x400
  ? trace_hardirqs_on+0x2d/0x100
  __device_attach+0x1a3/0x460
  ? device_driver_attach+0x1e0/0x1e0
  ? kobject_uevent_env+0x22d/0xf10
  bus_probe_device+0x1a2/0x260
  device_add+0x9b1/0x1b40
  ? dev_set_name+0xab/0xe0
  ? __fw_devlink_link_to_suppliers+0x260/0x260
  ? memset+0x20/0x40
  ? lockdep_init_map_type+0x21a/0x7d0
  __auxiliary_device_add+0x88/0xc0
  ? auxiliary_device_init+0x86/0xa0
  mlx5_sf_dev_state_change_handler+0x67e/0x9d0 [mlx5_core]
  blocking_notifier_call_chain+0xd5/0x130
  mlx5_vhca_state_work_handler+0x2b0/0x3f0 [mlx5_core]
  ? mlx5_vhca_event_arm+0x100/0x100 [mlx5_core]
  ? lock_downgrade+0x6e0/0x6e0
  ? lockdep_hardirqs_on_prepare+0x286/0x400
  process_one_work+0x7c2/0x1340
  ? lockdep_hardirqs_on_prepare+0x400/0x400
  ? pwq_dec_nr_in_flight+0x230/0x230
  ? rwlock_bug.part.0+0x90/0x90
  worker_thread+0x59d/0xec0
  ? process_one_work+0x1340/0x1340
  kthread+0x28f/0x330
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  </TASK>

Fixes: 6a327321 ("net/mlx5: SF, Port function state change support")
Signed-off-by: NMoshe Shemesh <moshe@nvidia.com>
Reviewed-by: NShay Drory <shayd@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

d59b73a6

21 8月, 2022 5 次提交

mm/shmem: fix chattr fsflags support in tmpfs · cb241339

由 Hugh Dickins 提交于 8月 10, 2022

ext[234] have always allowed unimplemented chattr flags to be set, but
other filesystems have tended to be stricter. Follow the stricter
approach for tmpfs: I don't want to have to explain why csu attributes
don't actually work, and we won't need to update the chattr(1) manpage;
and it's never wrong to start off strict, relaxing later if persuaded.
Allow only a (append only) i (immutable) A (no atime) and d (no dump).

Although lsattr showed 'A' inherited, the NOATIME behavior was not being
inherited: because nothing sync'ed FS_NOATIME_FL to S_NOATIME. Add
shmem_set_inode_flags() to sync the flags, using inode_set_flags() to
avoid that instant of lost immutablility during fileattr_set().

But that change switched generic/079 from passing to failing: because
FS_IMMUTABLE_FL and FS_APPEND_FL had been unconventionally included in the
INHERITED fsflags: remove them and generic/079 is back to passing.

Link: https://lkml.kernel.org/r/2961dcb0-ddf3-b9f0-3268-12a4ff996856@google.com
Fixes: e408e695 ("mm/shmem: support FS_IOC_[SG]ETFLAGS in tmpfs")
Signed-off-by: NHugh Dickins <hughd@google.com>
Cc: "Theodore Ts'o" <tytso@mit.edu>
Cc: Radoslaw Burny <rburny@google.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

cb241339

mm/uffd: reset write protection when unregister with wp-mode · f369b07c

由 Peter Xu 提交于 8月 11, 2022

The motivation of this patch comes from a recent report and patchfix from
David Hildenbrand on hugetlb shared handling of wr-protected page [1].

With the reproducer provided in commit message of [1], one can leverage
the uffd-wp lazy-reset of ptes to trigger a hugetlb issue which can affect
not only the attacker process, but also the whole system.

The lazy-reset mechanism of uffd-wp was used to make unregister faster,
meanwhile it has an assumption that any leftover pgtable entries should
only affect the process on its own, so not only the user should be aware
of anything it does, but also it should not affect outside of the process.

But it seems that this is not true, and it can also be utilized to make
some exploit easier.

So far there's no clue showing that the lazy-reset is important to any
userfaultfd users because normally the unregister will only happen once
for a specific range of memory of the lifecycle of the process.

Considering all above, what this patch proposes is to do explicit pte
resets when unregister an uffd region with wr-protect mode enabled.

It should be the same as calling ioctl(UFFDIO_WRITEPROTECT, wp=false)
right before ioctl(UFFDIO_UNREGISTER) for the user. So potentially it'll
make the unregister slower. From that pov it's a very slight abi change,
but hopefully nothing should break with this change either.

Regarding to the change itself - core of uffd write [un]protect operation
is moved into a separate function (uffd_wp_range()) and it is reused in
the unregister code path.

Note that the new function will not check for anything, e.g. ranges or
memory types, because they should have been checked during the previous
UFFDIO_REGISTER or it should have failed already. It also doesn't check
mmap_changing because we're with mmap write lock held anyway.

I added a Fixes upon introducing of uffd-wp shmem+hugetlbfs because that's
the only issue reported so far and that's the commit David's reproducer
will start working (v5.19+). But the whole idea actually applies to not
only file memories but also anonymous. It's just that we don't need to
fix anonymous prior to v5.19- because there's no known way to exploit.

IOW, this patch can also fix the issue reported in [1] as the patch 2 does.

[1] https://lore.kernel.org/all/20220811103435.188481-3-david@redhat.com/

Link: https://lkml.kernel.org/r/20220811201340.39342-1-peterx@redhat.com
Fixes: b1f9e876 ("mm/uffd: enable write protection for shmem & hugetlbfs")
Signed-off-by: NPeter Xu <peterx@redhat.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

f369b07c

mm: add DEVICE_ZONE to FOR_ALL_ZONES · a39c5d3c

由 Hao Lee 提交于 8月 07, 2022

FOR_ALL_ZONES should be consistent with enum zone_type.  Otherwise,
__count_zid_vm_events have the potential to add count to wrong item when
zid is ZONE_DEVICE.

Link: https://lkml.kernel.org/r/20220807154442.GA18167@haolee.ioSigned-off-by: NHao Lee <haolee.swjtu@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

a39c5d3c

mm/gup: fix FOLL_FORCE COW security issue and remove FOLL_COW · 5535be30

由 David Hildenbrand 提交于 8月 09, 2022

Ever since the Dirty COW (CVE-2016-5195) security issue happened, we know
that FOLL_FORCE can be possibly dangerous, especially if there are races
that can be exploited by user space.

Right now, it would be sufficient to have some code that sets a PTE of a
R/O-mapped shared page dirty, in order for it to erroneously become
writable by FOLL_FORCE.  The implications of setting a write-protected PTE
dirty might not be immediately obvious to everyone.

And in fact ever since commit 9ae0f87d ("mm/shmem: unconditionally set
pte dirty in mfill_atomic_install_pte"), we can use UFFDIO_CONTINUE to map
a shmem page R/O while marking the pte dirty.  This can be used by
unprivileged user space to modify tmpfs/shmem file content even if the
user does not have write permissions to the file, and to bypass memfd
write sealing -- Dirty COW restricted to tmpfs/shmem (CVE-2022-2590).

To fix such security issues for good, the insight is that we really only
need that fancy retry logic (FOLL_COW) for COW mappings that are not
writable (!VM_WRITE).  And in a COW mapping, we really only broke COW if
we have an exclusive anonymous page mapped.  If we have something else
mapped, or the mapped anonymous page might be shared (!PageAnonExclusive),
we have to trigger a write fault to break COW.  If we don't find an
exclusive anonymous page when we retry, we have to trigger COW breaking
once again because something intervened.

Let's move away from this mandatory-retry + dirty handling and rely on our
PageAnonExclusive() flag for making a similar decision, to use the same
COW logic as in other kernel parts here as well.  In case we stumble over
a PTE in a COW mapping that does not map an exclusive anonymous page, COW
was not properly broken and we have to trigger a fake write-fault to break
COW.

Just like we do in can_change_pte_writable() added via commit 64fe24a3
("mm/mprotect: try avoiding write faults for exclusive anonymous pages
when changing protection") and commit 76aefad6 ("mm/mprotect: fix
soft-dirty check in can_change_pte_writable()"), take care of softdirty
and uffd-wp manually.

For example, a write() via /proc/self/mem to a uffd-wp-protected range has
to fail instead of silently granting write access and bypassing the
userspace fault handler.  Note that FOLL_FORCE is not only used for debug
access, but also triggered by applications without debug intentions, for
example, when pinning pages via RDMA.

This fixes CVE-2022-2590. Note that only x86_64 and aarch64 are
affected, because only those support CONFIG_HAVE_ARCH_USERFAULTFD_MINOR.

Fortunately, FOLL_COW is no longer required to handle FOLL_FORCE. So
let's just get rid of it.

Thanks to Nadav Amit for pointing out that the pte_dirty() check in
FOLL_FORCE code is problematic and might be exploitable.

Note 1: We don't check for the PTE being dirty because it doesn't matter
	for making a "was COWed" decision anymore, and whoever modifies the
	page has to set the page dirty either way.

Note 2: Kernels before extended uffd-wp support and before
	PageAnonExclusive (< 5.19) can simply revert the problematic
	commit instead and be safe regarding UFFDIO_CONTINUE. A backport to
	v5.19 requires minor adjustments due to lack of
	vma_soft_dirty_enabled().

Link: https://lkml.kernel.org/r/20220809205640.70916-1-david@redhat.com
Fixes: 9ae0f87d ("mm/shmem: unconditionally set pte dirty in mfill_atomic_install_pte")
Signed-off-by: NDavid Hildenbrand <david@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: John Hubbard <jhubbard@nvidia.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: David Laight <David.Laight@ACULAB.COM>
Cc: <stable@vger.kernel.org>	[5.16]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

5535be30

ata: libata: Set __ATA_BASE_SHT max_sectors · a357f7b4

由 John Garry 提交于 8月 17, 2022

Commit 0568e612 ("ata: libata-scsi: cap ata_device->max_sectors
according to shost->max_sectors") inadvertently capped the max_sectors
value for some SATA disks to a value which is lower than we would want.

For a device which supports LBA48, we would previously have request queue
max_sectors_kb and max_hw_sectors_kb values of 1280 and 32767 respectively.

For AHCI controllers, the value chosen for shost max sectors comes from
the minimum of the SCSI host default max sectors in
SCSI_DEFAULT_MAX_SECTORS (1024) and the shost DMA device mapping limit.

This means that we would now set the max_sectors_kb and max_hw_sectors_kb
values for a disk which supports LBA48 at 512, ignoring DMA mapping limit.

As report by Oliver at [0], this caused a performance regression.

Fix by picking a large enough max sectors value for ATA host controllers
such that we don't needlessly reduce max_sectors_kb for LBA48 disks.

[0] https://lore.kernel.org/linux-ide/YvsGbidf3na5FpGb@xsang-OptiPlex-9020/T/#m22d9fc5ad15af66066dd9fecf3d50f1b1ef11da3

Fixes: 0568e612 ("ata: libata-scsi: cap ata_device->max_sectors according to shost->max_sectors")
Reported-by: NOliver Sang <oliver.sang@intel.com>
Signed-off-by: NJohn Garry <john.garry@huawei.com>
Signed-off-by: NDamien Le Moal <damien.lemoal@opensource.wdc.com>

a357f7b4

19 8月, 2022 2 次提交

KVM: Rename mmu_notifier_* to mmu_invalidate_* · 20ec3ebd

由 Chao Peng 提交于 8月 16, 2022

The motivation of this renaming is to make these variables and related
helper functions less mmu_notifier bound and can also be used for non
mmu_notifier based page invalidation. mmu_invalidate_* was chosen to
better describe the purpose of 'invalidating' a page that those
variables are used for.

  - mmu_notifier_seq/range_start/range_end are renamed to
    mmu_invalidate_seq/range_start/range_end.

  - mmu_notifier_retry{_hva} helper functions are renamed to
    mmu_invalidate_retry{_hva}.

  - mmu_notifier_count is renamed to mmu_invalidate_in_progress to
    avoid confusion with mn_active_invalidate_count.

  - While here, also update kvm_inc/dec_notifier_count() to
    kvm_mmu_invalidate_begin/end() to match the change for
    mmu_notifier_count.

No functional change intended.
Signed-off-by: NChao Peng <chao.p.peng@linux.intel.com>
Message-Id: <20220816125322.1110439-3-chao.p.peng@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

20ec3ebd

KVM: Rename KVM_PRIVATE_MEM_SLOTS to KVM_INTERNAL_MEM_SLOTS · bdd1c37a

由 Chao Peng 提交于 8月 16, 2022

KVM_INTERNAL_MEM_SLOTS better reflects the fact those slots are KVM
internally used (invisible to userspace) and avoids confusion to future
private slots that can have different meaning.
Signed-off-by: NChao Peng <chao.p.peng@linux.intel.com>
Message-Id: <20220816125322.1110439-2-chao.p.peng@linux.intel.com>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

bdd1c37a

18 8月, 2022 1 次提交

blk-mq: remove unused function blk_mq_queue_stopped() · a8239f03

由 Yu Kuai 提交于 8月 18, 2022

blk_mq_queue_stopped() doesn't have any caller, which was found by
code coverage test, thus remove it.
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20220818063555.3741222-1-yukuai1@huaweicloud.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

a8239f03

16 8月, 2022 7 次提交

virtio: kerneldocs fixes and enhancements · 5c669c4a

由 Ricardo Cañuelo 提交于 8月 10, 2022

Fix variable names in some kerneldocs, naming in others.
Add kerneldocs for struct vring_desc and vring_interrupt.
Signed-off-by: NRicardo Cañuelo <ricardo.canuelo@collabora.com>
Message-Id: <20220810094004.1250-2-ricardo.canuelo@collabora.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Reviewed-by: NCornelia Huck <cohuck@redhat.com>

5c669c4a

virtio: Revert "virtio: find_vqs() add arg sizes" · 9993a4f9

由 Michael S. Tsirkin 提交于 8月 16, 2022

This reverts commit a10fba03: the
proposed API isn't supported on all transports but no
effort was made to address this.

It might not be hard to fix if we want to: maybe just
rename size to size_hint and make sure legacy
transports ignore the hint.

But it's not sure what the benefit is in any case, so
let's drop it.

Fixes: a10fba03 ("virtio: find_vqs() add arg sizes")
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Message-Id: <20220816053602.173815-8-mst@redhat.com>

9993a4f9

virtio: Revert "virtio: add helper virtio_find_vqs_ctx_size()" · 484b9fa4

由 Michael S. Tsirkin 提交于 8月 16, 2022

This reverts commit fe3dc04e: the
API is now unused and in fact can't be implemented on top of a legacy
device.

Fixes: fe3dc04e ("virtio: add helper virtio_find_vqs_ctx_size()")
Cc: "Xuan Zhuo" <xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
Message-Id: <20220816053602.173815-3-mst@redhat.com>

484b9fa4

sched/psi: Remove redundant cgroup_psi() when !CONFIG_CGROUPS · d7ae5818

由 Hao Jia 提交于 8月 06, 2022

cgroup_psi() is only called under CONFIG_CGROUPS.
We don't need cgroup_psi() when !CONFIG_CGROUPS,
so we can remove it in this case.
Signed-off-by: NHao Jia <jiahao.os@bytedance.com>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

d7ae5818

sched/psi: Remove unused parameter nbytes of psi_trigger_create() · 76b079ef

由 Hao Jia 提交于 8月 06, 2022

psi_trigger_create()'s 'nbytes' parameter is not used, so we can remove it.
Signed-off-by: NHao Jia <jiahao.os@bytedance.com>
Reviewed-by: NIngo Molnar <mingo@kernel.org>
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NTejun Heo <tj@kernel.org>

76b079ef

lib/cpumask: add inline cpumask_next_wrap() for UP · 2248ccd8

由 Sander Vanheule 提交于 8月 09, 2022

In the uniprocessor case, cpumask_next_wrap() can be simplified, as the
number of valid argument combinations is limited:
    - 'start' can only be 0
    - 'n' can only be -1 or 0

The only valid CPU that can then be returned, if any, will be the first
one set in the provided 'mask'.

For NR_CPUS == 1, include/linux/cpumask.h now provides an inline
definition of cpumask_next_wrap(), which will conflict with the one
provided by lib/cpumask.c.  Make building of lib/cpumask.o again depend
on CONFIG_SMP=y (i.e. NR_CPUS > 1) to avoid the re-definition.
Suggested-by: NYury Norov <yury.norov@gmail.com>
Signed-off-by: NSander Vanheule <sander@svanheule.net>
Signed-off-by: NYury Norov <yury.norov@gmail.com>

2248ccd8

cpumask: align signatures of UP implementations · be599244

由 Sander Vanheule 提交于 8月 09, 2022

Between the generic version, and their uniprocessor optimised
implementations, the return types of cpumask_any_and_distribute() and
cpumask_any_distribute() are not identical.  Change the UP versions to
'unsigned int', to match the generic versions.
Suggested-by: NYury Norov <yury.norov@gmail.com>
Signed-off-by: NSander Vanheule <sander@svanheule.net>
Signed-off-by: NYury Norov <yury.norov@gmail.com>

be599244

15 8月, 2022 1 次提交

radix-tree: replace gfp.h inclusion with gfp_types.h · 9f162193

由 Yury Norov 提交于 8月 11, 2022

Radix tree header includes gfp.h for __GFP_BITS_SHIFT only. Now we
have gfp_types.h for this.

Fixes powerpc allmodconfig build:

   In file included from include/linux/nodemask.h:97,
                    from include/linux/mmzone.h:17,
                    from include/linux/gfp.h:7,
                    from include/linux/radix-tree.h:12,
                    from include/linux/idr.h:15,
                    from include/linux/kernfs.h:12,
                    from include/linux/sysfs.h:16,
                    from include/linux/kobject.h:20,
                    from include/linux/pci.h:35,
                    from arch/powerpc/kernel/prom_init.c:24:
   include/linux/random.h: In function 'add_latent_entropy':
>> include/linux/random.h:25:46: error: 'latent_entropy' undeclared (first use in this function); did you mean 'add_latent_entropy'?
      25 |         add_device_randomness((const void *)&latent_entropy, sizeof(latent_entropy));
         |                                              ^~~~~~~~~~~~~~
         |                                              add_latent_entropy
   include/linux/random.h:25:46: note: each undeclared identifier is reported only once for each function it appears in
Reported-by: Nkernel test robot <lkp@intel.com>
CC: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: NYury Norov <yury.norov@gmail.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9f162193

14 8月, 2022 2 次提交

T
NFS: Cleanup to remove unused flag NFS_CONTEXT_RESEND_WRITES · 5f6277a0
由 Trond Myklebust 提交于 8月 13, 2022
```
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
```
5f6277a0

NFS: Fix another fsync() issue after a server reboot · 67f4b5dc

由 Trond Myklebust 提交于 8月 13, 2022

Currently, when the writeback code detects a server reboot, it redirties
any pages that were not committed to disk, and it sets the flag
NFS_CONTEXT_RESEND_WRITES in the nfs_open_context of the file descriptor
that dirtied the file. While this allows the file descriptor in question
to redrive its own writes, it violates the fsync() requirement that we
should be synchronising all writes to disk.
While the problem is infrequent, we do see corner cases where an
untimely server reboot causes the fsync() call to abandon its attempt to
sync data to disk and causing data corruption issues due to missed error
conditions or similar.

In order to tighted up the client's ability to deal with this situation
without introducing livelocks, add a counter that records the number of
times pages are redirtied due to a server reboot-like condition, and use
that in fsync() to redrive the sync to disk.

Fixes: 2197e9b0 ("NFS: Fix up fsync() when the server rebooted")
Cc: stable@vger.kernel.org
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>

67f4b5dc

13 8月, 2022 2 次提交

io_uring: make io_kiocb_to_cmd() typesafe · f2ccb5ae

由 Stefan Metzmacher 提交于 8月 11, 2022

We need to make sure (at build time) that struct io_cmd_data is not
casted to a structure that's larger.
Signed-off-by: NStefan Metzmacher <metze@samba.org>
Link: https://lore.kernel.org/r/c024cdf25ae19fc0319d4180e2298bade8ed17b8.1660201408.git.metze@samba.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

f2ccb5ae

fs: don't randomize struct kiocb fields · addebd9a

由 Keith Busch 提交于 8月 12, 2022

This is a size sensitive structure and randomizing can introduce extra
padding that breaks io_uring's fixed size expectations. There are few
fields here as it is, half of which need a fixed order to optimally
pack, so the randomization isn't providing much.
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/io-uring/b6f508ca-b1b2-5f40-7998-e4cff1cf7212@kernel.dk/Signed-off-by: NJens Axboe <axboe@kernel.dk>

addebd9a

11 8月, 2022 14 次提交

vdpa: Add suspend operation · 848ecea1

由 Eugenio Pérez 提交于 8月 10, 2022

This operation is optional: It it's not implemented, backend feature bit
will not be exposed.
Signed-off-by: NEugenio Pérez <eperezma@redhat.com>
Message-Id: <20220810171512.2343333-2-eperezma@redhat.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

848ecea1

vdpa/mlx5: Implement susupend virtqueue callback · cae15c2e

由 Eli Cohen 提交于 7月 14, 2022

Implement the suspend callback allowing to suspend the virtqueues so
they stop processing descriptors. This is required to allow to query a
consistent state of the virtqueue while live migration is taking place.
Signed-off-by: NEli Cohen <elic@nvidia.com>
Message-Id: <20220714113927.85729-2-elic@nvidia.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

cae15c2e

virtio: add helper virtio_find_vqs_ctx_size() · fe3dc04e

由 Xuan Zhuo 提交于 8月 01, 2022

Introduce helper virtio_find_vqs_ctx_size() to call find_vqs and specify
the maximum size of each vq ring.
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-37-xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

fe3dc04e

virtio: find_vqs() add arg sizes · a10fba03

由 Xuan Zhuo 提交于 8月 01, 2022

find_vqs() adds a new parameter sizes to specify the size of each vq
vring.

NULL as sizes means that all queues in find_vqs() use the maximum size.
A value in the array is 0, which means that the corresponding queue uses
the maximum size.

In the split scenario, the meaning of size is the largest size, because
it may be limited by memory, the virtio core will try a smaller size.
And the size is power of 2.
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NHans de Goede <hdegoede@redhat.com>
Reviewed-by: NMathieu Poirier <mathieu.poirier@linaro.org>
Acked-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-34-xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

a10fba03

virtio_pci: introduce helper to get/set queue reset · 0b50cece

由 Xuan Zhuo 提交于 8月 01, 2022

Introduce new helpers to implement queue reset and get queue reset
status.

 https://github.com/oasis-tcs/virtio-spec/issues/124
 https://github.com/oasis-tcs/virtio-spec/issues/139Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-31-xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

0b50cece

virtio_pci: struct virtio_pci_common_cfg add queue_reset · 0cdd450e

由 Xuan Zhuo 提交于 8月 01, 2022

Add queue_reset in virtio_pci_modern_common_cfg.

 https://github.com/oasis-tcs/virtio-spec/issues/124
 https://github.com/oasis-tcs/virtio-spec/issues/139Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-30-xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

0cdd450e

virtio_ring: struct virtqueue introduce reset · 4913e854

由 Xuan Zhuo 提交于 8月 01, 2022

Introduce a new member reset to the structure virtqueue to determine
whether the current vq is in the reset state. Subsequent patches will
use it.
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-29-xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

4913e854

virtio: allow to unbreak/break virtqueue individually · 32510631

由 Xuan Zhuo 提交于 8月 01, 2022

This patch allows the new introduced
__virtqueue_break()/__virtqueue_unbreak() to break/unbreak the
virtqueue.
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-27-xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

32510631

virtio_pci: struct virtio_pci_common_cfg add queue_notify_data · ea024594

由 Xuan Zhuo 提交于 8月 01, 2022

Add queue_notify_data in struct virtio_pci_common_cfg, which comes from
here https://github.com/oasis-tcs/virtio-spec/issues/89

In order not to affect the API, add a dedicated structure struct
virtio_pci_modern_common_cfg to virtio_pci_modern.h.

Since I want to add queue_reset after queue_notify_data, I submitted
this patch first.
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-26-xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

ea024594

virtio_ring: introduce virtqueue_resize() · c790e8e1

由 Xuan Zhuo 提交于 8月 01, 2022

Introduce virtqueue_resize() to implement the resize of vring.
Based on these, the driver can dynamically adjust the size of the vring.
For example: ethtool -G.

virtqueue_resize() implements resize based on the vq reset function. In
case of failure to allocate a new vring, it will give up resize and use
the original vring.

During this process, if the re-enable reset vq fails, the vq can no
longer be used. Although the probability of this situation is not high.

The parameter recycle is used to recycle the buffer that is no longer
used.
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-25-xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

c790e8e1

virtio_ring: split: stop __vring_new_virtqueue as export symbol · 07d9629d

由 Xuan Zhuo 提交于 8月 01, 2022

There is currently only one place to reference __vring_new_virtqueue()
directly from the outside of virtio core. And here vring_new_virtqueue()
can be used instead.

Subsequent patches will modify __vring_new_virtqueue, so stop it as an
export symbol for now.
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-8-xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

07d9629d

virtio: struct virtio_config_ops add callbacks for queue_reset · 3086e9fc

由 Xuan Zhuo 提交于 8月 01, 2022

reset can be divided into the following four steps (example):
 1. transport: notify the device to reset the queue
 2. vring:     recycle the buffer submitted
 3. vring:     reset/resize the vring (may re-alloc)
 4. transport: mmap vring to device, and enable the queue

In order to support queue reset, add two callbacks in struct
virtio_config_ops to implement steps 1 and 4.
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-3-xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

3086e9fc

virtio: record the maximum queue num supported by the device. · da802961

由 Xuan Zhuo 提交于 8月 01, 2022

virtio-net can display the maximum (supported by hardware) ring size in
ethtool -g eth0.

When the subsequent patch implements vring reset, it can judge whether
the ring size passed by the driver is legal based on this.
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Acked-by: NJason Wang <jasowang@redhat.com>
Message-Id: <20220801063902.129329-2-xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

da802961

remoteproc: rename len of rpoc_vring to num · c2a052a4

由 Xuan Zhuo 提交于 6月 24, 2022

Rename the member len in the structure rpoc_vring to num. And remove 'in
bytes' from the comment of it. This is misleading. Because this actually
refers to the size of the virtio vring to be created. The unit is not
bytes.
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Message-Id: <20220624025621.128843-2-xuanzhuo@linux.alibaba.com>
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

c2a052a4

openeuler / Kernel 大约 2 年 前同步成功

openeuler / Kernel
大约 2 年前同步成功