提交 · b84ba30b6c7a75babdf73b83bc3c7b59b944501a · openeuler / Kernel

29 11月, 2021 17 次提交

block: remove the gendisk argument to blk_execute_rq · b84ba30b

由 Christoph Hellwig 提交于 11月 26, 2021

Remove the gendisk aregument to blk_execute_rq and blk_execute_rq_nowait
given that it is unused now. Also convert the boolean at_head parameter
to actually use the bool type while touching the prototype.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-5-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

b84ba30b

block: remove the ->rq_disk field in struct request · f3fa33ac

由 Christoph Hellwig 提交于 11月 26, 2021

Just use the disk attached to the request_queue instead.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>
Link: https://lore.kernel.org/r/20211126121802.2090656-4-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

f3fa33ac

block: remove get_io_context_active · 50569c24

由 Christoph Hellwig 提交于 11月 26, 2021

Fold it into it's only caller, and remove a lof of the debug checks
that are not needed.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-10-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

50569c24

block: mark put_io_context_active static · 33047425

由 Christoph Hellwig 提交于 11月 26, 2021

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-7-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

33047425

fork: move copy_io to block/blk-ioc.c · 88c9a2ce

由 Christoph Hellwig 提交于 11月 26, 2021

Move the copying of the I/O context to the block layer as that is where
we can use the proper low-level interfaces.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211126115817.2087431-3-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

88c9a2ce

blk-mq: Add blk_mq_complete_request_direct() · e8dc17e2

由 Sebastian Andrzej Siewior 提交于 10月 25, 2021

Add blk_mq_complete_request_direct() which completes the block request
directly instead deferring it to softirq for single queue devices.

This is useful for devices which complete the requests in preemptible
context and raising softirq from means scheduling ksoftirqd.
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211025070658.1565848-2-bigeasy@linutronix.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

e8dc17e2

blk-crypto: remove blk_crypto_unregister() · 72cd9df2

由 Eric Biggers 提交于 11月 23, 2021

This function is trivial and is only used in one place. Having this
function is misleading because it implies that blk_crypto_register()
needs to be paired with blk_crypto_unregister(), which is not the case.
Just set disk->queue->crypto_profile to NULL directly.
Signed-off-by: NEric Biggers <ebiggers@google.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211124013733.347612-1-ebiggers@kernel.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

72cd9df2

block: only allocate poll_stats if there's a user of them · 48b5c1fb

由 Jens Axboe 提交于 11月 13, 2021

This is essentially never used, yet it's about 1/3rd of the total
queue size. Allocate it when needed, and don't embed it in the queue.

Kill the queue flag for this while at it, since we can just check the
assigned pointer now.
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

48b5c1fb

block: cleanup the GENHD_FL_* definitions · 430cc5d3

由 Christoph Hellwig 提交于 11月 22, 2021

Switch to an enum and tidy up the documentation.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-14-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

430cc5d3

block: remove GENHD_FL_EXT_DEVT · 1ebe2e5f

由 Christoph Hellwig 提交于 11月 22, 2021

All modern drivers can support extra partitions using the extended
dev_t.  In fact except for the ioctl method drivers never even see
partitions in normal operation.

So remove the GENHD_FL_EXT_DEVT and allow extra partitions for all
block devices that do support partitions, and require those that
do not support partitions to explicit disallow them using
GENHD_FL_NO_PART.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-12-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

1ebe2e5f

block: remove GENHD_FL_SUPPRESS_PARTITION_INFO · 3b5149ac

由 Christoph Hellwig 提交于 11月 22, 2021

This flag is not set directly anywhere and only inherited from
GENHD_FL_HIDDEN. Just check for GENHD_FL_HIDDEN instead.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-11-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

3b5149ac

block: rename GENHD_FL_NO_PART_SCAN to GENHD_FL_NO_PART · 46e7eac6

由 Christoph Hellwig 提交于 11月 22, 2021

The GENHD_FL_NO_PART_SCAN controls more than just partitions canning,
so rename it to GENHD_FL_NO_PART.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NUlf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20211122130625.1136848-7-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

46e7eac6

block: remove GENHD_FL_CD · 1a827ce1

由 Christoph Hellwig 提交于 11月 22, 2021

GENHD_FL_CD marks a gendisk as a vaguely CD-ROM like device.
Besides being used internally inside of sunvdc.c an xen-blkfront it
is used by xen-blkback as a hint to claim a device exported to a
guest is a CD-ROM like device. Just check for disk->cdi instead
which is the right indicator for "real" CD-ROM or DVD drivers. This
will miss the paravirtualized guest drivers, but those make little
sense to report anyway.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-4-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

1a827ce1

block: move GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE to disk->event_flags · 1545e0b4

由 Christoph Hellwig 提交于 11月 22, 2021

GENHD_FL_BLOCK_EVENTS_ON_EXCL_WRITE is all about the event reporting
mechanism, so move it to the event_flags field.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-3-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

1545e0b4

block: move GENHD_FL_NATIVE_CAPACITY to disk->state · 86416916

由 Christoph Hellwig 提交于 11月 22, 2021

The flag to indicate an unlocked native capacity is dynamic state,
not a driver capability flag, so move it to disk->state.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211122130625.1136848-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

86416916

block: remove rq_flush_dcache_pages · 786d4e01

由 Christoph Hellwig 提交于 11月 17, 2021

This function is trivial, and flush_dcache_page is always defined, so
just open code it in the 2.5 callers.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-3-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

786d4e01

block: move blk_rq_err_bytes to scsi · 79478bf9

由 Christoph Hellwig 提交于 11月 17, 2021

blk_rq_err_bytes is only used by the scsi midlayer, so move it there.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20211117061404.331732-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

79478bf9

27 11月, 2021 1 次提交

net: ptp: add a definition for the UDP port for IEEE 1588 general messages · ec15baec

由 Vladimir Oltean 提交于 11月 26, 2021

As opposed to event messages (Sync, PdelayReq etc) which require
timestamping, general messages (Announce, FollowUp etc) do not.
In PTP they are part of different streams of data.

IEEE 1588-2008 Annex D.2 "UDP port numbers" states that the UDP
destination port assigned by IANA is 319 for event messages, and 320 for
general messages. Yet the kernel seems to be missing the definition for
general messages. This patch adds it.
Signed-off-by: NVladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: NRichard Cochran <richardcochran@gmail.com>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

ec15baec

25 11月, 2021 1 次提交

Revert "virtio_ring: validate used buffer length" · f124034f

由 Michael S. Tsirkin 提交于 11月 24, 2021

This reverts commit 939779f5.

Attempts to validate length in the core did not work out: there turn out
to exist multiple broken devices, and in particular legacy devices are
known to be broken in this respect.

We have ideas for handling this better in the next version but for now
let's revert to a known good state to make sure drivers work for people.
Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>

f124034f

24 11月, 2021 1 次提交

ACPI: Make acpi_node_get_parent() local · 985e9ece

由 Sakari Ailus 提交于 11月 17, 2021

acpi_node_get_parent() isn't used outside drivers/acpi/property.c.

Make it local.
Signed-off-by: NSakari Ailus <sakari.ailus@linux.intel.com>
Reviewed-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>

985e9ece

21 11月, 2021 2 次提交

hugetlb: fix hugetlb cgroup refcounting during mremap · afe041c2

由 Bui Quang Minh 提交于 11月 19, 2021

When hugetlb_vm_op_open() is called during copy_vma(), we may take the
reference to resv_map->css.  Later, when clearing the reservation
pointer of old_vma after transferring it to new_vma, we forget to drop
the reference to resv_map->css.  This leads to a reference leak of css.

Fixes this by adding a check to drop reservation css reference in
clear_vma_resv_huge_pages()

Link: https://lkml.kernel.org/r/20211113154412.91134-1-minhquangbui99@gmail.com
Fixes: 550a7d60 ("mm, hugepages: add mremap() support for hugepage backed vma")
Signed-off-by: NBui Quang Minh <minhquangbui99@gmail.com>
Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: NMina Almasry <almasrymina@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

afe041c2

shm: extend forced shm destroy to support objects from several IPC nses · 85b6d246

由 Alexander Mikhalitsyn 提交于 11月 19, 2021

Currently, the exit_shm() function not designed to work properly when
task->sysvshm.shm_clist holds shm objects from different IPC namespaces.

This is a real pain when sysctl kernel.shm_rmid_forced = 1, because it
leads to use-after-free (reproducer exists).

This is an attempt to fix the problem by extending exit_shm mechanism to
handle shm's destroy from several IPC ns'es.

To achieve that we do several things:

1. add a namespace (non-refcounted) pointer to the struct shmid_kernel

2. during new shm object creation (newseg()/shmget syscall) we
   initialize this pointer by current task IPC ns

3. exit_shm() fully reworked such that it traverses over all shp's in
   task->sysvshm.shm_clist and gets IPC namespace not from current task
   as it was before but from shp's object itself, then call
   shm_destroy(shp, ns).

Note: We need to be really careful here, because as it was said before
(1), our pointer to IPC ns non-refcnt'ed.  To be on the safe side we
using special helper get_ipc_ns_not_zero() which allows to get IPC ns
refcounter only if IPC ns not in the "state of destruction".

Q/A

Q: Why can we access shp->ns memory using non-refcounted pointer?
A: Because shp object lifetime is always shorther than IPC namespace
   lifetime, so, if we get shp object from the task->sysvshm.shm_clist
   while holding task_lock(task) nobody can steal our namespace.

Q: Does this patch change semantics of unshare/setns/clone syscalls?
A: No. It's just fixes non-covered case when process may leave IPC
   namespace without getting task->sysvshm.shm_clist list cleaned up.

Link: https://lkml.kernel.org/r/67bb03e5-f79c-1815-e2bf-949c67047418@colorfullife.com
Link: https://lkml.kernel.org/r/20211109151501.4921-1-manfred@colorfullife.com
Fixes: ab602f79 ("shm: make exit_shm work proportional to task activity")
Co-developed-by: NManfred Spraul <manfred@colorfullife.com>
Signed-off-by: NManfred Spraul <manfred@colorfullife.com>
Signed-off-by: NAlexander Mikhalitsyn <alexander.mikhalitsyn@virtuozzo.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Cc: Vasily Averin <vvs@virtuozzo.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

85b6d246

19 11月, 2021 2 次提交

signal: Replace force_fatal_sig with force_exit_sig when in doubt · fcb116bc

由 Eric W. Biederman 提交于 11月 18, 2021

Recently to prevent issues with SECCOMP_RET_KILL and similar signals
being changed before they are delivered SA_IMMUTABLE was added.

Unfortunately this broke debuggers[1][2] which reasonably expect
to be able to trap synchronous SIGTRAP and SIGSEGV even when
the target process is not configured to handle those signals.

Add force_exit_sig and use it instead of force_fatal_sig where
historically the code has directly called do_exit.  This has the
implementation benefits of going through the signal exit path
(including generating core dumps) without the danger of allowing
userspace to ignore or change these signals.

This avoids userspace regressions as older kernels exited with do_exit
which debuggers also can not intercept.

In the future is should be possible to improve the quality of
implementation of the kernel by changing some of these force_exit_sig
calls to force_fatal_sig.  That can be done where it matters on
a case-by-case basis with careful analysis.
Reported-by: NKyle Huey <me@kylehuey.com>
Reported-by: Nkernel test robot <oliver.sang@intel.com>
[1] https://lkml.kernel.org/r/CAP045AoMY4xf8aC_4QU_-j7obuEPYgTcnQQP3Yxk=2X90jtpjw@mail.gmail.com
[2] https://lkml.kernel.org/r/20211117150258.GB5403@xsang-OptiPlex-9020
Fixes: 00b06da2 ("signal: Add SA_IMMUTABLE to ensure forced siganls do not get changed")
Fixes: a3616a3c ("signal/m68k: Use force_sigsegv(SIGSEGV) in fpsp040_die")
Fixes: 83a1f27a ("signal/powerpc: On swapcontext failure force SIGSEGV")
Fixes: 9bc508cf ("signal/s390: Use force_sigsegv in default_trap_handler")
Fixes: 086ec444 ("signal/sparc32: In setup_rt_frame and setup_fram use force_fatal_sig")
Fixes: c317d306 ("signal/sparc32: Exit with a fatal signal when try_to_clear_window_buffer fails")
Fixes: 695dd0d6 ("signal/x86: In emulate_vsyscall force a signal instead of calling do_exit")
Fixes: 1fbd60df ("signal/vm86_32: Properly send SIGSEGV when the vm86 state cannot be saved.")
Fixes: 941edc5b ("exit/syscall_user_dispatch: Send ordinary signals on failure")
Link: https://lkml.kernel.org/r/871r3dqfv8.fsf_-_@email.froward.int.ebiederm.orgReviewed-by: NKees Cook <keescook@chromium.org>
Tested-by: NKees Cook <keescook@chromium.org>
Tested-by: NKyle Huey <khuey@kylehuey.com>
Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>

fcb116bc

mm: Add functions to zero portions of a folio · c0357139

由 Matthew Wilcox (Oracle) 提交于 11月 05, 2021

These functions are wrappers around zero_user_segments(), which means
that zero_user_segments() can now be called for compound pages even when
CONFIG_TRANSPARENT_HUGEPAGE is disabled.

Use 'xend' as the name of the parameter to indicate that this is an
excluded end, not the more usual included end.  Excluding the end makes
more sense to the callers, but can cause confusion to readers who are
more used to seeing included ends.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>

c0357139

18 11月, 2021 2 次提交

page_pool: Revert "page_pool: disable dma mapping support..." · f915b75b

由 Yunsheng Lin 提交于 11月 17, 2021

This reverts commit d00e60ee.

As reported by Guillaume in [1]:
Enabling LPAE always enables CONFIG_ARCH_DMA_ADDR_T_64BIT
in 32-bit systems, which breaks the bootup proceess when a
ethernet driver is using page pool with PP_FLAG_DMA_MAP flag.
As we were hoping we had no active consumers for such system
when we removed the dma mapping support, and LPAE seems like
a common feature for 32 bits system, so revert it.

1. https://www.spinics.net/lists/netdev/msg779890.html

Fixes: d00e60ee ("page_pool: disable dma mapping support for 32-bit arch with 64-bit DMA")
Signed-off-by: NYunsheng Lin <linyunsheng@huawei.com>
Reported-by: N"kernelci.org bot" <bot@kernelci.org>
Tested-by: N"kernelci.org bot" <bot@kernelci.org>
Acked-by: NJesper Dangaard Brouer <brouer@redhat.com>
Acked-by: NIlias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

f915b75b

KVM: Kill kvm_map_gfn() / kvm_unmap_gfn() and gfn_to_pfn_cache · 357a18ad

由 David Woodhouse 提交于 11月 15, 2021

In commit 7e2175eb ("KVM: x86: Fix recording of guest steal time /
preempted status") I removed the only user of these functions because
it was basically impossible to use them safely.

There are two stages to the GFN->PFN mapping; first through the KVM
memslots to a userspace HVA and then through the page tables to
translate that HVA to an underlying PFN. Invalidations of the former
were being handled correctly, but no attempt was made to use the MMU
notifiers to invalidate the cache when the HVA->GFN mapping changed.

As a prelude to reinventing the gfn_to_pfn_cache with more usable
semantics, rip it out entirely and untangle the implementation of
the unsafe kvm_vcpu_map()/kvm_vcpu_unmap() functions from it.

All current users of kvm_vcpu_map() also look broken right now, and
will be dealt with separately. They broadly fall into two classes:

* Those which map, access the data and immediately unmap. This is
  mostly gratuitous and could just as well use the existing user
  HVA, and could probably benefit from a gfn_to_hva_cache as they
  do so.

* Those which keep the mapping around for a longer time, perhaps
  even using the PFN directly from the guest. These will need to
  be converted to the new gfn_to_pfn_cache and then kvm_vcpu_map()
  can be removed too.
Signed-off-by: NDavid Woodhouse <dwmw@amazon.co.uk>
Message-Id: <20211115165030.7422-8-dwmw2@infradead.org>
Signed-off-by: NPaolo Bonzini <pbonzini@redhat.com>

357a18ad

17 11月, 2021 7 次提交

fs: Rename AS_THP_SUPPORT and mapping_thp_support · ed2145c4

由 Matthew Wilcox (Oracle) 提交于 8月 29, 2021

These are now indicators of large folio support, not THP support.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>

ed2145c4

fs: Remove FS_THP_SUPPORT · ff36da69

由 Matthew Wilcox (Oracle) 提交于 8月 29, 2021

Instead of setting a bit in the fs_flags to set a bit in the
address_space, set the bit in the address_space directly.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>

ff36da69

mm: Remove folio_test_single · a1efe484

由 Matthew Wilcox (Oracle) 提交于 11月 16, 2021

There's no need for this predicate; callers can just use
!folio_test_large().
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>

a1efe484

M
mm: Rename folio_test_multi to folio_test_large · 9c325215
由 Matthew Wilcox (Oracle) 提交于 11月 16, 2021
```
This is a better name.  Also add kernel-doc.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
```
9c325215

Add linux/cacheflush.h · 522a0032

由 Matthew Wilcox (Oracle) 提交于 11月 06, 2021

Many architectures do not include asm-generic/cacheflush.h, so turn
the includes on their head and add linux/cacheflush.h which includes
asm/cacheflush.h.

Move the flush_dcache_folio() declaration from asm-generic/cacheflush.h
to linux/cacheflush.h and change linux/highmem.h to include
linux/cacheflush.h instead of asm/cacheflush.h so that all necessary
places will see flush_dcache_folio().

More functions should have their default implementations moved in the
future, but those are for follow-on patches.  This fixes csky, sparc and
sparc64 which were missed in the commit which added flush_dcache_folio().

Fixes: 08b0b005 ("mm: Add flush_dcache_folio()")
Suggested-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Acked-by: NGeert Uytterhoeven <geert@linux-m68k.org>

522a0032

net: virtio_net_hdr_to_skb: count transport header in UFO · cf9acc90

由 Jonathan Davies 提交于 11月 16, 2021

virtio_net_hdr_to_skb does not set the skb's gso_size and gso_type
correctly for UFO packets received via virtio-net that are a little over
the GSO size. This can lead to problems elsewhere in the networking
stack, e.g. ovs_vport_send dropping over-sized packets if gso_size is
not set.

This is due to the comparison

  if (skb->len - p_off > gso_size)

not properly accounting for the transport layer header.

p_off includes the size of the transport layer header (thlen), so
skb->len - p_off is the size of the TCP/UDP payload.

gso_size is read from the virtio-net header. For UFO, fragmentation
happens at the IP level so does not need to include the UDP header.

Hence the calculation could be comparing a TCP/UDP payload length with
an IP payload length, causing legitimate virtio-net packets to have
lack gso_type/gso_size information.

Example: a UDP packet with payload size 1473 has IP payload size 1481.
If the guest used UFO, it is not fragmented and the virtio-net header's
flags indicate that it is a GSO frame (VIRTIO_NET_HDR_GSO_UDP), with
gso_size = 1480 for an MTU of 1500.  skb->len will be 1515 and p_off
will be 42, so skb->len - p_off = 1473.  Hence the comparison fails, and
shinfo->gso_size and gso_type are not set as they should be.

Instead, add the UDP header length before comparing to gso_size when
using UFO. In this way, it is the size of the IP payload that is
compared to gso_size.

Fixes: 6dd912f8 ("net: check untrusted gso_size at kernel entry")
Signed-off-by: NJonathan Davies <jonathan.davies@nutanix.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

cf9acc90

net/mlx5: E-Switch, Fix resetting of encap mode when entering switchdev · d7751d64

由 Paul Blakey 提交于 5月 20, 2021

E-Switch encap mode is relevant only when in switchdev mode.
The RDMA driver can query the encap configuration via
mlx5_eswitch_get_encap_mode(). Make sure it returns the currently
used mode and not the set one.

This reverts the cited commit which reset the encap mode
on entering switchdev and fixes the original issue properly.

Fixes: 9a64144d ("net/mlx5: E-Switch, Fix default encap mode")
Signed-off-by: NPaul Blakey <paulb@nvidia.com>
Reviewed-by: NMark Bloch <mbloch@nvidia.com>
Reviewed-by: NMaor Dickman <maord@nvidia.com>
Signed-off-by: NSaeed Mahameed <saeedm@nvidia.com>

d7751d64

16 11月, 2021 3 次提交

bpf: Fix toctou on read-only map's constant scalar tracking · 353050be

由 Daniel Borkmann 提交于 11月 09, 2021

Commit a23740ec ("bpf: Track contents of read-only maps as scalars") is
checking whether maps are read-only both from BPF program side and user space
side, and then, given their content is constant, reading out their data via
map->ops->map_direct_value_addr() which is then subsequently used as known
scalar value for the register, that is, it is marked as __mark_reg_known()
with the read value at verification time. Before a23740ec, the register
content was marked as an unknown scalar so the verifier could not make any
assumptions about the map content.

The current implementation however is prone to a TOCTOU race, meaning, the
value read as known scalar for the register is not guaranteed to be exactly
the same at a later point when the program is executed, and as such, the
prior made assumptions of the verifier with regards to the program will be
invalid which can cause issues such as OOB access, etc.

While the BPF_F_RDONLY_PROG map flag is always fixed and required to be
specified at map creation time, the map->frozen property is initially set to
false for the map given the map value needs to be populated, e.g. for global
data sections. Once complete, the loader "freezes" the map from user space
such that no subsequent updates/deletes are possible anymore. For the rest
of the lifetime of the map, this freeze one-time trigger cannot be undone
anymore after a successful BPF_MAP_FREEZE cmd return. Meaning, any new BPF_*
cmd calls which would update/delete map entries will be rejected with -EPERM
since map_get_sys_perms() removes the FMODE_CAN_WRITE permission. This also
means that pending update/delete map entries must still complete before this
guarantee is given. This corner case is not an issue for loaders since they
create and prepare such program private map in successive steps.

However, a malicious user is able to trigger this TOCTOU race in two different
ways: i) via userfaultfd, and ii) via batched updates. For i) userfaultfd is
used to expand the competition interval, so that map_update_elem() can modify
the contents of the map after map_freeze() and bpf_prog_load() were executed.
This works, because userfaultfd halts the parallel thread which triggered a
map_update_elem() at the time where we copy key/value from the user buffer and
this already passed the FMODE_CAN_WRITE capability test given at that time the
map was not "frozen". Then, the main thread performs the map_freeze() and
bpf_prog_load(), and once that had completed successfully, the other thread
is woken up to complete the pending map_update_elem() which then changes the
map content. For ii) the idea of the batched update is similar, meaning, when
there are a large number of updates to be processed, it can increase the
competition interval between the two. It is therefore possible in practice to
modify the contents of the map after executing map_freeze() and bpf_prog_load().

One way to fix both i) and ii) at the same time is to expand the use of the
map's map->writecnt. The latter was introduced in fc970227 ("bpf: Add mmap()
support for BPF_MAP_TYPE_ARRAY") and further refined in 1f6cb19b ("bpf:
Prevent re-mmap()'ing BPF map as writable for initially r/o mapping") with
the rationale to make a writable mmap()'ing of a map mutually exclusive with
read-only freezing. The counter indicates writable mmap() mappings and then
prevents/fails the freeze operation. Its semantics can be expanded beyond
just mmap() by generally indicating ongoing write phases. This would essentially
span any parallel regular and batched flavor of update/delete operation and
then also have map_freeze() fail with -EBUSY. For the check_mem_access() in
the verifier we expand upon the bpf_map_is_rdonly() check ensuring that all
last pending writes have completed via bpf_map_write_active() test. Once the
map->frozen is set and bpf_map_write_active() indicates a map->writecnt of 0
only then we are really guaranteed to use the map's data as known constants.
For map->frozen being set and pending writes in process of still being completed
we fall back to marking that register as unknown scalar so we don't end up
making assumptions about it. With this, both TOCTOU reproducers from i) and
ii) are fixed.

Note that the map->writecnt has been converted into a atomic64 in the fix in
order to avoid a double freeze_mutex mutex_{un,}lock() pair when updating
map->writecnt in the various map update/delete BPF_* cmd flavors. Spanning
the freeze_mutex over entire map update/delete operations in syscall side
would not be possible due to then causing everything to be serialized.
Similarly, something like synchronize_rcu() after setting map->frozen to wait
for update/deletes to complete is not possible either since it would also
have to span the user copy which can sleep. On the libbpf side, this won't
break d66562fb ("libbpf: Add BPF object skeleton support") as the
anonymous mmap()-ed "map initialization image" is remapped as a BPF map-backed
mmap()-ed memory where for .rodata it's non-writable.

Fixes: a23740ec ("bpf: Track contents of read-only maps as scalars")
Reported-by: w1tcher.bupt@gmail.com
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAndrii Nakryiko <andrii@kernel.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>

353050be

Remove unused header <linux/sdb.h> · a3143f78

由 Jonathan Corbet 提交于 11月 02, 2021

Commit 6a80b300 ("fmc: Delete the FMC subsystem") removed the last user
of <linux/sdb.h>, but left the header file behind.  Nothing uses this file,
delete it now.

Cc: Linus Walleij <linus.walleij@linaro.org>
Cc: Alessandro Rubini <rubini@gnudd.com>
Signed-off-by: NJonathan Corbet <corbet@lwn.net>
Acked-by: NAlessandro Rubini <rubini@gnudd.com>
Link: https://lore.kernel.org/r/20211102220203.940290-5-corbet@lwn.netSigned-off-by: NLinus Walleij <linus.walleij@linaro.org>

a3143f78

ACPI: Add stubs for wakeup handler functions · e9380df8

由 Mario Limonciello 提交于 10月 31, 2021

The commit ddfd9dcf ("ACPI: PM: Add acpi_[un]register_wakeup_handler()")
added new functions for drivers to use during the s2idle wakeup path, but
didn't add stubs for when CONFIG_ACPI wasn't set.

Add those stubs in for other drivers to be able to use.

Fixes: ddfd9dcf ("ACPI: PM: Add acpi_[un]register_wakeup_handler()")
Acked-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NMario Limonciello <mario.limonciello@amd.com>
Link: https://lore.kernel.org/r/20211101014853.6177-1-mario.limonciello@amd.comSigned-off-by: NLinus Walleij <linus.walleij@linaro.org>

e9380df8

15 11月, 2021 3 次提交

printk: Remove printk.h inclusion in percpu.h · a0ddee65

由 Andy Shevchenko 提交于 11月 12, 2021

After the commit 42a0bb3f ("printk/nmi: generic solution for safe
printk in NMI") the printk.h is not needed anymore in percpu.h.

Moreover `make headerdep` complains (an excerpt)

In file included from linux/printk.h,
                 from linux/dynamic_debug.h:188
                 from linux/printk.h:559 <-- here
                 from linux/percpu.h:9
                 from linux/idr.h:17
include/net/9p/client.h:13: warning: recursive header inclusion

Yeah, it's not a root cause of this, but removing will help to reduce
the noise.

Fixes: 42a0bb3f ("printk/nmi: generic solution for safe printk in NMI")
Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Acked-by: NDennis Zhou <dennis@kernel.org>
Reviewed-by: NPetr Mladek <pmladek@suse.com>
Signed-off-by: NPetr Mladek <pmladek@suse.com>
Link: https://lore.kernel.org/r/20211112140749.80042-1-andriy.shevchenko@linux.intel.com

a0ddee65

net: Clean up some inconsistent indenting · 10a2308f

由 Jiapeng Chong 提交于 11月 12, 2021

Eliminate the follow smatch warning:

./include/linux/skbuff.h:4229 skb_remcsum_process() warn: inconsistent
indenting.
Reported-by: NAbaci Robot <abaci@linux.alibaba.com>
Signed-off-by: NJiapeng Chong <jiapeng.chong@linux.alibaba.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

10a2308f

tracing: Add length protection to histogram string copies · 938aa33f

由 Steven Rostedt (VMware) 提交于 11月 14, 2021

The string copies to the histogram storage has a max size of 256 bytes
(defined by MAX_FILTER_STR_VAL). Only the string size of the event field
needs to be copied to the event storage, but no more than what is in the
event storage. Although nothing should be bigger than 256 bytes, there's
no protection against overwriting of the storage if one day there is.

Copy no more than the destination size, and enforce it.

Also had to turn MAX_FILTER_STR_VAL into an unsigned int, to keep the
min() comparison of the string sizes of comparable types.

Link: https://lore.kernel.org/all/CAHk-=wjREUihCGrtRBwfX47y_KrLCGjiq3t6QtoNJpmVrAEb1w@mail.gmail.com/
Link: https://lkml.kernel.org/r/20211114132834.183429a4@rorschach.local.home

Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Tom Zanussi <zanussi@kernel.org>
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NMasami Hiramatsu <mhiramat@kernel.org>
Fixes: 63f84ae6 ("tracing/histogram: Do not copy the fixed-size char array field over the field size")
Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>

938aa33f

14 11月, 2021 1 次提交

net,lsm,selinux: revert the security_sctp_assoc_established() hook · 1aa3b220

由 Paul Moore 提交于 11月 12, 2021

This patch reverts two prior patches, e7310c94
("security: implement sctp_assoc_established hook in selinux") and
7c2ef024 ("security: add sctp_assoc_established hook"), which
create the security_sctp_assoc_established() LSM hook and provide a
SELinux implementation.  Unfortunately these two patches were merged
without proper review (the Reviewed-by and Tested-by tags from
Richard Haines were for previous revisions of these patches that
were significantly different) and there are outstanding objections
from the SELinux maintainers regarding these patches.

Work is currently ongoing to correct the problems identified in the
reverted patches, as well as others that have come up during review,
but it is unclear at this point in time when that work will be ready
for inclusion in the mainline kernel.  In the interest of not keeping
objectionable code in the kernel for multiple weeks, and potentially
a kernel release, we are reverting the two problematic patches.
Signed-off-by: NPaul Moore <paul@paul-moore.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

1aa3b220

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功