提交 · 7cfc527cc00bfed47a32f180d5c0780a8902ca28 · openeuler / Kernel

27 12月, 2019 40 次提交

irqchip/gic-v3: Switch to PMR masking before calling IRQ handler · 7cfc527c

由 Julien Thierry 提交于 2月 13, 2019

hulk inclusion
category: feature
bugzilla: 9291
CVE: NA

ported from https://lore.kernel.org/patchwork/patch/1037469/

--------------------------------

Mask the IRQ priority through PMR and re-enable IRQs at CPU level,
allowing only higher priority interrupts to be received during interrupt
handling.
Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NWei Li <liwei391@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

7cfc527c

arm/arm64: gic-v3: Add PMR and RPR accessors · fe180abd

由 Julien Thierry 提交于 2月 13, 2019

hulk inclusion
category: feature
bugzilla: 9291
CVE: NA

ported from https://lore.kernel.org/patchwork/patch/1037468/

--------------------------------

Add helper functions to access system registers related to interrupt
priorities: PMR and RPR.
Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
Reviewed-by: NMark Rutland <mark.rutland@arm.com>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NWei Li <liwei391@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

fe180abd

arm64: cpufeature: Add cpufeature for IRQ priority masking · 10b784f8

由 Julien Thierry 提交于 2月 13, 2019

hulk inclusion
category: feature
bugzilla: 9291
CVE: NA

ported from https://lore.kernel.org/patchwork/patch/1037490/

--------------------------------

Add a cpufeature indicating whether a cpu supports masking interrupts
by priority.

The feature will be properly enabled in a later patch.
Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
Reviewed-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: NMark Rutland <mark.rutland@arm.com>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: NWei Li <liwei391@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

10b784f8

arm64: cpufeature: Set SYSREG_GIC_CPUIF as a boot system feature · dc31814c

由 Julien Thierry 提交于 2月 13, 2019

hulk inclusion
category: feature
bugzilla: 9291
CVE: NA

ported from https://lore.kernel.org/patchwork/patch/1037467/

--------------------------------

It is not supported to have some CPUs using GICv3 sysreg CPU interface
while some others do not.

Once ICC_SRE_EL1.SRE is set on a CPU, the bit cannot be cleared. Since
matching this feature require setting ICC_SRE_EL1.SRE, it cannot be
turned off if found on a CPU.

Set the feature as STRICT_BOOT, if boot CPU has it, all other CPUs are
required to have it.
Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
Suggested-by: NDaniel Thompson <daniel.thompson@linaro.org>
Reviewed-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
Reviewed-by: NMark Rutland <mark.rutland@arm.com>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NWei Li <liwei391@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

dc31814c

arm64: Remove unused daif related functions/macros · 78b16821

由 Julien Thierry 提交于 2月 13, 2019

hulk inclusion
category: feature
bugzilla: 9291
CVE: NA

ported from https://lore.kernel.org/patchwork/patch/1037466/

--------------------------------

There are some helpers to modify PSR.[DAIF] bits that are not referenced
anywhere. The less these bits are available outside of local_irq_*
functions the better.

Get rid of those unused helpers.
Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
Reviewed-by: NMark Rutland <mark.rutland@arm.com>
Acked-by: NCatalin Marinas <catalin.marinas@arm.com>
Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: NWei Li <liwei391@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

78b16821

arm64: Fix HCR.TGE status for NMI contexts · 4a941c35

由 Julien Thierry 提交于 2月 13, 2019

hulk inclusion
category: feature
bugzilla: 9291
CVE: NA

ported from https://lore.kernel.org/patchwork/patch/1037465/

--------------------------------

When using VHE, the host needs to clear HCR_EL2.TGE bit in order
to interact with guest TLBs, switching from EL2&0 translation regime
to EL1&0.

However, some non-maskable asynchronous event could happen while TGE is
cleared like SDEI. Because of this address translation operations
relying on EL2&0 translation regime could fail (tlb invalidation,
userspace access, ...).

Fix this by properly setting HCR_EL2.TGE when entering NMI context and
clear it if necessary when returning to the interrupted context.
Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
Suggested-by: NMarc Zyngier <marc.zyngier@arm.com>
Reviewed-by: NMarc Zyngier <marc.zyngier@arm.com>
Reviewed-by: NJames Morse <james.morse@arm.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Cc: James Morse <james.morse@arm.com>
Cc: linux-arch@vger.kernel.org
Cc: stable@vger.kernel.org
Signed-off-by: NWei Li <liwei391@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

4a941c35

irqchip/gic-v3: Remove acknowledge loop · d275fbb1

由 Julien Thierry 提交于 2月 13, 2019

mainline inclusion
from mainline-v4.20-rc1
commit 342677d70ab92142b483fc68bcade74cdf969785
category: bugfix
bugzilla: 9291
CVE: NA

--------------------------------

Multiple interrupts pending for a CPU is actually rare. Doing an
acknowledge loop does not give much better performance or even can
deteriorate them.

Do not loop when an interrupt has been acknowledged, just return
from interrupt and wait for another one to be raised.
Tested-by: NDaniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NWei Li <liwei391@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

d275fbb1

irqchip/gic: Unify GIC priority definitions · 4cd3de99

由 Julien Thierry 提交于 2月 13, 2019

mainline inclusion
from mainline-v4.20-rc1
commit 2130b789
category: bugfix
bugzilla: 9291
CVE: NA

--------------------------------

LPIs use the same priority value as other GIC interrupts.

Make the GIC default priority definition visible to ITS implementation
and use this same definition for LPI priorities.
Tested-by: NDaniel Thompson <daniel.thompson@linaro.org>
Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Jason Cooper <jason@lakedaemon.net>
Cc: Marc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NWei Li <liwei391@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

4cd3de99

arm64: Use daifflag_restore after bp_hardening · cc08a7f9

由 Julien Thierry 提交于 2月 13, 2019

mainline inclusion
from mainline-v4.20-rc1
commit 9a0c032825e073e002ee80cf2577431d7296a7f7
category: bugfix
bugzilla: 9291
CVE: NA

--------------------------------

For EL0 entries requiring bp_hardening, daif status is kept at
DAIF_PROCCTX_NOIRQ until after hardening has been done. Then interrupts
are enabled through local_irq_enable().

Before using local_irq_* functions, daifflags should be properly restored
to a state where IRQs are enabled.

Enable IRQs by restoring DAIF_PROCCTX state after bp hardening.
Acked-by: NJames Morse <james.morse@arm.com>
Signed-off-by: NJulien Thierry <julien.thierry@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NWei Li <liwei391@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

cc08a7f9

pstore: Remove needless lock during console writes · 3f744430

由 Kees Cook 提交于 2月 13, 2019

mainline inclusion
from mainline-5.0
commit b77fa617a2ff
category: bugfix
bugzilla: 5889
CVE: NA

-------------------------------------------------

Since the console writer does not use the preallocated crash dump buffer
any more, there is no reason to perform locking around it.

Fixes: 70ad35db ("pstore: Convert console write to use ->write_buf")
Signed-off-by: NKees Cook <keescook@chromium.org>
Reviewed-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
Signed-off-by: NHou Tao <houtao1@huawei.com>
Reviewed-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

3f744430

pstore/ram: Avoid NULL deref in ftrace merging failure path · 78bb6e41

由 Kees Cook 提交于 2月 13, 2019

mainline inclusion
from mainline-5.0
commit 8665569e
category: bugfix
bugzilla: 5808
CVE: NA

-------------------------------------------------

Given corruption in the ftrace records, it might be possible to allocate
tmp_prz without assigning prz to it, but still marking it as needing to
be freed, which would cause at least a NULL dereference.

smatch warnings:
fs/pstore/ram.c:340 ramoops_pstore_read() error: we previously assumed 'prz' could be null (see line 255)

https://lists.01.org/pipermail/kbuild-all/2018-December/055528.htmlReported-by: NDan Carpenter <dan.carpenter@oracle.com>
Fixes: 2fbea82b ("pstore: Merge per-CPU ftrace records into one")
Cc: "Joel Fernandes (Google)" <joel@joelfernandes.org>
Signed-off-by: NKees Cook <keescook@chromium.org>
Signed-off-by: NHou Tao <houtao1@huawei.com>
Reviewed-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

78bb6e41

pstore/ram: Clarify resource reservation labels · c77bb58c

由 Kees Cook 提交于 2月 13, 2019

mainline inclusion
from mainline-4.20
commit 1227daa4
category: bugfix
bugzilla: 5899
CVE: NA

-------------------------------------------------

When ramoops reserved a memory region in the kernel, it had an unhelpful
label of "persistent_memory". When reading /proc/iomem, it would be
repeated many times, did not hint that it was ramoops in particular,
and didn't clarify very much about what each was used for:

400000000-407ffffff : Persistent Memory (legacy)
  400000000-400000fff : persistent_memory
  400001000-400001fff : persistent_memory
...
  4000ff000-4000fffff : persistent_memory

Instead, this adds meaningful labels for how the various regions are
being used:

400000000-407ffffff : Persistent Memory (legacy)
  400000000-400000fff : ramoops:dump(0/252)
  400001000-400001fff : ramoops:dump(1/252)
...
  4000fc000-4000fcfff : ramoops:dump(252/252)
  4000fd000-4000fdfff : ramoops:console
  4000fe000-4000fe3ff : ramoops:ftrace(0/3)
  4000fe400-4000fe7ff : ramoops:ftrace(1/3)
  4000fe800-4000febff : ramoops:ftrace(2/3)
  4000fec00-4000fefff : ramoops:ftrace(3/3)
  4000ff000-4000fffff : ramoops:pmsg
Signed-off-by: NKees Cook <keescook@chromium.org>
Reviewed-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
Tested-by: NSai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Tested-by: NGuenter Roeck <groeck@chromium.org>
Signed-off-by: NHou Tao <houtao1@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

c77bb58c

pstore: Refactor compression initialization · 7fae0d51

由 Kees Cook 提交于 2月 13, 2019

mainline inclusion
from mainline-4.20
commit 95047b0519c1
category: bugfix
bugzilla: 5899
CVE: NA

-------------------------------------------------

This refactors compression initialization slightly to better handle
getting potentially called twice (via early pstore_register() calls
and later pstore_init()) and improves the comments and reporting to be
more verbose.
Signed-off-by: NKees Cook <keescook@chromium.org>
Tested-by: NGuenter Roeck <groeck@chromium.org>
Signed-off-by: NHou Tao <houtao1@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

7fae0d51

pstore: Allocate compression during late_initcall() · e680190d

由 Joel Fernandes (Google) 提交于 2月 13, 2019

mainline inclusion
from mainline-4.20
commit 416031653eb5
category: bugfix
bugzilla: 5899
CVE: NA

-------------------------------------------------

ramoops's call of pstore_register() was recently moved to run during
late_initcall() because the crypto backend may not have been ready during
postcore_initcall(). This meant early-boot crash dumps were not getting
caught by pstore any more.

Instead, lets allow calls to pstore_register() earlier, and once crypto
is ready we can initialize the compression.
Reported-by: NSai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Signed-off-by: NJoel Fernandes (Google) <joel@joelfernandes.org>
Tested-by: NSai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
Fixes: cb3bee03 ("pstore: Use crypto compress API")
[kees: trivial rebase]
Signed-off-by: NKees Cook <keescook@chromium.org>
Tested-by: NGuenter Roeck <groeck@chromium.org>
Signed-off-by: NHou Tao <houtao1@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

e680190d

pstore: Centralize init/exit routines · 68bc9e88

由 Kees Cook 提交于 2月 13, 2019

mainline inclusion
from mainline-4.20
commit cb095afd4476
category: bugfix
bugzilla: 5899
CVE: NA

-------------------------------------------------

In preparation for having additional actions during init/exit, this moves
the init/exit into platform.c, centralizing the logic to make call outs
to the fs init/exit.
Signed-off-by: NKees Cook <keescook@chromium.org>
Tested-by: NGuenter Roeck <groeck@chromium.org>
Signed-off-by: NHou Tao <houtao1@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

68bc9e88

nvme-pci: fix conflicting p2p resource adds · acf18bc6

由 Keith Busch 提交于 2月 13, 2019

mainline inclusion
from mainline-4.20-rc1
commit 9fe5c59f
category: bugfix
bugzilla: 6373
CVE: NA
---------------------------

The nvme pci driver had been adding its CMB resource to the P2P DMA
subsystem everytime on on a controller reset. This results in the
following warning:

    ------------[ cut here ]------------
    nvme 0000:00:03.0: Conflicting mapping in same section
    WARNING: CPU: 7 PID: 81 at kernel/memremap.c:155 devm_memremap_pages+0xa6/0x380
    ...
    Call Trace:
     pci_p2pdma_add_resource+0x153/0x370
     nvme_reset_work+0x28c/0x17b1 [nvme]
     ? add_timer+0x107/0x1e0
     ? dequeue_entity+0x81/0x660
     ? dequeue_entity+0x3b0/0x660
     ? pick_next_task_fair+0xaf/0x610
     ? __switch_to+0xbc/0x410
     process_one_work+0x1cf/0x350
     worker_thread+0x215/0x3d0
     ? process_one_work+0x350/0x350
     kthread+0x107/0x120
     ? kthread_park+0x80/0x80
     ret_from_fork+0x1f/0x30
    ---[ end trace f7ea76ac6ee72727 ]---
    nvme nvme0: failed to register the CMB

This patch fixes this by registering the CMB with P2P only once.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NLogan Gunthorpe <logang@deltatee.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

acf18bc6

sdei_watchdog: use the secure arch timer as SDEI watchdog timer · 76c89eaf

由 Xiongfeng Wang 提交于 2月 02, 2019

euler inclusion
category: feature
Bugzilla: 5515
CVE: N/A

----------------------------------------

This patch change the SDEI watchdog timer hardware IRQ number to use
secure arm_arch_timer as SDEI watchdog timer, so that host os can also
use SDEI NMI watchdog when there is a guest running on it.

BIOS will program the secure timer, so this patch remove the code
programming the timer.Also the secure can not be programmed in
non-secure state.
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

76c89eaf

dm: fix dm_wq_work() to only use __split_and_process_bio() if appropriate · 57187204

由 Mike Snitzer 提交于 2月 14, 2019

mainline inclusion
from mainline-5.0-rc4
commit 6548c7c5
category: bugfix
bugzilla: 7220
CVE: NA
---------------------------

Otherwise targets that don't support/expect IO splitting could resubmit
bios using code paths with unnecessary IO splitting complexity.

Depends-on: 24113d48 ("dm: avoid indirect call in __dm_make_request")
Fixes: 978e51ba ("dm: optimize bio-based NVMe IO submission")
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

57187204

dm: avoid indirect call in __dm_make_request · 05f06295

由 Mikulas Patocka 提交于 2月 14, 2019

mainline inclusion
from mainline-5.0-rc1
commit 24113d48
category: bugfix
bugzilla: 7220
CVE: NA
---------------------------

Indirect calls are inefficient because of retpolines that are used for
spectre workaround. This patch replaces an indirect call with a condition
(that can be predicted by the branch predictor).
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: Nyangerkun <yangerkun@huawei.com>
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

05f06295

dm: fix redundant IO accounting for bios that need splitting · 37b44dc8

由 Mike Snitzer 提交于 2月 14, 2019

mainline inclusion
from mainline-5.0-rc4
commit a1e1cb72
category: bugfix
bugzilla: 7221
CVE: NA
---------------------------

The risk of redundant IO accounting was not taken into consideration
when commit 18a25da8 ("dm: ensure bio submission follows a
depth-first tree walk") introduced IO splitting in terms of recursion
via generic_make_request().

Fix this by subtracting the split bio's payload from the IO stats that
were already accounted for by start_io_acct() upon dm_make_request()
entry.  This repeat oscillation of the IO accounting, up then down,
isn't ideal but refactoring DM core's IO splitting to pre-split bios
_before_ they are accounted turned out to be an excessive amount of
change that will need a full development cycle to refine and verify.

Before this fix:

  /dev/mapper/stripe_dev is a 4-way stripe using a 32k chunksize, so
  bios are split on 32k boundaries.

  # fio --name=16M --filename=/dev/mapper/stripe_dev --rw=write --bs=64k --size=16M \
    	--iodepth=1 --ioengine=libaio --direct=1 --refill_buffers

  with debugging added:
  [103898.310264] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=0 len=128
  [103898.318704] device-mapper: core: __split_and_process_bio: recursing for following split bio:
  [103898.329136] device-mapper: core: start_io_acct: dm-2 WRITE bio->bi_iter.bi_sector=64 len=64
  ...

  16M written yet 136M (278528 * 512b) accounted:
  # cat /sys/block/dm-2/stat | awk '{ print $7 }'
  278528

After this fix:

  16M written and 16M (32768 * 512b) accounted:
  # cat /sys/block/dm-2/stat | awk '{ print $7 }'
  32768

Fixes: 18a25da8 ("dm: ensure bio submission follows a depth-first tree walk")
Cc: stable@vger.kernel.org # 4.16+
Reported-by: NBryan Gurney <bgurney@redhat.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>

[Conflict:
drivers/md/dm.c
Since patch 1226b8dd("block: switch to per-cpu in-flight counters")
has not been included.
]

Signed-off-by: yangerkun@huawei.com
Reviewed-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

37b44dc8

Fix 'acccess_ok()' on alpha and SH · 74a08ad0

由 Linus Torvalds 提交于 2月 14, 2019

mainline inclusion
from mainline-5.0-rc1
commit 94bd8a05cd4de344a9a57e52ef7d99550251984f
category: bugfix
bugzilla: 9284
CVE: NA

-------------------------------------------------

Commit 594cc251fdd0 ("make 'user_access_begin()' do 'access_ok()'")
broke both alpha and SH booting in qemu, as noticed by Guenter Roeck.

It turns out that the bug wasn't actually in that commit itself (which
would have been surprising: it was mostly a no-op), but in how the
addition of access_ok() to the strncpy_from_user() and strnlen_user()
functions now triggered the case where those functions would test the
access of the very last byte of the user address space.

The string functions actually did that user range test before too, but
they did it manually by just comparing against user_addr_max().  But
with user_access_begin() doing the check (using "access_ok()"), it now
exposed problems in the architecture implementations of that function.

For example, on alpha, the access_ok() helper macro looked like this:

  #define __access_ok(addr, size) \
        ((get_fs().seg & (addr | size | (addr+size))) == 0)

and what it basically tests is of any of the high bits get set (the
USER_DS masking value is 0xfffffc0000000000).

And that's completely wrong for the "addr+size" check.  Because it's
off-by-one for the case where we check to the very end of the user
address space, which is exactly what the strn*_user() functions do.

Why? Because "addr+size" will be exactly the size of the address space,
so trying to access the last byte of the user address space will fail
the __access_ok() check, even though it shouldn't.  As a result, the
user string accessor functions failed consistently - because they
literally don't know how long the string is going to be, and the max
access is going to be that last byte of the user address space.

Side note: that alpha macro is buggy for another reason too - it re-uses
the arguments twice.

And SH has another version of almost the exact same bug:

  #define __addr_ok(addr) \
        ((unsigned long __force)(addr) < current_thread_info()->addr_limit.seg)

so far so good: yes, a user address must be below the limit.  But then:

  #define __access_ok(addr, size)         \
        (__addr_ok((addr) + (size)))

is wrong with the exact same off-by-one case: the case when "addr+size"
is exactly _equal_ to the limit is actually perfectly fine (think "one
byte access at the last address of the user address space")

The SH version is actually seriously buggy in another way: it doesn't
actually check for overflow, even though it did copy the _comment_ that
talks about overflow.

So it turns out that both SH and alpha actually have completely buggy
implementations of access_ok(), but they happened to work in practice
(although the SH overflow one is a serious serious security bug, not
that anybody likely cares about SH security).

This fixes the problems by using a similar macro on both alpha and SH.
It isn't trying to be clever, the end address is based on this logic:

        unsigned long __ao_end = __ao_a + __ao_b - !!__ao_b;

which basically says "add start and length, and then subtract one unless
the length was zero".  We can't subtract one for a zero length, or we'd
just hit an underflow instead.

For a lot of access_ok() users the length is a constant, so this isn't
actually as expensive as it initially looks.
Reported-and-tested-by: NGuenter Roeck <linux@roeck-us.net>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

74a08ad0

arch/openrisc: Fix issues with access_ok() · f3b64982

由 Stafford Horne 提交于 2月 14, 2019

mainline inclusion
from mainline-5.0-rc2
commit 9cb2feb4
category: bugfix
bugzilla: 9284
CVE: NA

-------------------------------------------------

The commit 594cc251 ("make 'user_access_begin()' do 'access_ok()'")
exposed incorrect implementations of access_ok() macro in several
architectures.  This change fixes 2 issues found in OpenRISC.

OpenRISC was not properly using parenthesis for arguments and also using
arguments twice.  This patch fixes those 2 issues.

I test booted this patch with v5.0-rc1 on qemu and it's working fine.

Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NStafford Horne <shorne@gmail.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

f3b64982

Fix access_ok() fallout for sparc32 and powerpc · b54741a9

由 Linus Torvalds 提交于 2月 14, 2019

mainline inclusion
from mainline-5.0-rc1
commit 4caf4ebf
category: bugfix
bugzilla: 9284
CVE: NA

-------------------------------------------------

These two architectures actually had an intentional use of the 'type'
argument to access_ok() just to avoid warnings.

I had actually noticed the powerpc one, but forgot to then fix it up.
And I missed the sparc32 case entirely.

This is hopefully all of it.
Reported-by: NMathieu Malaterre <malat@debian.org>
Reported-by: NGuenter Roeck <linux@roeck-us.net>
Fixes: 96d4f267 ("Remove 'type' argument from access_ok() function")
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

b54741a9

x86: uaccess: Inhibit speculation past access_ok() in user_access_begin() · f6bc0b89

由 Will Deacon 提交于 2月 14, 2019

mainline inclusion
from mainline-5.0-rc1
commit 6e693b3f
category: bugfix
bugzilla: 9284
CVE: NA

-------------------------------------------------

Commit 594cc251 ("make 'user_access_begin()' do 'access_ok()'")
makes the access_ok() check part of the user_access_begin() preceding a
series of 'unsafe' accesses.  This has the desirable effect of ensuring
that all 'unsafe' accesses have been range-checked, without having to
pick through all of the callsites to verify whether the appropriate
checking has been made.

However, the consolidated range check does not inhibit speculation, so
it is still up to the caller to ensure that they are not susceptible to
any speculative side-channel attacks for user addresses that ultimately
fail the access_ok() check.

This is an oversight, so use __uaccess_begin_nospec() to ensure that
speculation is inhibited until the access_ok() check has passed.
Reported-by: NJulien Thierry <julien.thierry@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

f6bc0b89

make 'user_access_begin()' do 'access_ok()' · c4e4b64e

由 Linus Torvalds 提交于 2月 14, 2019

mainline inclusion
from mainline-5.0-rc1
commit 594cc251
category: bugfix
bugzilla: 9284
CVE: CVE-2018-20669

-------------------------------------------------

Originally, the rule used to be that you'd have to do access_ok()
separately, and then user_access_begin() before actually doing the
direct (optimized) user access.

But experience has shown that people then decide not to do access_ok()
at all, and instead rely on it being implied by other operations or
similar.  Which makes it very hard to verify that the access has
actually been range-checked.

If you use the unsafe direct user accesses, hardware features (either
SMAP - Supervisor Mode Access Protection - on x86, or PAN - Privileged
Access Never - on ARM) do force you to use user_access_begin().  But
nothing really forces the range check.

By putting the range check into user_access_begin(), we actually force
people to do the right thing (tm), and the range check vill be visible
near the actual accesses.  We have way too long a history of people
trying to avoid them.
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

c4e4b64e

Remove 'type' argument from access_ok() function · 4983cb67

由 Linus Torvalds 提交于 2月 14, 2019

mainline inclusion
from mainline-5.0-rc1
commit 96d4f267e40f9509e8a66e2b39e8b95655617693
category: cleanup
bugzilla: 9284
CVE: NA

It's a cleanup patch that prepare for applying CVE-2018-20669
patch 594cc251fdd0 ("make 'user_access_begin()' do 'access_ok()'")

-------------------------------------------------

Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access.  But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model.  And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

 - csky still had the old "verify_area()" name as an alias.

 - the iter_iov code had magical hardcoded knowledge of the actual
   values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
   really used it)

 - microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something.  Any missed conversion should be trivially fixable, though.

Conflicts:
	drivers/media/v4l2-core/v4l2-compat-ioctl32.c
	drivers/infiniband/core/uverbs_main.c
	drivers/platform/goldfish/goldfish_pipe.c
	fs/namespace.c
	fs/select.c
	kernel/compat.c
	arch/powerpc/include/asm/uaccess.h
	arch/arm64/kernel/perf_callchain.c
	arch/arm64/include/asm/uaccess.h
	arch/ia64/kernel/signal.c
	arch/x86/entry/vsyscall/vsyscall_64.c
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
[yyl: adjust context]
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

4983cb67

bpf: fix bpf_jit_limit knob for PAGE_SIZE >= 64K · fbbbd7cd

由 Daniel Borkmann 提交于 2月 14, 2019

mainline inclusion
from mainline-4.20
commit fdadd049
category: bugfix
bugzilla: 5995
CVE: NA

-------------------------------------------------

Michael and Sandipan report:

  Commit ede95a63 introduced a bpf_jit_limit tuneable to limit BPF
  JIT allocations. At compile time it defaults to PAGE_SIZE * 40000,
  and is adjusted again at init time if MODULES_VADDR is defined.

  For ppc64 kernels, MODULES_VADDR isn't defined, so we're stuck with
  the compile-time default at boot-time, which is 0x9c400000 when
  using 64K page size. This overflows the signed 32-bit bpf_jit_limit
  value:

  root@ubuntu:/tmp# cat /proc/sys/net/core/bpf_jit_limit
  -1673527296

  and can cause various unexpected failures throughout the network
  stack. In one case `strace dhclient eth0` reported:

  setsockopt(5, SOL_SOCKET, SO_ATTACH_FILTER, {len=11, filter=0x105dd27f8},
             16) = -1 ENOTSUPP (Unknown error 524)

  and similar failures can be seen with tools like tcpdump. This doesn't
  always reproduce however, and I'm not sure why. The more consistent
  failure I've seen is an Ubuntu 18.04 KVM guest booted on a POWER9
  host would time out on systemd/netplan configuring a virtio-net NIC
  with no noticeable errors in the logs.

Given this and also given that in near future some architectures like
arm64 will have a custom area for BPF JIT image allocations we should
get rid of the BPF_JIT_LIMIT_DEFAULT fallback / default entirely. For
4.21, we have an overridable bpf_jit_alloc_exec(), bpf_jit_free_exec()
so therefore add another overridable bpf_jit_alloc_exec_limit() helper
function which returns the possible size of the memory area for deriving
the default heuristic in bpf_jit_charge_init().

Like bpf_jit_alloc_exec() and bpf_jit_free_exec(), the new
bpf_jit_alloc_exec_limit() assumes that module_alloc() is the default
JIT memory provider, and therefore in case archs implement their custom
module_alloc() we use MODULES_{END,_VADDR} for limits and otherwise for
vmalloc_exec() cases like on ppc64 we use VMALLOC_{END,_START}.

Additionally, for archs supporting large page sizes, we should change
the sysctl to be handled as long to not run into sysctl restrictions
in future.

Fixes: ede95a63 ("bpf: add bpf_jit_limit knob to restrict unpriv allocations")
Reported-by: NSandipan Das <sandipan@linux.ibm.com>
Reported-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Tested-by: NMichael Roth <mdroth@linux.vnet.ibm.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NMao Wenan <maowenan@huawei.com>
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

fbbbd7cd

bpf: add bpf_jit_limit knob to restrict unpriv allocations · 1bb640c7

由 Daniel Borkmann 提交于 2月 14, 2019

mainline inclusion
from mainline-4.20
commit ede95a63
category: bugfix
bugzilla: 5995
CVE: NA

-------------------------------------------------

Rick reported that the BPF JIT could potentially fill the entire module
space with BPF programs from unprivileged users which would prevent later
attempts to load normal kernel modules or privileged BPF programs, for
example. If JIT was enabled but unsuccessful to generate the image, then
before commit 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
we would always fall back to the BPF interpreter. Nowadays in the case
where the CONFIG_BPF_JIT_ALWAYS_ON could be set, then the load will abort
with a failure since the BPF interpreter was compiled out.

Add a global limit and enforce it for unprivileged users such that in case
of BPF interpreter compiled out we fail once the limit has been reached
or we fall back to BPF interpreter earlier w/o using module mem if latter
was compiled in. In a next step, fair share among unprivileged users can
be resolved in particular for the case where we would fail hard once limit
is reached.

Fixes: 290af866 ("bpf: introduce BPF_JIT_ALWAYS_ON config")
Fixes: 0a14842f ("net: filter: Just In Time compiler for x86-64")
Co-Developed-by: NRick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Acked-by: NAlexei Starovoitov <ast@kernel.org>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jann Horn <jannh@google.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: LKML <linux-kernel@vger.kernel.org>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NMao Wenan <maowenan@huawei.com>
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

1bb640c7

net: add missing SOF_TIMESTAMPING_OPT_ID support · ac73ebc1

由 Willem de Bruijn 提交于 2月 14, 2019

mainline inclusion
from mainline-4.20
commit 8f932f76
category: bugfix
bugzilla: 6159
CVE: NA

-------------------------------------------------

SOF_TIMESTAMPING_OPT_ID is supported on TCP, UDP and RAW sockets.
But it was missing on RAW with IPPROTO_IP, PF_PACKET and CAN.

Add skb_setup_tx_timestamp that configures both tx_flags and tskey
for these paths that do not need corking or use bytestream keys.

Fixes: 09c2d251 ("net-timestamp: add key to disambiguate concurrent datagrams")
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NLin Miaohe <linmiaohe@huawei.com>
Signed-off-by: NMao Wenan <maowenan@huawei.com>
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ac73ebc1

ipv6: add missing tx timestamping on IPPROTO_RAW · 29f70611

由 Willem de Bruijn 提交于 2月 14, 2019

mainline inclusion
from mainline-4.20
commit fbfb2321
category: bugfix
bugzilla: 6159
CVE: NA

-------------------------------------------------

Raw sockets support tx timestamping, but one case is missing.

IPPROTO_RAW takes a separate packet construction path. raw_send_hdrinc
has an explicit call to sock_tx_timestamp, but rawv6_send_hdrinc does
not. Add it.

Fixes: 11878b40 ("net-timestamp: SOCK_RAW and PING timestamping")
Signed-off-by: NWillem de Bruijn <willemb@google.com>
Acked-by: NSoheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NLin Miaohe <linmiaohe@huawei.com>
Signed-off-by: NMao Wenan <maowenan@huawei.com>
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

29f70611

sctp: increase sk_wmem_alloc when head->truesize is increased · 1d0adb21

由 Xin Long 提交于 2月 14, 2019

mainline inclusion
from mainline-4.20
commit 0d32f17717e6
category: bugfix
bugzilla: 6163
CVE: NA

-------------------------------------------------

I changed to count sk_wmem_alloc by skb truesize instead of 1 to
fix the sk_wmem_alloc leak caused by later truesize's change in
xfrm in Commit 02968ccf0125 ("sctp: count sk_wmem_alloc by skb
truesize in sctp_packet_transmit").

But I should have also increased sk_wmem_alloc when head->truesize
is increased in sctp_packet_gso_append() as xfrm does. Otherwise,
sctp gso packet will cause sk_wmem_alloc underflow.

Fixes: 02968ccf0125 ("sctp: count sk_wmem_alloc by skb truesize in sctp_packet_transmit")
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Acked-by: NMarcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NLin Miaohe <linmiaohe@huawei.com>
Signed-off-by: NMao Wenan <maowenan@huawei.com>
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

1d0adb21

sctp: count sk_wmem_alloc by skb truesize in sctp_packet_transmit · af608a1e

由 Xin Long 提交于 2月 14, 2019

mainline inclusion
from mainline-4.20
commit 02968ccf0125
category: bugfix
bugzilla: 6163
CVE: NA

-------------------------------------------------

Now sctp increases sk_wmem_alloc by 1 when doing set_owner_w for the
skb allocked in sctp_packet_transmit and decreases by 1 when freeing
this skb.

But when this skb goes through networking stack, some subcomponents
might change skb->truesize and add the same amount on sk_wmem_alloc.
However sctp doesn't know the amount to decrease by, it would cause
a leak on sk->sk_wmem_alloc and the sock can never be freed.

Xiumei found this issue when it hit esp_output_head() by using sctp
over ipsec, where skb->truesize is added and so is sk->sk_wmem_alloc.

Since sctp has used sk_wmem_queued to count for writable space since
Commit cd305c74b0f8 ("sctp: use sk_wmem_queued to check for writable
space"), it's ok to fix it by counting sk_wmem_alloc by skb truesize
in sctp_packet_transmit.

Fixes: cac2661c ("esp4: Avoid skb_cow_data whenever possible")
Reported-by: NXiumei Mu <xmu@redhat.com>
Signed-off-by: NXin Long <lucien.xin@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NLin Miaohe <linmiaohe@huawei.com>
Signed-off-by: NMao Wenan <maowenan@huawei.com>
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

af608a1e

proc: fix /proc/net/* after setns(2) · 23183c43

由 Alexey Dobriyan 提交于 2月 14, 2019

mainline inclusion
from mainline-v5.0-rc4-142-g1fde6f2
commit 1fde6f21d90f
category: bugfix
bugzilla: 7439
CVE: NA

-------------------------------------------------

/proc entries under /proc/net/* can't be cached into dcache because
setns(2) can change current net namespace.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: avoid vim miscolorization]
[adobriyan@gmail.com: write test, add dummy ->d_revalidate hook: necessary if /proc/net/* is pinned at setns time]
  Link: http://lkml.kernel.org/r/20190108192350.GA12034@avx2
Link: http://lkml.kernel.org/r/20190107162336.GA9239@avx2
Fixes: 1da4d377 ("proc: revalidate misc dentries")
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
Reported-by: NMateusz Stępień <mateusz.stepien@netrounds.com>
Reported-by: NAhmad Fatoum <a.fatoum@pengutronix.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NMao Wenan <maowenan@huawei.com>
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

23183c43

net/mlx5: Continue driver initialization despite debugfs failure · 2f7d2107

由 Leon Romanovsky 提交于 2月 14, 2019

mainline inclusion
from mainline-5.0
commit 199fa087dc6b
category: bugfix
bugzilla: 6166
CVE: NA

-------------------------------------------------

The failure to create debugfs entry is unpleasant event, but not enough
to abort drier initialization. Align the mlx5_core code to debugfs design
and continue execution whenever debugfs_create_dir() successes or not.

Fixes: e126ba97 ("mlx5: Add driver for Mellanox Connect-IB adapters")
Reviewed-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NLeon Romanovsky <leonro@mellanox.com>
Signed-off-by: NSaeed Mahameed <saeedm@mellanox.com>
Signed-off-by: NLin Miaohe <linmiaohe@huawei.com>
Signed-off-by: NMao Wenan <maowenan@huawei.com>
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

2f7d2107

tun: use netdev_alloc_frag() in tun_napi_alloc_frags() · 80aeb531

由 Eric Dumazet 提交于 2月 14, 2019

mainline inclusion
from mainline-5.0
commit aa6daaca
category: bugfix
bugzilla: 6169
CVE: NA

-------------------------------------------------

In order to cook skbs in the same way than Ethernet drivers,
it is probably better to not use GFP_KERNEL, but rather
use the GFP_ATOMIC and PFMEMALLOC mechanisms provided by
netdev_alloc_frag().

This would allow to use tun driver even in memory stress
situations, especially if swap is used over this tun channel.

Fixes: 90e33d45 ("tun: enable napi_gro_frags() for TUN/TAP driver")
Signed-off-by: NEric Dumazet <edumazet@google.com>
Cc: Petar Penkov <peterpenkov96@gmail.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NLin Miaohe <linmiaohe@huawei.com>
Signed-off-by: NMao Wenan <maowenan@huawei.com>
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

80aeb531

sunrpc: safely reallow resvport min/max inversion · 4d2e9d49

由 J. Bruce Fields 提交于 2月 14, 2019

mainline inclusion
from mainline-v4.20
commit 826799e66e86
category: bugfix
bugzilla: 6170
CVE: NA

-------------------------------------------------

Commits ffb6ca33 and e08ea3a9 prevent setting xprt_min_resvport
greater than xprt_max_resvport, but may also break simple code that sets
one parameter then the other, if the new range does not overlap the old.

Also it looks racy to me, unless there's some serialization I'm not
seeing.  Granted it would probably require malicious privileged processes
(unless there's a chance these might eventually be settable in unprivileged
containers), but still it seems better not to let userspace panic the
kernel.

Simpler seems to be to allow setting the parameters to whatever you want
but interpret xprt_min_resvport > xprt_max_resvport as the empty range.

Fixes: ffb6ca33 "sunrpc: Prevent resvport min/max inversion..."
Fixes: e08ea3a9 "sunrpc: Prevent rexvport min/max inversion..."
Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NLin Miaohe <linmiaohe@huawei.com>
Signed-off-by: NMao Wenan <maowenan@huawei.com>
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

4d2e9d49

bpf: correctly set initial window on active Fast Open sender · faa9b740

由 Yuchung Cheng 提交于 2月 14, 2019

mainline inclusion
from mainline-5.0
commit 31aa6503
category: bugfix
bugzilla: 6982
CVE: NA

-------------------------------------------------

The existing BPF TCP initial congestion window (TCP_BPF_IW) does not
to work on (active) Fast Open sender. This is because it changes the
(initial) window only if data_segs_out is zero -- but data_segs_out
is also incremented on SYN-data.  This patch fixes the issue by
proerly accounting for SYN-data additionally.

Fixes: fc747810 ("bpf: Adds support for setting initial cwnd")
Signed-off-by: NYuchung Cheng <ycheng@google.com>
Reviewed-by: NNeal Cardwell <ncardwell@google.com>
Acked-by: NLawrence Brakmo <brakmo@fb.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NMao Wenan <maowenan@huawei.com>
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

faa9b740

sock_diag: fix autoloading of the raw_diag module · ccc7cd64

由 Andrei Vagin 提交于 2月 14, 2019

mainline inclusion
from mainline-4.20
commit c34c1287
category: bugfix
bugzilla: 6053
CVE: NA

-------------------------------------------------

IPPROTO_RAW isn't registred as an inet protocol, so
inet_protos[protocol] is always NULL for it.

Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Xin Long <lucien.xin@gmail.com>
Fixes: bf2ae2e4 ("sock_diag: request _diag module only when the family or proto has been registered")
Signed-off-by: NAndrei Vagin <avagin@gmail.com>
Reviewed-by: NCyrill Gorcunov <gorcunov@gmail.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NMao Wenan <maowenan@huawei.com>
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ccc7cd64

netfilter: ebtables: compat: un-break 32bit setsockopt when no rules are present · 6691ef68

由 Florian Westphal 提交于 2月 14, 2019

mainline inclusion
from mainline-5.0
commit 2035f3ff8eaa
category: bugfix
bugzilla: 7269
CVE: NA

-------------------------------------------------

Unlike ip(6)tables ebtables only counts user-defined chains.

The effect is that a 32bit ebtables binary on a 64bit kernel can do
'ebtables -N FOO' only after adding at least one rule, else the request
fails with -EINVAL.

This is a similar fix as done in
3f1e53ab ("netfilter: ebtables: don't attempt to allocate 0-sized compat array").

Fixes: 7d7d7e02 ("netfilter: compat: reject huge allocation requests")
Reported-by: NFrancesco Ruggeri <fruggeri@arista.com>
Signed-off-by: NFlorian Westphal <fw@strlen.de>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NMao Wenan <maowenan@huawei.com>
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

6691ef68

netfilter: nfnetlink_osf: add missing fmatch check · ebe7f6d4

由 Fernando Fernandez Mancera 提交于 2月 14, 2019

mainline inclusion
from mainline-5.0
commit 1a6a0951
category: bugfix
bugzilla: 7252
CVE: NA

-------------------------------------------------

When we check the tcp options of a packet and it doesn't match the current
fingerprint, the tcp packet option pointer must be restored to its initial
value in order to do the proper tcp options check for the next fingerprint.

Here we can see an example.
Assumming the following fingerprint base with two lines:

S10:64:1:60:M*,S,T,N,W6:      Linux:3.0::Linux 3.0
S20:64:1:60:M*,S,T,N,W7:      Linux:4.19:arch:Linux 4.1

Where TCP options are the last field in the OS signature, all of them overlap
except by the last one, ie. 'W6' versus 'W7'.

In case a packet for Linux 4.19 kicks in, the osf finds no matching because the
TCP options pointer is updated after checking for the TCP options in the first
line.

Therefore, reset pointer back to where it should be.

Fixes: 11eeef41 ("netfilter: passive OS fingerprint xtables match")
Signed-off-by: NFernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: NMao Wenan <maowenan@huawei.com>
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ebe7f6d4

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功